CS 4641 - Silent Speech

Introduction

We created a silent spelling system that aims to recognize language from silent utterances. This system is useful for people with speech impairments, for people in noisy environments, and for people in social situations that restrict speech. Speech impairments may result from ALS, tracheostomies, or deafness. Noisy environments limit the effectivess of technologies that rely on speech as input, such as conversational AI and dictation tools. Social situations, like meetings, and private information may require a subtler interface than speech. Researchers have explored using different sensors, like surface electromyography and video, to capture data that encodes speech. We further existing research on speech recognition by using Hidden Markov Models to recognize silent utterances at the letter level.

Related Works

In 1989, Steve Young developed the HTK Hidden Markov Model Toolkit at Cambridge University for the primary purpose of speech recognition research. Interest in the software developed from other speech laboratories and universities for pedagogical purposes. By 1994, Young had described in detail the design and philosophy supporting HTK.

By 2003, the burgeoning field of wearable computing saw increased use of gesture recognition as a tool for interaction. However, building systems for gesture recognition required significant knowledge and effort. Westeyn developed the Georgia Tech Gesture Toolkit on top of HTK to ease the development of such a system. The Georgia Tech Gesture Toolkit trains models that recognize in real-time and off-line.

More recently, a silent speech recognition system was developed by Li. The system relied on the CompleteSpeech SmartPalate to collect data and a support vector machine to recognize speech from the collected data. The system had a 21 word vocabulary, and was evaluated through user studies with offline and online recognition for native and non-native users. Li also examined interaction with the silent spelling system in two settings: sitting and walking. Their results indicated that the system used by native speakers had high accuracy and information transfer rate comparable to that of a mouse and touchscreen.

Figure 1: SmartPalate Device used in Li's silent speech system.

Experimental Design

In an effort to develop our silent speller system, we trained a Hidden Markov Model to recognize silent utterances of letters using a dataset collected by Naoki Kimura and optimized our pipeline with respect to speed and memory.

Data collection

The CompleteSpeech SmartPalate is a retainer fitted with 124 capacitive sensors. Each sensor records a zero or one at a frequency of 100Hz. Naoki used the SmartPalate to collect the dataset used to train and test our Hidden Markov Models. The dataset consisted of 20 samples of each letter of the alphabet. Each sample was collected over the course of a second and consisted of 100 frames of 124 diminsional binary vectors.

Figure 2-1: Visualization of the capacitive sensors that were activated over the course of uttering the letter "A"

Figure 2-2: Visualization of the 128 capacitive sensors in the SmartPalate.

Figure 2-3: Visualization of the capacitive sensors that were activated over the course of uttering the letter "B"

Figure 2-4: Plot of the capacitive sensors that were activated over the course of uttering the letter "A"

Figure 2-5: Plot of the capacitive sensors that were activated over the course of uttering the letter "B"

Figure 2-6: Plot of the capacitive sensors that were activated over the course of uttering the letter "W"

Data Processing

The algorithm used to classify the first dataset was a Hidden Markov Model. Specifically, we used a left-to-right Hidden Markov Model, which results in a HMM representation like in Figure 2-7 below:

Figure 2-7: A three state left-to-right Hidden Markov Model with arrows representing transition probabilities.

Figure 2-8: Viterbi trellis of possible outcomes and state transitions through time.

The HTK Toolkit creates the viterbi trellis and trains the model on the data we provided.

Since our goal is create a real-time silent spelling system, we needed to optimize our silent spelling system's speed. We did this by subsampling the data. Subsamples of the data were created by summing multiple timesteps into a single timestep or grouping the capacitive sensors into sections.

To subsample over time, we summed every five consecutive 128 dimensional vectors to a single 128 dimensional vector with values between 0 and 5.

Subsampling by section converted each 128 dimensional vector into a 10 dimensional vector with values representing the amount of capacitive sensors activated in each group.

Figure 2-9: Groupings of capacitive sensors for subsampling by section.

Evaluation

The performance of our models varied as we tuned subsampling and number of states.

Figure 2-10: Accuracy and speed of model when subsampled through time.

The number of frames in each subsample affected the accuracy and speed of our HMMs. Increasing the number of frames in each subsample reduces the amount of information used to train the model, which decreases the time to train the model. However, subsampling the data over time loses temporal information, resulting in decreased prediction accuracy.

Figure 2-11: Accuracy and speed of model with varied number of states.

The number of hidden states in the HMMs also affected the speed and accuracy of our models. Unsurprisingly, as we increased the complexity of our models, the prediction accuracy increased. However, it is important to consider if the model generalizes, which we will evaluate later with cross validation.

For subsampling by section, we grouped the capacitive sensors into 10 sections. The performance of the model with respect to speed and accuracy following subsampling by section was 7.9 seconds and 66 percent, respectively.

To evaluate our recognizer's accuracy, we performed 4-fold stratified cross validation on the ten-state model without subsampling through time. The overall accuracy of the model was 79.82%.

Figure 2-12: Confusion Matrix showing the accuracy of predictions as a proportion.

Conclusion and Future Work

The accuracy of our models were determined by its offline classification performance. While the accuracy of our model could not support a successful silent spelling system, it provides preliminary results with promise for improvement. The information we gathered about optimizing our models by modifying our datasets through subsampling and our model's architecture (states and skips) will be crucial to further developments in this area of research. Specifically, the optimizations made will enable future silent spelling systems to perform letter recognition in real time. In our optimizations, subsampling through time was limited by the length of each data sample and number of states in our HMM; This limitation will be relaxed as the length of each data sample increases, which will occur with the introduction of context training and a statistical grammar. The speed of our models were additionally limited by verbose output logs on our machines. In the production level silent spelling system, such logs will be removed to support real-time recognition.

Moving forward, we plan on increasing the accuracy of our recognizer by introducing context training and a statistical grammar, measures that have been shown to decrease the error rate by a factor of eight in practice. Context training and a statistical grammar will require a dataset that has sequences of data representing phrases or words silently uttered letter by letter.

Replicate Results

Github Repository

References

R. Lee, J Wu, and T. Starner. TongueBoard: An Oral Interface for Subtle Input. In Proceedings of the 10th Augmented Human International Conference 2019. ACM, New York, NY, USA.
N. Kimura, M. Kono, and J. Rekimoto. 2019. SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ‘19). ACM, New York, NY, USA, 1 -- 11.
Jelinek, F.; Bahl, L.; Mercer, R. (1975). "Design of a linguistic statistical decoder for the recognition of continuous speech". IEEE Transactions on Information Theory. 21 (3): 250.
S.J. Young and Sj Young. The HTK Hidden Markov Model Toolkit: Design and Philosophy. Entropic Cambridge Research Laboratory, Ltd 1994. 2--44.
Xuedong Huang; M. Jack; Y. Ariki (1990). Hidden Markov Models for Speech Recognition. Edinburgh University Press. ISBN 978-0-7486-0162-2.
B. Denby, T. Schultz, K. Honda, T. Hueber, J. M. Gilbert, and J. S. Brumberg. 2010. Silent Speech Interfaces. Speech Commun. 52, 4 (April 2010), 270--287.
Diandra Fabre, Thomas Hueber, Laurent Girin, Xavier Alameda-Pineda, and Pierre Badin. 2017. Automatic animation of an articulatory tongue model from ultrasound images of the vocal tract. Speech Communication 93 (2017), 63 -- 75.
M.J. Fagan, S.R. Ell, J.M. Gilbert, E. Sarrazin, and P.M. Chapman. 2008. Development of a (silent) speech recognition system for patients following laryngectomy. Medical Engineering & Physics 30, 4 (2008), 419 -- 425.
Masaaki Fukumoto. 2018. SilentVoice: Unnoticeable Voice Input by Ingressive Speech. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (UIST '18). ACM, New York, NY, USA, 237--246.
MacKenzie, I. S., & Soukoreff, R. W. (2003). Phrase sets for evaluating text entry techniques. Extended Abstracts of the ACM Conference on Human Factors in Computing Systems - CHI 2003, pp. 754-755. New York: ACM.