Chinese Lipreading with RNN

29 Jun 2018

Feature of my thesis
System architecture
Corpus
Overall Performance
Conclusion

(This is a brief summary of my undergraduate thesis project, which has won an university-level excellence award.)

Feature of my thesis

Migrating cutting-edge lipreading research results to Chinese lipreading practice

The lipreading methodology is greately inspired by the following article:

Wand, Michael, Jan Koutník, and Jürgen Schmidhuber. “Lipreading with long short-term memory.” Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.
Proposing a valid automatic lip feature segmentation method that allows the lipreading system to operate on real world problems
The experiment confirms that the system functions well on a small-scale real world Chinese lipreading problem

System architecture

system architecture The architecture of the lipreading system. The corresponding English translations are: 1. frames 2. face detection 3. fitting contour model 4. segmentation 5. vocal segment 7. Chinese words and sentences

The frames in the video are sent into face detection and face alignment module. Now the original video has been converted into lip feature sequences, which will be subsequently segmented into two types of segments, namely vocal segments and silent segments. In vocal segments, the speaker in the original video is speaking something; while in silent segments, the speaker in the original video stays silent. The vocal segments are sent into the RNN to be categorized, which outputs the corresponding Chinese words and sentences.

Using HOG + SVM to achieve face detection

The implementation of face detection is provided by dlib. The robustness of the face detection module has been verified with experiments.
Using gradient boosting and random forest regression to achieve face alignment and lip feature extraction

The basic idea of face alignment is acquired from the following article:

Kazemi, Vahid, and Josephine Sullivan. “One millisecond face alignment with an ensemble of regression trees.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.

The implementation of face alignment is provided by dlib as well. The robustness of the face alignment module has been verified with experiments.
Using PCA to extract numerical feature from lip feature segments, which is subsequently used to segment the entire lip feature sequence

It would be more desirable if we can quantify the distance between an unknown lip segment and a silent segment. Using the value and the gradient of the value of the distance, we can segment the lip feature segment more accurately.

A figure that demonstrates the result of automatic lip segmentation. The horizontal axis indicates the index of lip feature segments in the video; the vertical axis indicates the common logarithm of the distance between a segment and a silent segment. Green dots indicate that the segment is a silent segment; red dot indicate that the segment is a vocal segment. The vertical lines are the segmentation result generated by the automatic lip segmentation method.

In our experiment, the accuracy of lip feature segmentation of the validation set is 95.21%.
Using LSTM network to achieve lipreading

In our experiment, the accuracy of the LSTM network on the traning set is 97.04%; the accuracy of the LSTM network on the validation set is 87.50%.

Corpus

Due to the fact that public Chinese corpus materials are pretty rare, the corpus is recorded by the author himself. There is only one speaker, which means all the experiments are speaker-dependent. The corpus contains three 2-character words, three 3-character words and three 4-character words. In the training set, each word is repeated at least 10 times. In the validation set, each word is repeated exactly 2 times.

Overall Performance

If lipreading is accurate if and if only both segmentation result and LSTM output are correct, the system achieves 84.85% accuracy on training set, and 91.67% accuracy on validation set.

If the accuracy is calculated using the angle between word vectors, the angle for training set is 10.07◦, and the angle for validation set is 16.10◦.

Conclusion

Shallow LSTM network works well on small-scale Chinese lipreading problem
The overall performance of the system indicates that it is able to address real world Chinese lipreading problems
Due to the limitation of equipment and corpus, only small scale experiments are carried out. Furthur large scale experiments are required to examine the capability of the system.

Alan Xiang's Blog