American Journal of Signal Processing
p-ISSN: 2165-9354 e-ISSN: 2165-9362
2016; 6(1): 19-23
doi:10.5923/j.ajsp.20160601.03

Yuki Takashima 1, Toru Nakashika 2, Tetsuya Takiguchi 1, Yasuo Ariki 1
1Graduate School of System Informatics, Kobe University, Kobe, Japan
2Graduate School of Information Systems, The University of Electro-Communications, Chofu, Japan
Correspondence to: Yuki Takashima , Graduate School of System Informatics, Kobe University, Kobe, Japan.
| Email: | ![]() |
Copyright © 2016 Scientific & Academic Publishing. All Rights Reserved.
This work is licensed under the Creative Commons Attribution International License (CC BY).
http://creativecommons.org/licenses/by/4.0/

In this paper, we discuss speech recognition for persons with articulation disorders resulting from athetoid cerebral palsy. Because the speech style for a person with this type of articulation disorder is quite different from a physically unimpaired person, a conventional speaker-independent acoustic model for unimpaired persons is hardly useful to recognize it. Therefore, a speaker-dependent model for a person with an articulation disorder is necessary. In our previous work, a feature extraction method using a convolutional neural network was proposed for dealing with small local fluctuation of dysarthric speech, and its effectiveness was shown in a word recognition task. The neural network needs a training label (teaching signal) to train the network using back-propagation, and the previous method used results from forced alignment using HMMs as the training label. However, as the phoneme boundary of an utterance by a dysarthric speaker is ambiguous, it is difficult to obtain the correct alignment. If a wrong alignment is used, the network may be inadequately trained. Therefore, we propose a probabilistic phoneme labeling method using the Gaussian distribution. In contrast to the general approach, we deal with the phoneme label as the soft label, that is, our proposed label takes the continuous value. This approach is effective for the dysarthric speech which is the ambiguity of the phoneme boundary. The effectiveness of this method has been confirmed by comparing its effectiveness with that of forced alignment.
Keywords: Articulation disorders, Feature extraction, Convolutional neural network, Bottleneck feature, Phoneme labelling
Cite this paper: Yuki Takashima , Toru Nakashika , Tetsuya Takiguchi , Yasuo Ariki , Phone Labeling Based on the Probabilistic Representation for Dysarthric Speech Recognition, American Journal of Signal Processing, Vol. 6 No. 1, 2016, pp. 19-23. doi: 10.5923/j.ajsp.20160601.03.
![]() | Figure 1. Example of a spectrogram for /ikioi/ spoken by a physically unimpaired person |
![]() | Figure 2. Example of a spectrogram for /ikioi/ spoken by a person with an articulation disorder using forced alignment (top) and manual alignment (bottom) |
the random variable that indicates a phoneme at time
For example,
indicates that a phone label at time
phoneme in an utterance. In this paper, the probability
is defined as follows:![]() | (1) |
is the Gaussian probability density function with mean
and variance
are the number of phonemes included in the utterance and its index, respectively. In this paper,
are the center of the
phoneme duration and its variance, and are defined as follows, respectively:![]() | (2) |
![]() | (3) |
is the boundary time between the
phoneme and
phoneme, and
is the non-negative hyperparameter that controls the variance
are the start time and the end time of the utterance, respectively.) In our approach, we first give the phoneme boundary for each utterance. However, it is not the correct boundary but only an approximate boundary. Next, the phoneme duration and its variance are set up using the phoneme boundary based on (2) and (3). Finally, the existence probabilities for all frames are calculated by using (1), and the obtained probabilities are regarded as the phoneme labels. From the characteristics of the Gaussian distribution, the probability becomes high around the mean, but lower in the distance. Consequently, we expect that the soft phoneme labeling gives a good representation of not only the steady state around the center of the phoneme duration but also the unclear phoneme boundary.![]() | Figure 3. Illustration of the proposed Gaussian labeling for an utterance /ki/ in Japanese |
in (3) which controls the variance. The best performance over the test data was obtained at
Fig. 4 depicts the difference between the hard labeling and soft labeling, and shows an example of the training label, the manual alignment, and the proposed alignment. In the proposed alignment, a phone label transitions to another one gradually. Fig. 5 shows the experimental results using the probabilistic (soft) phoneme labeling method (“Gaussian”), comparing with the conventional MFCC features and the forced alignment (“forced”). Fig. 5 also shows the results in the case of the manual alignment (“manual”). In this experiments,
is set to 0:4 in (3). The alignment obtained from our method provided a better recognition accuracy than both forced alignment and manual alignment. We consider that the network is trained flexibly by using the soft-labels.![]() | Figure 5. Word recognition accuracy for utterances of “speaker A” using each phoneme labeling method and conventional MFCC features |
and the best performance over the test data was obtained at
Fig. 6 shows experimental results for “speaker B”. When using manual alignment, the recognition accuracies were not improved as compared with forced alignment. This is because the forced alignment was obtained accurately as compared with the manual alignment in the case of “speaker B”. It might be too difficult to carry out alignment manually due to the unclear phoneme boundaries. Also, a few mistakes will have a bad influence on the performance because of the limited (small amount of training) data. Nevertheless, as shown in Fig. 6, the probabilistic labeling method could achieve the best accuracy for each condition.![]() | Figure 6. Word recognition accuracy for utterances of “speaker B” using each phoneme labeling method and conventional MFCC features |