Deep Learning Team : Colin

Abstract

Like the AdaSpeech model we looked at last time, the existing TTS adaptation method has used text-speech pair data to synthesize the voices of a specific speaker. However, since it is practically difficult to prepare data in pairs, it will be a much more efficient way to adapt the TTS model only with speech data that is not transcribed. The easiest way to access is to use the automatic speech recognition (ASR) system for transcription, but it is difficult to apply in certain situations and recognition accuracy is not high enough, which can reduce final adaptation performance. And there have been attempts to solve this problem by joint training of the TTS pipeline and the module for adaptation, which has the disadvantage of not being able to easily combine with other commercial TTS models.

AdaSpeech2 designs an additional module that can combine any TTS model together to enable learning with untranscribed speech (pluggable), and from this, it propose a model that can produce results equivalent to the performance of the TTS model fully adapted with text-speech data (effective).

Summary for Busy People

Additional modules were attached to the structure of AdaSpeech to induce adaptation to specific speakers using only speech data.
Mel Encoder’s latent space is trained to be similar to Phoneme Encoder’s latent space, so Mel Decoder can receive the same features regardless of whether the input comes in text or speech. This is suitable for situations where only speech data must be input into the pre-trained TTS model.
AdaSpeech2’s adaptation method can be used by attaching any TTS model and can produce similar performance to models that have adapted certain speakers with text-speech pair data.

Model Structure

AdaSpeech2 uses AdaSpeech, which consists of a phoneme encoder and a mel-spectrogram decoder, as a backbone model. Acoustic condition modeling and conditional layer normalization are used like the existing AdaSpeech, but are not expressed in the figure above for simplicity. Here, add a mel-spectrogram encoder that receives and encodes speech data, and apply L2 loss to make it similar to the output of the phoneme encoder. The detailed learning process will be explained below.

Training and Inference Process

Step 1. Source Model Training

First of all, it is important to train the source TTS model well. Train the phoneme encoder and mel-spectrogram decoder of the AdaSpeech model with a sufficient amount of text-speech pairs, where duration information to extend the output of the phoneme encoder to the length of the mel-spectrogram is obtained through the Montreal Forced Alignment (MFA).

Step 2. Mel Encoder Alignment

If you have a well-trained source model, attach a mel-spectrogram encoder for untranscribed speech adaptation. Finally, it plays a role in creating features that will enter the mel-spectrogram decoder while auto-encoding the speech, and it needs to be made to be the same as the latent space of the phoneme encoder because it has to spit out the same output as the feature from the transcription data (text). So, as we proceed with TTS learning again using text-speech data, we obtain and minimize the L2 loss between the sequence from the phoneme encoder and the sequence from the mel-spectrogram encoder, leading to the alignment of latent spaces between the two. At this time, this method can be expressed as pluggable because it does not retrain the entire structure, but fixes the parameters of the source model and updates only the parameters of the mel-spectrogram encoder.

Step 3. Untranscribed Speech Adaptation

Now fine-tune the model using only the (untranscribed) speech data of the specific speaker you want to synthesize. Since the input speech is synthesized back to speech via mel-spectrogram encoder and mel-spectrogram decoder, it is a speech restoration method through auto-encoding, in which the source model updates only the conditional layer normalization of the mel-spectrogram decoder and minimizes computation.

Step 4. Inference

Once all of the above adaptation processes have been completed, the model can now mimic the voice of a particular speaker through a phoneme encoder that has not been fine-tuned and a partially fine-tuned mel-spectrogram decoder when text is entered.

Experiment Results

Adaptation Voice Quality

In Table 1, joint-training is a setting used as a baseline in this experiment by learning both phoneme encoders and mel-spectrogram encoders at the same time, and the strategy to learn phoneme encoders and mel-spectrogram in order is judged to be superior.

In addition, the performance of the Adaspech and PPG-based models used as backbone was considered to be the upper limit for the performance of AdaSpeech2, so we conducted an experiment to compare them together. From the results of MOS and SMOS, we can see that AdaSpeech2 synthesizes voices of almost the same quality as models considered upper limits.

Analyses on Adaptation Strategy

Ablation study was conducted to evaluate whether the strategies mentioned earlier in the learning process contributed to the improvement of the model’s performance. As a result, the quality of the voice deteriorates if L2 loss is removed between the output of the phoneme encoder and the mel-spectrogram encoder, or the mel-spectrogram encoder is also updated in the fine-tuning step.

Varying Adaptation Data

When the number of adaptive speech data samples is less than 20, the synthesis quality improves significantly as the amount of data increases, but if it goes beyond that, there will be no significant quality improvement.

Conclusion and Opinion

Machine learning engineers who train TTS models know that the quality of data is synthetic quality, so they put a lot of effort into collecting and preprocessing data. And in order to synthesize voices with new speakers, new speakers’ speech files and transcribed text are collected in pairs to re-train the TTS model from scratch, but using the AdaSpeech2 method, data only needs to be collected and the model needs to be fine-tuned. Another advantage is that it is easy to apply in reality because it can be combined with any TTS model.

If we proceed with further research in AdaSpeech2, it could be an interesting topic to observe the resulting performance changes using new distance functions such as cosine similarity as constraints instead of L2 loss.

Next time, we will have time to introduce the last paper of the AdaSpeech series.