Authors : 오태양, 정기혁, 강홍구
Year : 2020
Publisher / Conference : 전자공학회 하계학술대회
Page : 980-982
In this paper, we improve the speech quality of multi-speaker text-to-speech (TTS) system by adding two embedding networks that represent speaker and speaking style characteristics. The speaker embedding is extracted from a d-vector based encoder and speaking style embedding from a global style token (GST) encoder. Since two encoders compensate each other for well-representing speaker and speaking style, the quality of synthesized speech is very good. Subjective listening tests show that our proposed model outperforms the d-vector based Tacotron2 system.