Papers

A Fast and Lightweight Text-To-Speech Model with Spectrum and Waveform Alignment Algorithms

International Conference
2021~
작성자
한혜원
작성일
2021-08-30 11:05
조회
3134
Authors : Kihyuk Jeong, Huu-Kim Nguyen, Hong-Goo Kang

Year : 2021

Publisher / Conference : EUSIPCO

Research area : Speech Signal Processing, Text-to-Speech

In this paper, we propose a fast and lightweight text-to-speech (TTS) model that generates high-quality speech even in CPU-only environments. By leveraging the front-end architecture of FastSpeech2, we adopt an effective generative adversarial network (GAN) framework for waveform synthesis, which enables training the proposed model in a fully end-to-end manner. Since the waveform generator consists of smallsize convolutional networks, its inference speed is tremendously fast and the number of network parameters can be reduced by half compared to the FastSpeech2 model. However, the generated waveform segments are often not time-aligned with reference ones because of utilizing the predicted duration, which reduces the reliability of the discriminator module in the GAN framework. To solve the time mis-alignment problem, we propose a waveform alignment algorithm that synchronizes timing information between the reference and generated waveforms. In addition to the waveform aligning task, we include an auxiliary mel-spectrogram prediction task to further enhance perceptual quality. Since this task is only required for training, it does not increase the computational complexity during the inference stage. Objective and subjective experimental results show that the synthesized quality of the proposed model is comparable to that of conventional approaches.
전체 368
7 International Conference Huu-Kim Nguyen, Kihyuk Jeong, Seyun Um, Min-Jae Hwang, Eunwoo Song, Hong-Goo Kang "LiteTTS: A Decoder-free Light-weight Text-to-wave Synthesis Based on Generative Adversarial Networks" in INTERSPEECH, 2021
6 International Conference Zainab Alhakeem, Yoohwan Kwon, Hong-Goo Kang "Disentangled Representations for Arabic Dialect Identification based on Supervised Clustering with Triplet Loss" in EUSIPCO, 2021
5 International Conference Miseul Kim, Minh-Tri Ho, Hong-Goo Kang "Self-supervised Complex Network for Machine Sound Anomaly Detection" in EUSIPCO, 2021
4 International Conference Kihyuk Jeong, Huu-Kim Nguyen, Hong-Goo Kang "A Fast and Lightweight Text-To-Speech Model with Spectrum and Waveform Alignment Algorithms" in EUSIPCO, 2021
3 International Conference Jiyoung Lee*, Soo-Whan Chung*, Sunok Kim, Hong-Goo Kang**, Kwanghoon Sohn** "Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation" in CVPR, 2021
2 International Conference Huu-Kim Nguyen, Kihyuk Jeong, Hong-Goo Kang "Fast and Lightweight Speech Synthesis Model based on FastSpeech2" in ITC-CSCC, 2021
1 International Conference Yoohwan Kwon*, Hee-Soo Heo*, Bong-Jin Lee, Joon Son Chung "The ins and outs of speaker recognition: lessons from VoxSRC 2020" in ICASSP, 2021