Papers

Effective Spectral and Excitation Modeling Techniques for LSTM-RNN-Based Speech Synthesis Systems

International Journal
2016~2020
작성자
이진영
작성일
2017-11-01 22:07
조회
1623
Authors : Eunwoo Song, Frank K. Soong, Hong-Goo Kang

Year : 2017

Publisher / Conference : IEEE/ACM Transactions on Audio, Speech, and Language Processing

Volume : 25, issue 11

Page : 2152-2161

In this paper, we report research results on modeling the parameters of an improved time-frequency trajectory excitation (ITFTE) and spectral envelopes of an LPC vocoder with a long short-term memory (LSTM)-based recurrent neural network (RNN) for high-quality text-to-speech (TTS) systems. The ITFTE vocoder has been shown to significantly improve the perceptual quality of statistical parameter-based TTS systems in our prior works. However, a simple feed-forward deep neural network (DNN) with a finite window length is inadequate to capture the time evolution of the ITFTE parameters. We propose to use the LSTM to exploit the time-varying nature of both trajectories of the excitation and filter parameters, where the LSTM is implemented to use the linguistic text input and to predict both ITFTE and LPC parameters holistically. In the case of LPC parameters, we further enhance the generated spectrum by applying LP bandwidth expansion and line spectral frequency-sharpening filters. These filters are not only beneficial for reducing unstable synthesis filter conditions but also advantageous toward minimizing the muffling problem in the generated spectrum. Experimental results have shown that the proposed LSTM-RNN system with the ITFTE vocoder significantly outperforms both similarly configured band aperiodicity-based systems and our best prior DNN-trainecounterpart, both objectively and subjectively.
전체 355
18 International Conference Hyungchan Yoon, Changhwan Kim, Eunwoo Song, Hyun-Wook Yoon, Hong-Goo Kang "Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech" in INTERSPEECH, 2023
17 International Conference Huu-Kim Nguyen, Kihyuk Jeong, Seyun Um, Min-Jae Hwang, Eunwoo Song, Hong-Goo Kang "LiteTTS: A Decoder-free Light-weight Text-to-wave Synthesis Based on Generative Adversarial Networks" in INTERSPEECH, 2021
16 International Conference Suhyeon Oh, Hyungseob Lim, Kyungguen Byun, Min-Jae Hwang, Eunwoo Song, Hong-Goo Kang "ExcitGlow: Improving a WaveGlow-based Neural Vocoder with Linear Prediction Analysis" in APSIPA (*awarded Best Paper), 2020
15 International Conference Min-Jae Hwang, Frank Soong, Eunwoo Song, Xi Wang, Hyeonjoo Kang, Hong-Goo Kang "LP-WaveNet: Linear Prediction-based WaveNet Speech Synthesis" in APSIPA, 2020
14 International Conference Min-Jae Hwang, Eunwoo Song, Ryuichi Yamamoto, Frank Soong, Hong-Goo Kang "Improving LPCNet-based Text-to-Speech with Linear Prediction-structured Mixture Density Network" in ICASSP, 2020
13 International Conference Kyungguen Byun, Eunwoo Song, Jinseob Kim, Jae-Min Kim, Hong-Goo Kang "Excitation-by-SampleRNN Model for Text-to-Speech" in ITC-CSCC, 2019
12 International Conference Min-Jae Hwang, Eunwoo Song, Jin-Seob Kim, Hong-Goo Kang "A Unified Framework for the Generation of Glottal Signals in Deep Learning-based Parametric Speech Synthesis Systems" in INTERSPEECH, 2018
11 International Conference Min-Jae Hwang, Eunwoo Song, Kyungguen Byun, Hong-Goo Kang "Modeling-by-Generation-Structured Noise Compensation Algorithm for Glottal Vocoding Speech Synthesis System" in ICASSP, 2018
10 International Conference Eunwoo Song, Frank K. Soong, Hong-Goo Kang "Perceptual quality and modeling accuracy of excitation parameters in DLSTM-based speech synthesis systems" in ASRU, 2017
9 International Journal Eunwoo Song, Frank K. Soong, Hong-Goo Kang "Effective Spectral and Excitation Modeling Techniques for LSTM-RNN-Based Speech Synthesis Systems" in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.25, issue 11, pp.2152-2161, 2017