Papers

Effective Spectral and Excitation Modeling Techniques for LSTM-RNN-Based Speech Synthesis Systems

International Journal
2016~2020
작성자
이진영
작성일
2017-11-01 22:07
조회
1540
Authors : Eunwoo Song, Frank K. Soong, Hong-Goo Kang

Year : 2017

Publisher / Conference : IEEE/ACM Transactions on Audio, Speech, and Language Processing

Volume : 25, issue 11

Page : 2152-2161

In this paper, we report research results on modeling the parameters of an improved time-frequency trajectory excitation (ITFTE) and spectral envelopes of an LPC vocoder with a long short-term memory (LSTM)-based recurrent neural network (RNN) for high-quality text-to-speech (TTS) systems. The ITFTE vocoder has been shown to significantly improve the perceptual quality of statistical parameter-based TTS systems in our prior works. However, a simple feed-forward deep neural network (DNN) with a finite window length is inadequate to capture the time evolution of the ITFTE parameters. We propose to use the LSTM to exploit the time-varying nature of both trajectories of the excitation and filter parameters, where the LSTM is implemented to use the linguistic text input and to predict both ITFTE and LPC parameters holistically. In the case of LPC parameters, we further enhance the generated spectrum by applying LP bandwidth expansion and line spectral frequency-sharpening filters. These filters are not only beneficial for reducing unstable synthesis filter conditions but also advantageous toward minimizing the muffling problem in the generated spectrum. Experimental results have shown that the proposed LSTM-RNN system with the ITFTE vocoder significantly outperforms both similarly configured band aperiodicity-based systems and our best prior DNN-trainecounterpart, both objectively and subjectively.
전체 355
54 International Journal Seung-Chul Shin, Jinkyu Lee, Soyeon Choe, Hyuk In Yang, Jihee Min, Ki-Yong Ahn, Justin Y. Jeon, Hong-Goo Kang "Dry Electrode-Based Body Fat Estimation System with Anthropometric Data for Use in a Wearable Device" in Sensors, vol.19, issue 9, 2019
53 International Journal Jinkyu Lee, Jan Skoglund, Turaj Shabestary, Hong-Goo Kang "Phase-Sensitive Joint Learning Algorithms for Deep Learning-Based Speech Enhancement" in IEEE Signal Processing Letters, vol.25, issue 8, pp.1276-1280, 2018
52 International Journal JeeSok Lee, Soo-Whan Chung, Min-Seok Choi, Hong-Goo Kang "Generic uniform search grid generation algorithm for far-field source localization" in The Journal of the Acoustical Society of America, vol.143, 2018
51 International Journal Min-Jae Hwang, JeeSok Lee, MiSuk Lee, Hong-Goo Kang "SVD-Based Adaptive QIM Watermarking on Stereo Audio Signals" in IEEE Transactions on Multimedia, vol.20, issue 1, pp.45-54, 2018
50 International Journal Eunwoo Song, Frank K. Soong, Hong-Goo Kang "Effective Spectral and Excitation Modeling Techniques for LSTM-RNN-Based Speech Synthesis Systems" in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.25, issue 11, pp.2152-2161, 2017
49 International Journal Ho Seon Shin, Tim Fingscheidt, Hong-Goo Kang "A Priori SNR Estimation Using Air- and Bone-Conduction Microphones" in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.23, issue 11, pp.2015-2025, 2015
48 International Journal Taegyu Lee, Hyun Oh Oh, Jeongil Seo, Young-Cheol Park, Dae Hee Youn "Scalable Multiband Binaural Renderer for MPEG-H 3D Audio" in IEEE Journal of Selected Topics in Signal Processing, vol.9, issue 5, pp.907-920, 2015
47 International Journal Taegyu Lee, Yonghyun Baek, Young-Cheol Park, Dae Hee Youn "Stereo upmix-based binaural auralization for mobile devices" in IEEE Transactions on Consumer Electronics, vol.60, issue 3, pp.411-419, 2014
46 International Journal Soonho Baek, Hong-Goo Kang "Selection of spectral compressive operator for vector Taylor series-based model adaptation in noisy environments" in The Journal of the Acoustical Society of America, vol.135, 2014
45 International Journal Jae-Mo Yang, Hong-Goo Kang "Online Speech Dereverberation Algorithm Based on Adaptive Multichannel Linear Prediction" in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.22, issue 3, pp.608-619, 2014