Effective Spectral and Excitation Modeling Techniques for LSTM-RNN-Based Speech Synthesis Systems

International Journal
2017-11-01 22:07
Authors : Eunwoo Song, Frank K. Soong, Hong-Goo Kang

Year : 2017

Publisher / Conference : IEEE/ACM Transactions on Audio, Speech, and Language Processing

Volume : 25, issue 11

Page : 2152-2161

In this paper, we report research results on modeling the parameters of an improved time-frequency trajectory excitation (ITFTE) and spectral envelopes of an LPC vocoder with a long short-term memory (LSTM)-based recurrent neural network (RNN) for high-quality text-to-speech (TTS) systems. The ITFTE vocoder has been shown to significantly improve the perceptual quality of statistical parameter-based TTS systems in our prior works. However, a simple feed-forward deep neural network (DNN) with a finite window length is inadequate to capture the time evolution of the ITFTE parameters. We propose to use the LSTM to exploit the time-varying nature of both trajectories of the excitation and filter parameters, where the LSTM is implemented to use the linguistic text input and to predict both ITFTE and LPC parameters holistically. In the case of LPC parameters, we further enhance the generated spectrum by applying LP bandwidth expansion and line spectral frequency-sharpening filters. These filters are not only beneficial for reducing unstable synthesis filter conditions but also advantageous toward minimizing the muffling problem in the generated spectrum. Experimental results have shown that the proposed LSTM-RNN system with the ITFTE vocoder significantly outperforms both similarly configured band aperiodicity-based systems and our best prior DNN-trainecounterpart, both objectively and subjectively.
전체 344
64 International Journal Zhenyu Piao, Hyungseob Lim, Miseul Kim, Hong-goo Kang "PDF-NET: Pitch-adaptive Dynamic Filter Network for Intra-gender Speaker Verification" in APSIPA ASC, 2023
63 International Journal Hyungchan Yoon, Changhwan Kim, Seyun Um, Hyun-Wook Yoon, Hong-Goo Kang "SC-CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems" in IEEE Signal Processing Letters, vol.30, pp.593-597, 2023
62 International Journal Jinyoung Lee, Hong-Goo Kang "Real-Time Neural Speech Enhancement Based on Temporal Refinement Network and Channel-Wise Gating Methods" in Digital Signal Processing, vol.133, 2023
61 International Journal Taemin Kim, Yejee Shin, Kyowon Kang, Kiho Kim, Gwanho Kim, Yunsu Byeon, Hwayeon Kim, Yuyan Gao, Jeong Ryong Lee, Geonhui Son, Taeseong Kim, Yohan Jun, Jihyun Kim, Jinyoung Lee, Seyun Um, Yoohwan Kwon, Byung Gwan Son, Myeongki Cho, Mingyu Sang, Jongwoon Shin, Kyubeen Kim, Jungmin Suh, Heekyeong Choi, Seokjun Hong, Huanyu Cheng, Hong-Goo Kang, Dosik Hwang & Ki Jun Yu "Ultrathin crystalline-silicon-based strain gauges with deep learning algorithms for silent speech interfaces" in Nature Communications, vol.13, 2022
60 International Journal Jinyoung Lee, Hong-Goo Kang "Two-Stage Refinement of Magnitude and Complex Spectra for Real-Time Speech Enhancement" in IEEE Signal Processing Letters, vol.29, pp.2188-2192, 2022
59 International Journal Kyungguen Byun, Seyun Um, Hong-Goo Kang "Length-Normalized Representation Learning for Speech Signals" in IEEE Access, vol.10, pp.60362-60372, 2022
58 International Journal Young-Sun Joo, Hanbin Bae, Young-Ik Kim, Hoon-Young Cho, Hong-Goo Kang "Effective Emotion Transplantation in an End-to-End Text-to-Speech System" in IEEE Access, vol.8, pp.161713-161719, 2020
57 International Journal Soo-Whan Chung, Joon Son Chung, Hong Goo Kang "Perfect Match: Self-Supervised Embeddings for Cross-Modal Retrieval" in IEEE Journal of Selected Topics in Signal Processing, vol.14, issue 3, 2020
56 International Journal Ohsung Kwon, Inseon Jang, ChungHyun Ahn, Hong-Goo Kang "An Effective Style Token Weight Control Technique for End-to-End Emotional Speech Synthesis" in IEEE Signal Processing Letters, vol.26, issue 9, pp.1383-1387, 2019
55 International Journal Jinkyu Lee, Hong-Goo Kang "A Joint Learning Algorithm for Complex-Valued T-F Masks in Deep Learning-Based Single-Channel Speech Enhancement Systems" in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.27, issue 6, pp.1098-1108, 2019