SC-CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems
This letter proposes an effective speaker-conditioning method that is applicable to zero-shot multi-speaker text-to-speech (ZSM-TTS) systems. Based on the inductive bias in the speech generation task, inwhich local context information in text/phoneme sequences heavily affect the speaker characteristics of the output speech, we propose a Speaker-Conditional Convolutional Neural Network (SC-CNN) for the ZSM-TTS task. SC-CNN first predicts convolutional kernels from each learned speaker embedding, then applies 1-D convolutions to phoneme sequences with the predicted kernels. It utilizes the aforementioned inductive bias and effectively models the characteristic of speech by providing the speaker-specific local context in phonetic domain. We also build both FastSpeech2 and VITS-based ZSM-TTS systems to verify its superiority over conventional speaker conditioning methods. The results confirm that the models with SC-CNN outperform the recent ZSM-TTS models in terms of both subjective and objective measurements.
|6||International Journal||Zainab Alhakeem, Se-In Jang, Hong-Goo Kang "Disentangled Representations in Local-Global Contexts for Arabic Dialect Identification" in Transactions on Audio, Speech, and Language Processing, 2024|
|5||International Journal||Hyungchan Yoon, Changhwan Kim, Seyun Um, Hyun-Wook Yoon, Hong-Goo Kang "SC-CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems" in IEEE Signal Processing Letters, vol.30, pp.593-597, 2023|
|4||International Journal||Jinyoung Lee, Hong-Goo Kang "Real-Time Neural Speech Enhancement Based on Temporal Refinement Network and Channel-Wise Gating Methods" in Digital Signal Processing, vol.133, 2023|
|3||International Journal||Taemin Kim, Yejee Shin, Kyowon Kang, Kiho Kim, Gwanho Kim, Yunsu Byeon, Hwayeon Kim, Yuyan Gao, Jeong Ryong Lee, Geonhui Son, Taeseong Kim, Yohan Jun, Jihyun Kim, Jinyoung Lee, Seyun Um, Yoohwan Kwon, Byung Gwan Son, Myeongki Cho, Mingyu Sang, Jongwoon Shin, Kyubeen Kim, Jungmin Suh, Heekyeong Choi, Seokjun Hong, Huanyu Cheng, Hong-Goo Kang, Dosik Hwang & Ki Jun Yu "Ultrathin crystalline-silicon-based strain gauges with deep learning algorithms for silent speech interfaces" in Nature Communications, vol.13, 2022|
|2||International Journal||Jinyoung Lee, Hong-Goo Kang "Two-Stage Refinement of Magnitude and Complex Spectra for Real-Time Speech Enhancement" in IEEE Signal Processing Letters, vol.29, pp.2188-2192, 2022|
|1||International Journal||Kyungguen Byun, Seyun Um, Hong-Goo Kang "Length-Normalized Representation Learning for Speech Signals" in IEEE Access, vol.10, pp.60362-60372, 2022|