Speech Technology Summer Research Seminar on 4.6
Date: 4 June, 2018 (Monday)
Time: 10:00 - 15:00
Venue: Room TB180 (Joensuu Science Park) and F211 (Kuopio)
Morning Session (Invited Talks by NII Researchers)
10:00 - 11:00 Talk 1: Overview of Research at NII & Voice Conversion Challenge 2018 (Junichi Yamagishi)
11:00 - 11:30 Talk 2: Autoregressive Neural Models for Statistical Parametric Speech Synthesis (Xin Wang)
11:30 - 12:00 Talk 3: Speaker Adaptation for DNN Speech Synthesis (Luong Hieu Thi)
Afternoon Session (by UEF researchers), 13:00 - 15:00
* A Spoofing Benchmark for the 2018 Voice Conversion Challenge: Leveraging from Spoofing Countermeasures for Speech Artifact Assessment (Tomi Kinnunen)
* Perceptual Evaluation of the Effectiveness of Voice Disguise by Age Modification (Rosa Gonzalez Hautamäki)
* Supervector Compression Strategies to Speed up I-Vector System Development (Ville Vestman)
* t-DCF: a Detection Cost Function for the Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification (Tomi Kinnunen)
* A Regression Model of Recurrent Deep Neural Networks for Noise Robust Estimation of the Fundamental Frequency Contour of Speech (Akihiro Kato)
* Staircase Network: Structural Language Identification via Hierarchical Attentive Units (Trung Ngo Trong)
ABSTRACTS AND BIOGRAPHIES OF THE INVITED TALKS
Talk 1: Overview of Research at NII & Voice Conversion Challenge 2018
Voice conversion (VC) is a technique to transform the speaker identity included in a source speech waveform into a different one while preserving linguistic information of the source speech waveform. VC has great potential in the development of various new applications such as a speaking aid for individuals with vocal impairments such as dysarthric patients, a voice changer to generate various types of expressive speech, novel vocal effects of singing voices, silent speech interfaces, and accent conversion for computer assisted language learning.
In this talk, I introduce the Voice Conversion Challenge 2018, designed as a follow up to the 2016 edition with the aim of providing a common framework for evaluating and comparing different state-of-the-art voice VC systems. As an update to the previous challenge, we considered both parallel and non-parallel data to form the Hub and Spoke tasks, respectively. A total of 23 teams from around the world submitted their systems, 11 of them additionally participated in the optional Spoke task.
A large-scale crowdsourced perceptual evaluation with 267 subjects was then carried out to rate the submitted converted speech in terms of naturalness and similarity to the target speaker identity. From the results, we have seen the incredible progress that has come to the field with the rise of new speech generation paradigms such as Wavenet. We observed that one of the submitted VC systems obtained an average of 4.1 in the five-point scale evaluation for quality judgment and about 80% of its converted speech samples were judged to be the same as target speakers by listeners.
I also introduce a few large projects that we're conducting at National Institute of Informatics, Japan in this talk.
Junichi Yamagishi received the Ph.D. degree from Tokyo Institute of Technology in 2006 for a thesis that pioneered speaker-adaptive speech synthesis. He is currently an Associate Professor with the National Institute of Informatics, Tokyo, Japan, and also a Senior Research Fellow with the Centre for Speech Technology Research, University of Edinburgh, Edinburgh, U.K. Since 2006, he has authored and co-authored more than 180 refereed papers in international journals and conferences. He was the recipient of the Tejima Prize as the best Ph.D. thesis of Tokyo Institute of Technology in 2007. He was awarded the Itakura Prize from the Acoustic Society of Japan, the Kiyasu Special Industrial Achievement Award from the Information Processing Society of Japan, and the Young Scientists' Prize from the Minister of Education, Science and Technology, the JSPS prize in 2010, 2013, 2014, and 2016, respectively. He was one of organizers for special sessions on "Spoofing and Countermeasures for Automatic Speaker Verification" at Interspeech 2013, "ASVspoof evaluation" at Interspeech 2015, "Voice conversion challenge 2016" at Interspeech 2016, and "2nd ASVspoof evaluation" at Interspeech 2017. He has been a member of the Speech and Language Technical Committee. He was an Associate Editor of the IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING and a Lead Guest Editor for the IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING special issue on Spoofing and Countermeasures for Automatic Speaker Verification.
Talk 2: Autoregressive Neural Models for Statistical Parametric Speech Synthesis
The main task of statistical parametric speech synthesis (SPSS) is to convert an input sequence of textual features into a target sequence of acoustic features. This task can be approached by using various types of neural networks such as recurrent neural networks (RNNs). Although RNN shows great performance in SPSS, its standard architecture and the common way of using it may be sub-optimal. This talk will explain how a normal RNN is imperfect in modeling the temporal dependency of the target sequence. After that, this talk will use the idea of autoregressive (AR) dependency and propose a shallow AR neural model that can alleviate the over-smoothing problem and then a deep AR model that enable random sampling for fundamental frequency (F0) generation in SPSS. This talk will also generalize the shallow AR model to a recent method called AR normalizing flow (NF) and show how NF can be used for SPSS.
Xin Wang received the M.S. degree from the University of Science and Technology of China, Hefei, China, in 2015 because of his work on HMMbased speech synthesis. Since October 2015, he has been working toward the Ph.D. degree at National Institute of Informatics, Tokyo, Japan. His research interests include statistical speech synthesis and machine learning.
Talk 3: Speaker Adaptation for DNN Speech Synthesis
While text-to-speech systems start to achieve outstanding performance and the best synthetic speech sounds almost indistinguishable from human speech, building such a TTS system still requires a large amount of carefully recorded speech data of a single speaker and its transcriptions, which limit the utility and deployment of new applications of such systems obviously. In this talk I introduce approaches for constructing a multi-speaker speech synthesis system based on DNN where the common DNN is used to model multiple speaker's voices at the same time and each speaker has less amount of data. I also introduce frameworks to quickly adapt pre-trained DNNs to unseen speakers using a small amount of speech data with or without the associated transcriptions.
Luong Hieu Thi received M.S degree from the University of Science of Vietnam, Ho Chi Minh City in 2016 for his work on multi-speaker speech synthesis while earlier working on Automatic Speech Recognition systems. Since October 2017, he became a Ph.D student at National Institute of Informatics, Tokyo to further his research on low-resource speaker adaptation for speech synthesis model.