ASVspoof - world-leading research institutes and IT companies are joining forces to combat voice spoofs
Fake data is a key concern in today’s society. Fake data is continually evolving - and nobody can predict how. Besides fake news, multimedia data such as video, image and voice data has become increasingly easier to generate or manipulate, opening up potential for its misuse, especially in fields of information security and user privacy. In 2018, so-called DeepFakes - realistic-looking, yet fake videos portraying celebrities - drew particular attention. This work showed how deep learning technologies can be used to generate illicit videos or audio recordings of specific, target persons.
Concerning voice, Google’s introduction in 2016 of ‘WaveNet’ technology showed the ease by which speech synthesis solutions can generate natural-sounding, but fake speech that most of us would find almost indistinguishable from genuine speech produced by a human. This and other, even more advanced machine learning technology is gradually making its way into our homes and daily lives. Today, it finds application in smart home assistants, audio books, healthcare, public announcement systems and a plethora of other applications. For the most part, the technology is exploited with only positive intentions. Technology is not neutral, though - it can build, or it can destroy - depending on who’s using it. The ability of (text-to-speech) TTS and (voice conversion) VC to put words in someone else’s mouth or to clone someone’s voice, raises obvious concerns. It is important to understand that, in future, we may no longer be able to judge by ourselves whether what we are watching or what we are listening to is real or not, i.e. whether or not it is genuine or fake. Society is in urgent need of new tools, perhaps similar to today’s anti-virus systems, that will alert us to fake media, i.e. artificially generated or manipulated video or voice data.
The concerns surrounding fake data and fake media are perhaps most alarming in the field of biometrics technologies. These are automatic systems designed to recognise people using our behavioural or biological characteristics. Researchers and vendors of biometric technology have been aware for about two decades of the threat fake media poses to the reliability of biometrics technology. The problem of spoofing attacks, also known as presentation attacks, entails one person (a fraudster) masquerading as another person (perhaps you!), so that they can gain illegitimate access to data records (logical access) or restricted/sensitive areas and facilities (physical access). The same spoofing attacks can be used to fool humans too.
In the context of biometric systems, the threat of spoofing is worrying, for numerous applications rely fundamentally upon their reliability and security. Examples include automated border control, home assistants, smart devices, telephone banking, even forensic applications. Biometrics researchers have accordingly sought to devise countermeasure technology which aims to protect biometric systems by detecting and deflecting spoofing attacks. Dedicated anti-spoofing initiatives have been created for all of the mainstream biometric characteristics - face, iris and fingerprint biometrics. Until recently, there was no equivalent initiative in the field of voice biometrics. This is surprising since our voice is the most natural means of communication, be that with each other people, or with smart technology. The ease with which recordings of our voice can be collected, e.g. using a mobile phone or any other device equipped with a microphone, has resulted in a soaring rise in the use of voice biometrics for person recognition.
In order to assess the vulnerability of voice biometric systems and to develop and calibrate new countermeasures that are capable of protecting against spoofing attacks, researchers need a common ground upon which they can compare different solutions. This takes the form of common benchmark data and evaluation metrics. Without data, there would be no countermeasures and without common data and metrics, different research teams would be comparing apples and pears; common data supports the comparisons of apples to apples, and is critical to most fields of pattern recognition and machine learning research. It is only with common data and metrics that researchers and technology vendors are able to identify the best performing countermeasure solutions. Until recently, though, these common data and metrics were missing. Enter ASVspoof!
The Automatic Speaker Verification Spoofing and Countermeasures (ASVspoof) initiative was formed in 2013 following the first special session on spoofing and countermeasures held at the community’s flagship INTERSPEECH conference in Lyon, France. The goal was to promote the consideration of spoofing and countermeasures and to attract the broader participation of research colleagues in the voice conversion and speech synthesis communities in order to launch common evaluations. The first such evaluation, ASVspoof 2015 was held at the INTERSPEECH conference in Darmstadt, Germany. It attracted the participation of 28 research teams from all over the world. The follow-up evaluation, ASVspoof 2017, held once again at the INTERSPEECH conference, this time in Stockholm, Sweden, was even more successful, with over 100 teams registering their participation.
ASVspoof 2019 (www.asvspoof.org), the most recent initiative, currently underway, represents the largest and most comprehensive spoofing and countermeasure evaluation to date. Planning for ASVspoof 2019 started almost one year ago and has thus far involved contributions from about 40 researchers and 17 organizations. While ASVspoof remains mostly an academically-led initiative, co-organised by EURECOM and INRIA in France, the National Institute of Informatics (NII) and NEC in Japan, the University of Eastern Finland and the University of Edinburgh in the UK, the 2019 edition involves substantial data contributions from an impressive array of external partners from both academia and industry: Aalto University (Finland), Academia Sinica (Taiwan), the Adapt Centre (Ireland), DFKI (Germany), HOYA (Japan), iFlytek (China), Google LLC (UK), Nagoya University (Japan), Saarland University (Germany), Trinity College Dublin (Ireland), NTT Communication Science Laboratories (Japan), the Laboratoire Informatique d’Avignon (France) and the University of Science and Technology of China.
The ASVspoof initiative is today one of the most successful of all anti-spoofing initiatives within the entire biometrics community. Even if registration is still open at the time of writing, almost 150 registrations have been received thus far from every corner of the globe, including both academic and industrial participation.
What the 2019 evaluation will tell us, we do not know - the results of ASVspoof 2019 will be made public at this year’s INTERSPEECH conference being held in Graz, Austria in September. At least the hope is that progress in anti-spoofing has kept apace with progress in the breadth of technologies that facilitate spoofing, e.g. we are hoping that we can detect reliably the most advanced of today’s TTS and VC technologies and thereby secure the future of automatic speaker verification technology.
Organisers (equal contribution)
Junichi Yamagishi, National Institute of Informatics, JAPAN / University of Edinburgh, UK
Massimiliano Todisco, EURECOM, FRANCE
Md Sahidullah, Inria, FRANCE
Héctor Delgado, EURECOM, FRANCE
Xin Wang, National Institute of Informatics, JAPAN
Nicholas Evans, EURECOM, FRANCE
Tomi Kinnunen, University of Eastern Finland, FINLAND
Kong Aik Lee, NEC Corporation, JAPAN
Ville Vestman, University of Eastern Finland, FINLAND