Multilingual and Code Switching: The Challenges in Future Speech Recognition Technology
In recent years, Automatic Speech Recognition (ASR) has achieved important development in commercial use, and several enterprise-level ASR models based on neural networks have been successfully launched, such as Alexa, Rev, AssemblyAI, ASAPP, etc.
Back in 2016, Microsoft Research published an article announcing that they had achieved human performance (measured using Word Error Rate, WER) on a 25-year historical dataset called Switchboard. The accuracy of ASR is still improving, reaching human level in more datasets and more use cases.
However, today’s commercial ASR models are mainly trained on English datasets and thus have higher accuracy on English input. The long-term focus on English is higher in academia and industry due to data availability and market demand. Although the recognition accuracy of commercial popular languages such as French, Spanish, Portuguese, and German is reasonable, there is obviously a long tail of languages with limited training data and relatively low ASR output quality.
Furthermore, most business systems are based on a single language, which cannot be applied to the multilingual scenarios specific to many societies. Multilingualism can take the form of back-to-back languages, such as media programming in bilingual countries. Amazon has made strides in dealing with this issue recently with a product that integrates language identification (LID) and ASR. In contrast, cross-language (also known as code-switching) is a language system used by individuals that combines the words and grammars of two languages in the same sentence.
As a world’s leading AI data service provider, Datatang has developed 200,000 hours speech data, covering over 60 languages and dialects and helps developers build applications that anyone can understand in any language, truly unleashing the power of speech recognition to the world.
The dataset is recorded by 1,245 local Japanese speakers with authentic accents; the recorded texts cover general, interactive, car, home and other categories, and are rich in content.
The data volumn is 1044 hours and is recorded by 2038 Brazilian native speakers. The recording text is designed by linguistic experts, which covers general interactive, in-car and home category. The texts are manually proofread with high accuracy. Recording devices are mainstream Android phones and iPhones.
The data were recorded by 3,109 native Italian speakers with authentic Italian accents. The recorded content covers a wide range of categories such as general purpose, interactive, in car commands, home commands, etc. The recorded text is designed by a language expert, and the text is manually proofread with high accuracy. Match mainstream Android, Apple system phones.
The data is recorded by Korean native speakers . The recorded text is a mixture of Korean and English sentences, covering general scenes and human-computer interaction scenes. It is rich in content and accurate in transcription. It can be used for improving the recognition effect of the speech recognition system on Korean-English mixed reading speech.
The data is recorded by 1113 Chinese native speakers with accents covering seven major dialect areas. The recorded text is a mixture of Chinese and English sentences, covering general scenes and human-computer interaction scenes. It is rich in content and accurate in transcription. It can be used for improving the recognition effect of the speech recognition system on Chinese-English mixed reading speech.
If you want to know more details about the datasets or how to acquire, please feel free to contact us: firstname.lastname@example.org.