Using Accented English Data to Improve your AI Models

According to reports, Vocalize.ai’s laboratory has performed a speech recognition capability test on Amazon’s voice assistant Alexa, Apple’s voice assistant Siri, and Google’s voice assistant Google Assistant. Researchers tested the three voice assistants to test how well they understand accented English in three countries: the United States, India, and China.

The result shows that Google Assistant completely surpasses the other two voice assistants in understanding Chinese accent English. The main reason for this result is that Google Assistant has learned a considerable Chinese accent English data while the other two voice assistants have not.

As an international language, there is a big difference in English accents in different countries and regions. In some regions, the English accent is hard to understand as it sounds like another language. If the machine does not learn tons of accented English data in different regions, it is very likely that it will not be able to identify these English accents.

Currently, the ASR system has a high accuracy rate for standard English accent and can meet the commercial requirements of certain scenarios, but the shortage of accented English speech data has severely restricted the research of accented English recognition.

As the world’s leading AI data service provider, Datatang has accented English data in dozens of countries and regions including the United Kingdom, the United States, China, Germany, India etc., covering various pronunciation style and accents, and has completed the phonetic transcription, accent annotation, and prosody annotation for the data, which can power the research of accented English.

Chinese English Datasets

More than 3,000 native Chinese participated in the recording of 100,000 common English sentences, covering the regions of Jiangsu, Shandong, Beijing, Henan, etc., and conformed to the specific accent of the Chinese speaking English. The recording text covers commonly used English sentences, rich in content, wide in scope, and balanced phonemes.

American English Datasets

Nearly 2,000 native American English speakers participated in the recording with authentic accents. The recording text is designed by language experts and is guided by interactive scenarios, covering multiple categories such as human machine interaction, in-car, home, and general etc.

British English Datasets

The data is recorded by 1,651 native British English speakers, with authentic accents. The recorded text covers multiple categories such as human machine interaction, in-car, home, and general etc.

German English Datasets

More than 1,000 native German English speakers participated in the recording with authentic accents. The recording text is designed by language experts, covering multiple categories such as human machine interaction, in-car, home, and general etc.

French English Datasets

More than 1,000 native French English speakers participated in the recording with authentic accents. The recording text covers multiple categories such as human machine interaction, in-car, home, and general etc. It’s recorded in a quiet room, covering the age group of 18 to 60 years old.

Indian English Datasets

The data is recorded by more than 2,000 native Indian English speakers, with authentic accents. The recorded text covers multiple categories such as human machine interaction, in-car, home, and general etc. The sentence accuracy is over 95%.

Datatang is always committed to protecting participants’ interests, protecting data security, and respecting participants’ privacy. Datatang has passed ISO27701, ISO27001 privacy information management system certification and ISO9001 quality management system certification.

If the above datasets cannot meet your current research needs, Datatang can also provide data customization services for specific peoples, specific scenarios, and specific languages to help customers get satisfactory data services.

End

If you need data services, please feel free to contact us: info@datatang.com

Off-the-shelf AI training data, on-demand data collection & annotation services