Upgrade Your Speech Recognition Models with Large Scale Data

4 min readApr 16


The data shows that the global smart voice market will grow from US$11.03 billion in 2017 to US$26.39 billion in 2021, and the global smart voice industry will reach US$35.12 billion in 2022, maintaining a high growth rate of 33.1%. It is expected to reach US$39.92 billion in 2023.

In the past ten years, speech recognition technology has made great progress. Continuous speech and non-specific real-time speech recognition systems have been successfully developed and developed in the laboratory. A large number of speech recognition technologies have entered the stage of implementation . However, the actual application scenarios of speech recognition face various challenges. In summary, the challenges mainly include three aspects: robustness, low resources, and complex scenarios.

Typical problems of robustness include accents and dialects, mixed or multilingual languages, domain adaptation, etc. Low resources refer to scenarios where system deployment resources are limited and annotation data is scarce. The former is typically the deployment of various end-side devices in AIoT scenarios The limitation of model size and computing power, and the lack of training data are also key factors that limit the development of speech recognition in various vertical domains and languages.

In order to solve the problem of lack of speech recognition data, Datatang has designed and developed 200,000 hours of speech recognition datasets, including more than 60 languages and dialects, such as Chinese Mandarin, English, Japanese, Korean, Hindi, Vietnamese, Arabic, Spanish, French, German, Italian, and Portuguese .

344 People — American English Speech Data by Mobile Phone_Guiding

The data set contains 344 American English speakers’ speech data, all of whom are American locals. 50 sentences for each speaker. The valid data is 9.7 hours. It is recorded in quiet environment. The contents cover in-car scenario, smart home and speech assistant.

520 Hours — French Speaking English Speech Data by Mobile Phone

1089 French native speakers participated in the recording with authentic accent. The recorded script is designed by linguists and cover a wide range of topics including generic, interactive, on-board and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones. The data set can be applied for automatic speech recognition, and machine translation scenes.

211 Hours — German Speech Data by Mobile Phone_Reading

The data set contains 327 German native speakers’ speech data. The recording contents include economics, entertainment, news, oral, figure, letter, etc. Each sentence contains 10.3 words on average. Each sentence is repeated 1.4 times on average. All texts are manually transcribed to ensure the high accuracy.

347 Hours-Italian Speech Data Collected by Mobile Phone

Italian audio data captured by mobile phone , with total duration of 347 hours. It is recorded by 800 Italian native speakers, balanced in gender is balanced; the recording environment is quiet; all texts are manually transferred with high accuracy. This data set can be applied on automatic speech recognition, machine translation, and sound pattern recognition.

1,044 Hours — Brazilian Portuguese Speech Data by Mobile Phone

The 1,044 Hours — Brazilian Portuguese Speech Data of natural conversations collected by phone involved more than 2,038 native speakers, developed with proper balance of gender ratio and geographical distribution. Speakers would choose linguistic experts designed topics conduct conversations. The recording devices are various mobile phones. The audio format is 16kHz, 16bit, uncompressed WAV, and all the speech data was recorded in quiet indoor environments. All the speech audio was manually transcript with text content, the start and end time of each effective sentence, and speaker identification. The accuracy rate of sentences is ≥ 95%.

759 Hours — Hindi Speech Data by Mobile Phone

The data is 759 hours long and was recorded by 1,425 Indian native speakers. The accent is authentic. The recording text is designed by language experts and covers general, interactive, car, home and other categories. The text is manually proofread, and the accuracy is high. Recording devices are mainstream Android phones and iPhones. It can be applied to speech recognition, machine translation, and voiceprint recognition.

234 Hours-Japanese Speech Data by Mobile Phone

It collects 799 Japanese locals and is recorded in quiet indoor places, streets, restaurant. The recording includes 210,000 commonly used written and spoken Japanese sentences. The error rate of text transfer sentence is less than 5%. Recording devices are mainstream Android phones and iPhones.

About Datatang

Founded in 2011, Datatang is a professional artificial intelligence data service provider and committed to providing high-quality training data and data services for global AI companies. Relying on own data resources, technical advantages and intensive data processing experiences, Datatang provides data services to 1000+ companies and institutions worldwide. Datatang entered Chinese stock market (NEEQ: 831428) in 2014 and became the first listed company in China’s artificial intelligence data service industry.

If you need data services, please feel free to contact us: info@datatang.com




Off-the-shelf AI training data, on-demand data collection & annotation services