Improving your AI Models with High-quality Chinese Dialects Data

With the expansion of AI applications, dialect recognition has received increasing attention. However, due to the huge difference between Chinese dialects and Mandarin, the speech recognition of Chinese dialects is much more complicated.

Generally speaking, the speech data collection is to record commonly used sentences and words through text, phonetic symbols and voice and integrate the recorded contents to a database. However, the numerous types of dialects in China mean that the data to be collected is also massive, and it is difficult to establish a national dialect database in a short time.

For the large-scale applications of Chinese dialect, Datatang has arranged in advance and has accumulated 25,000 hours of Chinese dialects data, covering dialect regions of Fujian, Guangdong, Wu, Hunan, Southwest, Northeast, Central Plains and ethnic minorities. The datasets can be delivered in seconds and quickly help to improve the recognition accuracy of AI models. All the datasets are recorded by native speakers with signed authorization agreements.

Cantonese Conversational Speech Data

Nearly 1,000 Cantonese speakers participated in the recording, and conducted face-to-face communication in a natural way. They had free discussion on a number of given topics, with a wide range of fields; the voice was natural and fluent, in line with the actual dialogue scene.

Minnan Dialect Conversational Speech Data

It collects nearly 1,000 speakers from Fujian Province. Dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performeThe accuracy rate of sentence is 95%.

Sichuan Dialect Conversational Speech Data

1730 Sichuan native speakers participated in the recording and face-to-face free talking in a natural way in wide fields without the topic specified. It is natural and fluency in speech, and in line with the actual dialogue scene.

If the above data cannot meet the needs of your current research, Datatang also provides data customization services for specific groups of people, specific scenarios, and specific languages to meet customers’ diversified data needs.


If you need data services, please feel free to contact us:

Off-the-shelf AI training data, on-demand data collection & annotation services

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Do Machines Need to See Color?

Exponential Mindset

Conversational IVR for the Life Sciences Industry — Demystified

Practical and Myterious Artificial Intelligence

H+ Weekly — Issue #191

Quick Transcribe: For When You Need To Get Transcription Done Quickly

The Loss Function of Intelligence

The beast of the Revelation will be an AI

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


Off-the-shelf AI training data, on-demand data collection & annotation services

More from Medium

Build Multilingual Speech Recognition System with High-quality Training Data

Automatic Job Level Classification

Using a multilingual model with Pinecone services to semantically search for similar questions

Chinese Word Segmentation