Improving your AI Models with High-quality Chinese Dialects Data

With the expansion of AI applications, dialect recognition has received increasing attention. However, due to the huge difference between Chinese dialects and Mandarin, the speech recognition of Chinese dialects is much more complicated.

Generally speaking, the speech data collection is to record commonly used sentences and words through text, phonetic symbols and voice and integrate the recorded contents to a database. However, the numerous types of dialects in China mean that the data to be collected is also massive, and it is difficult to establish a national dialect database in a short time.

For the large-scale applications of Chinese dialect, Datatang has arranged in advance and has accumulated 25,000 hours of Chinese dialects data, covering dialect regions of Fujian, Guangdong, Wu, Hunan, Southwest, Northeast, Central Plains and ethnic minorities. The datasets can be delivered in seconds and quickly help to improve the recognition accuracy of AI models. All the datasets are recorded by native speakers with signed authorization agreements.

Cantonese Conversational Speech Data

Nearly 1,000 Cantonese speakers participated in the recording, and conducted face-to-face communication in a natural way. They had free discussion on a number of given topics, with a wide range of fields; the voice was natural and fluent, in line with the actual dialogue scene.

Minnan Dialect Conversational Speech Data

It collects nearly 1,000 speakers from Fujian Province. Dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performeThe accuracy rate of sentence is 95%.

Sichuan Dialect Conversational Speech Data

1730 Sichuan native speakers participated in the recording and face-to-face free talking in a natural way in wide fields without the topic specified. It is natural and fluency in speech, and in line with the actual dialogue scene.

If the above data cannot meet the needs of your current research, Datatang also provides data customization services for specific groups of people, specific scenarios, and specific languages to meet customers’ diversified data needs.


