How to Build a Conversational AI Models? In the View of Training Data

Nexdata
4 min readSep 11, 2022

From the perspective of the current data industry, most speech recognition data is based on reading training data. Reading speech data can solve relatively simple human-machine interaction application scenarios such as mobile phone voice assistants, in-vehicle voice assistants, smart speakers, and smart home appliances.

The dialogue or command control between the user and the machine is usually in the form of a single short sentence. The user often needs pay attention to his own speaking speed and pronunciation, which is essentially an unnatural pronunciation. In this scenario, reading speech data can meet the training needs of the speech recognition algorithm.

However, with the implementation of speech recognition technology in more natural scenarios such as intelligent customer service and intelligent conferences, the training effect of reading speech data has become unsatisfactory. Because the pronunciation habits of speakers in daily life are more natural, there will be a large number of legatos, swallowing, pronunciation distortion, unclear articulation, etc., including some unconscious “um, ah, uh”, etc. The speakers will deliberately control the voice and pronunciation habits. When many people communicate at the same time, complex speech phenomena such as sentence interruption, grabbing, and overlapping sounds may even occur. Therefore, the speech recognition rate of this natural dialogue style is always not satisfactory.

Data is the foundation of artificial intelligence. In order to make artificial intelligence technology more accurate, a training data set that better matches the application scenario is required. Natural dialogue speech data has become a more urgent data set in the industry. When Datatang collects natural dialogue voice data, there is no preset corpus at all, and only a list of topics is given. The recorder selects a number of topics that he is familiar with to start a dialogue to ensure that the dialogue speech is natural and smooth.

Right now, Datatang has 200,000 hours off-the-shelf speech data, including nearly 40,000 hours of natural dialogue speech data, including English, Japanese, Korean, Hindi, Vietnamese, Arabic, Spanish, French, German, Italian, etc. The speakers come from different regions and cities, and the age and gender coverage is balanced. All the audio has undergone strict manual transcription and quality inspection. The text content, the start and end time points of valid sentences, and the identity of the recorder are labeled, and the sentence accuracy rate is over 95%.

American English Conversational Speech Data by Mobile Phone

2000 speakers participated in the recording and conducted face-to-face communication in a natural way. They had free discussion on a number of given topics, with a wide range of fields; the voice was natural and fluent, in line with the actual dialogue scene. Text is transferred manually, with high accuracy.

Spanish Conversational Speech Data by Mobile Phone

About 700 speakers participated in the recording, and conducted face-to-face communication in a natural way. They had free discussion on a number of given topics, with a wide range of fields; the voice was natural and fluent, in line with the actual dialogue scene. Text is transferred manually, with high accuracy.

German Conversational Speech Data by Mobile Phone

About 750 speakers participated in the recording, and conducted communication in a natural way. They had free discussion on a number of given topics, with a wide range of fields; the voice was natural and fluent, in line with the actual dialogue scene. Text is transferred manually, with high accuracy.

Korean Conversational Speech Data by Mobile Phone

About 700 Korean speakers participated in the recording, and conducted face-to-face communication in a natural way. They had free discussion on a number of given topics, with a wide range of fields; the voice was natural and fluent, in line with the actual dialogue scene. Text is transferred manually, with high accuracy.

Minnan Dialect Conversational Speech Data by Mobile Phone

500 Hours — Minnan Dialect Conversational Speech Data by Mobile Phone was recorded by about 1,000 native Hokkien speakers. The recorders are from Quanzhou, Zhangzhou and Xiamen. The ratio of males and females is balanced, covering multiple age groups. 500 hours of Hokkien natural dialogue collected by mobile phone There is no preset corpus for the voice data. In order to ensure the smooth and natural dialogue, the recorder will start the dialogue and record it according to the topic he is familiar with.

End

If you want to know more details about the datasets or how to acquire, please feel free to contact us: info@datatang.com.

--

--

Nexdata

Off-the-shelf AI training data, on-demand data collection & annotation services