Using High-quality TTS Data to Optimize your AI Models

3 min readNov 8, 2021

Speech synthesis, also known as TTS (Text to Speech), is a technology that artificially generates human speech and converts arbitrary text information into standard and smooth speech read aloud in real time. It’s an indispensable part for human-machine interaction. Speech recognition technology allows computers to learn to “listen”, while speech synthesis technology allows computer to “speak” like a human.

From map navigation, voice assistant, news reading, to smart customer service, call centers, and broadcast in public, the application of TTS is everywhere in our life.

Apart from text-to-speech, the research scope of speech synthesis technology also includes: singing synthesis, whisper synthesis, dialect synthesis, animal sound synthesis, and etc. At present, speech synthesis technology has been successfully applied in many fields.

Different from the traditional TTS broadcast synthesis, personalized TTS application are becoming more and more popular. Based on massive speech and text data annotation experience, Datatang provides high-quality, multi-scenario, and multi-category speech synthesis data solutions.

100 People — Chinese Mandarin Average Tone Speech Synthesis Corpus, General

The corpus is recorded by Chinese native speakers. It covers news, dialogue, audio books, poetry, advertising, news broadcasting, entertainment; and the phonemes and tones are balanced. The words accuracy rate is not less than 99.9%, the phoneme accuracy rate is note less than 99%, the prosodic accuracy rate is not less than 98%.

19.46 Hours — American English Speech Synthesis Corpus-Female

The corpus is recorded by American English native speakers, with authentic accent and sweet sound. The phoneme coverage is balanced.‍‍ The words accuracy rate is not less than 99%, the phoneme accuracy rate is note less than 98%, the prosodic accuracy rate is not less than 98%.

10 Hours — Chinese Mandarin Synthesis Corpus-Female, Customer Service‍

The corpus is recorded by Chinese native speakers, with lively and frindly voice. The phoneme coverage is balanced. The words accuracy rate is not less than 99.8%, the phoneme accuracy rate is note less than 98%, the accuracy of syllable boundary is not less than 98%.

6.78 Hours — Chinese Mandarin Speech Synthesis Corpus-Female Imitating Children

The corpus is recorded by Chinese native speakers, with authentic accent and sweet sound. The phoneme coverage is balanced. The words accuracy rate is not less than 99%.

With the rapid development of speech synthesis technology, the speech generated by TTS will become more and more natural and vivid. We firmly believe that the development of technology will continue to break through the conventional obstacles and bring us more convenience for our daily life.

End

If you need data services, please feel free to contact us: info@datatang.com

Using High-quality TTS Data to Optimize your AI Models

End

Written by Nexdata