Build Personalized Speech Synthesis With High-quality Training Data
Speech synthesis, commonly known as Text To Speech (TTS), is a technology that can convert any input text into corresponding speech, and is one of the indispensable modules in human-computer voice interaction.
Traditional Speech Synthesis
Traditional speech synthesis systems usually include two modules: front-end and back-end. The front-end module mainly analyzes the input text and extracts the linguistic information required by the back-end module. For Chinese synthesis systems, the front-end module generally includes sub-modules such as Text Normalization (TN), polyphonic word disambiguation, and prosody prediction. The back-end module generates a speech waveform through a certain method according to the front-end analysis results.
Behind the front-end technology, a large amount of basic data such as TN annotation, polyphonic word annotation, and rhythm annotation is needed to help the front-end technology output accurate results.
The back-end technology requires high-quality voice libraries recorded by professional speakers. In order to apply in various scenarios, a large number of voice libraries with diverse timbres and languages are required.
Personalized Speech Synthesis
Personalized speech synthesis usually uses a small amount of and possibly low-quality target speaker speech, and using methods such as transfer learning to train a speech synthesis model capable of synthesizing the target speaker’s speech. The usual approach is to train a general speech synthesis model based on a large number of different speakers, and then fine-tune it with a small number of target speakers.
The application of personalized speech synthesis is becoming more and more mature. Baidu Maps supports users to record 9 sentences to generate a complete personal speech package and use it in all scenarios of the map.
Behind the personalized speech synthesis technology, the multi-speaker average model library is needed as an important data support. Datatang’s speech synthesis data for general scenarios is divided into three categories:
Monophonic Human Synthesis Library
A sound library recorded in a professional recording studio by a single speaker.
Female audio data of American English,. It is recorded by American English native speaker, with authentic accent and sweet sound. The phoneme coverage is balanced. Professional phonetician participates in the annotation.
Male audio data of American English. It is recorded by American English native speakers, with authentic accent. The phoneme coverage is balanced. Professional phonetician participates in the annotation.
Japanese Synthesis Corpus-Female. It is recorded by Japanese native speaker, with authentic accent. The phoneme coverage is balanced. Professional phonetician participates in the annotation.
Multi-speaker Average Model Library
A sound library recorded in a professional recording studio by multiple speakers.
Chinese Mandarin Average Tone Speech Synthesis Corpus, General. It is recorded by Chinese native speaker. It covers news, dialogue, audio books, poetry, advertising, news broadcasting, entertainment; and the phonemes and tones are balanced. Professional phonetician participates in the annotation.
Chinese Average Tone Speech Synthesis Corpus-Three Styles.It is recorded by Chinese native speakers. Corpus includes cunstomer service,news and story. The syllables, phonemes and tones are balanced. Professional phonetician participates in the annotation.
•200,475 Sentences — Chinese Text Normalization Data
The dataset covers novels, articles, news and other categories, the specific special symbols and Arabic numerals contained in the sentences are marked as Chinese characters, with a total of 199,652 sentences and 454,638 annotations.
•319,977 Sentences — Mandarin Polyphone Corpus Data
The dataset covers news, spoken language and other categories, including 603 phonetic sounds of 266 polyphonic words, a total of 319,977 sentences.
•200,955 Sentences — Mandarin Prosodic Corpus Data
Texts from news and daily chats were annotated at level 4 prosody.
As the world’s leading AI data service provider, Datatang has rich sample sound resources, outstanding technical advantages and data processing experience, and supports personalized collection services for designated language, timbre, age, and gender. Meanwhile, Datatang supports data customization services such as audio segmentation, phoneme boundary segmentation (segmentation accuracy of 0.01 seconds), phonetic tagging, prosody tagging, part-of-speech tagging, pitch proofreading, rhythm tagging, and musical score production to fully meet customers’ diverse requirements.
If you need data services, please feel free to contact us: firstname.lastname@example.org.