Build Personalized Speech Synthesis With High-quality Training Data

Speech synthesis, commonly known as Text To Speech (TTS), is a technology that can convert any input text into corresponding speech, and is one of the indispensable modules in human-computer voice interaction.

Traditional Speech Synthesis

Traditional speech synthesis systems usually include two modules: front-end and back-end. The front-end module mainly analyzes the input text and extracts the linguistic information required by the back-end module. For Chinese synthesis systems, the front-end module generally includes sub-modules such as Text Normalization (TN), polyphonic word disambiguation, and prosody prediction. The back-end module generates a speech waveform through a certain method according to the front-end analysis results.

Behind the front-end technology, a large amount of basic data such as TN annotation, polyphonic word annotation, and rhythm annotation is needed to help the front-end technology output accurate results.

The back-end technology requires high-quality voice libraries recorded by professional speakers. In order to apply in various scenarios, a large number of voice libraries with diverse timbres and languages are required.

Personalized Speech Synthesis

Personalized speech synthesis usually uses a small amount of and possibly low-quality target speaker speech, and using methods such as transfer learning to train a speech synthesis model capable of synthesizing the target speaker’s speech. The usual approach is to train a general speech synthesis model based on a large number of different speakers, and then fine-tune it with a small number of target speakers.

The application of personalized speech synthesis is becoming more and more mature. Baidu Maps supports users to record 9 sentences to generate a complete personal speech package and use it in all scenarios of the map.

Behind the personalized speech synthesis technology, the multi-speaker average model library is needed as an important data support. Datatang’s speech synthesis data for general scenarios is divided into three categories:

Monophonic Human Synthesis Library

A sound library recorded in a professional recording studio by a single speaker.

American English Speech Synthesis Corpus-Female

Female audio data of American English,. It is recorded by American English native speaker, with authentic accent and sweet sound. The phoneme coverage is balanced. Professional phonetician participates in the annotation.

American English Speech Synthesis Corpus-Male

Male audio data of American English. It is recorded by American English native speakers, with authentic accent. The phoneme coverage is balanced. Professional phonetician participates in the annotation.

Japanese Synthesis Corpus-Female

Japanese Synthesis Corpus-Female. It is recorded by Japanese native speaker, with authentic accent. The phoneme coverage is balanced. Professional phonetician participates in the annotation.

Multi-speaker Average Model Library

A sound library recorded in a professional recording studio by multiple speakers.

Chinese Mandarin Average Tone Speech Synthesis Corpus, General

Chinese Mandarin Average Tone Speech Synthesis Corpus, General. It is recorded by Chinese native speaker. It covers news, dialogue, audio books, poetry, advertising, news broadcasting, entertainment; and the phonemes and tones are balanced. Professional phonetician participates in the annotation.

Chinese Average Tone Speech Synthesis Corpus-Three Styles

Chinese Average Tone Speech Synthesis Corpus-Three Styles.It is recorded by Chinese native speakers. Corpus includes cunstomer service,news and story. The syllables, phonemes and tones are balanced. Professional phonetician participates in the annotation.

Frontend Text

•200,475 Sentences — Chinese Text Normalization Data

The dataset covers novels, articles, news and other categories, the specific special symbols and Arabic numerals contained in the sentences are marked as Chinese characters, with a total of 199,652 sentences and 454,638 annotations.

•319,977 Sentences — Mandarin Polyphone Corpus Data

The dataset covers news, spoken language and other categories, including 603 phonetic sounds of 266 polyphonic words, a total of 319,977 sentences.

•200,955 Sentences — Mandarin Prosodic Corpus Data

Texts from news and daily chats were annotated at level 4 prosody.

As the world’s leading AI data service provider, Datatang has rich sample sound resources, outstanding technical advantages and data processing experience, and supports personalized collection services for designated language, timbre, age, and gender. Meanwhile, Datatang supports data customization services such as audio segmentation, phoneme boundary segmentation (segmentation accuracy of 0.01 seconds), phonetic tagging, prosody tagging, part-of-speech tagging, pitch proofreading, rhythm tagging, and musical score production to fully meet customers’ diverse requirements.

End

If you need data services, please feel free to contact us: info@datatang.com.

--

--

--

Off-the-shelf AI training data, on-demand data collection & annotation services

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How I Stopped Spending So Much Time Getting Useless Information

Helping African Farmers Increase Their Yields Using Deep Learning

Commodity MCX Gold & Silver Intraday Trading tips

mcx gold silver tips

In times of pandemic, data governance will give you the competitive edge

Tips for surviving as Data Scientist employee #1

Exploratory Data Analysis (EDA) with Python & Matplotlib

Using GNNs and Protein Expression Networks to Predict Alzheimer’s Disease Diagnosis

RGV COVID-19 Cases: Data & Analysis as of April 3rd, 2020

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Datatang

Datatang

Off-the-shelf AI training data, on-demand data collection & annotation services

More from Medium

Synthetic Training Data for Chatbots

An indispensable Part of Acceleration — GPU Computing

Automating water bill generation for an apartment society

Speech to Translation: An end-to-end walkthrough with GCP deployment