How to Use AI to Clone Your Voice

Recently, Google said that the latest version of its speech synthesis system, Tacotron2, has synthesized speech almost exactly like a human voice. It has two deep neural networks, the first is capable of converting text to spectrogram, and the second is responsible for generating the corresponding audio from the spectrogram.

Text to Speech, or TTS for short, is a technology that artificially generates human speech and converts arbitrary text information into standard and fluent speech in real time. TTS involves many disciplines and technologies such as acoustics, linguistics, digital signal processing, computer science, etc. It is a cutting-edge technology in the field of information processing. The main problem solved is how to convert text information into audible sound information, that is, let the machine talk like human.

According to the Markets and Markets, the global voice clone market is likely to grow from $456 million in 2018 to $1.739 billion by 2023.

In the personalized scene of human-computer interaction, speech synthesis technology can applied to customize personal AI assistants, reading audio, and voice systems for the speech impaired. Speech synthesis can help the speech impaired practice their vocalization and make it easier for them to communicate with others. In the field of psychological medicine, if the voice of the deceased can be restored, it will be a great comfort to those who have been traumatized by the loss of a loved one.

As a world’s leading AI data service provider, Datatang is committed to overcoming technical bottlenecks and supporting the wider application of TTS technology. Datatang has rich data resources, outstanding technical advantages and rich experience in data processing, and supports customized speech data collection by scene, language, age, gender, and speaker.

Security Compliance

In order to provide customers with safe and compliant data services and at the same time ensure Datatang’s own security and compliance, Datatang has formulated a security compliance system for the company’s data business in accordance with the data laws and policies of major countries around the world. In Datatang, data collection must be subject to the authorization letter signed by the person being collected.

Recording Studio

Datatang has a professional recording studio, equipped with vocal condenser microphones and monitoring equipment. The recording studio complies with the NR15 acoustic standard: the reverberation time is less than 0.1 seconds, the background noise is less than 20dB, and it has been certified by the Building Physics Laboratory of Tsinghua University.

Speaker Resources

Datatang has thousands of speaker resources and hundreds of professional teams around the world, and supports speech synthesis in multiple languages such as Mandarin Chinese, English, Japanese, and mixed reading of Chinese and English and etc. In addition, Datatang has a variety of timbre resources such as male, female, and children voices. Each timbre has different types of speakers, which fully meets the requirements of diverse speech synthesis.

Quality Assurance

During the recording process, Datatang is equipped with professional monitoring to ensure the recording quality. By consulting experts, research papers, and referring to the pronunciation of words on various dictionaries, Google Translate and Baidu Translate, Datatang has compiled a complete set of pronunciation rules and made a pronunciation dictionary.

Off-the-Shelf TTS Speech Datasets

American English Speech Synthesis Corpus-Female

The corpus is recorded by American English native speakers, with authentic accent and sweet sound. The phonemes and tones are balanced and professional phonetician participates in the annotation.

American English Speech Synthesis Corpus-Male

The data is recorded by American English native speakers, with authentic accent and sweet sound. The phonemes and tones are balanced and professional phonetician participates in the annotation.

Japanese Synthesis Corpus-Female

The corpus is recorded by Japanese native speakers, with authentic accent and sweet sound. The phonemes and tones are balanced and professional phonetician participates in the annotation.

Chinese-English Mixed Average Tone Speech Synthesis Corpus-Customer Service

It is recorded by Chinese native speakers, customer service text, and the syllables, phonemes and tones are balanced. Professional phonetician participates in the annotation.

Chinese Mandarin Synthesis Corpus-Female, Emotional

The data is recorded by Chinese native speaker, emotional text, and the syllables, phonemes and tones are balanced. Professional phonetician participates in the annotation.

End

If you need data services, please feel free to contact us: info@datatang.com.

Off-the-shelf AI training data, on-demand data collection & annotation services

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Why robots will never turn on us

🧠💭What happens when we start treating Ai as if it were human?

Demystifying artificial intelligence and saving the world from killer robots

Justin Lynch | AI Strategy & Defense

7 New Technologies to Improve Customer Service in 2021

The Ongoing Quest for Insight and Foresight

The size of the datasphere will grow to over 160 zettbytes by 2025

Push a Button, Get a Bus

Bot or Not? Five reasons why you should build a chatbot.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Datatang

Datatang

Off-the-shelf AI training data, on-demand data collection & annotation services

More from Medium

Promoting Trust in AI Algorithms

Transfer Learning: Summarizing Medical Documents

Optimizing Chemical Manufacturing with Artificial Intelligence

Kappa Reads: March