How to Use AI to Clone Your Voice
Recently, Google said that the latest version of its speech synthesis system, Tacotron2, has synthesized speech almost exactly like a human voice. It has two deep neural networks, the first is capable of converting text to spectrogram, and the second is responsible for generating the corresponding audio from the spectrogram.
Text to Speech, or TTS for short, is a technology that artificially generates human speech and converts arbitrary text information into standard and fluent speech in real time. TTS involves many disciplines and technologies such as acoustics, linguistics, digital signal processing, computer science, etc. It is a cutting-edge technology in the field of information processing. The main problem solved is how to convert text information into audible sound information, that is, let the machine talk like human.
According to the Markets and Markets, the global voice clone market is likely to grow from $456 million in 2018 to $1.739 billion by 2023.
In the personalized scene of human-computer interaction, speech synthesis technology can applied to customize personal AI assistants, reading audio, and voice systems for the speech impaired. Speech synthesis can help the speech impaired practice their vocalization and make it easier for them to communicate with others. In the field of psychological medicine, if the voice of the deceased can be restored, it will be a great comfort to those who have been traumatized by the loss of a loved one.
As a world’s leading AI data service provider, Datatang is committed to overcoming technical bottlenecks and supporting the wider application of TTS technology. Datatang has rich data resources, outstanding technical advantages and rich experience in data processing, and supports customized speech data collection by scene, language, age, gender, and speaker.
In order to provide customers with safe and compliant data services and at the same time ensure Datatang’s own security and compliance, Datatang has formulated a security compliance system for the company’s data business in accordance with the data laws and policies of major countries around the world. In Datatang, data collection must be subject to the authorization letter signed by the person being collected.
Datatang has a professional recording studio, equipped with vocal condenser microphones and monitoring equipment. The recording studio complies with the NR15 acoustic standard: the reverberation time is less than 0.1 seconds, the background noise is less than 20dB, and it has been certified by the Building Physics Laboratory of Tsinghua University.
Datatang has thousands of speaker resources and hundreds of professional teams around the world, and supports speech synthesis in multiple languages such as Mandarin Chinese, English, Japanese, and mixed reading of Chinese and English and etc. In addition, Datatang has a variety of timbre resources such as male, female, and children voices. Each timbre has different types of speakers, which fully meets the requirements of diverse speech synthesis.
During the recording process, Datatang is equipped with professional monitoring to ensure the recording quality. By consulting experts, research papers, and referring to the pronunciation of words on various dictionaries, Google Translate and Baidu Translate, Datatang has compiled a complete set of pronunciation rules and made a pronunciation dictionary.
Off-the-Shelf TTS Speech Datasets
The corpus is recorded by American English native speakers, with authentic accent and sweet sound. The phonemes and tones are balanced and professional phonetician participates in the annotation.
The data is recorded by American English native speakers, with authentic accent and sweet sound. The phonemes and tones are balanced and professional phonetician participates in the annotation.
The corpus is recorded by Japanese native speakers, with authentic accent and sweet sound. The phonemes and tones are balanced and professional phonetician participates in the annotation.
It is recorded by Chinese native speakers, customer service text, and the syllables, phonemes and tones are balanced. Professional phonetician participates in the annotation.
The data is recorded by Chinese native speaker, emotional text, and the syllables, phonemes and tones are balanced. Professional phonetician participates in the annotation.
If you need data services, please feel free to contact us: firstname.lastname@example.org.