Improve Speech Emotion Recognition with High-quality Datasets
With the rise of deep learning, emotion recognition methods based on deep neural networks have been widely used. Speech Emotion Recognition, also known as NER, is a computer simulation of the process of human emotion perception and understanding. Computers are used to analyze emotions, extract emotional feature values, and use these parameters for corresponding modeling and recognition to establish a mapping relationship between feature values and emotions. , and finally classify the emotion.
Speech emotion recognition is very important in scenarios such as education, medical care, and intelligent entertainment, e-commerce, car driving, auxiliary lie detection, and human-computer interaction.
Grasp the emotional state of students in real time through the voice emotion recognition system, use its unique ability to analyze and distinguish emotions, analyze the responses received by the system in real time, and understand and grasp the real emotional state of students in a timely manner, so as to quickly make feedback and make adjustments, greatly enhanced the classroom effect and improved the learning efficiency of students.
● Customer Service
Ordinary manual customer service can only answer customers’ questions and needs mechanically and repetitively, and cannot be flexible, which causes some customers to resist and lead to the loss of customer sources. Speech emotion recognition will take targeted analysis on this, and when negative fluctuations in customer emotions are detected, it will switch to manual customer service for coordination in a timely manner, effectively reducing the loss of customer sources.
In the face of the inability to communicate between many doctors and patients, the speech emotion recognition system has played an extremely important role. When encountering mood swings, resistance to conversation, or mentally traumatized, difficult-to-communicate patients, the speech emotion recognition system will respond quickly and analyze the patient’s psychological state at the moment, interact emotionally with the patient, and calm the patient’s emotions.
For the elderly who live alone at home, the speech emotion system automatically recognizes the emotional fluctuations of the elderly, communicates effectively with them, and tries to create a healthy living environment for the elderly through spiritual comfort and help within their ability.
The Challenge of Speech Emotion Recognition
1. Emotional subjectivity and ambiguity
Speech emotion recognition is a relatively young field that lacks official standards on emotion definitions. Different listeners may have different views on the emotion of the same speech. In addition, a piece of speech often has emotional changes and is highly subjective, which makes many research works not universal.
2. Emotion feature extraction and selection
There are various voice speakers, various emotional categories, and different lengths of voice clips. These problems make artificially designed features unable to cover all emotional information. On the other hand, deep features, while effective, are not interpretable.
3. Lack of labeled data
The good performance of deep learning methods requires a large amount of high-quality labeled data. Due to the subjectivity and ambiguity of emotion, labeling speech emotion is very time-consuming and laborious, and requires a large number of professionals. Collecting a large amount of emotion annotation data is an urgent problem in the field of speech emotion recognition.
Data is the cornerstone of machine learning, and large-scale, high-quality data is the key to the success of machine learning. Datatang has off-the-shelf speech datasets with multi-scene to fuel the development of speech emotion recongition.
● 1,003 People — Emotional Video Data
1,003 People — Emotional Video Data. The data diversity includes multiple races, multiple indoor scenes, multiple age groups, multiple languages, multiple emotions (11 types of facial emotions, 15 types of inner emotions). For each sentence in each video, emotion types (including facial emotions and inner emotions), start & end time, and text transcription were annotated.This dataset can be used for tasks such as emotion recognition and sentiment analysis.
● 20 People-English Emotional Speech Data by Microphone
English emotional audio data captured by microphone, 20 American native speakers participate in the recording, 2,100 sentences per person; the recorded script covers 10 emotions such as anger, happiness, sadness; the voice is recorded by high-fidelity microphone therefore has high quality; it is used for analytical detection of emotional speech.
● 13.3 Hours — Chinese Mandarin Synthesis Corpus-Female, Emotional
The 13.3 Hours — Chinese Mandarin Synthesis Corpus-Female, Emotional. It is recorded by Chinese native speaker, emotional text, and the syllables, phonemes and tones are balanced. Professional phonetician participates in the annotation. It precisely matches with the research and development needs of the speech synthesis.
● 30 Movies — Audio and Video Annotation Data
30 Movies — Audio and Video Annotation Data. In this dataset, there are 39,891 audios, each audio corresponds to one video file and one json file. In terms of data annotation, audio content, role name (main character), emotion and timestamp were annotated. The datastet can be used in speech recognition, emotional speech analysis and other tasks.
If you want to know more details about the datasets or how to acquire, please feel free to contact us: firstname.lastname@example.org.