Recently, AI expert Andrew Ng shared the 2022 AI trend forecast on the DeepLearning.AI platform. He mentioned that multi-modal AI will be the future of AI.
Multimodal refers to different types of data, such as text, video, audio, video, etc. Research on multimodal AI dates back decades. In 1989, researchers at Johns Hopkins University and the University of California, San Diego developed a system that could classify vowels based on audio and visual data of people speaking. Over the next two decades, research teams has experimented with multimodal applications, such as searching digital video libraries and classifying human emotions based on audio-visual data.
In the past, AI models were almost limited to monomodal tasks, such as speech or vision. In past 2021, there has been many multi-modal AI achievements, such as the CLIP and DALL•E models published by OpenAI, which can process text and images at the same time, and generate pictures by entering text. DeepMind’s Perceiver IO can classify text, images, videos, and point clouds.
Although most of these new multimodal systems are in the experimental stage, breakthroughs have been made in practical applications. Facebook says its multimodal speech detector can flag and remove 97% of abusive and harmful contents on the social network, and the system can classify image-text pairings as benign or harmful based on 10 data types including text, images and videos. Google said it will add multimodal capabilities to its search engine to handle text, audio, image and video content, users can use it in any of the 75 languages.
As a world’s leading AI data service provider, Datatang has been adhering to the corporate vision of “empowering AI with data and changing the world with intelligence” for many years. Datatang has accumulated off-the-shelf 200,000 hours of speech data, more than 100 image and video data and 2 billion pieces of text data. Datatang is strictly complying with ISO27701 privacy management system and ISO27001 information security management system. All the data have been authorized by the collectors, customers can use them with confidence.
The dataset contains 1,000 hours of American English conversation speech data. It’s recorded by 2,000 native speakers. The speakers start the conversation around a familar topic, to ensure the smoothness and nature of the conversation.
Indian English audio data captured by mobile phones, 1,012 hours in total, recorded by 2,100 Indian native speakers. The recorded text is designed by linguistic experts, covering generic, interactive, on-board, home and other categories. The sentence accuracy is over 95%.
The data is recorded by Korean native speakers. The recorded text is a mixture of Korean and English sentences, covering general scenes and human-computer interaction scenes.
The data covers ages from preschool (3–5 years old) to post-school (6–12 years old), with children’s speech features. Content accurately matches children’s actual scenes of speaking English
Image & Video Data
The dataset diversity includes multiple scenes, light conditions, ages, shooting angles, and poses. In terms of annotation, we use instance segmentation annotations on human body. 22 landmarks were also annotated for each human body.
The data includes male and female. The age distribution ranges from child to the elderly. For each person, 7 images were collected. The data diversity includes different facial postures, different expressions, different light conditions and different scenes.
The data is collected from 1,078 Chinese people. Each subject was collected once a week, 6 times in total, so the time span is 6 weeks.
If you need data services, please feel free to contact us: firstname.lastname@example.org.