Leverage High-Quality Data to Power Multimodal AI Training

Recently, AI expert Andrew Ng shared the 2022 AI trend forecast on the DeepLearning.AI platform. He mentioned that multi-modal AI will be the future of AI.

Multimodal refers to different types of data, such as text, video, audio, video, etc. Research on multimodal AI dates back decades. In 1989, researchers at Johns Hopkins University and the University of California, San Diego developed a system that could classify vowels based on audio and visual data of people speaking. Over the next two decades, research teams has experimented with multimodal applications, such as searching digital video libraries and classifying human emotions based on audio-visual data.

In the past, AI models were almost limited to monomodal tasks, such as speech or vision. In past 2021, there has been many multi-modal AI achievements, such as the CLIP and DALL•E models published by OpenAI, which can process text and images at the same time, and generate pictures by entering text. DeepMind’s Perceiver IO can classify text, images, videos, and point clouds.

Although most of these new multimodal systems are in the experimental stage, breakthroughs have been made in practical applications. Facebook says its multimodal speech detector can flag and remove 97% of abusive and harmful contents on the social network, and the system can classify image-text pairings as benign or harmful based on 10 data types including text, images and videos. Google said it will add multimodal capabilities to its search engine to handle text, audio, image and video content, users can use it in any of the 75 languages.

As a world’s leading AI data service provider, Datatang has been adhering to the corporate vision of “empowering AI with data and changing the world with intelligence” for many years. Datatang has accumulated off-the-shelf 200,000 hours of speech data, more than 100 image and video data and 2 billion pieces of text data. Datatang is strictly complying with ISO27701 privacy management system and ISO27001 information security management system. All the data have been authorized by the collectors, customers can use them with confidence.

Speech Data

American English Natural Dialogue Speech Data

The dataset contains 1,000 hours of American English conversation speech data. It’s recorded by 2,000 native speakers. The speakers start the conversation around a familar topic, to ensure the smoothness and nature of the conversation.

Indian English Speech Data by Mobile Phone

Indian English audio data captured by mobile phones, 1,012 hours in total, recorded by 2,100 Indian native speakers. The recorded text is designed by linguistic experts, covering generic, interactive, on-board, home and other categories. The sentence accuracy is over 95%. (Facebook/Linkedin)

Mixed Speech with Korean and English Data by Mobile Phone

The data is recorded by Korean native speakers. The recorded text is a mixture of Korean and English sentences, covering general scenes and human-computer interaction scenes.

Chinese Children Speaking English Speech Data by Mobile Phone

The data covers ages from preschool (3–5 years old) to post-school (6–12 years old), with children’s speech features. Content accurately matches children’s actual scenes of speaking English

Image & Video Data

3D Instance Segmentation and 22 Landmarks Annotation Data of Human Body

The dataset diversity includes multiple scenes, light conditions, ages, shooting angles, and poses. In terms of annotation, we use instance segmentation annotations on human body. 22 landmarks were also annotated for each human body.

Multi-race 7 Expressions Recognition Data

The data includes male and female. The age distribution ranges from child to the elderly. For each person, 7 images were collected. The data diversity includes different facial postures, different expressions, different light conditions and different scenes.

3D Faces Collection Data

The data is collected from 1,078 Chinese people. Each subject was collected once a week, 6 times in total, so the time span is 6 weeks.

End

If you need data services, please feel free to contact us: info@datatang.com.

Off-the-shelf AI training data, on-demand data collection & annotation services

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How Much Do Webcam Models Really Make? (Cam Girl Salary 2022 Updated)

Track, log, and measure your Unsplash photo stats for greater future insight

Behind the scene look at the Unsplash stats page for my profile

Ensemble learning using the Voting Classifier

Understanding the Mathematics behind Principal Component Analysis

Important Q&A’s related to Data Science Career

How To Evaluate a Data Catalog…

When Will the World Return to Its Pre-COVID Self?

Pokémon, I Choose You!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Datatang

Datatang

Off-the-shelf AI training data, on-demand data collection & annotation services

More from Medium

Ray Tracing from Scratch — Advanced 3D Image Data Augmentation in Python

On Natural Language Processing

Detection and Normalization of Temporal Expressions in French Text (2) — Label Format and…

Text-to-Text Models In Natural Language Processing