Top 8 Open Source OCR Datasets

Datatang
3 min readDec 2, 2022

--

Optical character recognition (OCR) refers to the task of an electronic device such as a scanner or a digital camera examining characters in an image, and then using character recognition methods to translate the shape into computer text. Some applications of OCR include automated data entry for business documents, translation applications, online databases, security cameras that automatically recognize license plates, and more.

In this article, I sorted out some commonly used datasets in the field of OCR research.

  1. COCO-Text

The COCO-Text dataset contains 63,686 images with 145,859 cropped text instances. It is the first large-scale dataset for text in natural images and also the first dataset to annotate scene text with attributes such as legibility and type of text. However, no lexicon is associated with COCO-Text.

2. SynthText (ST)

SynthText (ST) can be said to be ImageNet in the field of OCR. The data set is generated by synthesis, and 8 million texts are artificially added to 800,000 pictures, and this synthesis is not a very blunt superposition, but some processing is done to make the text look more natural in the picture.

3. IIIT5K

The IIIT5K dataset contains 5,000 text instance images: 2,000 for training and 3,000 for testing. It contains words from street scenes and from originally-digital images. Every image is associated with a 50 -word lexicon and a 1,000 -word lexicon. Specifically, the lexicon consists of a ground-truth word and some randomly picked words.

4. SVT(Street View Text)

The SVT dataset contains 350 images: 100 for training and 250 for testing. Some images are severely corrupted by noise, blur, and low resolution. Each image is associated with a 50 -word lexicon.

5. CUTE80

The CUTE80 dataset contains 80 high-resolution images with 288 cropped text instances. It focuses on curved text recognition. Most images in CUTE80 have a complex background, perspective distortion, and poor resolution. No lexicon is associated with CUTE80.

6. SVHN

The SVHN dataset contains more than 600,000 digits of house numbers in natural scenes. It is obtained from a large number of street view images using a combination of automated algorithms and the Amazon Mechanical Turk (AMT) framework. The SVHN dataset was typically used for scene digit recognition.

7. RCTW-17

The RCTW-17 dataset contains 12,514 images: 11,514 for training and 1,000 for testing. Most are natural images collected by cameras or mobile phones, whereas others are digital-born. Text instances are annotated with labels, fonts, languages, etc.

8. MLT(MLTcompetition, ICDAR2019)

The MLT-2019 dataset contains 20,000 images: 10,000 for training (1,000 per language) and 10,000 for testing. The dataset includes ten languages, representing seven different scripts: Arabic, Bangla, Chinese, Devanagari, English, French, German, Italian, Japanese, and Korean. The number of images per script is equal.

About Datatang

Founded in 2011, Datatang is a professional artificial intelligence data service provider and committed to providing high-quality training data and data services for global AI companies. Relying on own data resources, technical advantages and intensive data processing experiences, Datatang provides data services to 1000+ companies and institutions worldwide. Datatang entered Chinese stock market (NEEQ: 831428) in 2014 and became the first listed company in China’s artificial intelligence data service industry.

If you need data services, please feel free to contact us: info@datatang.com

--

--

Datatang

Off-the-shelf AI training data, on-demand data collection & annotation services