Natural Scenes OCR: Use High-quality Training Data to Optimize AI Model

Datatang
4 min readAug 14, 2022

--

Optical character recognition (OCR) is the task of electronic devices such as scanners or digital cameras examining characters in an image and then translating the shapes into computer text using character recognition methods.

OCR application scenarios are relatively rich, including natural scenes, handwriting scenes, document recognition, etc. Natural scenario OCR is one of the most widely used scenarios in OCR tasks and with huge market demand. Natural scenes OCR is involved in people’s daily life. The text carriers can usually be store plaques, stop signs, posters, road signs, cartoons, manhole cover paintings, prompts, warnings, packaging instructions, menus, building signs, etc.

Natural Scene OCR Data Annotation

According to the different degrees of labeling fineness, it can usually be divided into text line-level labeling and character-level labeling (word-level labeling will also be performed if there are words in the Latin language). The labeling method is usually text box + character transcription. Based on different task requirements, the text box can be a rectangular box or a quadrilateral box.

Challenges of Natural Scene OCR Tasks

From a technical point of view, the natural scene OCR task has the following four difficulties:

● Language

Different countries have different common languages, and the character morphology of different languages is also very different, which increases the difficulty of OCR algorithm recognition.

● Complicated Font

In natural scenes, texts are usually artistic fonts, and the status of artistic fonts is quite different from that of standard fonts. In addition, factors such as different text sizes and changing colors in natural scenes further increase the difficulty of OCR tasks.

● Various Shooting Angles

Most users will use mobile phones as the device for shooting text. Different users have different shooting habits, which will lead to various shooting angles during shooting, which poses a challenge to the robustness of the OCR algorithm to angle inclination.

● Diverse Character Carriers

The distribution of OCR text carriers in natural scenes is relatively rich, and some carriers will cause text distortion. For example, food packaging is often deformed, resulting in the bending of text, which increases the difficulty of OCR tasks.

According to the needs and difficulties of OCR tasks in natural scenes, Datatang has developed a series of datasets, covering multiple languages, multiple scenes, and multiple shooting angles, etc.

Natural Scenes OCR Data of 12 Languages

The data covers 12 languages (6 Asian languages, 6 European languages), multiple natural scenes, multiple photographic angles. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data.

English OCR Data in Natural Scenes

The collecting scenes of this dataset are the real scenes in Britain and the United States. The data diversity includes multiple scenes, multiple photographic angles and multiple light conditions. For annotation, line-level & word-leve & character-level rectangular bounding box or quadrilateral bounding box annotation were adopted, the text transcription was also adopted.

Handwriting OCR Data of Japanese and Korean

This dadaset was collected from 100 subjects including 50 Japanese, 49 Koreans and 1 Afghan. For different subjects, the corpus are different. The data diversity includes multiple cellphone models and different corpus.

Hindi OCR Images Data — Images with Annotation and Transcription

The data includes 2,056 images of natural scenes, 1,103 Internet images and 347 document images. For line-level content annotation, line-level quadrilateral bounding box annotation and test transcription was adpoted; for column-level content annotation, column-level quadrilateral bounding box annotation and text transcription was adpoted.

Vietnamese OCR Images Data — Images with Annotation and Transcription

The data includes 258 images of natural scenes, 2,553 Internet images, 2,184 document images. For line-level content annotation, line-level quadrilateral bounding box annotation and test transcription was adpoted; for column-level content annotation, column-level quadrilateral bounding box annotation and text transcription was adpoted.

End

If you want to know more details about the datasets or how to acquire, please feel free to contact us: info@datatang.com.

--

--

Datatang

Off-the-shelf AI training data, on-demand data collection & annotation services