Training Strategies for OCR Systems for Historical Documents

This paper presents an overview of training strategies for optical character recognition of historical documents. The main issue is the lack of the annotated data and its quality. We summarize several ways of synthetic data preparation. The main goal of this paper is to show and compare possibilities how to train a convolutional recurrent neural network classifier using the synthetic data and its combination with a real annotated dataset.

Keywords

Domains

Computer Science [cs]

Fichier principal

483292_1_En_30_Chapter.pdf (968.48 Ko)

Origin	Files produced by the author(s)
licence	CC BY 4.0 - Attribution

Connect in order to contact the contributor

https://inria.hal.science/hal-02331288

Submitted on : Thursday, October 24, 2019-12:49:35 PM

Last modification on : Wednesday, May 7, 2025-1:40:02 PM

Long-term archiving on : Saturday, January 25, 2020-3:20:16 PM

Dates and versions

hal-02331288 , version 1 (24-10-2019)

Licence

CC BY 4.0 - Attribution

Identifiers

HAL Id : hal-02331288 , version 1
DOI : 10.1007/978-3-030-19823-7_30

Cite

Jiří Martínek, Ladislav Lenc, Pavel Král. Training Strategies for OCR Systems for Historical Documents. 15th IFIP International Conference on Artificial Intelligence Applications and Innovations (AIAI), May 2019, Hersonissos, Greece. pp.362-373, ⟨10.1007/978-3-030-19823-7_30⟩. ⟨hal-02331288⟩