myOCR: Optical Character Recognition for Myanmar language with Post-OCR Error Correction

Thura Aung; Ye Kyaw Thu; Myat Noe Oo

doi:10.1109/isai-nlp64410.2024.10799448

myOCR: Optical Character Recognition for Myanmar language with Post-OCR Error Correction

Date

2024-11-11

Authors

Thura Aung

Ye Kyaw Thu

Myat Noe Oo

Abstract

This paper presents the Myanmar Optical Character Recognition (OCR), named myOCR. It utilizes a synthetic text image dataset with 14 different font styles that contains 25,790 text images. The system includes Convolutional Neural Networks (CNN) for feature extraction, Bidirectional Long-Short Term Memory (BiLSTM) networks for sequence modeling, and Connectionist Temporal Classification (CTC) for decoding, evaluated across various iterations (3,000, 6,000, 9,000) and hidden states (64, 128, 256). Statistical Post-OCR correction methods involve N(3,4,5)-grams and edit distances with the Symmetric Delete Spelling correction algorithm (SymSpell). For Neural Machine Translation-based correction, BiLSTM and Transformer models are employed, while the mT5-base and mBART-50 models are used for LLM-based correction. The best base (optical) model is the model with 9,000 iterations that achieved a chrF<sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">++</sup> score of over 97.90 and a Word Error Rate (WER) of 9.18%. Transformer correction improved its chrF<sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">++</sup> to 99.31 and reduced the WER to 0.66%.