myOCR: Optical Character Recognition for Myanmar language with Post-OCR Error Correction

Thura Aung; Ye Kyaw Thu; Myat Noe Oo

doi:10.1109/isai-nlp64410.2024.10799448

myOCR: Optical Character Recognition for Myanmar language with Post-OCR Error Correction

dc.contributor.author	Thura Aung
dc.contributor.author	Ye Kyaw Thu
dc.contributor.author	Myat Noe Oo
dc.date.accessioned	2026-05-08T19:20:30Z
dc.date.issued	2024-11-11
dc.description.abstract	This paper presents the Myanmar Optical Character Recognition (OCR), named myOCR. It utilizes a synthetic text image dataset with 14 different font styles that contains 25,790 text images. The system includes Convolutional Neural Networks (CNN) for feature extraction, Bidirectional Long-Short Term Memory (BiLSTM) networks for sequence modeling, and Connectionist Temporal Classification (CTC) for decoding, evaluated across various iterations (3,000, 6,000, 9,000) and hidden states (64, 128, 256). Statistical Post-OCR correction methods involve N(3,4,5)-grams and edit distances with the Symmetric Delete Spelling correction algorithm (SymSpell). For Neural Machine Translation-based correction, BiLSTM and Transformer models are employed, while the mT5-base and mBART-50 models are used for LLM-based correction. The best base (optical) model is the model with 9,000 iterations that achieved a chrF<sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">++</sup> score of over 97.90 and a Word Error Rate (WER) of 9.18%. Transformer correction improved its chrF<sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">++</sup> to 99.31 and reduced the WER to 0.66%.
dc.identifier.doi	10.1109/isai-nlp64410.2024.10799448
dc.identifier.uri	https://dspace.kmitl.ac.th/handle/123456789/17573
dc.subject	Handwritten Text Recognition Techniques
dc.subject	Speech Recognition and Synthesis
dc.subject	Computer Science and Engineering
dc.title	myOCR: Optical Character Recognition for Myanmar language with Post-OCR Error Correction
dc.type	Article

Collections

All

myOCR: Optical Character Recognition for Myanmar language with Post-OCR Error Correction

Files

Collections