Thai OCR Spelling Correction: Case Study in Thai Historical Document

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

This research focuses on correcting errors from Optical Character Recognition (OCR) reader for Thai historical documents, using the case study of the Bangkok Recorder, one of the earliest printed media in Thailand. The main problems found are the high error rate caused by the deterioration of the original documents, old Thai language spelling, and the complexity of Thai script. The research compared three approaches: (1) using large language models (LLMs) with prompt-based correction, (2) fine-tuning LLMs on OCR datasets and reference texts, and (3) generating correction rules from LLMs and applying them in practice. The experimental results indicate that fine-tuning LLMs can improve accuracy and the ability to handle old spellings and Thai numerals better than using prompts alone, while the rule-generation method helps enhance the consistency of correction in cases with repetitive error patterns. Overall, the GPT-4o model showed superior performance to LLaMA-3 Typhoon in both contextual understanding and accuracy, with the best precision, recall and F1 score of 99.16, 97.62, and 98.38 percent, respectively. This research shows promising results in using LLMs for Thai OCR spelling correction for Thai archival documents domain.

Description

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By