Comprehensive Benchmarking and Analysis of Open Pretrained Thai Speech Recognition Models

Pattara Tipakasorn; Oatsada Chatthong; Ren Yonehana; Kwanchiva Thangthai

doi:10.1109/o-cocosda64382.2024.10800399

Comprehensive Benchmarking and Analysis of Open Pretrained Thai Speech Recognition Models

Date

2024-10-17

Authors

Abstract

This paper presents a comprehensive benchmarking and analysis of open pretrained Thai Automatic Speech Recognition (ASR) models, addressing a critical gap in low-resource language ASR development. Our benchmarking focuses on one foundation speech model, three Open Thai Speech Recognition Models, and Open speech APls. We evaluate these models, including our fine-tuned Whisper model, across diverse speech types and acoustic environments. The study reveals significant performance gaps between read and spontaneous speech, with models performing well in controlled settings but struggling in real-world scenarios. We introduce evaluation datasets for distant speech and noisy podcasts, exposing limitations in current models' robustness. Our fine-tuned Whisper model demonstrates performance across various Thai regional dialects, reflecting its targeted training on dialectal data, while the others demonstrate resilience in spontaneous speech scenarios. However, all models show substantial degradation in challenging acoustic conditions, indicating a need for more diverse training corpora to better capture real-world complexity, including spontaneous speech and far-field acoustic scenarios, to further enhance Thai ASR. This work provides valuable insights for improving ASR models in low-resource languages.