Mathematical Benchmarking of Convolutional Neural Networks for Thai Dialect Recognition: A Spectrogram Texture Classification Approach

Porawat Visutsak; Duongduen Ongrungruaeng; Surapong Wiriya; Keun Ho Ryu

doi:10.3390/electronics15061271

Mathematical Benchmarking of Convolutional Neural Networks for Thai Dialect Recognition: A Spectrogram Texture Classification Approach

dc.contributor.author	Porawat Visutsak
dc.contributor.author	Duongduen Ongrungruaeng
dc.contributor.author	Surapong Wiriya
dc.contributor.author	Keun Ho Ryu
dc.date.accessioned	2026-05-08T19:26:35Z
dc.date.issued	2026-3-18
dc.description.abstract	This study rigorously evaluates 13 Convolutional Neural Network (CNN) architectures for Thai dialect recognition. By treating Automatic Speech Recognition (ASR) as a computer vision texture classification task, we processed an extensive 840-h dataset from the Spoken Language Systems, Chulalongkorn University (SLSCU) corpus. Raw audio from four major dialects—Central, Northern (Khummuang), Northeastern (Korat), and Southern (Pat-tani)—was transformed into 2D Mel-spectrograms using the Short-Time Fourier Transform (STFT). We analyzed a diverse range of architectures, including the VGG, Inception, ResNet, DenseNet, and MobileNet families, to establish the optimal trade-off between mathematical complexity and spectral feature extraction. Our experimental results identify NASNet-Mobile as the most effective model, achieving a macro-average F1-score of 0.9425. The analysis suggests that NASNet’s search-optimized cell structure is uniquely capable of capturing the multiscale texture of phonetic formants. In contrast, we observed a catastrophic mode collapse in VGG16 (32.97% accuracy), likely due to excessive parameter bloat, while Xception and MobileNetV2 maintained robust generalization. Confusion matrix analysis reveals high acoustic distinctiveness for Southern Thai (96.7% recall), whereas Northern Thai exhibits significant spectral overlap with Central Thai. These results support the hypothesis that CNNs interpret spectrograms as textures rather than discrete objects, positioning NASNet-Mobile as a high-performance, low-latency baseline for edge-device deployment in resource-constrained environments.
dc.identifier.doi	10.3390/electronics15061271
dc.identifier.uri	https://dspace.kmitl.ac.th/handle/123456789/20695
dc.publisher	Electronics
dc.subject	Speech Recognition and Synthesis
dc.subject	Phonetics and Phonology Research
dc.subject	Authorship Attribution and Profiling
dc.title	Mathematical Benchmarking of Convolutional Neural Networks for Thai Dialect Recognition: A Spectrogram Texture Classification Approach
dc.type	Article

Collections

All

Mathematical Benchmarking of Convolutional Neural Networks for Thai Dialect Recognition: A Spectrogram Texture Classification Approach

Files

Collections