Mathematical Benchmarking of Convolutional Neural Networks for Thai Dialect Recognition: A Spectrogram Texture Classification Approach

Porawat Visutsak; Duongduen Ongrungruaeng; Surapong Wiriya; Keun Ho Ryu

doi:10.3390/electronics15061271

Mathematical Benchmarking of Convolutional Neural Networks for Thai Dialect Recognition: A Spectrogram Texture Classification Approach

Date

2026-3-18

Authors

Porawat Visutsak

Duongduen Ongrungruaeng

Surapong Wiriya

Keun Ho Ryu

Publisher

Electronics

Abstract

This study rigorously evaluates 13 Convolutional Neural Network (CNN) architectures for Thai dialect recognition. By treating Automatic Speech Recognition (ASR) as a computer vision texture classification task, we processed an extensive 840-h dataset from the Spoken Language Systems, Chulalongkorn University (SLSCU) corpus. Raw audio from four major dialects—Central, Northern (Khummuang), Northeastern (Korat), and Southern (Pat-tani)—was transformed into 2D Mel-spectrograms using the Short-Time Fourier Transform (STFT). We analyzed a diverse range of architectures, including the VGG, Inception, ResNet, DenseNet, and MobileNet families, to establish the optimal trade-off between mathematical complexity and spectral feature extraction. Our experimental results identify NASNet-Mobile as the most effective model, achieving a macro-average F1-score of 0.9425. The analysis suggests that NASNet’s search-optimized cell structure is uniquely capable of capturing the multiscale texture of phonetic formants. In contrast, we observed a catastrophic mode collapse in VGG16 (32.97% accuracy), likely due to excessive parameter bloat, while Xception and MobileNetV2 maintained robust generalization. Confusion matrix analysis reveals high acoustic distinctiveness for Southern Thai (96.7% recall), whereas Northern Thai exhibits significant spectral overlap with Central Thai. These results support the hypothesis that CNNs interpret spectrograms as textures rather than discrete objects, positioning NASNet-Mobile as a high-performance, low-latency baseline for edge-device deployment in resource-constrained environments.