ThaiScamBench: Toward a Benchmark Dataset for Scam and Phishing Detection in the Thai Language

Nutthakorn Chalaemwongwan

doi:10.1109/icsec67360.2025.11298052

ThaiScamBench: Toward a Benchmark Dataset for Scam and Phishing Detection in the Thai Language

Date

2025-11-2

Authors

Nutthakorn Chalaemwongwan

Abstract

Recent reports from Thai financial regulators reveal a sharp increase in online scams, resulting in major financial damage each year. While English-language research has made considerable progress, Thai-language scam detection remains underexplored and lacks standardized benchmarks. The study establishes *ThaiScamBench (pilot)*, a curated corpus of 1,750 Thai messages labeled as scam or legitimate across seven categories. Reproducible baselines (Logistic Regression, Linear SVM) and a rigorous evaluation protocol are provided to address class imbalance, Thai–English code-switching, and adversarial obfuscation. All scam texts contained URLs; none appeared in legitimate messages, revealing dataset bias, with all scam samples containing a URL and none among legitimate texts, artificially inflating headline metrics. We addressed this issue by introducing URL-masked and domain hold-out evaluations. We outline our roadmap toward version 1.0 (around 50k messages), focusing on dataset scaling and robust benchmarking for Thai scam detection, adversarial stress tests, and PDPA-compliant release artifacts. *ThaiScamBench* establishes the first standardized evaluation for Thai scam detection, enabling transparent comparison and artifact-conscious evaluation. This benchmark provides a reproducible foundation for robust Thai-language scam detection.