ThaiScamBench: Toward a Benchmark Dataset for Scam and Phishing Detection in the Thai Language

Nutthakorn Chalaemwongwan

doi:10.1109/icsec67360.2025.11298052

ThaiScamBench: Toward a Benchmark Dataset for Scam and Phishing Detection in the Thai Language

dc.contributor.author	Nutthakorn Chalaemwongwan
dc.date.accessioned	2026-05-08T19:25:55Z
dc.date.issued	2025-11-2
dc.description.abstract	Recent reports from Thai financial regulators reveal a sharp increase in online scams, resulting in major financial damage each year. While English-language research has made considerable progress, Thai-language scam detection remains underexplored and lacks standardized benchmarks. The study establishes ThaiScamBench (pilot), a curated corpus of 1,750 Thai messages labeled as scam or legitimate across seven categories. Reproducible baselines (Logistic Regression, Linear SVM) and a rigorous evaluation protocol are provided to address class imbalance, Thai–English code-switching, and adversarial obfuscation. All scam texts contained URLs; none appeared in legitimate messages, revealing dataset bias, with all scam samples containing a URL and none among legitimate texts, artificially inflating headline metrics. We addressed this issue by introducing URL-masked and domain hold-out evaluations. We outline our roadmap toward version 1.0 (around 50k messages), focusing on dataset scaling and robust benchmarking for Thai scam detection, adversarial stress tests, and PDPA-compliant release artifacts. ThaiScamBench establishes the first standardized evaluation for Thai scam detection, enabling transparent comparison and artifact-conscious evaluation. This benchmark provides a reproducible foundation for robust Thai-language scam detection.
dc.identifier.doi	10.1109/icsec67360.2025.11298052
dc.identifier.uri	https://dspace.kmitl.ac.th/handle/123456789/20335
dc.subject	Spam and Phishing Detection
dc.subject	Cybercrime and Law Enforcement Studies
dc.subject	Hate Speech and Cyberbullying Detection
dc.title	ThaiScamBench: Toward a Benchmark Dataset for Scam and Phishing Detection in the Thai Language
dc.type	Article

Collections

All

ThaiScamBench: Toward a Benchmark Dataset for Scam and Phishing Detection in the Thai Language

Files

Collections