ThaiScamBench: Toward a Benchmark Dataset for Scam and Phishing Detection in the Thai Language

dc.contributor.authorNutthakorn Chalaemwongwan
dc.date.accessioned2026-05-08T19:25:55Z
dc.date.issued2025-11-2
dc.description.abstractRecent reports from Thai financial regulators reveal a sharp increase in online scams, resulting in major financial damage each year. While English-language research has made considerable progress, Thai-language scam detection remains underexplored and lacks standardized benchmarks. The study establishes *ThaiScamBench (pilot)*, a curated corpus of 1,750 Thai messages labeled as scam or legitimate across seven categories. Reproducible baselines (Logistic Regression, Linear SVM) and a rigorous evaluation protocol are provided to address class imbalance, Thai–English code-switching, and adversarial obfuscation. All scam texts contained URLs; none appeared in legitimate messages, revealing dataset bias, with all scam samples containing a URL and none among legitimate texts, artificially inflating headline metrics. We addressed this issue by introducing URL-masked and domain hold-out evaluations. We outline our roadmap toward version 1.0 (around 50k messages), focusing on dataset scaling and robust benchmarking for Thai scam detection, adversarial stress tests, and PDPA-compliant release artifacts. *ThaiScamBench* establishes the first standardized evaluation for Thai scam detection, enabling transparent comparison and artifact-conscious evaluation. This benchmark provides a reproducible foundation for robust Thai-language scam detection.
dc.identifier.doi10.1109/icsec67360.2025.11298052
dc.identifier.urihttps://dspace.kmitl.ac.th/handle/123456789/20335
dc.subjectSpam and Phishing Detection
dc.subjectCybercrime and Law Enforcement Studies
dc.subjectHate Speech and Cyberbullying Detection
dc.titleThaiScamBench: Toward a Benchmark Dataset for Scam and Phishing Detection in the Thai Language
dc.typeArticle

Files

Collections