DataDecon: Data Cleansing Tools for Large Language Model with Efficient Decontamination Techniques
| dc.contributor.author | Sumeth Yuenyong | |
| dc.contributor.author | Norapat Buppodom | |
| dc.contributor.author | Koravich Sangkaew | |
| dc.contributor.author | Konthee Boonmeeprakob | |
| dc.contributor.author | Prachya Boonkwan | |
| dc.contributor.author | Jillaphat Jaroenkantasima | |
| dc.contributor.author | Pitikorn Khlaisamniang | |
| dc.contributor.author | Anuruth Lertpiya | |
| dc.contributor.author | Apivadee Piyatumrong | |
| dc.contributor.author | Peerawat Rojratchadakorn | |
| dc.contributor.author | Thaweewat Rugsujarit | |
| dc.contributor.author | Teerapol Saengsukhiran | |
| dc.contributor.author | Kriangkrai Saetan | |
| dc.contributor.author | Isada Sukprapa | |
| dc.contributor.author | Thanachot Thavornmongkol | |
| dc.contributor.author | Nucharee Thongthungwong | |
| dc.contributor.author | Patteera Triamamornwooth | |
| dc.contributor.author | Chanon Utupon | |
| dc.contributor.author | Kobkrit Viriyayudhakorn | |
| dc.contributor.author | Phoochit Witchutanon | |
| dc.contributor.author | Sadit Wongprayon | |
| dc.contributor.author | Thepchai Supnithi | |
| dc.date.accessioned | 2026-05-08T19:24:23Z | |
| dc.date.issued | 2024-11-11 | |
| dc.description.abstract | Large language models (LLMs) play an important role in modern NLP technology as they are versatile for a wide array of NLP tasks. However, constructing an LLM is challenging due to concealed construction pipelines, the lack of cleansed datasets, and hyperparameter settings, making it almost irreproducible. This paper presents an efficient pipeline for constructing an LLM tailored to a low-to-medium-sourced language with a high level of data contamination and tools to cleanse the dataset. Following our pipeline, we constructed OpenThaiGPT, an LLM for Thai, with only open-sourced datasets such as CC100, OSCAR, and mC4, and achieved the state-of-the-art accuracies on our downstream tasks. Here, we disclosed the data statistics and all hyperparameter settings for reproducibility. | |
| dc.identifier.doi | 10.1109/isai-nlp64410.2024.10799278 | |
| dc.identifier.uri | https://dspace.kmitl.ac.th/handle/123456789/19562 | |
| dc.subject | Data Quality and Management | |
| dc.subject | Privacy-Preserving Technologies in Data | |
| dc.title | DataDecon: Data Cleansing Tools for Large Language Model with Efficient Decontamination Techniques | |
| dc.type | Article |