DataDecon: Data Cleansing Tools for Large Language Model with Efficient Decontamination Techniques

dc.contributor.authorSumeth Yuenyong
dc.contributor.authorNorapat Buppodom
dc.contributor.authorKoravich Sangkaew
dc.contributor.authorKonthee Boonmeeprakob
dc.contributor.authorPrachya Boonkwan
dc.contributor.authorJillaphat Jaroenkantasima
dc.contributor.authorPitikorn Khlaisamniang
dc.contributor.authorAnuruth Lertpiya
dc.contributor.authorApivadee Piyatumrong
dc.contributor.authorPeerawat Rojratchadakorn
dc.contributor.authorThaweewat Rugsujarit
dc.contributor.authorTeerapol Saengsukhiran
dc.contributor.authorKriangkrai Saetan
dc.contributor.authorIsada Sukprapa
dc.contributor.authorThanachot Thavornmongkol
dc.contributor.authorNucharee Thongthungwong
dc.contributor.authorPatteera Triamamornwooth
dc.contributor.authorChanon Utupon
dc.contributor.authorKobkrit Viriyayudhakorn
dc.contributor.authorPhoochit Witchutanon
dc.contributor.authorSadit Wongprayon
dc.contributor.authorThepchai Supnithi
dc.date.accessioned2026-05-08T19:24:23Z
dc.date.issued2024-11-11
dc.description.abstractLarge language models (LLMs) play an important role in modern NLP technology as they are versatile for a wide array of NLP tasks. However, constructing an LLM is challenging due to concealed construction pipelines, the lack of cleansed datasets, and hyperparameter settings, making it almost irreproducible. This paper presents an efficient pipeline for constructing an LLM tailored to a low-to-medium-sourced language with a high level of data contamination and tools to cleanse the dataset. Following our pipeline, we constructed OpenThaiGPT, an LLM for Thai, with only open-sourced datasets such as CC100, OSCAR, and mC4, and achieved the state-of-the-art accuracies on our downstream tasks. Here, we disclosed the data statistics and all hyperparameter settings for reproducibility.
dc.identifier.doi10.1109/isai-nlp64410.2024.10799278
dc.identifier.urihttps://dspace.kmitl.ac.th/handle/123456789/19562
dc.subjectData Quality and Management
dc.subjectPrivacy-Preserving Technologies in Data
dc.titleDataDecon: Data Cleansing Tools for Large Language Model with Efficient Decontamination Techniques
dc.typeArticle

Files

Collections