Pantip Multi-turn Datasets Generating from Thai Large Social Platform Forum Using Sentence Similarity Techniques

dc.contributor.authorAnon Sae-Oueng
dc.contributor.authorKun Kerdthaisong
dc.contributor.authorKittisak Sukhantharat
dc.contributor.authorPakawat Phasook
dc.contributor.authorPiyawat Chuangkrud
dc.contributor.authorChaianun Damrongrat
dc.contributor.authorSarawoot Kongyoung
dc.date.accessioned2026-05-08T19:24:23Z
dc.date.issued2024-11-11
dc.description.abstractFine-tuning Large Language Models (LLMs) for specific domains is crucial. However, lack of Thai open dialogues presents a major challenge. For the major challenge, this study proposes a novel methodology for extracting and constructing multi-turn conversational data from existing Thai large social platform, named Pantip. Our approach implements semantic matching algorithms to identify and compile both single-turn and multi-turn dialogues. By employing a cosine similarity threshold ≥ 0.3, we yields contextually coherent conversation pairs directly from the source data. The outcome dataset represents real Thai conversation styles, which could improve how accurately fine-tuned language models reflect Thai social and linguistic norms. Our approach introduces a new way to use existing publicly available data to create training datasets, which is especially valuable for languages with limited resources. Chaotic Pantip datasets can be contributed to the development of more culturally attuned and linguistically precise Thai language models, potentially advancing culturally-specific natural language processing.
dc.identifier.doi10.1109/isai-nlp64410.2024.10799403
dc.identifier.urihttps://dspace.kmitl.ac.th/handle/123456789/19565
dc.subjectNatural Language Processing Techniques
dc.subjectComputational and Text Analysis Methods
dc.subjectTopic Modeling
dc.titlePantip Multi-turn Datasets Generating from Thai Large Social Platform Forum Using Sentence Similarity Techniques
dc.typeArticle

Files

Collections