Pantip Multi-turn Datasets Generating from Thai Large Social Platform Forum Using Sentence Similarity Techniques

Anon Sae-Oueng; Kun Kerdthaisong; Kittisak Sukhantharat; Pakawat Phasook; Piyawat Chuangkrud; Chaianun Damrongrat; Sarawoot Kongyoung

doi:10.1109/isai-nlp64410.2024.10799403

Pantip Multi-turn Datasets Generating from Thai Large Social Platform Forum Using Sentence Similarity Techniques

Date

2024-11-11

Authors

Anon Sae-Oueng

Kun Kerdthaisong

Kittisak Sukhantharat

Abstract

Fine-tuning Large Language Models (LLMs) for specific domains is crucial. However, lack of Thai open dialogues presents a major challenge. For the major challenge, this study proposes a novel methodology for extracting and constructing multi-turn conversational data from existing Thai large social platform, named Pantip. Our approach implements semantic matching algorithms to identify and compile both single-turn and multi-turn dialogues. By employing a cosine similarity threshold ≥ 0.3, we yields contextually coherent conversation pairs directly from the source data. The outcome dataset represents real Thai conversation styles, which could improve how accurately fine-tuned language models reflect Thai social and linguistic norms. Our approach introduces a new way to use existing publicly available data to create training datasets, which is especially valuable for languages with limited resources. Chaotic Pantip datasets can be contributed to the development of more culturally attuned and linguistically precise Thai language models, potentially advancing culturally-specific natural language processing.