Pantip Multi-turn Datasets Generating from Thai Large Social Platform Forum Using Sentence Similarity Techniques

Abstract

Fine-tuning Large Language Models (LLMs) for specific domains is crucial. However, lack of Thai open dialogues presents a major challenge. For the major challenge, this study proposes a novel methodology for extracting and constructing multi-turn conversational data from existing Thai large social platform, named Pantip. Our approach implements semantic matching algorithms to identify and compile both single-turn and multi-turn dialogues. By employing a cosine similarity threshold ≥ 0.3, we yields contextually coherent conversation pairs directly from the source data. The outcome dataset represents real Thai conversation styles, which could improve how accurately fine-tuned language models reflect Thai social and linguistic norms. Our approach introduces a new way to use existing publicly available data to create training datasets, which is especially valuable for languages with limited resources. Chaotic Pantip datasets can be contributed to the development of more culturally attuned and linguistically precise Thai language models, potentially advancing culturally-specific natural language processing.

Description

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By