ToxVI: a Multimodal LLM-based Framework for Generating Intervention in Toxic Code-Mixed Videos

Krishanu Maity; A. S. Poornash; Sriparna Saha; Kitsuchart Pasupa

doi:10.1145/3627673.3680004

ToxVI: a Multimodal LLM-based Framework for Generating Intervention in Toxic Code-Mixed Videos

Date

2024-10-20

Authors

Abstract

While considerable research has delved into detecting toxic content in text-based data, the realm of video content, particularly in languages other than English, has received less attention. Prior studies have primarily focused on creating automated tools to identify online toxic speech but have often overlooked the crucial next steps of mitigating its impact and discouraging future use. We can discourage social media users from sharing such material by automatically generating interventions that explain why certain content is inappropriate. To bridge this research gap, we propose an innovative task: generating interventions for toxic videos in code-mixed languages which go beyond existing methods focusing on text and images to combat online toxicity. We are introducing a Toxic Code-Mixed Intervention Video benchmark dataset (ToxCMI), comprising 1697 code-mixed toxic video utterances sourced from YouTube. Each utterance in this dataset has been meticulously annotated for toxicity and severity, accompanied by interventions provided in Hindi-English code-mixed languages. We have developed an advanced multimodal framework ToxVI, specifically designed for the task of generating Toxic Video appropriate Interventions, leveraging Large Language Models (LLMs), which comprises three modules - Modality module, Cross-Modal Synchronization module and Generation module. Our experiments demonstrate that integrating multiple modalities from the videos significantly enhances the performance of the proposed task and outperforms all the baselines by a significant margin.