TrafficInternVL: Spatially-Guided Fine-Tuning with Caption Refinement for Fine-Grained Traffic Safety Captioning and Visual Question Answering

dc.contributor.authorSasin Phimsiri
dc.contributor.authorSarut Sunpawatr
dc.contributor.authorRiu Cherdchusakulchai
dc.contributor.authorPornprom Kiawjak
dc.contributor.authorTeepakorn Tosawadi
dc.contributor.authorSuchat Tungjitnob
dc.contributor.authorVisarut Trairattanapa
dc.contributor.authorSupawit Vatathanavaro
dc.contributor.authorWasu Kudisthalert
dc.contributor.authorChaitat Utintu
dc.contributor.authorWorawit Saetan
dc.contributor.authorNathamon Kongsawat
dc.contributor.authorPhawat Borisuitsawat
dc.contributor.authorKasisdis Mahakijdechachai
dc.contributor.authorNitipan Su-Inn
dc.contributor.authorEk Thamwiwatthana
dc.contributor.authorVasin Suttichaya
dc.date.accessioned2026-05-08T19:21:38Z
dc.date.issued2025-10-19
dc.description.abstractFine-grained traffic understanding requires both detailed visual descriptions and precise answers to safety-critical questions. We present TrafficInternVl, a framework for fine-grained traffic safety description and question answering, developed for AI City Challenge 2025 Track 2. Our approach is based on the InternVL3-38B vision-language model and integrates four key components: (1) spatially guided visual prompting via bounding-box-based cropping and rendering; (2) Adaptive view selection protocols; (3) low-rank adaptation (LoRA) fine-tuning, updating only 1% of model parameters; and (4) caption refinement for intra-scene consistency. Our model achieves a Caption Score of 32.75 (BLEU-4, METEOR, ROUGE-L, CIDEr averaged) and a VQA accuracy of 83.08 %. Code, prompts, and LoRA weights are released at https://github.com/ARV-MLCORE/TrafficInternVL
dc.identifier.doi10.1109/iccvw69036.2025.00559
dc.identifier.urihttps://dspace.kmitl.ac.th/handle/123456789/18112
dc.subjectMultimodal Machine Learning Applications
dc.subjectTopic Modeling
dc.subjectVideo Analysis and Summarization
dc.titleTrafficInternVL: Spatially-Guided Fine-Tuning with Caption Refinement for Fine-Grained Traffic Safety Captioning and Visual Question Answering
dc.typeArticle

Files

Collections