TrafficInternVL: Spatially-Guided Fine-Tuning with Caption Refinement for Fine-Grained Traffic Safety Captioning and Visual Question Answering
| dc.contributor.author | Sasin Phimsiri | |
| dc.contributor.author | Sarut Sunpawatr | |
| dc.contributor.author | Riu Cherdchusakulchai | |
| dc.contributor.author | Pornprom Kiawjak | |
| dc.contributor.author | Teepakorn Tosawadi | |
| dc.contributor.author | Suchat Tungjitnob | |
| dc.contributor.author | Visarut Trairattanapa | |
| dc.contributor.author | Supawit Vatathanavaro | |
| dc.contributor.author | Wasu Kudisthalert | |
| dc.contributor.author | Chaitat Utintu | |
| dc.contributor.author | Worawit Saetan | |
| dc.contributor.author | Nathamon Kongsawat | |
| dc.contributor.author | Phawat Borisuitsawat | |
| dc.contributor.author | Kasisdis Mahakijdechachai | |
| dc.contributor.author | Nitipan Su-Inn | |
| dc.contributor.author | Ek Thamwiwatthana | |
| dc.contributor.author | Vasin Suttichaya | |
| dc.date.accessioned | 2026-05-08T19:21:38Z | |
| dc.date.issued | 2025-10-19 | |
| dc.description.abstract | Fine-grained traffic understanding requires both detailed visual descriptions and precise answers to safety-critical questions. We present TrafficInternVl, a framework for fine-grained traffic safety description and question answering, developed for AI City Challenge 2025 Track 2. Our approach is based on the InternVL3-38B vision-language model and integrates four key components: (1) spatially guided visual prompting via bounding-box-based cropping and rendering; (2) Adaptive view selection protocols; (3) low-rank adaptation (LoRA) fine-tuning, updating only 1% of model parameters; and (4) caption refinement for intra-scene consistency. Our model achieves a Caption Score of 32.75 (BLEU-4, METEOR, ROUGE-L, CIDEr averaged) and a VQA accuracy of 83.08 %. Code, prompts, and LoRA weights are released at https://github.com/ARV-MLCORE/TrafficInternVL | |
| dc.identifier.doi | 10.1109/iccvw69036.2025.00559 | |
| dc.identifier.uri | https://dspace.kmitl.ac.th/handle/123456789/18112 | |
| dc.subject | Multimodal Machine Learning Applications | |
| dc.subject | Topic Modeling | |
| dc.subject | Video Analysis and Summarization | |
| dc.title | TrafficInternVL: Spatially-Guided Fine-Tuning with Caption Refinement for Fine-Grained Traffic Safety Captioning and Visual Question Answering | |
| dc.type | Article |