myFoodQA: A Multimodal Dataset for Evaluating Cultural and Visual Reasoning in Myanmar Gastronomy

dc.contributor.authorShin Thant Phyo
dc.contributor.authorPyae Linn
dc.contributor.authorThet Hmue Khin
dc.contributor.authorLynn Myat Bhone
dc.contributor.authorEaint Kay Khaing Kyaw
dc.contributor.authorYe Kyaw Thu
dc.date.accessioned2026-05-08T19:25:53Z
dc.date.issued2025-11-5
dc.description.abstractThis paper presents myFoodQA (Myanmar Food Question Answering), the first multimodal benchmark focused on Myanmar's rich gastronomic culture. We constructed a novel dataset containing 2,485 question-answer pairs covering 20 distinct dishes, with all data sourced from personal photography and web-crawling and subsequently validated by native Burmese speakers for authenticity. The benchmark is designed to test single-image, multi-image, and text-only reasoning, evaluating a model's understanding of ingredients, cultural context, preparation methods, and comparative logic. Our zero-shot evaluation of leading vision-language models (VLMs) reveals a significant performance gap: while models perform well on text-based tasks, they show a significant deficit in image-based reasoning, which requires specific visual understanding and deep cultural knowledge. These findings expose the limitations of current models in the Myanmar gastronomic domain. We establish myFoodQA as a foundational resource for advancing culturally-aware multimodal AI, particularly in low-resource settings.
dc.identifier.doi10.5281/zenodo.17531976
dc.identifier.urihttps://dspace.kmitl.ac.th/handle/123456789/20309
dc.publisherZenodo (CERN European Organization for Nuclear Research)
dc.titlemyFoodQA: A Multimodal Dataset for Evaluating Cultural and Visual Reasoning in Myanmar Gastronomy
dc.typeOther

Files

Collections