myFoodQA: A Multimodal Dataset for Evaluating Cultural and Visual Reasoning in Myanmar Gastronomy
Loading...
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Zenodo (CERN European Organization for Nuclear Research)
Abstract
This paper presents myFoodQA (Myanmar Food Question Answering), the first multimodal benchmark focused on Myanmar's rich gastronomic culture. We constructed a novel dataset containing 2,485 question-answer pairs covering 20 distinct dishes, with all data sourced from personal photography and web-crawling and subsequently validated by native Burmese speakers for authenticity. The benchmark is designed to test single-image, multi-image, and text-only reasoning, evaluating a model's understanding of ingredients, cultural context, preparation methods, and comparative logic. Our zero-shot evaluation of leading vision-language models (VLMs) reveals a significant performance gap: while models perform well on text-based tasks, they show a significant deficit in image-based reasoning, which requires specific visual understanding and deep cultural knowledge. These findings expose the limitations of current models in the Myanmar gastronomic domain. We establish myFoodQA as a foundational resource for advancing culturally-aware multimodal AI, particularly in low-resource settings.