Myanmar Text Grade-Level Prediction Using Statistical and Linguistics Features

Readability assessment supports curriculum design and adaptive learning by estimating the difficulty of a text for readers at different proficiency levels. However, research on text readability for the Myanmar language remains unexplored, mainly due to the absence of labeled resources and established computational approaches. This paper presents the first comprehensive study on Myanmar text readability classification. We construct a grade-annotated corpus from official Myanmar school textbooks (Grades 1-12) and extract linguistic, statistical, and Myanmar-specific indicators. We then evaluate regression and classification baselines, text-only embeddings, and ensembles. We also adapt three classic readability formulas (LIX, Dale-Chall, Flesch) to Myanmar and empirically show that their score distributions overlap heavily across educational levels. Experimental results show that ensemble-based models achieved the best performance in predicting grade levels, demonstrating the effectiveness of our feature design and modeling framework. This work introduces the first readability dataset, modeling approaches, and benchmark results for the Myanmar language, providing a strong foundation for future research in readability prediction, low-resource Natural language processing (NLP), and educational text analysis.