Comparison of Evaluation Metrics for Short Story Generation

dc.contributor.authorPonrudee Netisopakul
dc.contributor.authorUsanisa Taoto
dc.date.accessioned2025-07-21T06:08:25Z
dc.date.issued2023-01-01
dc.description.abstractThe aim of this study was to analyze the correlation among different automatic evaluation metrics for text generation. In the study, texts were generated from short stories using different language models: N-gram model, Continuous Bag-of-Word (CBOW) model, Gated recurrent unit (GRU) model, and Generative Pre-trained Transformer 2 (GPT-2) model. All models were trained on short Aesop’s fables. The quality of the generated text was measured with various metrics: Perplexity, BLEU score, the number of grammatical errors, Self-BLEU score, ROUGE score, BERTScore, andWord Mover’s Distance (WMD). The resulting correlation analysis of the evaluation metrics showed four groups of correlated metrics. Firstly, perplexity and grammatical errors were moderately correlated. Secondly, BLEU, ROUGE and BERTScore were highly correlated. Next, WMD was negatively correlated with BLEU, ROUGE and BERTScore. On the other hand, Self-BLEU, which measures text diversity within the model, did not correlate with the other metrics. In conclusion, to evaluate text generation, a combination of various metrics should be used to measure different aspects of the generated text.
dc.identifier.doi10.1109/access.2023.3337095
dc.identifier.urihttps://dspace.kmitl.ac.th/handle/123456789/12024
dc.subjectPerplexity
dc.subjectBLEU
dc.subject.classificationNatural Language Processing Techniques
dc.titleComparison of Evaluation Metrics for Short Story Generation
dc.typeArticle

Files

Collections