07 - 12 - 2022

How to evaluate a text generation model: strengths and limitations of popular evaluation metrics

The purpose of this article is to provide a comprehensive description of evaluation methods that can be applied to a text generation task. Three different evaluation techniques are introduced, then used to analyze lyrics written in the Beatles’ style.

With the evolution of technology, language models are continually evolving in sync with technological developments. With these developments, Natural Language Generation (NLG) allows us to create models that can write in human languages. More than you may be aware, a lot of the applications we use daily—like chatbots, language translation, etc.—are based on a text generation model. Building these language models to be “as human as possible” is difficult since numerous factors, including linguistic structure, grammar, and vocabulary, must be considered.

An important challenge in developing models that can generate human level text is to assess how closely the text generated by your model compares to humans. In this blog we show some popular evaluation metrics you can use and what their strengths and limitations are.

The unsupervised nature of these tasks makes the evaluation procedure challenging. However, it is vital to determine whether a trained model performs well. The most frequently used approaches for these tasks are: Human judgment, Untrained Automatic Metrics, and Machine-Learned Metrics. The following overview is based on a survey on evaluation of Text Generation.

Human judgment

As this model is trying to write in human languages and produce a text that is valuable to people, the best way to validate the output is human based. In this scenario you can have multiple people review the model and give insight into how well the model is performing. This can be done with an annotation task; this method provides the readers with a guideline that describes how they can proceed with the evaluation. Even though this type of evaluation is considered important, there are multiple limitations. Human evaluations can be time consuming and expensive, often the amount of data reviewed is large so therefore, it is difficult for an individual to manually inspect the content. Also, the judgments by different annotators are prone to be ambiguous, resulting in unreliable assessment of the quality of the text generating model. Inter-annotator agreement is therefore an important measure of the model’s performance. This metric indicates whether the task is well-defined and the differences in the generated text are consistently noticeable to evaluators. However, this evaluation method is biased since the assessed quality of the model also depends on the personal beliefs of each annotator. Therefore, this can lead to the evaluation results being subjective.

Based on these limitations, different ways of evaluating language models such as NLG have been developed. To minimize costs involved with the manual judgment and to have less ambiguity in judging generated texts, automated metrics have gained popularity to evaluate NLG models.

Untrained Automatic Metrics

With the implementation of untrained automatic metrics, the effectiveness of language models can be calculated. These methods can be used to calculate a score that contrasts an autonomously generated text with a reference text that was written by a human. Utilizing these techniques is simple and efficient. There are many different automatic measures, including metrics for n-gram overlap, distance-based metrics, diversity metrics, and overlap metrics for content. For this blog we are going to focus on n-gram overlap metrics.

N-gram overlap metrics are commonly used for evaluating NLG systems. When you are trying to evaluate a generated text your first instinct may be to figure out the similarity degree to the human reference to assess the quality of the generated text. And that is exactly what this type of metric does. The overlap between these two texts is calculated with the number of subsequent words from a sequence (n-gram). Some well-known metrics that are based on this approach are: bilingual evaluation understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), Metric for Evaluation of Translation with Explicit Ordering (METEOR).

Despite their popularity, these metrics do have some big drawbacks. Most importantly, these metrics are sensitive to lexical variation. This means that when other words with the same meaning are used, the model will be punished since the text will not be the same. As they only look at the overlap using unigrams and bigrams, the semantic and syntactic structure is not considered.  For example,[1] if we have the sentence ‘people like foreign cars’ – this type of evaluation will not give a high score to a generated sentence such as ‘consumers prefer imported cars’ and will give a high score to ‘people like visiting foreign countries’.  When semantically correct statements are penalized because they vary from the reference’s surface form, performance is underestimated.

The true semantics of an entire sentence cannot be evaluated using n-gram based metrics. Therefore, machine learning metrics emerged to find a way to measure text with higher qualities.

Machine-learning based metrics

These metrics are often based on machine-learned models, which are used to measure the similarity between two machine-generated texts or between machine-generated and human-generated texts. These models can be viewed as digital judges that simulate human interpretation. Well-known machine learning evaluation metrices developed are BERT-score and BLEURT. BERT-score can be considered as a hybrid approach as it combines trained elements (embeddings) with handwritten logic (token alignment rules) (Sellam, 2020). This method leverages the pre-trained contextual embeddings from Bidirectional Encoder Representations from Transformers (BERT) and matches words in candidate and reference sentences by cosine similarity. Since that is quite a mouth full, let’s break that down a bit: Contextual embeddings generate different vector representations for the same word in different sentences depending on the surrounding words, which form the context of the

target word. The BERT model was one of the most important game changing NLP models, using the attention mechanism to train these contextual embeddings. BERT-score has been shown to correlate well with human judgments on sentence-level and system-level evaluations. However, with this approach there are some elements that need to be considered. The vector representation allows for a less rigid measure of similarity instead of exact-string or heuristic matching. Depending on the goal or ‘rules’ of the project this can be a limitation or an advantage.

Finally, another machine learning metric is BLEURT, a metric developed using BERT for developing a representation model of the text and relies on BLEU. The selection of the appropriate evaluation method depends on the goal of the project. Based on the project and the component generated some of these methods can be considered strict or lenient on the results. Therefore, it might be the case that the evaluation scores are high or low because the evaluation metric chosen is not the right one.  

Let’s apply some of these metrics to a concrete example: As you can read in our previous blog posts we’ve trained an NLG model to write lyrics in the style of the Beatles. Obviously, we’re anxious to learn whether our model is doing a good job: are we actually Beatly? Based on the results generated by the two machine learning evaluation metrics, it seemed that our model was performing quite well. Both BLEURT and BERT-score reported an f-score higher than 0.7 which indicates that the generation quality of our songs was decent.

However, we already mentioned that quantitative metrics like BLEURT and BERT-score have some deficits. Therefore, we also asked a critical crowd of Beatles fans what they think of the lyrics we generated. To collect human observations, a questionnaire was generated. In this questionnaire 15 songs were used, 12 generated by our model and 3 originals. This survey was then posted on a social media platform that contains thousands of Beatles fans and were asked to let us know what they think of the made-up songs. Based on these results it was evident that the generated songs were not as good as the machine-based scores presented them. Some of the fans even characterized these songs as “awful” or “written poorly”. The readers also pointed out a good observation, apparently our models often repeat existing Beatles lyrics, which defeats the purpose of text generation. We do have to consider that Beatles fans are less prone to agree their heroes can be replaced by AI, so some bias might be presented there as well… In conclusion: With the evolvement of technology, the Natural Language Generator has made a great contribution in our daily lives. The unsupervised nature of the model makes the evaluation process challenging. To minimize costs involved with the manual judgment and to have less ambiguity in judging generated texts, automated metrics have gained popularity to evaluate NLG models. However, machine learning evaluations metrics can be seen to fall short in replicating human decisions in some, even in many circumstances. These metrics are not able to fully cover all the qualitive components of generated text. Therefore, a human in the loop is still needed most of the time. It is worth noting that human judgment observation can result to a biased interpretation based on the subjective interpretation of the reader.


[1] This example was sourced by 1904.09675.pdf (arxiv.org)

This article is written by:
Konstantina Andronikou
Konstantina Andronikou
k.andronikou@cmotions.nl
Jurriaan Nagelkerke
Jurriaan Nagelkerke
j.nagelkerke@cmotions.nl