Research scientist at Lyft Inc.

How does one evaluate the quality of sentence translation (or more generally, corpus translation) by a machine learning system? The BLEU (BilinguaL Evaluation Understudy) provides one widely used metric for this purpose. For the physicists in the readers, this is closely related to the idea of cluster expansions from quantum mechanics. The metric itself is quite easy to define.

Modified Precision Score

We start with defining what is called the modified precision score for a sentence. Consider an input sentence S0, two reference translations (perhaps by human translators) R01 and R02, and a candidate translation produced by the ML system, C. Then, the modified precision score for n-gram is defined as

where Count(n-gram) is the number of times an n-gram appears in the candidate C, and CountClip(n-gram) is maximum number of times the n-gram appears in any reference translation.

BLEU Score

The BLEU score is simply defined as

where we are considering up to N length n-grams.

Critique of BLEU Score

There are several reasonable critiques of the BLEU score, including the fact that the score was not really defined for sentence level scoring (as opposed to entire corpus), that it does not consider meaning (thus if the reference translation is “This apple is good”, then the two candidate translations “This apple is amazing” and “This apple is bad” score equal, while one is certainly much better than the alternative), that it doesn’t work well with certain kinds of languages, and many more.

Most of all, it is important to keep your use case in mind and have at least some human expert evaluation of the NLP system, particularly if it will be exposed to real humans in a product. At this point if you are not fully satisfied with the output of the system trained on BLEU score, you might consider some alternatives such as F1 score, WER (word-error-rate), NIST, STM (subtree metric), and several others.