fbpx
Wikipedia

Evaluation of machine translation

Various methods for the evaluation for machine translation have been employed. This article focuses on the evaluation of the output of machine translation, rather than on performance or usability evaluation.

Round-trip translation edit

A typical way for lay people to assess machine translation quality is to translate from a source language to a target language and back to the source language with the same engine. Though intuitively this may seem like a good method of evaluation, it has been shown that round-trip translation is a "poor predictor of quality".[1] The reason why it is such a poor predictor of quality is reasonably intuitive. A round-trip translation is not testing one system, but two systems: the language pair of the engine for translating into the target language, and the language pair translating back from the target language.

Consider the following examples of round-trip translation performed from English to Italian and Portuguese from Somers (2005):

Original text Select this link to look at our home page.
Translated Selezioni questo collegamento per guardare il nostro Home Page.
Translated back Selections this connection in order to watch our Home Page.
Original text Tit for tat
Translated Melharuco para o tat
Translated back Tit for tat

In the first example, where the text is translated into Italian then back into English—the English text is significantly garbled, but the Italian is a serviceable translation. In the second example, the text translated back into English is perfect, but the Portuguese translation is meaningless; the program thought "tit" was a reference to a tit (bird), which was intended for a "tat", a word it did not understand.

While round-trip translation may be useful to generate a "surplus of fun,"[2] the methodology is deficient for serious study of machine translation quality.

Human evaluation edit

This section covers two of the large scale evaluation studies that have had significant impact on the field—the ALPAC 1966 study and the ARPA study.[3]

Automatic Language Processing Advisory Committee (ALPAC) edit

One of the constituent parts of the ALPAC report was a study comparing different levels of human translation with machine translation output, using human subjects as judges. The human judges were specially trained for the purpose. The evaluation study compared an MT system translating from Russian into English with human translators, on two variables.

The variables studied were "intelligibility" and "fidelity". Intelligibility was a measure of how "understandable" the sentence was, and was measured on a scale of 1–9. Fidelity was a measure of how much information the translated sentence retained compared to the original, and was measured on a scale of 0–9. Each point on the scale was associated with a textual description. For example, 3 on the intelligibility scale was described as "Generally unintelligible; it tends to read like nonsense but, with a considerable amount of reflection and study, one can at least hypothesize the idea intended by the sentence".[4]

Intelligibility was measured without reference to the original, while fidelity was measured indirectly. The translated sentence was presented, and after reading it and absorbing the content, the original sentence was presented. The judges were asked to rate the original sentence on informativeness. So, the more informative the original sentence, the lower the quality of the translation.

The study showed that the variables were highly correlated when the human judgment was averaged per sentence. The variation among raters was small, but the researchers recommended that at the very least, three or four raters should be used. The evaluation methodology managed to separate translations by humans from translations by machines with ease.

The study concluded that, "highly reliable assessments can be made of the quality of human and machine translations".[4]

Advanced Research Projects Agency (ARPA) edit

As part of the Human Language Technologies Program, the Advanced Research Projects Agency (ARPA) created a methodology to evaluate machine translation systems, and continues to perform evaluations based on this methodology. The evaluation programme was instigated in 1991, and continues to this day. Details of the programme can be found in White et al. (1994) and White (1995).

The evaluation programme involved testing several systems based on different theoretical approaches; statistical, rule-based and human-assisted. A number of methods for the evaluation of the output from these systems were tested in 1992 and the most recent suitable methods were selected for inclusion in the programmes for subsequent years. The methods were; comprehension evaluation, quality panel evaluation, and evaluation based on adequacy and fluency.

Comprehension evaluation aimed to directly compare systems based on the results from multiple choice comprehension tests, as in Church et al. (1993). The texts chosen were a set of articles in English on the subject of financial news. These articles were translated by professional translators into a series of language pairs, and then translated back into English using the machine translation systems. It was decided that this was not adequate for a standalone method of comparing systems and as such abandoned due to issues with the modification of meaning in the process of translating from English.

The idea of quality panel evaluation was to submit translations to a panel of expert native English speakers who were professional translators and get them to evaluate them. The evaluations were done on the basis of a metric, modelled on a standard US government metric used to rate human translations. This was good from the point of view that the metric was "externally motivated",[3] since it was not specifically developed for machine translation. However, the quality panel evaluation was very difficult to set up logistically, as it necessitated having a number of experts together in one place for a week or more, and furthermore for them to reach consensus. This method was also abandoned.

Along with a modified form of the comprehension evaluation (re-styled as informativeness evaluation), the most popular method was to obtain ratings from monolingual judges for segments of a document. The judges were presented with a segment, and asked to rate it for two variables, adequacy and fluency. Adequacy is a rating of how much information is transferred between the original and the translation, and fluency is a rating of how good the English is. This technique was found to cover the relevant parts of the quality panel evaluation, while at the same time being easier to deploy, as it didn't require expert judgment.

Measuring systems based on adequacy and fluency, along with informativeness is now the standard methodology for the ARPA evaluation program.[5]

Automatic evaluation edit

In the context of this article, a metric is a measurement. A metric that evaluates machine translation output represents the quality of the output. The quality of a translation is inherently subjective, there is no objective or quantifiable "good." Therefore, any metric must assign quality scores so they correlate with the human judgment of quality. That is, a metric should score highly translations that humans score highly, and give low scores to those humans give low scores. Human judgment is the benchmark for assessing automatic metrics, as humans are the end-users of any translation output.

The measure of evaluation for metrics is correlation with human judgment. This is generally done at two levels, at the sentence level, where scores are calculated by the metric for a set of translated sentences, and then correlated against human judgment for the same sentences. And at the corpus level, where scores over the sentences are aggregated for both human judgments and metric judgments, and these aggregate scores are then correlated. Figures for correlation at the sentence level are rarely reported, although Banerjee et al. (2005) do give correlation figures that show that, at least for their metric, sentence-level correlation is substantially worse than corpus level correlation.

While not widely reported, it has been noted that the genre, or domain, of a text has an effect on the correlation obtained when using metrics. Coughlin (2003) reports that comparing the candidate text against a single reference translation does not adversely affect the correlation of metrics when working in a restricted domain text.

Even if a metric correlates well with human judgment in one study on one corpus, this successful correlation may not carry over to another corpus. Good metric performance, across text types or domains, is important for the reusability of the metric. A metric that only works for text in a specific domain is useful, but less useful than one that works across many domains—because creating a new metric for every new evaluation or domain is undesirable.

Another important factor in the usefulness of an evaluation metric is to have a good correlation, even when working with small amounts of data, that is candidate sentences and reference translations. Turian et al. (2003) point out that, "Any MT evaluation measure is less reliable on shorter translations", and show that increasing the amount of data improves the reliability of a metric. However, they add that "... reliability on shorter texts, as short as one sentence or even one phrase, is highly desirable because a reliable MT evaluation measure can greatly accelerate exploratory data analysis".[6]

Banerjee et al. (2005) highlight five attributes that a good automatic metric must possess; correlation, sensitivity, consistency, reliability and generality. Any good metric must correlate highly with human judgment, it must be consistent, giving similar results to the same MT system on similar text. It must be sensitive to differences between MT systems and reliable in that MT systems that score similarly should be expected to perform similarly. Finally, the metric must be general, that is it should work with different text domains, in a wide range of scenarios and MT tasks.

The aim of this subsection is to give an overview of the state of the art in automatic metrics for evaluating machine translation.[7]

BLEU edit

BLEU was one of the first metrics to report a high correlation with human judgments of quality. The metric is currently one of the most popular in the field. The central idea behind the metric is that "the closer a machine translation is to a professional human translation, the better it is".[8] The metric calculates scores for individual segments, generally sentences — then averages these scores over the whole corpus for a final score. It has been shown to correlate highly with human judgments of quality at the corpus level.[9]

BLEU uses a modified form of precision to compare a candidate translation against multiple reference translations. The metric modifies simple precision since machine translation systems have been known to generate more words than appear in a reference text. No other machine translation metric is yet to significantly outperform BLEU with respect to correlation with human judgment across language pairs.[10]

NIST edit

The NIST metric is based on the BLEU metric, but with some alterations. Where BLEU simply calculates n-gram precision adding equal weight to each one, NIST also calculates how informative a particular n-gram is. That is to say, when a correct n-gram is found, the rarer that n-gram is, the more weight it is given.[11] For example, if the bigram "on the" correctly matches, it receives lower weight than the correct matching of bigram "interesting calculations," as this is less likely to occur. NIST also differs from BLEU in its calculation of the brevity penalty, insofar as small variations in translation length do not impact the overall score as much.

Word error rate edit

The Word error rate (WER) is a metric based on the Levenshtein distance, where the Levenshtein distance works at the character level, WER works at the word level. It was originally used for measuring the performance of speech recognition systems but is also used in the evaluation of machine translation. The metric is based on the calculation of the number of words that differ between a piece of machine-translated text and a reference translation.

A related metric is the Position-independent word error rate (PER), which allows for the re-ordering of words and sequences of words between a translated text and a reference translation.

METEOR edit

The METEOR metric is designed to address some of the deficiencies inherent in the BLEU metric. The metric is based on the weighted harmonic mean of unigram precision and unigram recall. The metric was designed after research by Lavie (2004) into the significance of recall in evaluation metrics. Their research showed that metrics based on recall consistently achieved higher correlation than those based on precision alone, cf. BLEU and NIST.[12]

METEOR also includes some other features not found in other metrics, such as synonymy matching, where instead of matching only on the exact word form, the metric also matches on synonyms. For example, the word "good" in the reference rendering as "well" in the translation counts as a match. The metric is also includes a stemmer, which lemmatises words and matches on the lemmatised forms. The implementation of the metric is modular insofar as the algorithms that match words are implemented as modules, and new modules that implement different matching strategies may easily be added.

LEPOR edit

A new MT evaluation metric LEPOR was proposed as the combination of many evaluation factors including existing ones (precision, recall) and modified ones (sentence-length penalty and n-gram based word order penalty). The experiments were tested on eight language pairs from ACL-WMT2011 including English-to-other (Spanish, French, German, and Czech) and the inverse, and showed that LEPOR yielded higher system-level correlation with human judgments than several existing metrics such as BLEU, Meteor-1.3, TER, AMBER and MP4IBM1.[13] An enhanced version of LEPOR metric, hLEPOR, is introduced in the paper.[14] hLEPOR utilizes the harmonic mean to combine the sub-factors of the designed metric. Furthermore, they design a set of parameters to tune the weights of the sub-factors according to different language pairs. The ACL-WMT13 Metrics shared task[15] results show that hLEPOR yields the highest Pearson correlation score with human judgment on the English-to-Russian language pair, in addition to the highest average-score on five language pairs (English-to-German, French, Spanish, Czech, Russian). The detailed results of WMT13 Metrics Task is introduced in the paper.[16]

Overviews on Human and Automatic Evaluation Methodologies edit

There are some machine translation evaluation survey works,[17][18][19] where people introduced more details about what kinds of human evaluation methods they used and how they work, such as the intelligibility, fidelity, fluency, adequacy, comprehension, and informativeness, etc. For automatic evaluations, they also did some clear classifications such as the lexical similarity methods, the linguistic features application, and the subfields of these two aspects. For instance, for lexical similarity, it contains edit distance, precision, recall and word order; for linguistic feature, it is divided into the syntactic feature and the semantic feature respectively. Some state-of-the-art overview on both manual and automatic translation evaluation[20] introduced the recently developed translation quality assessment (TQA) methodologies, such as the crowd-sourced intelligence Amazon Mechanical Turk utilization, statistical significance testing, re-visiting traditional criteria with newly designed strategies, as well as MT quality estimation (QE) shared tasks from the annual workshop on MT (WMT)[21] and corresponding models that do not rely on human offered reference translations.

See also edit

Notes edit

  1. ^ Somers (2005)
  2. ^ Gaspari (2006)
  3. ^ a b White et al. (1994)
  4. ^ a b ALPAC (1966)
  5. ^ White (1995)
  6. ^ Turian et al. (2003)
  7. ^ While the metrics are described as for the evaluation of machine translation, in practice they may also be used to measure the quality of human translation. The same metrics have even been used for plagiarism detection, for details see Somers et al. (2006).
  8. ^ Papineni et al. (2002)
  9. ^ Papineni et al. (2002), Coughlin (2003)
  10. ^ Graham and Baldwin (2014)
  11. ^ Doddington (2002)
  12. ^ Lavie (2004)
  13. ^ Han (2012)
  14. ^ Han et al. (2013a)
  15. ^ ACL-WMT (2013)
  16. ^ Han et al. (2013b)
  17. ^ EuroMatrix. (2007).
  18. ^ Dorr et al. ()
  19. ^ Han (2016)
  20. ^ Han et al. (2021)
  21. ^ "WMT Conference - Home".

References edit

  • Banerjee, S. and Lavie, A. (2005) "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments" in Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43rd Annual Meeting of the Association of Computational Linguistics (ACL-2005), Ann Arbor, Michigan, June 2005
  • Church, K. and Hovy, E. (1993) "Good Applications for Crummy Machine Translation". Machine Translation, 8 pp. 239–258
  • Coughlin, D. (2003) "Correlating Automated and Human Assessments of Machine Translation Quality" in MT Summit IX, New Orleans, USA pp. 23–27
  • Doddington, G. (2002) "Automatic evaluation of machine translation quality using n-gram cooccurrence statistics". Proceedings of the Human Language Technology Conference (HLT), San Diego, CA pp. 128–132
  • Gaspari, F. (2006) "Look Who's Translating. Impersonations, Chinese Whispers and Fun with Machine Translation on the Internet" in Proceedings of the 11th Annual Conference of the European Association of Machine Translation
  • Graham, Y. and T. Baldwin. (2014) "Testing for Significance of Increased Correlation with Human Judgment". Proceedings of EMNLP 2014, Doha, Qatar
  • Lavie, A., Sagae, K. and Jayaraman, S. (2004) "The Significance of Recall in Automatic Metrics for MT Evaluation" in Proceedings of AMTA 2004, Washington DC. September 2004
  • Papineni, K., Roukos, S., Ward, T., and Zhu, W. J. (2002). "BLEU: a method for automatic evaluation of machine translation" in ACL-2002: 40th Annual meeting of the Association for Computational Linguistics pp. 311–318
  • Somers, H. (2005) "Round-trip Translation: What Is It Good For?"
  • Somers, H., Gaspari, F. and Ana Niño (2006) "Detecting Inappropriate Use of Free Online Machine Translation by Language Students - A Special Case of Plagiarism Detection". Proceedings of the 11th Annual Conference of the European Association of Machine Translation, Oslo University (Norway) pp. 41–48
  • ALPAC (1966) "Languages and machines: computers in translation and linguistics". A report by the Automatic Language Processing Advisory Committee, Division of Behavioral Sciences, National Academy of Sciences, National Research Council. Washington, D.C.: National Academy of Sciences, National Research Council, 1966. (Publication 1416.)
  • Turian, J., Shen, L. and Melamed, I. D. (2003) "Evaluation of Machine Translation and its Evaluation". Proceedings of the MT Summit IX, New Orleans, USA, 2003 pp. 386–393
  • White, J., O'Connell, T. and O'Mara, F. (1994) "The ARPA MT Evaluation Methodologies: Evolution, Lessons, and Future Approaches". Proceedings of the 1st Conference of the Association for Machine Translation in the Americas. Columbia, MD pp. 193–205
  • White, J. (1995) "Approaches to Black Box MT Evaluation". Proceedings of MT Summit V
  • Han, A.L.F., Wong, D.F., and Chao, L.S. (2012) "LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors" in Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012): Posters, Mumbai, India. Open source tool pp. 441–450
  • Han, A.L.F., Wong, D.F., Chao, L.S., He, L., Lu, Y., Xing, J., and Zeng, X. (2013a) "Language-independent Model for Machine Translation Evaluation with Reinforced Factors" in Proceedings of the Machine Translation Summit XIV, Nice, France. International Association for Machine Translation. Open source tool
  • ACL-WMT. (2013) "ACL-WMT13 METRICS TASK"
  • Han, A.L.F., Wong, D.F., Chao, L.S., Lu, Y., He, L., Wang, Y., and Zhou, J. (2013b) "A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task" in Proceedings of the Eighth Workshop on Statistical Machine Translation, ACL-WMT13, Sofia, Bulgaria. Association for Computational Linguistics. Online paper pp. 414–421
  • Han, Lifeng (2016) "Machine Translation Evaluation Resources and Methods: A Survey" in arXiv:1605.04515 [cs.CL], [1] pp. 1–14, May, 2016.
  • EuroMatrix. 2007. 1.3: Survey of Machine Translation Evaluation. Public Distribution. Project funded by the European Community under the Sixth Framework Programme for Research and Technological Development.
  • Bonnie Dorr, Matt Snover, Nitin Madnani . Part 5: Machine Translation Evaluation. Editor: Bonnie Dorr. Book chapter.
  • Han, Lifeng, Jones, Gareth and Smeaton, Alan (2021) Translation quality assessment: a brief survey on manual and automatic methods. [2] In: MoTra21: Workshop on Modelling Translation: Translatology in the Digital Age, @NoDaLiDa 2021. 19 pages. Publisher: Association for Computational Linguistics.

Further reading edit

  • Machine Translation Archive: Subject Index : Publications after 2000 6 February 2010 at the Wayback Machine (see Evaluation subheading)
  • Machine Translation Archive: Subject Index : Publications prior to 2000 21 June 2009 at the Wayback Machine (see Evaluation subheading)
  • Machine Translation Evaluation: A Survey : Publications up to 2015

Software for Automated Evaluation edit

  • Asia Online Language Studio - Supports BLEU, TER, F-Measure, METEOR
  • NIST
  • METEOR
  • TER
  • LEPOR
  • hLEPOR
  • KantanAnalytics - segment level MT quality estimation

evaluation, machine, translation, various, methods, evaluation, machine, translation, have, been, employed, this, article, focuses, evaluation, output, machine, translation, rather, than, performance, usability, evaluation, contents, round, trip, translation, . Various methods for the evaluation for machine translation have been employed This article focuses on the evaluation of the output of machine translation rather than on performance or usability evaluation Contents 1 Round trip translation 2 Human evaluation 2 1 Automatic Language Processing Advisory Committee ALPAC 2 2 Advanced Research Projects Agency ARPA 3 Automatic evaluation 3 1 BLEU 3 2 NIST 3 3 Word error rate 3 4 METEOR 3 5 LEPOR 4 Overviews on Human and Automatic Evaluation Methodologies 5 See also 6 Notes 7 References 8 Further reading 9 Software for Automated EvaluationRound trip translation editMain article Round trip translation A typical way for lay people to assess machine translation quality is to translate from a source language to a target language and back to the source language with the same engine Though intuitively this may seem like a good method of evaluation it has been shown that round trip translation is a poor predictor of quality 1 The reason why it is such a poor predictor of quality is reasonably intuitive A round trip translation is not testing one system but two systems the language pair of the engine for translating into the target language and the language pair translating back from the target language Consider the following examples of round trip translation performed from English to Italian and Portuguese from Somers 2005 Original text Select this link to look at our home page Translated Selezioni questo collegamento per guardare il nostro Home Page Translated back Selections this connection in order to watch our Home Page Original text Tit for tat Translated Melharuco para o tat Translated back Tit for tat In the first example where the text is translated into Italian then back into English the English text is significantly garbled but the Italian is a serviceable translation In the second example the text translated back into English is perfect but the Portuguese translation is meaningless the program thought tit was a reference to a tit bird which was intended for a tat a word it did not understand While round trip translation may be useful to generate a surplus of fun 2 the methodology is deficient for serious study of machine translation quality Human evaluation editThis section covers two of the large scale evaluation studies that have had significant impact on the field the ALPAC 1966 study and the ARPA study 3 Automatic Language Processing Advisory Committee ALPAC edit Main article Automatic Language Processing Advisory Committee One of the constituent parts of the ALPAC report was a study comparing different levels of human translation with machine translation output using human subjects as judges The human judges were specially trained for the purpose The evaluation study compared an MT system translating from Russian into English with human translators on two variables The variables studied were intelligibility and fidelity Intelligibility was a measure of how understandable the sentence was and was measured on a scale of 1 9 Fidelity was a measure of how much information the translated sentence retained compared to the original and was measured on a scale of 0 9 Each point on the scale was associated with a textual description For example 3 on the intelligibility scale was described as Generally unintelligible it tends to read like nonsense but with a considerable amount of reflection and study one can at least hypothesize the idea intended by the sentence 4 Intelligibility was measured without reference to the original while fidelity was measured indirectly The translated sentence was presented and after reading it and absorbing the content the original sentence was presented The judges were asked to rate the original sentence on informativeness So the more informative the original sentence the lower the quality of the translation The study showed that the variables were highly correlated when the human judgment was averaged per sentence The variation among raters was small but the researchers recommended that at the very least three or four raters should be used The evaluation methodology managed to separate translations by humans from translations by machines with ease The study concluded that highly reliable assessments can be made of the quality of human and machine translations 4 Advanced Research Projects Agency ARPA edit As part of the Human Language Technologies Program the Advanced Research Projects Agency ARPA created a methodology to evaluate machine translation systems and continues to perform evaluations based on this methodology The evaluation programme was instigated in 1991 and continues to this day Details of the programme can be found in White et al 1994 and White 1995 The evaluation programme involved testing several systems based on different theoretical approaches statistical rule based and human assisted A number of methods for the evaluation of the output from these systems were tested in 1992 and the most recent suitable methods were selected for inclusion in the programmes for subsequent years The methods were comprehension evaluation quality panel evaluation and evaluation based on adequacy and fluency Comprehension evaluation aimed to directly compare systems based on the results from multiple choice comprehension tests as in Church et al 1993 The texts chosen were a set of articles in English on the subject of financial news These articles were translated by professional translators into a series of language pairs and then translated back into English using the machine translation systems It was decided that this was not adequate for a standalone method of comparing systems and as such abandoned due to issues with the modification of meaning in the process of translating from English The idea of quality panel evaluation was to submit translations to a panel of expert native English speakers who were professional translators and get them to evaluate them The evaluations were done on the basis of a metric modelled on a standard US government metric used to rate human translations This was good from the point of view that the metric was externally motivated 3 since it was not specifically developed for machine translation However the quality panel evaluation was very difficult to set up logistically as it necessitated having a number of experts together in one place for a week or more and furthermore for them to reach consensus This method was also abandoned Along with a modified form of the comprehension evaluation re styled as informativeness evaluation the most popular method was to obtain ratings from monolingual judges for segments of a document The judges were presented with a segment and asked to rate it for two variables adequacy and fluency Adequacy is a rating of how much information is transferred between the original and the translation and fluency is a rating of how good the English is This technique was found to cover the relevant parts of the quality panel evaluation while at the same time being easier to deploy as it didn t require expert judgment Measuring systems based on adequacy and fluency along with informativeness is now the standard methodology for the ARPA evaluation program 5 Automatic evaluation editIn the context of this article a metric is a measurement A metric that evaluates machine translation output represents the quality of the output The quality of a translation is inherently subjective there is no objective or quantifiable good Therefore any metric must assign quality scores so they correlate with the human judgment of quality That is a metric should score highly translations that humans score highly and give low scores to those humans give low scores Human judgment is the benchmark for assessing automatic metrics as humans are the end users of any translation output The measure of evaluation for metrics is correlation with human judgment This is generally done at two levels at the sentence level where scores are calculated by the metric for a set of translated sentences and then correlated against human judgment for the same sentences And at the corpus level where scores over the sentences are aggregated for both human judgments and metric judgments and these aggregate scores are then correlated Figures for correlation at the sentence level are rarely reported although Banerjee et al 2005 do give correlation figures that show that at least for their metric sentence level correlation is substantially worse than corpus level correlation While not widely reported it has been noted that the genre or domain of a text has an effect on the correlation obtained when using metrics Coughlin 2003 reports that comparing the candidate text against a single reference translation does not adversely affect the correlation of metrics when working in a restricted domain text Even if a metric correlates well with human judgment in one study on one corpus this successful correlation may not carry over to another corpus Good metric performance across text types or domains is important for the reusability of the metric A metric that only works for text in a specific domain is useful but less useful than one that works across many domains because creating a new metric for every new evaluation or domain is undesirable Another important factor in the usefulness of an evaluation metric is to have a good correlation even when working with small amounts of data that is candidate sentences and reference translations Turian et al 2003 point out that Any MT evaluation measure is less reliable on shorter translations and show that increasing the amount of data improves the reliability of a metric However they add that reliability on shorter texts as short as one sentence or even one phrase is highly desirable because a reliable MT evaluation measure can greatly accelerate exploratory data analysis 6 Banerjee et al 2005 highlight five attributes that a good automatic metric must possess correlation sensitivity consistency reliability and generality Any good metric must correlate highly with human judgment it must be consistent giving similar results to the same MT system on similar text It must be sensitive to differences between MT systems and reliable in that MT systems that score similarly should be expected to perform similarly Finally the metric must be general that is it should work with different text domains in a wide range of scenarios and MT tasks The aim of this subsection is to give an overview of the state of the art in automatic metrics for evaluating machine translation 7 BLEU edit Main article BLEU BLEU was one of the first metrics to report a high correlation with human judgments of quality The metric is currently one of the most popular in the field The central idea behind the metric is that the closer a machine translation is to a professional human translation the better it is 8 The metric calculates scores for individual segments generally sentences then averages these scores over the whole corpus for a final score It has been shown to correlate highly with human judgments of quality at the corpus level 9 BLEU uses a modified form of precision to compare a candidate translation against multiple reference translations The metric modifies simple precision since machine translation systems have been known to generate more words than appear in a reference text No other machine translation metric is yet to significantly outperform BLEU with respect to correlation with human judgment across language pairs 10 NIST edit Main article NIST metric The NIST metric is based on the BLEU metric but with some alterations Where BLEU simply calculates n gram precision adding equal weight to each one NIST also calculates how informative a particular n gram is That is to say when a correct n gram is found the rarer that n gram is the more weight it is given 11 For example if the bigram on the correctly matches it receives lower weight than the correct matching of bigram interesting calculations as this is less likely to occur NIST also differs from BLEU in its calculation of the brevity penalty insofar as small variations in translation length do not impact the overall score as much Word error rate edit Main article Word error rate The Word error rate WER is a metric based on the Levenshtein distance where the Levenshtein distance works at the character level WER works at the word level It was originally used for measuring the performance of speech recognition systems but is also used in the evaluation of machine translation The metric is based on the calculation of the number of words that differ between a piece of machine translated text and a reference translation A related metric is the Position independent word error rate PER which allows for the re ordering of words and sequences of words between a translated text and a reference translation METEOR edit Main article METEOR The METEOR metric is designed to address some of the deficiencies inherent in the BLEU metric The metric is based on the weighted harmonic mean of unigram precision and unigram recall The metric was designed after research by Lavie 2004 into the significance of recall in evaluation metrics Their research showed that metrics based on recall consistently achieved higher correlation than those based on precision alone cf BLEU and NIST 12 METEOR also includes some other features not found in other metrics such as synonymy matching where instead of matching only on the exact word form the metric also matches on synonyms For example the word good in the reference rendering as well in the translation counts as a match The metric is also includes a stemmer which lemmatises words and matches on the lemmatised forms The implementation of the metric is modular insofar as the algorithms that match words are implemented as modules and new modules that implement different matching strategies may easily be added LEPOR edit Main article LEPOR A new MT evaluation metric LEPOR was proposed as the combination of many evaluation factors including existing ones precision recall and modified ones sentence length penalty and n gram based word order penalty The experiments were tested on eight language pairs from ACL WMT2011 including English to other Spanish French German and Czech and the inverse and showed that LEPOR yielded higher system level correlation with human judgments than several existing metrics such as BLEU Meteor 1 3 TER AMBER and MP4IBM1 13 An enhanced version of LEPOR metric hLEPOR is introduced in the paper 14 hLEPOR utilizes the harmonic mean to combine the sub factors of the designed metric Furthermore they design a set of parameters to tune the weights of the sub factors according to different language pairs The ACL WMT13 Metrics shared task 15 results show that hLEPOR yields the highest Pearson correlation score with human judgment on the English to Russian language pair in addition to the highest average score on five language pairs English to German French Spanish Czech Russian The detailed results of WMT13 Metrics Task is introduced in the paper 16 Overviews on Human and Automatic Evaluation Methodologies editThere are some machine translation evaluation survey works 17 18 19 where people introduced more details about what kinds of human evaluation methods they used and how they work such as the intelligibility fidelity fluency adequacy comprehension and informativeness etc For automatic evaluations they also did some clear classifications such as the lexical similarity methods the linguistic features application and the subfields of these two aspects For instance for lexical similarity it contains edit distance precision recall and word order for linguistic feature it is divided into the syntactic feature and the semantic feature respectively Some state of the art overview on both manual and automatic translation evaluation 20 introduced the recently developed translation quality assessment TQA methodologies such as the crowd sourced intelligence Amazon Mechanical Turk utilization statistical significance testing re visiting traditional criteria with newly designed strategies as well as MT quality estimation QE shared tasks from the annual workshop on MT WMT 21 and corresponding models that do not rely on human offered reference translations See also editComparison of machine translation applications Machine translation software usabilityNotes edit Somers 2005 Gaspari 2006 a b White et al 1994 a b ALPAC 1966 White 1995 Turian et al 2003 While the metrics are described as for the evaluation of machine translation in practice they may also be used to measure the quality of human translation The same metrics have even been used for plagiarism detection for details see Somers et al 2006 Papineni et al 2002 Papineni et al 2002 Coughlin 2003 Graham and Baldwin 2014 Doddington 2002 Lavie 2004 Han 2012 Han et al 2013a ACL WMT 2013 Han et al 2013b EuroMatrix 2007 Dorr et al Han 2016 Han et al 2021 WMT Conference Home References editBanerjee S and Lavie A 2005 METEOR An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments in Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and or Summarization at the 43rd Annual Meeting of the Association of Computational Linguistics ACL 2005 Ann Arbor Michigan June 2005 Church K and Hovy E 1993 Good Applications for Crummy Machine Translation Machine Translation 8 pp 239 258 Coughlin D 2003 Correlating Automated and Human Assessments of Machine Translation Quality in MT Summit IX New Orleans USA pp 23 27 Doddington G 2002 Automatic evaluation of machine translation quality using n gram cooccurrence statistics Proceedings of the Human Language Technology Conference HLT San Diego CA pp 128 132 Gaspari F 2006 Look Who s Translating Impersonations Chinese Whispers and Fun with Machine Translation on the Internet in Proceedings of the 11th Annual Conference of the European Association of Machine Translation Graham Y and T Baldwin 2014 Testing for Significance of Increased Correlation with Human Judgment Proceedings of EMNLP 2014 Doha Qatar Lavie A Sagae K and Jayaraman S 2004 The Significance of Recall in Automatic Metrics for MT Evaluation in Proceedings of AMTA 2004 Washington DC September 2004 Papineni K Roukos S Ward T and Zhu W J 2002 BLEU a method for automatic evaluation of machine translation in ACL 2002 40th Annual meeting of the Association for Computational Linguistics pp 311 318 Somers H 2005 Round trip Translation What Is It Good For Somers H Gaspari F and Ana Nino 2006 Detecting Inappropriate Use of Free Online Machine Translation by Language Students A Special Case of Plagiarism Detection Proceedings of the 11th Annual Conference of the European Association of Machine Translation Oslo University Norway pp 41 48 ALPAC 1966 Languages and machines computers in translation and linguistics A report by the Automatic Language Processing Advisory Committee Division of Behavioral Sciences National Academy of Sciences National Research Council Washington D C National Academy of Sciences National Research Council 1966 Publication 1416 Turian J Shen L and Melamed I D 2003 Evaluation of Machine Translation and its Evaluation Proceedings of the MT Summit IX New Orleans USA 2003 pp 386 393 White J O Connell T and O Mara F 1994 The ARPA MT Evaluation Methodologies Evolution Lessons and Future Approaches Proceedings of the 1st Conference of the Association for Machine Translation in the Americas Columbia MD pp 193 205 White J 1995 Approaches to Black Box MT Evaluation Proceedings of MT Summit V Han A L F Wong D F and Chao L S 2012 LEPOR A Robust Evaluation Metric for Machine Translation with Augmented Factors in Proceedings of the 24th International Conference on Computational Linguistics COLING 2012 Posters Mumbai India Open source tool pp 441 450 Han A L F Wong D F Chao L S He L Lu Y Xing J and Zeng X 2013a Language independent Model for Machine Translation Evaluation with Reinforced Factors in Proceedings of the Machine Translation Summit XIV Nice France International Association for Machine Translation Open source tool ACL WMT 2013 ACL WMT13 METRICS TASK Han A L F Wong D F Chao L S Lu Y He L Wang Y and Zhou J 2013b A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task in Proceedings of the Eighth Workshop on Statistical Machine Translation ACL WMT13 Sofia Bulgaria Association for Computational Linguistics Online paper pp 414 421 Han Lifeng 2016 Machine Translation Evaluation Resources and Methods A Survey in arXiv 1605 04515 cs CL 1 pp 1 14 May 2016 EuroMatrix 2007 1 3 Survey of Machine Translation Evaluation Public Distribution Project funded by the European Community under the Sixth Framework Programme for Research and Technological Development Bonnie Dorr Matt Snover Nitin Madnani Part 5 Machine Translation Evaluation Editor Bonnie Dorr Book chapter Han Lifeng Jones Gareth and Smeaton Alan 2021 Translation quality assessment a brief survey on manual and automatic methods 2 In MoTra21 Workshop on Modelling Translation Translatology in the Digital Age NoDaLiDa 2021 19 pages Publisher Association for Computational Linguistics Further reading editMachine Translation Archive Subject Index Publications after 2000 Archived 6 February 2010 at the Wayback Machine see Evaluation subheading Machine Translation Archive Subject Index Publications prior to 2000 Archived 21 June 2009 at the Wayback Machine see Evaluation subheading Machine Translation Evaluation A Survey Publications up to 2015Software for Automated Evaluation editAsia Online Language Studio Supports BLEU TER F Measure METEOR BLEU F Measure NIST METEOR TER TERP LEPOR hLEPOR KantanAnalytics segment level MT quality estimation Retrieved from https en wikipedia org w index php title Evaluation of machine translation amp oldid 1214817406, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.