fbpx
Wikipedia

Paraphrasing (computational linguistics)

Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection.[1] Paraphrasing is also useful in the evaluation of machine translation,[2] as well as semantic parsing[3] and generation[4] of new samples to expand existing corpora.[5]

Paraphrase generation edit

Multiple sequence alignment edit

Barzilay and Lee[5] proposed a method to generate paraphrases through the usage of monolingual parallel corpora, namely news articles covering the same event on the same day. Training consists of using multi-sequence alignment to generate sentence-level paraphrases from an unannotated corpus. This is done by

  • finding recurring patterns in each individual corpus, i.e. "X (injured/wounded) Y people, Z seriously" where X, Y, Z are variables
  • finding pairings between such patterns the represent paraphrases, i.e. "X (injured/wounded) Y people, Z seriously" and "Y were (wounded/hurt) by X, among them Z were in serious condition"

This is achieved by first clustering similar sentences together using n-gram overlap. Recurring patterns are found within clusters by using multi-sequence alignment. Then the position of argument words is determined by finding areas of high variability within each cluster, aka between words shared by more than 50% of a cluster's sentences. Pairings between patterns are then found by comparing similar variable words between different corpora. Finally, new paraphrases can be generated by choosing a matching cluster for a source sentence, then substituting the source sentence's argument into any number of patterns in the cluster.

Phrase-based Machine Translation edit

Paraphrase can also be generated through the use of phrase-based translation as proposed by Bannard and Callison-Burch.[6] The chief concept consists of aligning phrases in a pivot language to produce potential paraphrases in the original language. For example, the phrase "under control" in an English sentence is aligned with the phrase "unter kontrolle" in its German counterpart. The phrase "unter kontrolle" is then found in another German sentence with the aligned English phrase being "in check," a paraphrase of "under control."

The probability distribution can be modeled as  , the probability phrase   is a paraphrase of  , which is equivalent to   summed over all  , a potential phrase translation in the pivot language. Additionally, the sentence   is added as a prior to add context to the paraphrase. Thus the optimal paraphrase,   can be modeled as:

 

  and   can be approximated by simply taking their frequencies. Adding   as a prior is modeled by calculating the probability of forming the   when   is substituted with  .

Long short-term memory edit

There has been success in using long short-term memory (LSTM) models to generate paraphrases.[7] In short, the model consists of an encoder and decoder component, both implemented using variations of a stacked residual LSTM. First, the encoding LSTM takes a one-hot encoding of all the words in a sentence as input and produces a final hidden vector, which can represent the input sentence. The decoding LSTM takes the hidden vector as input and generates a new sentence, terminating in an end-of-sentence token. The encoder and decoder are trained to take a phrase and reproduce the one-hot distribution of a corresponding paraphrase by minimizing perplexity using simple stochastic gradient descent. New paraphrases are generated by inputting a new phrase to the encoder and passing the output to the decoder.

Transformers edit

With the introduction of Transformer models, paraphrase generation approaches improved their ability to generate text by scaling neural network parameters and heavily parallelizing training through feed-forward layers.[8] These models are so fluent in generating text that human experts cannot identify if an example was human-authored or machine-generated.[9] Transformer-based paraphrase generation relies on autoencoding, autoregressive, or sequence-to-sequence methods. Autoencoder models predict word replacement candidates with a one-hot distribution over the vocabulary, while autoregressive and seq2seq models generate new text based on the source predicting one word at a time.[10][11] More advanced efforts also exist to make paraphrasing controllable according to predefined quality dimensions, such as semantic preservation or lexical diversity.[12] Many Transformer-based paraphrase generation methods rely on unsupervised learning to leverage large amounts of training data and scale their methods.[13][14]

Paraphrase recognition edit

Recursive Autoencoders edit

Paraphrase recognition has been attempted by Socher et al[1] through the use of recursive autoencoders. The main concept is to produce a vector representation of a sentence and its components by recursively using an autoencoder. The vector representations of paraphrases should have similar vector representations; they are processed, then fed as input into a neural network for classification.

Given a sentence   with   words, the autoencoder is designed to take 2  -dimensional word embeddings as input and produce an  -dimensional vector as output. The same autoencoder is applied to every pair of words in   to produce   vectors. The autoencoder is then applied recursively with the new vectors as inputs until a single vector is produced. Given an odd number of inputs, the first vector is forwarded as-is to the next level of recursion. The autoencoder is trained to reproduce every vector in the full recursion tree, including the initial word embeddings.

Given two sentences   and   of length 4 and 3 respectively, the autoencoders would produce 7 and 5 vector representations including the initial word embeddings. The euclidean distance is then taken between every combination of vectors in   and   to produce a similarity matrix  .   is then subject to a dynamic min-pooling layer to produce a fixed size   matrix. Since   are not uniform in size among all potential sentences,   is split into   roughly even sections. The output is then normalized to have mean 0 and standard deviation 1 and is fed into a fully connected layer with a softmax output. The dynamic pooling to softmax model is trained using pairs of known paraphrases.

Skip-thought vectors edit

Skip-thought vectors are an attempt to create a vector representation of the semantic meaning of a sentence, similarly to the skip gram model.[15] Skip-thought vectors are produced through the use of a skip-thought model which consists of three key components, an encoder and two decoders. Given a corpus of documents, the skip-thought model is trained to take a sentence as input and encode it into a skip-thought vector. The skip-thought vector is used as input for both decoders; one attempts to reproduce the previous sentence and the other the following sentence in its entirety. The encoder and decoder can be implemented through the use of a recursive neural network (RNN) or an LSTM.

Since paraphrases carry the same semantic meaning between one another, they should have similar skip-thought vectors. Thus a simple logistic regression can be trained to good performance with the absolute difference and component-wise product of two skip-thought vectors as input.

Transformers edit

Similar to how Transformer models influenced paraphrase generation, their application in identifying paraphrases showed great success. Models such as BERT can be adapted with a binary classification layer and trained end-to-end on identification tasks.[16][17] Transformers achieve strong results when transferring between domains and paraphrasing techniques compared to more traditional machine learning methods such as logistic regression. Other successful methods based on the Transformer architecture include using adversarial learning and meta-learning.[18][19]

Evaluation edit

Multiple methods can be used to evaluate paraphrases. Since paraphrase recognition can be posed as a classification problem, most standard evaluations metrics such as accuracy, f1 score, or an ROC curve do relatively well. However, there is difficulty calculating f1-scores due to trouble producing a complete list of paraphrases for a given phrase and the fact that good paraphrases are dependent upon context. A metric designed to counter these problems is ParaMetric.[20] ParaMetric aims to calculate the precision and recall of an automatic paraphrase system by comparing the automatic alignment of paraphrases to a manual alignment of similar phrases. Since ParaMetric is simply rating the quality of phrase alignment, it can be used to rate paraphrase generation systems, assuming it uses phrase alignment as part of its generation process. A notable drawback to ParaMetric is the large and exhaustive set of manual alignments that must be initially created before a rating can be produced.

The evaluation of paraphrase generation has similar difficulties as the evaluation of machine translation. The quality of a paraphrase depends on its context, whether it is being used as a summary, and how it is generated, among other factors. Additionally, a good paraphrase usually is lexically dissimilar from its source phrase. The simplest method used to evaluate paraphrase generation would be through the use of human judges. Unfortunately, evaluation through human judges tends to be time-consuming. Automated approaches to evaluation prove to be challenging as it is essentially a problem as difficult as paraphrase recognition. While originally used to evaluate machine translations, bilingual evaluation understudy (BLEU) has been used successfully to evaluate paraphrase generation models as well. However, paraphrases often have several lexically different but equally valid solutions, hurting BLEU and other similar evaluation metrics.[21]

Metrics specifically designed to evaluate paraphrase generation include paraphrase in n-gram change (PINC)[21] and paraphrase evaluation metric (PEM)[22] along with the aforementioned ParaMetric. PINC is designed to be used with BLEU and help cover its inadequacies. Since BLEU has difficulty measuring lexical dissimilarity, PINC is a measurement of the lack of n-gram overlap between a source sentence and a candidate paraphrase. It is essentially the Jaccard distance between the sentence, excluding n-grams that appear in the source sentence to maintain some semantic equivalence. PEM, on the other hand, attempts to evaluate the "adequacy, fluency, and lexical dissimilarity" of paraphrases by returning a single value heuristic calculated using N-grams overlap in a pivot language. However, a large drawback to PEM is that it must be trained using large, in-domain parallel corpora and human judges.[21] It is equivalent to training a paraphrase recognition to evaluate a paraphrase generation system.

The Quora Question Pairs Dataset, which contains hundreds of thousands of duplicate questions, has become a common dataset for the evaluation of paraphrase detectors.[23] Consistently reliable paraphrase detection have all used the Transformer architecture and all have relied on large amounts of pre-training with more general data before fine-tuning with the question pairs.

See also edit

References edit

  1. ^ a b Socher, Richard; Huang, Eric; Pennington, Jeffrey; Ng, Andrew; Manning, Christopher (2011), , Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection, archived from the original on 2018-01-06, retrieved 2017-12-29
  2. ^ Callison-Burch, Chris (October 25–27, 2008). Syntactic Constraints on Paraphrases Extracted from Parallel Corpora. EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing. Honolulu, Hawaii. pp. 196–205.
  3. ^ Berant, Jonathan, and Percy Liang. "Semantic parsing via paraphrasing." Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2014.
  4. ^ Wahle, Jan Philip; Ruas, Terry; Kirstein, Frederic; Gipp, Bela (2022). "How Large Language Models are Transforming Machine-Paraphrase Plagiarism". Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Online and Abu Dhabi, United Arab Emirates. pp. 952–963. arXiv:2210.03568. doi:10.18653/v1/2022.emnlp-main.62.{{cite book}}: CS1 maint: location missing publisher (link)
  5. ^ a b Barzilay, Regina; Lee, Lillian (May–June 2003). Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment. Proceedings of HLT-NAACL 2003.
  6. ^ Bannard, Colin; Callison-Burch, Chris (2005). Paraphrasing Bilingual Parallel Corpora. Proceedings of the 43rd Annual Meeting of the ACL. Ann Arbor, Michigan. pp. 597–604.
  7. ^ Prakash, Aaditya; Hasan, Sadid A.; Lee, Kathy; Datla, Vivek; Qadir, Ashequl; Liu, Joey; Farri, Oladimeji (2016), Neural Paraphrase Generation with Staked Residual LSTM Networks, arXiv:1610.03098, Bibcode:2016arXiv161003098P
  8. ^ Zhou, Jianing; Bhat, Suma (2021). "Paraphrase Generation: A Survey of the State of the Art". Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. pp. 5075–5086. doi:10.18653/v1/2021.emnlp-main.414. S2CID 243865349.
  9. ^ Dou, Yao; Forbes, Maxwell; Koncel-Kedziorski, Rik; Smith, Noah; Choi, Yejin (2022). "Is GPT-3 Text Indistinguishable from Human Text? Scarecrow: A Framework for Scrutinizing Machine Text". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics: 7250–7274. arXiv:2107.01294. doi:10.18653/v1/2022.acl-long.501. S2CID 247315430.
  10. ^ Liu, Xianggen; Mou, Lili; Meng, Fandong; Zhou, Hao; Zhou, Jie; Song, Sen (2020). "Unsupervised Paraphrasing by Simulated Annealing". Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics: 302–312. arXiv:1909.03588. doi:10.18653/v1/2020.acl-main.28. S2CID 202537332.
  11. ^ Wahle, Jan Philip; Ruas, Terry; Meuschke, Norman; Gipp, Bela (2021). "Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphrase Detection". 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL). Champaign, IL, USA: IEEE. pp. 226–229. arXiv:2103.12450. doi:10.1109/JCDL52503.2021.00065. ISBN 978-1-6654-1770-9. S2CID 232320374.
  12. ^ Bandel, Elron; Aharonov, Ranit; Shmueli-Scheuer, Michal; Shnayderman, Ilya; Slonim, Noam; Ein-Dor, Liat (2022). "Quality Controlled Paraphrase Generation". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics: 596–609. arXiv:2203.10940. doi:10.18653/v1/2022.acl-long.45.
  13. ^ Lee, John Sie Yuen; Lim, Ho Hung; Carol Webster, Carol (2022). "Unsupervised Paraphrasability Prediction for Compound Nominalizations". Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association for Computational Linguistics. pp. 3254–3263. doi:10.18653/v1/2022.naacl-main.237. S2CID 250390695.
  14. ^ Niu, Tong; Yavuz, Semih; Zhou, Yingbo; Keskar, Nitish Shirish; Wang, Huan; Xiong, Caiming (2021). "Unsupervised Paraphrasing with Pretrained Language Models". Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. pp. 5136–5150. doi:10.18653/v1/2021.emnlp-main.417. S2CID 237497412.
  15. ^ Kiros, Ryan; Zhu, Yukun; Salakhutdinov, Ruslan; Zemel, Richard; Torralba, Antonio; Urtasun, Raquel; Fidler, Sanja (2015), Skip-Thought Vectors, arXiv:1506.06726, Bibcode:2015arXiv150606726K
  16. ^ Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2019). "Proceedings of the 2019 Conference of the North". Proceedings of the 2019 Conference of the North. Minneapolis, Minnesota: Association for Computational Linguistics: 4171–4186. doi:10.18653/v1/N19-1423. S2CID 52967399.
  17. ^ Wahle, Jan Philip; Ruas, Terry; Foltýnek, Tomáš; Meuschke, Norman; Gipp, Bela (2022), Smits, Malte (ed.), "Identifying Machine-Paraphrased Plagiarism", Information for a Better World: Shaping the Global Future, vol. 13192, Cham: Springer International Publishing, pp. 393–413, arXiv:2103.11909, doi:10.1007/978-3-030-96957-8_34, ISBN 978-3-030-96956-1, S2CID 232307572, retrieved 2022-10-06
  18. ^ Nighojkar, Animesh; Licato, John (2021). "Improving Paraphrase Detection with the Adversarial Paraphrasing Task". Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics. pp. 7106–7116. doi:10.18653/v1/2021.acl-long.552. S2CID 235436269.
  19. ^ Dopierre, Thomas; Gravier, Christophe; Logerais, Wilfried (2021). "ProtAugment: Intent Detection Meta-Learning through Unsupervised Diverse Paraphrasing". Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics. pp. 2454–2466. doi:10.18653/v1/2021.acl-long.191. S2CID 236460333.
  20. ^ Callison-Burch, Chris; Cohn, Trevor; Lapata, Mirella (2008). ParaMetric: An Automatic Evaluation Metric for Paraphrasing. Proceedings of the 22nd International Conference on Computational Linguistics. Manchester. pp. 97–104. doi:10.3115/1599081.1599094. S2CID 837398.
  21. ^ a b c Chen, David; Dolan, William (2008). Collecting Highly Parallel Data for Paraphrase Evaluation. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon. pp. 190–200.
  22. ^ Liu, Chang; Dahlmeier, Daniel; Ng, Hwee Tou (2010). PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts. Proceedings of the 2010 Conference on Empricial Methods in Natural Language Processing. MIT, Massachusetts. pp. 923–932.
  23. ^ "Paraphrase Identification on Quora Question Pairs". Papers with Code.

External links edit

  • Microsoft Research Paraphrase Corpus - a dataset consisting of 5800 pairs of sentences extracted from news articles annotated to note whether a pair captures semantic equivalence
  • Paraphrase Database (PPDB) - A searchable database containing millions of paraphrases in 16 different languages

paraphrasing, computational, linguistics, this, article, about, automated, generation, recognition, paraphrases, other, uses, paraphrase, disambiguation, paraphrase, paraphrasing, computational, linguistics, natural, language, processing, task, detecting, gene. This article is about automated generation and recognition of paraphrases For other uses see Paraphrase disambiguation Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases Applications of paraphrasing are varied including information retrieval question answering text summarization and plagiarism detection 1 Paraphrasing is also useful in the evaluation of machine translation 2 as well as semantic parsing 3 and generation 4 of new samples to expand existing corpora 5 Contents 1 Paraphrase generation 1 1 Multiple sequence alignment 1 2 Phrase based Machine Translation 1 3 Long short term memory 1 4 Transformers 2 Paraphrase recognition 2 1 Recursive Autoencoders 2 2 Skip thought vectors 2 3 Transformers 3 Evaluation 4 See also 5 References 6 External linksParaphrase generation editMultiple sequence alignment edit Barzilay and Lee 5 proposed a method to generate paraphrases through the usage of monolingual parallel corpora namely news articles covering the same event on the same day Training consists of using multi sequence alignment to generate sentence level paraphrases from an unannotated corpus This is done by finding recurring patterns in each individual corpus i e X injured wounded Y people Z seriously where X Y Z are variables finding pairings between such patterns the represent paraphrases i e X injured wounded Y people Z seriously and Y were wounded hurt by X among them Z were in serious condition This is achieved by first clustering similar sentences together using n gram overlap Recurring patterns are found within clusters by using multi sequence alignment Then the position of argument words is determined by finding areas of high variability within each cluster aka between words shared by more than 50 of a cluster s sentences Pairings between patterns are then found by comparing similar variable words between different corpora Finally new paraphrases can be generated by choosing a matching cluster for a source sentence then substituting the source sentence s argument into any number of patterns in the cluster Phrase based Machine Translation edit Paraphrase can also be generated through the use of phrase based translation as proposed by Bannard and Callison Burch 6 The chief concept consists of aligning phrases in a pivot language to produce potential paraphrases in the original language For example the phrase under control in an English sentence is aligned with the phrase unter kontrolle in its German counterpart The phrase unter kontrolle is then found in another German sentence with the aligned English phrase being in check a paraphrase of under control The probability distribution can be modeled as Pr e2 e1 displaystyle Pr e 2 e 1 nbsp the probability phrase e2 displaystyle e 2 nbsp is a paraphrase of e1 displaystyle e 1 nbsp which is equivalent to Pr e2 f Pr f e1 displaystyle Pr e 2 f Pr f e 1 nbsp summed over all f displaystyle f nbsp a potential phrase translation in the pivot language Additionally the sentence e1 displaystyle e 1 nbsp is added as a prior to add context to the paraphrase Thus the optimal paraphrase e2 displaystyle hat e 2 nbsp can be modeled as e2 argmaxe2 e1Pr e2 e1 S argmaxe2 e1 fPr e2 f S Pr f e1 S displaystyle hat e 2 text arg max e 2 neq e 1 Pr e 2 e 1 S text arg max e 2 neq e 1 sum f Pr e 2 f S Pr f e 1 S nbsp Pr e2 f displaystyle Pr e 2 f nbsp and Pr f e1 displaystyle Pr f e 1 nbsp can be approximated by simply taking their frequencies Adding S displaystyle S nbsp as a prior is modeled by calculating the probability of forming the S displaystyle S nbsp when e1 displaystyle e 1 nbsp is substituted with e2 displaystyle e 2 nbsp Long short term memory edit There has been success in using long short term memory LSTM models to generate paraphrases 7 In short the model consists of an encoder and decoder component both implemented using variations of a stacked residual LSTM First the encoding LSTM takes a one hot encoding of all the words in a sentence as input and produces a final hidden vector which can represent the input sentence The decoding LSTM takes the hidden vector as input and generates a new sentence terminating in an end of sentence token The encoder and decoder are trained to take a phrase and reproduce the one hot distribution of a corresponding paraphrase by minimizing perplexity using simple stochastic gradient descent New paraphrases are generated by inputting a new phrase to the encoder and passing the output to the decoder Transformers edit With the introduction of Transformer models paraphrase generation approaches improved their ability to generate text by scaling neural network parameters and heavily parallelizing training through feed forward layers 8 These models are so fluent in generating text that human experts cannot identify if an example was human authored or machine generated 9 Transformer based paraphrase generation relies on autoencoding autoregressive or sequence to sequence methods Autoencoder models predict word replacement candidates with a one hot distribution over the vocabulary while autoregressive and seq2seq models generate new text based on the source predicting one word at a time 10 11 More advanced efforts also exist to make paraphrasing controllable according to predefined quality dimensions such as semantic preservation or lexical diversity 12 Many Transformer based paraphrase generation methods rely on unsupervised learning to leverage large amounts of training data and scale their methods 13 14 Paraphrase recognition editRecursive Autoencoders edit Paraphrase recognition has been attempted by Socher et al 1 through the use of recursive autoencoders The main concept is to produce a vector representation of a sentence and its components by recursively using an autoencoder The vector representations of paraphrases should have similar vector representations they are processed then fed as input into a neural network for classification Given a sentence W displaystyle W nbsp with m displaystyle m nbsp words the autoencoder is designed to take 2 n displaystyle n nbsp dimensional word embeddings as input and produce an n displaystyle n nbsp dimensional vector as output The same autoencoder is applied to every pair of words in S displaystyle S nbsp to produce m 2 displaystyle lfloor m 2 rfloor nbsp vectors The autoencoder is then applied recursively with the new vectors as inputs until a single vector is produced Given an odd number of inputs the first vector is forwarded as is to the next level of recursion The autoencoder is trained to reproduce every vector in the full recursion tree including the initial word embeddings Given two sentences W1 displaystyle W 1 nbsp and W2 displaystyle W 2 nbsp of length 4 and 3 respectively the autoencoders would produce 7 and 5 vector representations including the initial word embeddings The euclidean distance is then taken between every combination of vectors in W1 displaystyle W 1 nbsp and W2 displaystyle W 2 nbsp to produce a similarity matrix S R7 5 displaystyle S in mathbb R 7 times 5 nbsp S displaystyle S nbsp is then subject to a dynamic min pooling layer to produce a fixed size np np displaystyle n p times n p nbsp matrix Since S displaystyle S nbsp are not uniform in size among all potential sentences S displaystyle S nbsp is split into np displaystyle n p nbsp roughly even sections The output is then normalized to have mean 0 and standard deviation 1 and is fed into a fully connected layer with a softmax output The dynamic pooling to softmax model is trained using pairs of known paraphrases Skip thought vectors edit Skip thought vectors are an attempt to create a vector representation of the semantic meaning of a sentence similarly to the skip gram model 15 Skip thought vectors are produced through the use of a skip thought model which consists of three key components an encoder and two decoders Given a corpus of documents the skip thought model is trained to take a sentence as input and encode it into a skip thought vector The skip thought vector is used as input for both decoders one attempts to reproduce the previous sentence and the other the following sentence in its entirety The encoder and decoder can be implemented through the use of a recursive neural network RNN or an LSTM Since paraphrases carry the same semantic meaning between one another they should have similar skip thought vectors Thus a simple logistic regression can be trained to good performance with the absolute difference and component wise product of two skip thought vectors as input Transformers edit Similar to how Transformer models influenced paraphrase generation their application in identifying paraphrases showed great success Models such as BERT can be adapted with a binary classification layer and trained end to end on identification tasks 16 17 Transformers achieve strong results when transferring between domains and paraphrasing techniques compared to more traditional machine learning methods such as logistic regression Other successful methods based on the Transformer architecture include using adversarial learning and meta learning 18 19 Evaluation editMultiple methods can be used to evaluate paraphrases Since paraphrase recognition can be posed as a classification problem most standard evaluations metrics such as accuracy f1 score or an ROC curve do relatively well However there is difficulty calculating f1 scores due to trouble producing a complete list of paraphrases for a given phrase and the fact that good paraphrases are dependent upon context A metric designed to counter these problems is ParaMetric 20 ParaMetric aims to calculate the precision and recall of an automatic paraphrase system by comparing the automatic alignment of paraphrases to a manual alignment of similar phrases Since ParaMetric is simply rating the quality of phrase alignment it can be used to rate paraphrase generation systems assuming it uses phrase alignment as part of its generation process A notable drawback to ParaMetric is the large and exhaustive set of manual alignments that must be initially created before a rating can be produced The evaluation of paraphrase generation has similar difficulties as the evaluation of machine translation The quality of a paraphrase depends on its context whether it is being used as a summary and how it is generated among other factors Additionally a good paraphrase usually is lexically dissimilar from its source phrase The simplest method used to evaluate paraphrase generation would be through the use of human judges Unfortunately evaluation through human judges tends to be time consuming Automated approaches to evaluation prove to be challenging as it is essentially a problem as difficult as paraphrase recognition While originally used to evaluate machine translations bilingual evaluation understudy BLEU has been used successfully to evaluate paraphrase generation models as well However paraphrases often have several lexically different but equally valid solutions hurting BLEU and other similar evaluation metrics 21 Metrics specifically designed to evaluate paraphrase generation include paraphrase in n gram change PINC 21 and paraphrase evaluation metric PEM 22 along with the aforementioned ParaMetric PINC is designed to be used with BLEU and help cover its inadequacies Since BLEU has difficulty measuring lexical dissimilarity PINC is a measurement of the lack of n gram overlap between a source sentence and a candidate paraphrase It is essentially the Jaccard distance between the sentence excluding n grams that appear in the source sentence to maintain some semantic equivalence PEM on the other hand attempts to evaluate the adequacy fluency and lexical dissimilarity of paraphrases by returning a single value heuristic calculated using N grams overlap in a pivot language However a large drawback to PEM is that it must be trained using large in domain parallel corpora and human judges 21 It is equivalent to training a paraphrase recognition to evaluate a paraphrase generation system The Quora Question Pairs Dataset which contains hundreds of thousands of duplicate questions has become a common dataset for the evaluation of paraphrase detectors 23 Consistently reliable paraphrase detection have all used the Transformer architecture and all have relied on large amounts of pre training with more general data before fine tuning with the question pairs See also editRound trip translation Text simplification Text normalizationReferences edit a b Socher Richard Huang Eric Pennington Jeffrey Ng Andrew Manning Christopher 2011 Advances in Neural Information Processing Systems 24 Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection archived from the original on 2018 01 06 retrieved 2017 12 29 Callison Burch Chris October 25 27 2008 Syntactic Constraints on Paraphrases Extracted from Parallel Corpora EMNLP 08 Proceedings of the Conference on Empirical Methods in Natural Language Processing Honolulu Hawaii pp 196 205 Berant Jonathan and Percy Liang Semantic parsing via paraphrasing Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics Volume 1 Long Papers Vol 1 2014 Wahle Jan Philip Ruas Terry Kirstein Frederic Gipp Bela 2022 How Large Language Models are Transforming Machine Paraphrase Plagiarism Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing Online and Abu Dhabi United Arab Emirates pp 952 963 arXiv 2210 03568 doi 10 18653 v1 2022 emnlp main 62 a href Template Cite book html title Template Cite book cite book a CS1 maint location missing publisher link a b Barzilay Regina Lee Lillian May June 2003 Learning to Paraphrase An Unsupervised Approach Using Multiple Sequence Alignment Proceedings of HLT NAACL 2003 Bannard Colin Callison Burch Chris 2005 Paraphrasing Bilingual Parallel Corpora Proceedings of the 43rd Annual Meeting of the ACL Ann Arbor Michigan pp 597 604 Prakash Aaditya Hasan Sadid A Lee Kathy Datla Vivek Qadir Ashequl Liu Joey Farri Oladimeji 2016 Neural Paraphrase Generation with Staked Residual LSTM Networks arXiv 1610 03098 Bibcode 2016arXiv161003098P Zhou Jianing Bhat Suma 2021 Paraphrase Generation A Survey of the State of the Art Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing Online and Punta Cana Dominican Republic Association for Computational Linguistics pp 5075 5086 doi 10 18653 v1 2021 emnlp main 414 S2CID 243865349 Dou Yao Forbes Maxwell Koncel Kedziorski Rik Smith Noah Choi Yejin 2022 Is GPT 3 Text Indistinguishable from Human Text Scarecrow A Framework for Scrutinizing Machine Text Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics Volume 1 Long Papers Dublin Ireland Association for Computational Linguistics 7250 7274 arXiv 2107 01294 doi 10 18653 v1 2022 acl long 501 S2CID 247315430 Liu Xianggen Mou Lili Meng Fandong Zhou Hao Zhou Jie Song Sen 2020 Unsupervised Paraphrasing by Simulated Annealing Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics Online Association for Computational Linguistics 302 312 arXiv 1909 03588 doi 10 18653 v1 2020 acl main 28 S2CID 202537332 Wahle Jan Philip Ruas Terry Meuschke Norman Gipp Bela 2021 Are Neural Language Models Good Plagiarists A Benchmark for Neural Paraphrase Detection 2021 ACM IEEE Joint Conference on Digital Libraries JCDL Champaign IL USA IEEE pp 226 229 arXiv 2103 12450 doi 10 1109 JCDL52503 2021 00065 ISBN 978 1 6654 1770 9 S2CID 232320374 Bandel Elron Aharonov Ranit Shmueli Scheuer Michal Shnayderman Ilya Slonim Noam Ein Dor Liat 2022 Quality Controlled Paraphrase Generation Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics Volume 1 Long Papers Dublin Ireland Association for Computational Linguistics 596 609 arXiv 2203 10940 doi 10 18653 v1 2022 acl long 45 Lee John Sie Yuen Lim Ho Hung Carol Webster Carol 2022 Unsupervised Paraphrasability Prediction for Compound Nominalizations Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies Seattle United States Association for Computational Linguistics pp 3254 3263 doi 10 18653 v1 2022 naacl main 237 S2CID 250390695 Niu Tong Yavuz Semih Zhou Yingbo Keskar Nitish Shirish Wang Huan Xiong Caiming 2021 Unsupervised Paraphrasing with Pretrained Language Models Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing Online and Punta Cana Dominican Republic Association for Computational Linguistics pp 5136 5150 doi 10 18653 v1 2021 emnlp main 417 S2CID 237497412 Kiros Ryan Zhu Yukun Salakhutdinov Ruslan Zemel Richard Torralba Antonio Urtasun Raquel Fidler Sanja 2015 Skip Thought Vectors arXiv 1506 06726 Bibcode 2015arXiv150606726K Devlin Jacob Chang Ming Wei Lee Kenton Toutanova Kristina 2019 Proceedings of the 2019 Conference of the North Proceedings of the 2019 Conference of the North Minneapolis Minnesota Association for Computational Linguistics 4171 4186 doi 10 18653 v1 N19 1423 S2CID 52967399 Wahle Jan Philip Ruas Terry Foltynek Tomas Meuschke Norman Gipp Bela 2022 Smits Malte ed Identifying Machine Paraphrased Plagiarism Information for a Better World Shaping the Global Future vol 13192 Cham Springer International Publishing pp 393 413 arXiv 2103 11909 doi 10 1007 978 3 030 96957 8 34 ISBN 978 3 030 96956 1 S2CID 232307572 retrieved 2022 10 06 Nighojkar Animesh Licato John 2021 Improving Paraphrase Detection with the Adversarial Paraphrasing Task Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing Volume 1 Long Papers Online Association for Computational Linguistics pp 7106 7116 doi 10 18653 v1 2021 acl long 552 S2CID 235436269 Dopierre Thomas Gravier Christophe Logerais Wilfried 2021 ProtAugment Intent Detection Meta Learning through Unsupervised Diverse Paraphrasing Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing Volume 1 Long Papers Online Association for Computational Linguistics pp 2454 2466 doi 10 18653 v1 2021 acl long 191 S2CID 236460333 Callison Burch Chris Cohn Trevor Lapata Mirella 2008 ParaMetric An Automatic Evaluation Metric for Paraphrasing Proceedings of the 22nd International Conference on Computational Linguistics Manchester pp 97 104 doi 10 3115 1599081 1599094 S2CID 837398 a b c Chen David Dolan William 2008 Collecting Highly Parallel Data for Paraphrase Evaluation Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics Human Language Technologies Portland Oregon pp 190 200 Liu Chang Dahlmeier Daniel Ng Hwee Tou 2010 PEM A Paraphrase Evaluation Metric Exploiting Parallel Texts Proceedings of the 2010 Conference on Empricial Methods in Natural Language Processing MIT Massachusetts pp 923 932 Paraphrase Identification on Quora Question Pairs Papers with Code External links editMicrosoft Research Paraphrase Corpus a dataset consisting of 5800 pairs of sentences extracted from news articles annotated to note whether a pair captures semantic equivalence Paraphrase Database PPDB A searchable database containing millions of paraphrases in 16 different languages Retrieved from https en wikipedia org w index php title Paraphrasing computational linguistics amp oldid 1193288439, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.