fbpx
Wikipedia

Automatic summarization

Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the other hand, visual content can be summarized using computer vision algorithms. Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.[2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/or the most important video segments (key-shots), normally in a temporally ordered fashion.[5][6][7][8] Video summaries simply retain a carefully selected subset of the original video frames and, therefore, are not identical to the output of video synopsis algorithms, where new video frames are being synthesized based on the original video content.

Commercial products edit

In 2022 Google Docs released an automatic summarization feature.[9]

Approaches edit

There are two general approaches to automatic summarization: extraction and abstraction.

Extraction-based summarization edit

Here, content is extracted from the original data, but the extracted content is not modified in any way. Examples of extracted content include key-phrases that can be used to "tag" or index a text document, or key sentences (including headings) that collectively comprise an abstract, and representative images or video segments, as stated above. For text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in detail.[10] Other examples of extraction that include key sequences of text in terms of clinical relevance (including patient/problem, intervention, and outcome).[11]

Abstractive-based summarization edit

Abstractive summarization methods generate new text that did not exist in the original text.[12] This has been applied mainly for text. Abstractive methods build an internal semantic representation of the original content (often called a language model), and then use this representation to create a summary that is closer to what a human might express. Abstraction may transform the extracted content by paraphrasing sections of the source document, to condense a text more strongly than extraction. Such transformation, however, is computationally much more challenging than extraction, involving both natural language processing and often a deep understanding of the domain of the original text in cases where the original document relates to a special field of knowledge. "Paraphrasing" is even more difficult to apply to images and videos, which is why most summarization systems are extractive.

Aided summarization edit

Approaches aimed at higher summarization quality rely on combined software and human effort. In Machine Aided Human Summarization, extractive techniques highlight candidate passages for inclusion (to which the human adds or removes text). In Human Aided Machine Summarization, a human post-processes software output, in the same way that one edits the output of automatic translation by Google Translate.

Applications and systems for summarization edit

There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.

An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summarizing news articles. Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.

Image collection summarization is another application example of automatic summarization. It consists in selecting a representative set of images from a larger set of images.[13] A summary in this context is useful to show the most representative images of results in an image collection exploration system. Video summarization is a related domain, where the system automatically creates a trailer of a long video. This also has applications in consumer or personal videos, where one might want to skip the boring or repetitive actions. Similarly, in surveillance videos, one would want to extract important and suspicious activity, while ignoring all the boring and redundant frames captured.

At a very high level, summarization algorithms try to find subsets of objects (like set of sentences, or a set of images), which cover information of the entire set. This is also called the core-set. These algorithms model notions like diversity, coverage, information and representativeness of the summary. Query based summarization techniques, additionally model for relevance of the summary with the query. Some techniques and algorithms which naturally model summarization problems are TextRank and PageRank, Submodular set function, Determinantal point process, maximal marginal relevance (MMR) etc.

Keyphrase extraction edit

The task is the following. You are given a piece of text, such as a journal article, and you must produce a list of keywords or key[phrase]s that capture the primary topics discussed in the text.[14] In the case of research articles, many authors provide manually assigned keywords, but most text lacks pre-existing keyphrases. For example, news articles rarely have keyphrases attached, but it would be useful to be able to automatically do so for a number of applications discussed below. Consider the example text from a news article:

"The Army Corps of Engineers, rushing to meet President Bush's promise to protect New Orleans by the start of the 2006 hurricane season, installed defective flood-control pumps last year despite warnings from its own expert that the equipment would fail during a storm, according to documents obtained by The Associated Press".

A keyphrase extractor might select "Army Corps of Engineers", "President Bush", "New Orleans", and "defective flood-control pumps" as keyphrases. These are pulled directly from the text. In contrast, an abstractive keyphrase system would somehow internalize the content and generate keyphrases that do not appear in the text, but more closely resemble what a human might produce, such as "political negligence" or "inadequate protection from floods". Abstraction requires a deep understanding of the text, which makes it difficult for a computer system. Keyphrases have many applications. They can enable document browsing by providing a short summary, improve information retrieval (if documents have keyphrases assigned, a user could search by keyphrase to produce more reliable hits than a full-text search), and be employed in generating index entries for a large text corpus.

Depending on the different literature and the definition of key terms, words or phrases, keyword extraction is a highly related theme.

Supervised learning approaches edit

Beginning with the work of Turney,[15] many researchers have approached keyphrase extraction as a supervised machine learning problem. Given a document, we construct an example for each unigram, bigram, and trigram found in the text (though other text units are also possible, as discussed below). We then compute various features describing each example (e.g., does the phrase begin with an upper-case letter?). We assume there are known keyphrases available for a set of training documents. Using the known keyphrases, we can assign positive or negative labels to the examples. Then we learn a classifier that can discriminate between positive and negative examples as a function of the features. Some classifiers make a binary classification for a test example, while others assign a probability of being a keyphrase. For instance, in the above text, we might learn a rule that says phrases with initial capital letters are likely to be keyphrases. After training a learner, we can select keyphrases for test documents in the following manner. We apply the same example-generation strategy to the test documents, then run each example through the learner. We can determine the keyphrases by looking at binary classification decisions or probabilities returned from our learned model. If probabilities are given, a threshold is used to select the keyphrases. Keyphrase extractors are generally evaluated using precision and recall. Precision measures how many of the proposed keyphrases are actually correct. Recall measures how many of the true keyphrases your system proposed. The two measures can be combined in an F-score, which is the harmonic mean of the two (F = 2PR/(P + R) ). Matches between the proposed keyphrases and the known keyphrases can be checked after stemming or applying some other text normalization.

Designing a supervised keyphrase extraction system involves deciding on several choices (some of these apply to unsupervised, too). The first choice is exactly how to generate examples. Turney and others have used all possible unigrams, bigrams, and trigrams without intervening punctuation and after removing stopwords. Hulth showed that you can get some improvement by selecting examples to be sequences of tokens that match certain patterns of part-of-speech tags. Ideally, the mechanism for generating examples produces all the known labeled keyphrases as candidates, though this is often not the case. For example, if we use only unigrams, bigrams, and trigrams, then we will never be able to extract a known keyphrase containing four words. Thus, recall may suffer. However, generating too many examples can also lead to low precision.

We also need to create features that describe the examples and are informative enough to allow a learning algorithm to discriminate keyphrases from non- keyphrases. Typically features involve various term frequencies (how many times a phrase appears in the current text or in a larger corpus), the length of the example, relative position of the first occurrence, various boolean syntactic features (e.g., contains all caps), etc. The Turney paper used about 12 such features. Hulth uses a reduced set of features, which were found most successful in the KEA (Keyphrase Extraction Algorithm) work derived from Turney's seminal paper.

In the end, the system will need to return a list of keyphrases for a test document, so we need to have a way to limit the number. Ensemble methods (i.e., using votes from several classifiers) have been used to produce numeric scores that can be thresholded to provide a user-provided number of keyphrases. This is the technique used by Turney with C4.5 decision trees. Hulth used a single binary classifier so the learning algorithm implicitly determines the appropriate number.

Once examples and features are created, we need a way to learn to predict keyphrases. Virtually any supervised learning algorithm could be used, such as decision trees, Naive Bayes, and rule induction. In the case of Turney's GenEx algorithm, a genetic algorithm is used to learn parameters for a domain-specific keyphrase extraction algorithm. The extractor follows a series of heuristics to identify keyphrases. The genetic algorithm optimizes parameters for these heuristics with respect to performance on training documents with known key phrases.

Unsupervised approach: TextRank edit

Another keyphrase extraction algorithm is TextRank. While supervised methods have some nice properties, like being able to produce interpretable rules for what features characterize a keyphrase, they also require a large amount of training data. Many documents with known keyphrases are needed. Furthermore, training on a specific domain tends to customize the extraction process to that domain, so the resulting classifier is not necessarily portable, as some of Turney's results demonstrate. Unsupervised keyphrase extraction removes the need for training data. It approaches the problem from a different angle. Instead of trying to learn explicit features that characterize keyphrases, the TextRank algorithm[16] exploits the structure of the text itself to determine keyphrases that appear "central" to the text in the same way that PageRank selects important Web pages. Recall this is based on the notion of "prestige" or "recommendation" from social networks. In this way, TextRank does not rely on any previous training data at all, but rather can be run on any arbitrary piece of text, and it can produce output simply based on the text's intrinsic properties. Thus the algorithm is easily portable to new domains and languages.

TextRank is a general purpose graph-based ranking algorithm for NLP. Essentially, it runs PageRank on a graph specially designed for a particular NLP task. For keyphrase extraction, it builds a graph using some set of text units as vertices. Edges are based on some measure of semantic or lexical similarity between the text unit vertices. Unlike PageRank, the edges are typically undirected and can be weighted to reflect a degree of similarity. Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).

The vertices should correspond to what we want to rank. Potentially, we could do something similar to the supervised methods and create a vertex for each unigram, bigram, trigram, etc. However, to keep the graph small, the authors decide to rank individual unigrams in a first step, and then include a second step that merges highly ranked adjacent unigrams to form multi-word phrases. This has a nice side effect of allowing us to produce keyphrases of arbitrary length. For example, if we rank unigrams and find that "advanced", "natural", "language", and "processing" all get high ranks, then we would look at the original text and see that these words appear consecutively and create a final keyphrase using all four together. Note that the unigrams placed in the graph can be filtered by part of speech. The authors found that adjectives and nouns were the best to include. Thus, some linguistic knowledge comes into play in this step.

Edges are created based on word co-occurrence in this application of TextRank. Two vertices are connected by an edge if the unigrams appear within a window of size N in the original text. N is typically around 2–10. Thus, "natural" and "language" might be linked in a text about NLP. "Natural" and "processing" would also be linked because they would both appear in the same string of N words. These edges build on the notion of "text cohesion" and the idea that words that appear near each other are likely related in a meaningful way and "recommend" each other to the reader.

Since this method simply ranks the individual vertices, we need a way to threshold or produce a limited number of keyphrases. The technique chosen is to set a count T to be a user-specified fraction of the total number of vertices in the graph. Then the top T vertices/unigrams are selected based on their stationary probabilities. A post- processing step is then applied to merge adjacent instances of these T unigrams. As a result, potentially more or less than T final keyphrases will be produced, but the number should be roughly proportional to the length of the original text.

It is not initially clear why applying PageRank to a co-occurrence graph would produce useful keyphrases. One way to think about it is the following. A word that appears multiple times throughout a text may have many different co-occurring neighbors. For example, in a text about machine learning, the unigram "learning" might co-occur with "machine", "supervised", "un-supervised", and "semi-supervised" in four different sentences. Thus, the "learning" vertex would be a central "hub" that connects to these other modifying words. Running PageRank/TextRank on the graph is likely to rank "learning" highly. Similarly, if the text contains the phrase "supervised classification", then there would be an edge between "supervised" and "classification". If "classification" appears several other places and thus has many neighbors, its importance would contribute to the importance of "supervised". If it ends up with a high rank, it will be selected as one of the top T unigrams, along with "learning" and probably "classification". In the final post-processing step, we would then end up with keyphrases "supervised learning" and "supervised classification".

In short, the co-occurrence graph will contain densely connected regions for terms that appear often and in different contexts. A random walk on this graph will have a stationary distribution that assigns large probabilities to the terms in the centers of the clusters. This is similar to densely connected Web pages getting ranked highly by PageRank. This approach has also been used in document summarization, considered below.

Document summarization edit

Like keyphrase extraction, document summarization aims to identify the essence of a text. The only real difference is that now we are dealing with larger text units—whole sentences instead of words and phrases.

Supervised learning approaches edit

Supervised text summarization is very much like supervised keyphrase extraction. Basically, if you have a collection of documents and human-generated summaries for them, you can learn features of sentences that make them good candidates for inclusion in the summary. Features might include the position in the document (i.e., the first few sentences are probably important), the number of words in the sentence, etc. The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training document can be labeled as "in summary" or "not in summary". This is not typically how people create summaries, so simply using journal abstracts or existing summaries is usually not sufficient. The sentences in these summaries do not necessarily match up with sentences in the original text, so it would be difficult to assign labels to examples for training. Note, however, that these natural summaries can still be used for evaluation purposes, since ROUGE-1 evaluation only considers unigrams.

Maximum entropy-based summarization edit

During the DUC 2001 and 2002 evaluation workshops, TNO developed a sentence extraction system for multi-document summarization in the news domain. The system was based on a hybrid system using a Naive Bayes classifier and statistical language models for modeling salience. Although the system exhibited good results, the researchers wanted to explore the effectiveness of a maximum entropy (ME) classifier for the meeting summarization task, as ME is known to be robust against feature dependencies. Maximum entropy has also been applied successfully for summarization in the broadcast news domain.

Adaptive summarization edit

A promising approach is adaptive document/text summarization.[17] It involves first recognizing the text genre and then applying summarization algorithms optimized for this genre. Such software has been created.[18]

TextRank and LexRank edit

The unsupervised approach to summarization is also quite similar in spirit to unsupervised keyphrase extraction and gets around the issue of costly training data. Some unsupervised summarization approaches are based on finding a "centroid" sentence, which is the mean word vector of all the sentences in the document. Then the sentences can be ranked with regard to their similarity to this centroid sentence.

A more principled way to estimate sentence importance is using random walks and eigenvector centrality. LexRank[19] is an algorithm essentially identical to TextRank, and both use this approach for document summarization. The two methods were developed by different groups at the same time, and LexRank simply focused on summarization, but could just as easily be used for keyphrase extraction or any other NLP ranking task.

In both LexRank and TextRank, a graph is constructed by creating a vertex for each sentence in the document.

The edges between sentences are based on some form of semantic similarity or content overlap. While LexRank uses cosine similarity of TF-IDF vectors, TextRank uses a very similar measure based on the number of words two sentences have in common (normalized by the sentences' lengths). The LexRank paper explored using unweighted edges after applying a threshold to the cosine values, but also experimented with using edges with weights equal to the similarity score. TextRank uses continuous similarity scores as weights.

In both algorithms, the sentences are ranked by applying PageRank to the resulting graph. A summary is formed by combining the top ranking sentences, using a threshold or length cutoff to limit the size of the summary.

It is worth noting that TextRank was applied to summarization exactly as described here, while LexRank was used as part of a larger summarization system (MEAD) that combines the LexRank score (stationary probability) with other features like sentence position and length using a linear combination with either user-specified or automatically tuned weights. In this case, some training documents might be needed, though the TextRank results show the additional features are not absolutely necessary.

Unlike TextRank, LexRank has been applied to multi-document summarization.

Multi-document summarization edit

Multi-document summarization is an automatic procedure aimed at extraction of information from multiple texts written about the same topic. Resulting summary report allows individual users, such as professional information consumers, to quickly familiarize themselves with information contained in a large cluster of documents. In such a way, multi-document summarization systems are complementing the news aggregators performing the next step down the road of coping with information overload. Multi-document summarization may also be done in response to a question.[20][11]

Multi-document summarization creates information reports that are both concise and comprehensive. With different opinions being put together and outlined, every topic is described from multiple perspectives within a single document. While the goal of a brief summary is to simplify information search and cut the time by pointing to the most relevant source documents, comprehensive multi-document summary should itself contain the required information, hence limiting the need for accessing original files to cases when refinement is required. Automatic summaries present information extracted from multiple sources algorithmically, without any editorial touch or subjective human intervention, thus making it completely unbiased.[dubious ]

Diversity edit

Multi-document extractive summarization faces a problem of redundancy. Ideally, we want to extract sentences that are both "central" (i.e., contain the main ideas) and "diverse" (i.e., they differ from one another). For example, in a set of news articles about some event, each article is likely to have many similar sentences. To address this issue, LexRank applies a heuristic post-processing step that adds sentences in rank order, but discards sentences that are too similar to ones already in the summary. This method is called Cross-Sentence Information Subsumption (CSIS). These methods work based on the idea that sentences "recommend" other similar sentences to the reader. Thus, if one sentence is very similar to many others, it will likely be a sentence of great importance. Its importance also stems from the importance of the sentences "recommending" it. Thus, to get ranked highly and placed in a summary, a sentence must be similar to many sentences that are in turn also similar to many other sentences. This makes intuitive sense and allows the algorithms to be applied to an arbitrary new text. The methods are domain-independent and easily portable. One could imagine the features indicating important sentences in the news domain might vary considerably from the biomedical domain. However, the unsupervised "recommendation"-based approach applies to any domain.

A related method is Maximal Marginal Relevance (MMR),[21] which uses a general-purpose graph-based ranking algorithm like Page/Lex/TextRank that handles both "centrality" and "diversity" in a unified mathematical framework based on absorbing Markov chain random walks (a random walk where certain states end the walk). The algorithm is called GRASSHOPPER.[22] In addition to explicitly promoting diversity during the ranking process, GRASSHOPPER incorporates a prior ranking (based on sentence position in the case of summarization).

The state of the art results for multi-document summarization are obtained using mixtures of submodular functions. These methods have achieved the state of the art results for Document Summarization Corpora, DUC 04 - 07.[23] Similar results were achieved with the use of determinantal point processes (which are a special case of submodular functions) for DUC-04.[24]

A new method for multi-lingual multi-document summarization that avoids redundancy generates ideograms to represent the meaning of each sentence in each document, then evaluates similarity by comparing ideogram shape and position. It does not use word frequency, training or preprocessing. It uses two user-supplied parameters: equivalence (when are two sentences to be considered equivalent?) and relevance (how long is the desired summary?).

Submodular functions as generic tools for summarization edit

The idea of a submodular set function has recently emerged as a powerful modeling tool for various summarization problems. Submodular functions naturally model notions of coverage, information, representation and diversity. Moreover, several important combinatorial optimization problems occur as special instances of submodular optimization. For example, the set cover problem is a special case of submodular optimization, since the set cover function is submodular. The set cover function attempts to find a subset of objects which cover a given set of concepts. For example, in document summarization, one would like the summary to cover all important and relevant concepts in the document. This is an instance of set cover. Similarly, the facility location problem is a special case of submodular functions. The Facility Location function also naturally models coverage and diversity. Another example of a submodular optimization problem is using a determinantal point process to model diversity. Similarly, the Maximum-Marginal-Relevance procedure can also be seen as an instance of submodular optimization. All these important models encouraging coverage, diversity and information are all submodular. Moreover, submodular functions can be efficiently combined, and the resulting function is still submodular. Hence, one could combine one submodular function which models diversity, another one which models coverage and use human supervision to learn a right model of a submodular function for the problem.

While submodular functions are fitting problems for summarization, they also admit very efficient algorithms for optimization. For example, a simple greedy algorithm admits a constant factor guarantee.[25] Moreover, the greedy algorithm is extremely simple to implement and can scale to large datasets, which is very important for summarization problems.

Submodular functions have achieved state-of-the-art for almost all summarization problems. For example, work by Lin and Bilmes, 2012[26] shows that submodular functions achieve the best results to date on DUC-04, DUC-05, DUC-06 and DUC-07 systems for document summarization. Similarly, work by Lin and Bilmes, 2011,[27] shows that many existing systems for automatic summarization are instances of submodular functions. This was a breakthrough result establishing submodular functions as the right models for summarization problems.[citation needed]

Submodular Functions have also been used for other summarization tasks. Tschiatschek et al., 2014 show[28] that mixtures of submodular functions achieve state-of-the-art results for image collection summarization. Similarly, Bairi et al., 2015[29] show the utility of submodular functions for summarizing multi-document topic hierarchies. Submodular Functions have also successfully been used for summarizing machine learning datasets.[30]

Applications edit

Specific applications of automatic summarization include:

  • The Reddit bot "autotldr",[31] created in 2011 summarizes news articles in the comment-section of reddit posts. It was found to be very useful by the reddit community which upvoted its summaries hundreds of thousands of times.[32] The name is reference to TL;DRInternet slang for "too long; didn't read".[33][34]
  • Adversarial stylometry may make use of summaries, if the detail lost is not major and the summary is sufficiently stylistically different to the input.[35]

Evaluation edit

The most common way to evaluate the informativeness of automatic summaries is to compare them with human-made model summaries.

Evaluation can be intrinsic or extrinsic,[36] and inter-textual or intra-textual.[37]

Intrinsic versus extrinsic edit

Intrinsic evaluation assesses the summaries directly, while extrinsic evaluation evaluates how the summarization system affects the completion of some other task. Intrinsic evaluations have assessed mainly the coherence and informativeness of summaries. Extrinsic evaluations, on the other hand, have tested the impact of summarization on tasks like relevance assessment, reading comprehension, etc.

Inter-textual versus intra-textual edit

Intra-textual evaluation assess the output of a specific summarization system, while inter-textual evaluation focuses on contrastive analysis of outputs of several summarization systems.

Human judgement often varies greatly in what it considers a "good" summary, so creating an automatic evaluation process is particularly difficult. Manual evaluation can be used, but this is both time and labor-intensive, as it requires humans to read not only the summaries but also the source documents. Other issues are those concerning coherence and coverage.

The most common way to evaluate summaries is ROUGE (Recall-Oriented Understudy for Gisting Evaluation). It is very common for summarization and translation systems in NIST's Document Understanding Conferences. ROUGE is a recall-based measure of how well a summary covers the content of human-generated summaries known as references. It calculates n-gram overlaps between automatically generated summaries and previously written human summaries. It is recall-based to encourage inclusion of all important topics in summaries. Recall can be computed with respect to unigram, bigram, trigram, or 4-gram matching. For example, ROUGE-1 is the fraction of unigrams that appear in both the reference summary and the automatic summary out of all unigrams in the reference summary. If there are multiple reference summaries, their scores are averaged. A high level of overlap should indicate a high degree of shared concepts between the two summaries.

ROUGE cannot determine if the result is coherent, that is if sentences flow together in a sensibly. High-order n-gram ROUGE measures help to some degree.

Another unsolved problem is Anaphor resolution. Similarly, for image summarization, Tschiatschek et al., developed a Visual-ROUGE score which judges the performance of algorithms for image summarization.[38]

Domain-specific versus domain-independent summarization edit

Domain-independent summarization techniques apply sets of general features to identify information-rich text segments. Recent research focuses on domain-specific summarization using knowledge specific to the text's domain, such as medical knowledge and ontologies for summarizing medical texts.[39]

Qualitative edit

The main drawback of the evaluation systems so far is that we need a reference summary (for some methods, more than one), to compare automatic summaries with models. This is a hard and expensive task. Much effort has to be made to create corpora of texts and their corresponding summaries. Furthermore, some methods require manual annotation of the summaries (e.g. SCU in the Pyramid Method). Moreover, they all perform a quantitative evaluation with regard to different similarity metrics.

History edit

The first publication in the area dates back to 1957 [40] (Hans Peter Luhn), starting with a statistical technique. Research increased significantly in 2015. Term frequency–inverse document frequency had been used by 2016. Pattern-based summarization was the most powerful option for multi-document summarization found by 2016. In the following year it was surpassed by latent semantic analysis (LSA) combined with non-negative matrix factorization (NMF). Although they did not replace other approaches and are often combined with them, by 2019 machine learning methods dominated the extractive summarization of single documents, which was considered to be nearing maturity. By 2020, the field was still very active and research is shifting towards abstractive summation and real-time summarization.[41]

Recent approaches edit

Recently the rise of transformer models replacing more traditional RNN (LSTM) have provided a flexibility in the mapping of text sequences to text sequences of a different type, which is well suited to automatic summarization. This includes models such as T5[42] and Pegasus.[43]

See also edit

References edit

  1. ^ Torres-Moreno, Juan-Manuel (1 October 2014). Automatic Text Summarization. Wiley. pp. 320–. ISBN 978-1-848-21668-6.
  2. ^ Pan, Xingjia; Tang, Fan; Dong, Weiming; Ma, Chongyang; Meng, Yiping; Huang, Feiyue; Lee, Tong-Yee; Xu, Changsheng (2021-04-01). "Content-Based Visual Summarization for Image Collection". IEEE Transactions on Visualization and Computer Graphics. 27 (4): 2298–2312. doi:10.1109/tvcg.2019.2948611. ISSN 1077-2626. PMID 31647438. S2CID 204865221.
  3. ^ "WIPO PUBLISHES PATENT OF KT FOR "IMAGE SUMMARIZATION SYSTEM AND METHOD" (SOUTH KOREAN INVENTORS)". US Fed News Service. January 10, 2018. ProQuest 1986931333. Retrieved January 22, 2021.
  4. ^ Li Tan; Yangqiu Song; Shixia Liu; Lexing Xie (February 2012). "ImageHive: Interactive Content-Aware Image Summarization". IEEE Computer Graphics and Applications. 32 (1): 46–55. doi:10.1109/mcg.2011.89. ISSN 0272-1716. PMID 24808292. S2CID 7668289.
  5. ^ Sankar K. Pal; Alfredo Petrosino; Lucia Maddalena (25 January 2012). Handbook on Soft Computing for Video Surveillance. CRC Press. pp. 81–. ISBN 978-1-4398-5685-7.
  6. ^ Elhamifar, Ehsan; Sapiro, Guillermo; Vidal, Rene (2012). "See all by looking at a few: Sparse modeling for finding representative objects". 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE. pp. 1600–1607. doi:10.1109/CVPR.2012.6247852. ISBN 978-1-4673-1228-8. S2CID 5909301. Retrieved 4 December 2022.
  7. ^ Mademlis, Ioannis; Tefas, Anastasios; Nikolaidis, Nikos; Pitas, Ioannis (2016). "Multimodal stereoscopic movie summarization conforming to narrative characteristics" (PDF). IEEE Transactions on Image Processing. IEEE. 25 (12): 5828–5840. Bibcode:2016ITIP...25.5828M. doi:10.1109/TIP.2016.2615289. hdl:1983/2bcdd7a5-825f-4ac9-90ec-f2f538bfcb72. PMID 28113502. S2CID 18566122. Retrieved 4 December 2022.
  8. ^ Mademlis, Ioannis; Tefas, Anastasios; Pitas, Ioannis (2018). "A salient dictionary learning framework for activity video summarization via key-frame extraction". Information Sciences. Elsevier. 432: 319–331. doi:10.1016/j.ins.2017.12.020. Retrieved 4 December 2022.
  9. ^ "Auto-generated Summaries in Google Docs". Google AI Blog. 23 March 2022. Retrieved 2022-04-03.
  10. ^ Richard Sutz, Peter Weverka. How to skim text. https://www.dummies.com/education/language-arts/speed-reading/how-to-skim-text/ Accessed Dec 2019.
  11. ^ a b Afzal M, Alam F, Malik KM, Malik GM, Clinical Context-Aware Biomedical Text Summarization Using Deep Neural Network: Model Development and Validation, J Med Internet Res 2020;22(10):e19810, DOI: 10.2196/19810, PMID 33095174
  12. ^ Zhai, ChengXiang (2016). Text data management and analysis : a practical introduction to information retrieval and text mining. Sean Massung. [New York, NY]. p. 321. ISBN 978-1-970001-19-8. OCLC 957355971.{{cite book}}: CS1 maint: location missing publisher (link)
  13. ^ Jorge E. Camargo and Fabio A. González. A Multi-class Kernel Alignment Method for Image Collection Summarization. In Proceedings of the 14th Iberoamerican Conference on Pattern Recognition: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications (CIARP '09), Eduardo Bayro-Corrochano and Jan-Olof Eklundh (Eds.). Springer-Verlag, Berlin, Heidelberg, 545-552. doi:10.1007/978-3-642-10268-4_64
  14. ^ Alrehamy, Hassan H; Walker, Coral (2018). "SemCluster: Unsupervised Automatic Keyphrase Extraction Using Affinity Propagation". Advances in Computational Intelligence Systems. Advances in Intelligent Systems and Computing. Vol. 650. pp. 222–235. doi:10.1007/978-3-319-66939-7_19. ISBN 978-3-319-66938-0.
  15. ^ Turney, Peter D (2002). "Learning Algorithms for Keyphrase Extraction". Information Retrieval. 2 (4): 303–336. arXiv:cs/0212020. Bibcode:2002cs.......12020T. doi:10.1023/A:1009976227802. S2CID 7007323.
  16. ^ Rada Mihalcea and Paul Tarau, 2004: TextRank: Bringing Order into Texts, Department of Computer Science University of North Texas (PDF). Archived from the original on 2012-06-17. Retrieved 2012-07-20.{{cite web}}: CS1 maint: archived copy as title (link) CS1 maint: bot: original URL status unknown (link)
  17. ^ Yatsko, V. A.; Starikov, M. S.; Butakov, A. V. (2010). "Automatic genre recognition and adaptive text summarization". Automatic Documentation and Mathematical Linguistics. 44 (3): 111–120. doi:10.3103/S0005105510030027. S2CID 1586931.
  18. ^ UNIS (Universal Summarizer)
  19. ^ Güneş Erkan and Dragomir R. Radev: LexRank: Graph-based Lexical Centrality as Salience in Text Summarization [1]
  20. ^ "Versatile question answering systems: seeing in synthesis", International Journal of Intelligent Information Database Systems, 5(2), 119-142, 2011.
  21. ^ Carbonell, Jaime, and Jade Goldstein. "The use of MMR, diversity-based reranking for reordering documents and producing summaries." Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1998.
  22. ^ Zhu, Xiaojin, et al. "Improving Diversity in Ranking using Absorbing Random Walks." HLT-NAACL. 2007.
  23. ^ Hui Lin, Jeff Bilmes. "Learning mixtures of submodular shells with application to document summarization
  24. ^ Alex Kulesza and Ben Taskar, Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, December 2012.
  25. ^ Nemhauser, George L., Laurence A. Wolsey, and Marshall L. Fisher. "An analysis of approximations for maximizing submodular set functions—I." Mathematical Programming 14.1 (1978): 265-294.
  26. ^ Hui Lin, Jeff Bilmes. "Learning mixtures of submodular shells with application to document summarization", UAI, 2012
  27. ^ Hui Lin, Jeff Bilmes. "A Class of Submodular Functions for Document Summarization", The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT), 2011
  28. ^ Sebastian Tschiatschek, Rishabh Iyer, Hoachen Wei and Jeff Bilmes, Learning Mixtures of Submodular Functions for Image Collection Summarization, In Advances of Neural Information Processing Systems (NIPS), Montreal, Canada, December - 2014.
  29. ^ Ramakrishna Bairi, Rishabh Iyer, Ganesh Ramakrishnan and Jeff Bilmes, Summarizing Multi-Document Topic Hierarchies using Submodular Mixtures, To Appear In the Annual Meeting of the Association for Computational Linguistics (ACL), Beijing, China, July - 2015
  30. ^ Kai Wei, Rishabh Iyer, and Jeff Bilmes, Submodularity in Data Subset Selection and Active Learning 2017-03-13 at the Wayback Machine, To Appear In Proc. International Conference on Machine Learning (ICML), Lille, France, June - 2015
  31. ^ "overview for autotldr". reddit. Retrieved 9 February 2017.
  32. ^ Squire, Megan (2016-08-29). Mastering Data Mining with Python – Find patterns hidden in your data. Packt Publishing Ltd. ISBN 9781785885914. Retrieved 9 February 2017.
  33. ^ "What Is 'TLDR'?". Lifewire. Retrieved 9 February 2017.
  34. ^ "What Does TL;DR Mean? AMA? TIL? Glossary Of Reddit Terms And Abbreviations". International Business Times. 29 March 2012. Retrieved 9 February 2017.
  35. ^ Potthast, Hagen & Stein 2016, p. 11-12.
  36. ^ Mani, I. Summarization evaluation: an overview
  37. ^ Yatsko, V. A.; Vishnyakov, T. N. (2007). "A method for evaluating modern systems of automatic text summarization". Automatic Documentation and Mathematical Linguistics. 41 (3): 93–103. doi:10.3103/S0005105507030041. S2CID 7853204.
  38. ^ Sebastian Tschiatschek, Rishabh Iyer, Hoachen Wei and Jeff Bilmes, Learning Mixtures of Submodular Functions for Image Collection Summarization, In Advances of Neural Information Processing Systems (NIPS), Montreal, Canada, December - 2014. (PDF)
  39. ^ Sarker, Abeed; Molla, Diego; Paris, Cecile (2013). "An Approach for Query-Focused Text Summarisation for Evidence Based Medicine". Artificial Intelligence in Medicine. Lecture Notes in Computer Science. Vol. 7885. pp. 295–304. doi:10.1007/978-3-642-38326-7_41. ISBN 978-3-642-38325-0.
  40. ^ Luhn, Hans Peter (1957). "A Statistical Approach to Mechanized Encoding and Searching of Literary Information" (PDF). IBM Journal of Research and Development. 1 (4): 309–317. doi:10.1147/rd.14.0309.
  41. ^ Widyassari, Adhika Pramita; Rustad, Supriadi; Shidik, Guruh Fajar; Noersasongko, Edi; Syukur, Abdul; Affandy, Affandy; Setiadi, De Rosal Ignatius Moses (2020-05-20). "Review of automatic text summarization techniques & methods". Journal of King Saud University - Computer and Information Sciences. 34 (4): 1029–1046. doi:10.1016/j.jksuci.2020.05.006. ISSN 1319-1578.
  42. ^ "Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer". Google AI Blog. 24 February 2020. Retrieved 2022-04-03.
  43. ^ Zhang, J., Zhao, Y., Saleh, M., & Liu, P. (2020, November). Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning (pp. 11328-11339). PMLR.

Works cited edit

  • Potthast, Martin; Hagen, Matthias; Stein, Benno (2016). Author Obfuscation: Attacking the State of the Art in Authorship Verification (PDF). Conference and Labs of the Evaluation Forum.

Further reading edit

  • Hercules, Dalianis (2003). Porting and evaluation of automatic summarization.
  • Roxana, Angheluta (2002). The Use of Topic Segmentation for Automatic Summarization.
  • Anne, Buist (2004). (PDF). Archived from the original (PDF) on 2021-01-23. Retrieved 2020-07-19.
  • Annie, Louis (2009). Performance Confidence Estimation for Automatic Summarization.
  • Elena, Lloret and Manuel, Palomar (2009). . Archived from the original on 2018-10-03. Retrieved 2018-10-03.{{cite book}}: CS1 maint: multiple names: authors list (link)
  • Andrew, Goldberg (2007). Automatic Summarization.
  • Alrehamy, Hassan (2018). "SemCluster: Unsupervised Automatic Keyphrase Extraction Using Affinity Propagation". Advances in Computational Intelligence Systems. Advances in Intelligent Systems and Computing. Vol. 650. pp. 222–235. doi:10.1007/978-3-319-66939-7_19. ISBN 978-3-319-66938-0.
  • Endres-Niggemeyer, Brigitte (1998). Summarizing Information. Springer. ISBN 978-3-540-63735-6.
  • Marcu, Daniel (2000). The Theory and Practice of Discourse Parsing and Summarization. MIT Press. ISBN 978-0-262-13372-2.
  • Mani, Inderjeet (2001). Automatic Summarization. ISBN 978-1-58811-060-2.
  • Huff, Jason (2010). AutoSummarize., Conceptual artwork using automatic summarization software in Microsoft Word 2008.
  • Lehmam, Abderrafih (2010). Essential summarizer: innovative automatic text summarization software in twenty languages - ACM Digital Library. Riao '10. pp. 216–217., Published in Proceeding RIAO'10 Adaptivity, Personalization and Fusion of Heterogeneous Information, CID Paris, France
  • Xiaojin, Zhu, Andrew Goldberg, Jurgen Van Gael, and David Andrzejewski (2007). Improving diversity in ranking using absorbing random walks (PDF).{{cite book}}: CS1 maint: multiple names: authors list (link), The GRASSHOPPER algorithm
  • Miranda-Jiménez, Sabino, Gelbukh, Alexander, and Sidorov, Grigori (2013). "Summarizing Conceptual Graphs for Automatic Summarization Task". Conceptual Structures for STEM Research and Education. Lecture Notes in Computer Science. Vol. 7735. pp. 245–253. doi:10.1007/978-3-642-35786-2_18. ISBN 978-3-642-35785-5.{{cite book}}: CS1 maint: multiple names: authors list (link), Conceptual Structures for STEM Research and Education.

automatic, summarization, this, article, needs, additional, citations, verification, please, help, improve, this, article, adding, citations, reliable, sources, unsourced, material, challenged, removed, find, sources, news, newspapers, books, scholar, jstor, a. This article needs additional citations for verification Please help improve this article by adding citations to reliable sources Unsourced material may be challenged and removed Find sources Automatic summarization news newspapers books scholar JSTOR April 2022 Learn how and when to remove this template message Automatic summarization is the process of shortening a set of data computationally to create a subset a summary that represents the most important or relevant information within the original content Artificial intelligence algorithms are commonly developed and employed to achieve this specialized for different types of data Text summarization is usually implemented by natural language processing methods designed to locate the most informative sentences in a given document 1 On the other hand visual content can be summarized using computer vision algorithms Image summarization is the subject of ongoing research existing approaches typically attempt to display the most representative images from a given image collection or generate a video that only includes the most important content from the entire collection 2 3 4 Video summarization algorithms identify and extract from the original video content the most important frames key frames and or the most important video segments key shots normally in a temporally ordered fashion 5 6 7 8 Video summaries simply retain a carefully selected subset of the original video frames and therefore are not identical to the output of video synopsis algorithms where new video frames are being synthesized based on the original video content Contents 1 Commercial products 2 Approaches 2 1 Extraction based summarization 2 2 Abstractive based summarization 2 3 Aided summarization 3 Applications and systems for summarization 3 1 Keyphrase extraction 3 1 1 Supervised learning approaches 3 1 2 Unsupervised approach TextRank 3 2 Document summarization 3 2 1 Supervised learning approaches 3 2 2 Maximum entropy based summarization 3 2 3 Adaptive summarization 3 2 4 TextRank and LexRank 3 2 5 Multi document summarization 3 2 5 1 Diversity 3 3 Submodular functions as generic tools for summarization 3 4 Applications 4 Evaluation 4 1 Intrinsic versus extrinsic 4 2 Inter textual versus intra textual 4 3 Domain specific versus domain independent summarization 4 4 Qualitative 5 History 5 1 Recent approaches 6 See also 7 References 8 Works cited 9 Further readingCommercial products editIn 2022 Google Docs released an automatic summarization feature 9 Approaches editThere are two general approaches to automatic summarization extraction and abstraction Extraction based summarization edit Here content is extracted from the original data but the extracted content is not modified in any way Examples of extracted content include key phrases that can be used to tag or index a text document or key sentences including headings that collectively comprise an abstract and representative images or video segments as stated above For text extraction is analogous to the process of skimming where the summary if available headings and subheadings figures the first and last paragraphs of a section and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in detail 10 Other examples of extraction that include key sequences of text in terms of clinical relevance including patient problem intervention and outcome 11 Abstractive based summarization edit Abstractive summarization methods generate new text that did not exist in the original text 12 This has been applied mainly for text Abstractive methods build an internal semantic representation of the original content often called a language model and then use this representation to create a summary that is closer to what a human might express Abstraction may transform the extracted content by paraphrasing sections of the source document to condense a text more strongly than extraction Such transformation however is computationally much more challenging than extraction involving both natural language processing and often a deep understanding of the domain of the original text in cases where the original document relates to a special field of knowledge Paraphrasing is even more difficult to apply to images and videos which is why most summarization systems are extractive Aided summarization edit Approaches aimed at higher summarization quality rely on combined software and human effort In Machine Aided Human Summarization extractive techniques highlight candidate passages for inclusion to which the human adds or removes text In Human Aided Machine Summarization a human post processes software output in the same way that one edits the output of automatic translation by Google Translate Applications and systems for summarization editThere are broadly two types of extractive summarization tasks depending on what the summarization program focuses on The first is generic summarization which focuses on obtaining a generic summary or abstract of the collection whether documents or sets of images or videos news stories etc The second is query relevant summarization sometimes called query based summarization which summarizes objects specific to a query Summarization systems are able to create both query relevant text summaries and generic machine generated summaries depending on what the user needs An example of a summarization problem is document summarization which attempts to automatically produce an abstract from a given document Sometimes one might be interested in generating a summary from a single source document while others can use multiple source documents for example a cluster of articles on the same topic This problem is called multi document summarization A related application is summarizing news articles Imagine a system which automatically pulls together news articles on a given topic from the web and concisely represents the latest news as a summary Image collection summarization is another application example of automatic summarization It consists in selecting a representative set of images from a larger set of images 13 A summary in this context is useful to show the most representative images of results in an image collection exploration system Video summarization is a related domain where the system automatically creates a trailer of a long video This also has applications in consumer or personal videos where one might want to skip the boring or repetitive actions Similarly in surveillance videos one would want to extract important and suspicious activity while ignoring all the boring and redundant frames captured At a very high level summarization algorithms try to find subsets of objects like set of sentences or a set of images which cover information of the entire set This is also called the core set These algorithms model notions like diversity coverage information and representativeness of the summary Query based summarization techniques additionally model for relevance of the summary with the query Some techniques and algorithms which naturally model summarization problems are TextRank and PageRank Submodular set function Determinantal point process maximal marginal relevance MMR etc Keyphrase extraction edit The task is the following You are given a piece of text such as a journal article and you must produce a list of keywords or key phrase s that capture the primary topics discussed in the text 14 In the case of research articles many authors provide manually assigned keywords but most text lacks pre existing keyphrases For example news articles rarely have keyphrases attached but it would be useful to be able to automatically do so for a number of applications discussed below Consider the example text from a news article The Army Corps of Engineers rushing to meet President Bush s promise to protect New Orleans by the start of the 2006 hurricane season installed defective flood control pumps last year despite warnings from its own expert that the equipment would fail during a storm according to documents obtained by The Associated Press A keyphrase extractor might select Army Corps of Engineers President Bush New Orleans and defective flood control pumps as keyphrases These are pulled directly from the text In contrast an abstractive keyphrase system would somehow internalize the content and generate keyphrases that do not appear in the text but more closely resemble what a human might produce such as political negligence or inadequate protection from floods Abstraction requires a deep understanding of the text which makes it difficult for a computer system Keyphrases have many applications They can enable document browsing by providing a short summary improve information retrieval if documents have keyphrases assigned a user could search by keyphrase to produce more reliable hits than a full text search and be employed in generating index entries for a large text corpus Depending on the different literature and the definition of key terms words or phrases keyword extraction is a highly related theme Supervised learning approaches edit Beginning with the work of Turney 15 many researchers have approached keyphrase extraction as a supervised machine learning problem Given a document we construct an example for each unigram bigram and trigram found in the text though other text units are also possible as discussed below We then compute various features describing each example e g does the phrase begin with an upper case letter We assume there are known keyphrases available for a set of training documents Using the known keyphrases we can assign positive or negative labels to the examples Then we learn a classifier that can discriminate between positive and negative examples as a function of the features Some classifiers make a binary classification for a test example while others assign a probability of being a keyphrase For instance in the above text we might learn a rule that says phrases with initial capital letters are likely to be keyphrases After training a learner we can select keyphrases for test documents in the following manner We apply the same example generation strategy to the test documents then run each example through the learner We can determine the keyphrases by looking at binary classification decisions or probabilities returned from our learned model If probabilities are given a threshold is used to select the keyphrases Keyphrase extractors are generally evaluated using precision and recall Precision measures how many of the proposed keyphrases are actually correct Recall measures how many of the true keyphrases your system proposed The two measures can be combined in an F score which is the harmonic mean of the two F 2PR P R Matches between the proposed keyphrases and the known keyphrases can be checked after stemming or applying some other text normalization Designing a supervised keyphrase extraction system involves deciding on several choices some of these apply to unsupervised too The first choice is exactly how to generate examples Turney and others have used all possible unigrams bigrams and trigrams without intervening punctuation and after removing stopwords Hulth showed that you can get some improvement by selecting examples to be sequences of tokens that match certain patterns of part of speech tags Ideally the mechanism for generating examples produces all the known labeled keyphrases as candidates though this is often not the case For example if we use only unigrams bigrams and trigrams then we will never be able to extract a known keyphrase containing four words Thus recall may suffer However generating too many examples can also lead to low precision We also need to create features that describe the examples and are informative enough to allow a learning algorithm to discriminate keyphrases from non keyphrases Typically features involve various term frequencies how many times a phrase appears in the current text or in a larger corpus the length of the example relative position of the first occurrence various boolean syntactic features e g contains all caps etc The Turney paper used about 12 such features Hulth uses a reduced set of features which were found most successful in the KEA Keyphrase Extraction Algorithm work derived from Turney s seminal paper In the end the system will need to return a list of keyphrases for a test document so we need to have a way to limit the number Ensemble methods i e using votes from several classifiers have been used to produce numeric scores that can be thresholded to provide a user provided number of keyphrases This is the technique used by Turney with C4 5 decision trees Hulth used a single binary classifier so the learning algorithm implicitly determines the appropriate number Once examples and features are created we need a way to learn to predict keyphrases Virtually any supervised learning algorithm could be used such as decision trees Naive Bayes and rule induction In the case of Turney s GenEx algorithm a genetic algorithm is used to learn parameters for a domain specific keyphrase extraction algorithm The extractor follows a series of heuristics to identify keyphrases The genetic algorithm optimizes parameters for these heuristics with respect to performance on training documents with known key phrases Unsupervised approach TextRank edit Another keyphrase extraction algorithm is TextRank While supervised methods have some nice properties like being able to produce interpretable rules for what features characterize a keyphrase they also require a large amount of training data Many documents with known keyphrases are needed Furthermore training on a specific domain tends to customize the extraction process to that domain so the resulting classifier is not necessarily portable as some of Turney s results demonstrate Unsupervised keyphrase extraction removes the need for training data It approaches the problem from a different angle Instead of trying to learn explicit features that characterize keyphrases the TextRank algorithm 16 exploits the structure of the text itself to determine keyphrases that appear central to the text in the same way that PageRank selects important Web pages Recall this is based on the notion of prestige or recommendation from social networks In this way TextRank does not rely on any previous training data at all but rather can be run on any arbitrary piece of text and it can produce output simply based on the text s intrinsic properties Thus the algorithm is easily portable to new domains and languages TextRank is a general purpose graph based ranking algorithm for NLP Essentially it runs PageRank on a graph specially designed for a particular NLP task For keyphrase extraction it builds a graph using some set of text units as vertices Edges are based on some measure of semantic or lexical similarity between the text unit vertices Unlike PageRank the edges are typically undirected and can be weighted to reflect a degree of similarity Once the graph is constructed it is used to form a stochastic matrix combined with a damping factor as in the random surfer model and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 i e the stationary distribution of the random walk on the graph The vertices should correspond to what we want to rank Potentially we could do something similar to the supervised methods and create a vertex for each unigram bigram trigram etc However to keep the graph small the authors decide to rank individual unigrams in a first step and then include a second step that merges highly ranked adjacent unigrams to form multi word phrases This has a nice side effect of allowing us to produce keyphrases of arbitrary length For example if we rank unigrams and find that advanced natural language and processing all get high ranks then we would look at the original text and see that these words appear consecutively and create a final keyphrase using all four together Note that the unigrams placed in the graph can be filtered by part of speech The authors found that adjectives and nouns were the best to include Thus some linguistic knowledge comes into play in this step Edges are created based on word co occurrence in this application of TextRank Two vertices are connected by an edge if the unigrams appear within a window of size N in the original text N is typically around 2 10 Thus natural and language might be linked in a text about NLP Natural and processing would also be linked because they would both appear in the same string of N words These edges build on the notion of text cohesion and the idea that words that appear near each other are likely related in a meaningful way and recommend each other to the reader Since this method simply ranks the individual vertices we need a way to threshold or produce a limited number of keyphrases The technique chosen is to set a count T to be a user specified fraction of the total number of vertices in the graph Then the top T vertices unigrams are selected based on their stationary probabilities A post processing step is then applied to merge adjacent instances of these T unigrams As a result potentially more or less than T final keyphrases will be produced but the number should be roughly proportional to the length of the original text It is not initially clear why applying PageRank to a co occurrence graph would produce useful keyphrases One way to think about it is the following A word that appears multiple times throughout a text may have many different co occurring neighbors For example in a text about machine learning the unigram learning might co occur with machine supervised un supervised and semi supervised in four different sentences Thus the learning vertex would be a central hub that connects to these other modifying words Running PageRank TextRank on the graph is likely to rank learning highly Similarly if the text contains the phrase supervised classification then there would be an edge between supervised and classification If classification appears several other places and thus has many neighbors its importance would contribute to the importance of supervised If it ends up with a high rank it will be selected as one of the top T unigrams along with learning and probably classification In the final post processing step we would then end up with keyphrases supervised learning and supervised classification In short the co occurrence graph will contain densely connected regions for terms that appear often and in different contexts A random walk on this graph will have a stationary distribution that assigns large probabilities to the terms in the centers of the clusters This is similar to densely connected Web pages getting ranked highly by PageRank This approach has also been used in document summarization considered below Document summarization edit Like keyphrase extraction document summarization aims to identify the essence of a text The only real difference is that now we are dealing with larger text units whole sentences instead of words and phrases Supervised learning approaches edit Supervised text summarization is very much like supervised keyphrase extraction Basically if you have a collection of documents and human generated summaries for them you can learn features of sentences that make them good candidates for inclusion in the summary Features might include the position in the document i e the first few sentences are probably important the number of words in the sentence etc The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training document can be labeled as in summary or not in summary This is not typically how people create summaries so simply using journal abstracts or existing summaries is usually not sufficient The sentences in these summaries do not necessarily match up with sentences in the original text so it would be difficult to assign labels to examples for training Note however that these natural summaries can still be used for evaluation purposes since ROUGE 1 evaluation only considers unigrams Maximum entropy based summarization edit During the DUC 2001 and 2002 evaluation workshops TNO developed a sentence extraction system for multi document summarization in the news domain The system was based on a hybrid system using a Naive Bayes classifier and statistical language models for modeling salience Although the system exhibited good results the researchers wanted to explore the effectiveness of a maximum entropy ME classifier for the meeting summarization task as ME is known to be robust against feature dependencies Maximum entropy has also been applied successfully for summarization in the broadcast news domain Adaptive summarization edit A promising approach is adaptive document text summarization 17 It involves first recognizing the text genre and then applying summarization algorithms optimized for this genre Such software has been created 18 TextRank and LexRank edit The unsupervised approach to summarization is also quite similar in spirit to unsupervised keyphrase extraction and gets around the issue of costly training data Some unsupervised summarization approaches are based on finding a centroid sentence which is the mean word vector of all the sentences in the document Then the sentences can be ranked with regard to their similarity to this centroid sentence A more principled way to estimate sentence importance is using random walks and eigenvector centrality LexRank 19 is an algorithm essentially identical to TextRank and both use this approach for document summarization The two methods were developed by different groups at the same time and LexRank simply focused on summarization but could just as easily be used for keyphrase extraction or any other NLP ranking task In both LexRank and TextRank a graph is constructed by creating a vertex for each sentence in the document The edges between sentences are based on some form of semantic similarity or content overlap While LexRank uses cosine similarity of TF IDF vectors TextRank uses a very similar measure based on the number of words two sentences have in common normalized by the sentences lengths The LexRank paper explored using unweighted edges after applying a threshold to the cosine values but also experimented with using edges with weights equal to the similarity score TextRank uses continuous similarity scores as weights In both algorithms the sentences are ranked by applying PageRank to the resulting graph A summary is formed by combining the top ranking sentences using a threshold or length cutoff to limit the size of the summary It is worth noting that TextRank was applied to summarization exactly as described here while LexRank was used as part of a larger summarization system MEAD that combines the LexRank score stationary probability with other features like sentence position and length using a linear combination with either user specified or automatically tuned weights In this case some training documents might be needed though the TextRank results show the additional features are not absolutely necessary Unlike TextRank LexRank has been applied to multi document summarization Multi document summarization edit Main article Multi document summarization Multi document summarization is an automatic procedure aimed at extraction of information from multiple texts written about the same topic Resulting summary report allows individual users such as professional information consumers to quickly familiarize themselves with information contained in a large cluster of documents In such a way multi document summarization systems are complementing the news aggregators performing the next step down the road of coping with information overload Multi document summarization may also be done in response to a question 20 11 Multi document summarization creates information reports that are both concise and comprehensive With different opinions being put together and outlined every topic is described from multiple perspectives within a single document While the goal of a brief summary is to simplify information search and cut the time by pointing to the most relevant source documents comprehensive multi document summary should itself contain the required information hence limiting the need for accessing original files to cases when refinement is required Automatic summaries present information extracted from multiple sources algorithmically without any editorial touch or subjective human intervention thus making it completely unbiased dubious discuss Diversity edit Multi document extractive summarization faces a problem of redundancy Ideally we want to extract sentences that are both central i e contain the main ideas and diverse i e they differ from one another For example in a set of news articles about some event each article is likely to have many similar sentences To address this issue LexRank applies a heuristic post processing step that adds sentences in rank order but discards sentences that are too similar to ones already in the summary This method is called Cross Sentence Information Subsumption CSIS These methods work based on the idea that sentences recommend other similar sentences to the reader Thus if one sentence is very similar to many others it will likely be a sentence of great importance Its importance also stems from the importance of the sentences recommending it Thus to get ranked highly and placed in a summary a sentence must be similar to many sentences that are in turn also similar to many other sentences This makes intuitive sense and allows the algorithms to be applied to an arbitrary new text The methods are domain independent and easily portable One could imagine the features indicating important sentences in the news domain might vary considerably from the biomedical domain However the unsupervised recommendation based approach applies to any domain A related method is Maximal Marginal Relevance MMR 21 which uses a general purpose graph based ranking algorithm like Page Lex TextRank that handles both centrality and diversity in a unified mathematical framework based on absorbing Markov chain random walks a random walk where certain states end the walk The algorithm is called GRASSHOPPER 22 In addition to explicitly promoting diversity during the ranking process GRASSHOPPER incorporates a prior ranking based on sentence position in the case of summarization The state of the art results for multi document summarization are obtained using mixtures of submodular functions These methods have achieved the state of the art results for Document Summarization Corpora DUC 04 07 23 Similar results were achieved with the use of determinantal point processes which are a special case of submodular functions for DUC 04 24 A new method for multi lingual multi document summarization that avoids redundancy generates ideograms to represent the meaning of each sentence in each document then evaluates similarity by comparing ideogram shape and position It does not use word frequency training or preprocessing It uses two user supplied parameters equivalence when are two sentences to be considered equivalent and relevance how long is the desired summary Submodular functions as generic tools for summarization edit The idea of a submodular set function has recently emerged as a powerful modeling tool for various summarization problems Submodular functions naturally model notions of coverage information representation and diversity Moreover several important combinatorial optimization problems occur as special instances of submodular optimization For example the set cover problem is a special case of submodular optimization since the set cover function is submodular The set cover function attempts to find a subset of objects which cover a given set of concepts For example in document summarization one would like the summary to cover all important and relevant concepts in the document This is an instance of set cover Similarly the facility location problem is a special case of submodular functions The Facility Location function also naturally models coverage and diversity Another example of a submodular optimization problem is using a determinantal point process to model diversity Similarly the Maximum Marginal Relevance procedure can also be seen as an instance of submodular optimization All these important models encouraging coverage diversity and information are all submodular Moreover submodular functions can be efficiently combined and the resulting function is still submodular Hence one could combine one submodular function which models diversity another one which models coverage and use human supervision to learn a right model of a submodular function for the problem While submodular functions are fitting problems for summarization they also admit very efficient algorithms for optimization For example a simple greedy algorithm admits a constant factor guarantee 25 Moreover the greedy algorithm is extremely simple to implement and can scale to large datasets which is very important for summarization problems Submodular functions have achieved state of the art for almost all summarization problems For example work by Lin and Bilmes 2012 26 shows that submodular functions achieve the best results to date on DUC 04 DUC 05 DUC 06 and DUC 07 systems for document summarization Similarly work by Lin and Bilmes 2011 27 shows that many existing systems for automatic summarization are instances of submodular functions This was a breakthrough result establishing submodular functions as the right models for summarization problems citation needed Submodular Functions have also been used for other summarization tasks Tschiatschek et al 2014 show 28 that mixtures of submodular functions achieve state of the art results for image collection summarization Similarly Bairi et al 2015 29 show the utility of submodular functions for summarizing multi document topic hierarchies Submodular Functions have also successfully been used for summarizing machine learning datasets 30 Applications edit This section needs expansion You can help by adding to it February 2017 Specific applications of automatic summarization include The Reddit bot autotldr 31 created in 2011 summarizes news articles in the comment section of reddit posts It was found to be very useful by the reddit community which upvoted its summaries hundreds of thousands of times 32 The name is reference to TL DR Internet slang for too long didn t read 33 34 Adversarial stylometry may make use of summaries if the detail lost is not major and the summary is sufficiently stylistically different to the input 35 Evaluation editThe most common way to evaluate the informativeness of automatic summaries is to compare them with human made model summaries Evaluation can be intrinsic or extrinsic 36 and inter textual or intra textual 37 Intrinsic versus extrinsic edit Intrinsic evaluation assesses the summaries directly while extrinsic evaluation evaluates how the summarization system affects the completion of some other task Intrinsic evaluations have assessed mainly the coherence and informativeness of summaries Extrinsic evaluations on the other hand have tested the impact of summarization on tasks like relevance assessment reading comprehension etc Inter textual versus intra textual edit Intra textual evaluation assess the output of a specific summarization system while inter textual evaluation focuses on contrastive analysis of outputs of several summarization systems Human judgement often varies greatly in what it considers a good summary so creating an automatic evaluation process is particularly difficult Manual evaluation can be used but this is both time and labor intensive as it requires humans to read not only the summaries but also the source documents Other issues are those concerning coherence and coverage The most common way to evaluate summaries is ROUGE Recall Oriented Understudy for Gisting Evaluation It is very common for summarization and translation systems in NIST s Document Understanding Conferences 2 ROUGE is a recall based measure of how well a summary covers the content of human generated summaries known as references It calculates n gram overlaps between automatically generated summaries and previously written human summaries It is recall based to encourage inclusion of all important topics in summaries Recall can be computed with respect to unigram bigram trigram or 4 gram matching For example ROUGE 1 is the fraction of unigrams that appear in both the reference summary and the automatic summary out of all unigrams in the reference summary If there are multiple reference summaries their scores are averaged A high level of overlap should indicate a high degree of shared concepts between the two summaries ROUGE cannot determine if the result is coherent that is if sentences flow together in a sensibly High order n gram ROUGE measures help to some degree Another unsolved problem is Anaphor resolution Similarly for image summarization Tschiatschek et al developed a Visual ROUGE score which judges the performance of algorithms for image summarization 38 Domain specific versus domain independent summarization edit Domain independent summarization techniques apply sets of general features to identify information rich text segments Recent research focuses on domain specific summarization using knowledge specific to the text s domain such as medical knowledge and ontologies for summarizing medical texts 39 Qualitative edit The main drawback of the evaluation systems so far is that we need a reference summary for some methods more than one to compare automatic summaries with models This is a hard and expensive task Much effort has to be made to create corpora of texts and their corresponding summaries Furthermore some methods require manual annotation of the summaries e g SCU in the Pyramid Method Moreover they all perform a quantitative evaluation with regard to different similarity metrics History editThe first publication in the area dates back to 1957 40 Hans Peter Luhn starting with a statistical technique Research increased significantly in 2015 Term frequency inverse document frequency had been used by 2016 Pattern based summarization was the most powerful option for multi document summarization found by 2016 In the following year it was surpassed by latent semantic analysis LSA combined with non negative matrix factorization NMF Although they did not replace other approaches and are often combined with them by 2019 machine learning methods dominated the extractive summarization of single documents which was considered to be nearing maturity By 2020 the field was still very active and research is shifting towards abstractive summation and real time summarization 41 Recent approaches edit Recently the rise of transformer models replacing more traditional RNN LSTM have provided a flexibility in the mapping of text sequences to text sequences of a different type which is well suited to automatic summarization This includes models such as T5 42 and Pegasus 43 See also editSentence extraction Text mining Multi document summarizationReferences edit Torres Moreno Juan Manuel 1 October 2014 Automatic Text Summarization Wiley pp 320 ISBN 978 1 848 21668 6 Pan Xingjia Tang Fan Dong Weiming Ma Chongyang Meng Yiping Huang Feiyue Lee Tong Yee Xu Changsheng 2021 04 01 Content Based Visual Summarization for Image Collection IEEE Transactions on Visualization and Computer Graphics 27 4 2298 2312 doi 10 1109 tvcg 2019 2948611 ISSN 1077 2626 PMID 31647438 S2CID 204865221 WIPO PUBLISHES PATENT OF KT FOR IMAGE SUMMARIZATION SYSTEM AND METHOD SOUTH KOREAN INVENTORS US Fed News Service January 10 2018 ProQuest 1986931333 Retrieved January 22 2021 Li Tan Yangqiu Song Shixia Liu Lexing Xie February 2012 ImageHive Interactive Content Aware Image Summarization IEEE Computer Graphics and Applications 32 1 46 55 doi 10 1109 mcg 2011 89 ISSN 0272 1716 PMID 24808292 S2CID 7668289 Sankar K Pal Alfredo Petrosino Lucia Maddalena 25 January 2012 Handbook on Soft Computing for Video Surveillance CRC Press pp 81 ISBN 978 1 4398 5685 7 Elhamifar Ehsan Sapiro Guillermo Vidal Rene 2012 See all by looking at a few Sparse modeling for finding representative objects 2012 IEEE Conference on Computer Vision and Pattern Recognition IEEE pp 1600 1607 doi 10 1109 CVPR 2012 6247852 ISBN 978 1 4673 1228 8 S2CID 5909301 Retrieved 4 December 2022 Mademlis Ioannis Tefas Anastasios Nikolaidis Nikos Pitas Ioannis 2016 Multimodal stereoscopic movie summarization conforming to narrative characteristics PDF IEEE Transactions on Image Processing IEEE 25 12 5828 5840 Bibcode 2016ITIP 25 5828M doi 10 1109 TIP 2016 2615289 hdl 1983 2bcdd7a5 825f 4ac9 90ec f2f538bfcb72 PMID 28113502 S2CID 18566122 Retrieved 4 December 2022 Mademlis Ioannis Tefas Anastasios Pitas Ioannis 2018 A salient dictionary learning framework for activity video summarization via key frame extraction Information Sciences Elsevier 432 319 331 doi 10 1016 j ins 2017 12 020 Retrieved 4 December 2022 Auto generated Summaries in Google Docs Google AI Blog 23 March 2022 Retrieved 2022 04 03 Richard Sutz Peter Weverka How to skim text https www dummies com education language arts speed reading how to skim text Accessed Dec 2019 a b Afzal M Alam F Malik KM Malik GM Clinical Context Aware Biomedical Text Summarization Using Deep Neural Network Model Development and Validation J Med Internet Res 2020 22 10 e19810 DOI 10 2196 19810 PMID 33095174 Zhai ChengXiang 2016 Text data management and analysis a practical introduction to information retrieval and text mining Sean Massung New York NY p 321 ISBN 978 1 970001 19 8 OCLC 957355971 a href Template Cite book html title Template Cite book cite book a CS1 maint location missing publisher link Jorge E Camargo and Fabio A Gonzalez A Multi class Kernel Alignment Method for Image Collection Summarization In Proceedings of the 14th Iberoamerican Conference on Pattern Recognition Progress in Pattern Recognition Image Analysis Computer Vision and Applications CIARP 09 Eduardo Bayro Corrochano and Jan Olof Eklundh Eds Springer Verlag Berlin Heidelberg 545 552 doi 10 1007 978 3 642 10268 4 64 Alrehamy Hassan H Walker Coral 2018 SemCluster Unsupervised Automatic Keyphrase Extraction Using Affinity Propagation Advances in Computational Intelligence Systems Advances in Intelligent Systems and Computing Vol 650 pp 222 235 doi 10 1007 978 3 319 66939 7 19 ISBN 978 3 319 66938 0 Turney Peter D 2002 Learning Algorithms for Keyphrase Extraction Information Retrieval 2 4 303 336 arXiv cs 0212020 Bibcode 2002cs 12020T doi 10 1023 A 1009976227802 S2CID 7007323 Rada Mihalcea and Paul Tarau 2004 TextRank Bringing Order into Texts Department of Computer Science University of North Texas Archived copy PDF Archived from the original on 2012 06 17 Retrieved 2012 07 20 a href Template Cite web html title Template Cite web cite web a CS1 maint archived copy as title link CS1 maint bot original URL status unknown link Yatsko V A Starikov M S Butakov A V 2010 Automatic genre recognition and adaptive text summarization Automatic Documentation and Mathematical Linguistics 44 3 111 120 doi 10 3103 S0005105510030027 S2CID 1586931 UNIS Universal Summarizer Gunes Erkan and Dragomir R Radev LexRank Graph based Lexical Centrality as Salience in Text Summarization 1 Versatile question answering systems seeing in synthesis International Journal of Intelligent Information Database Systems 5 2 119 142 2011 Carbonell Jaime and Jade Goldstein The use of MMR diversity based reranking for reordering documents and producing summaries Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval ACM 1998 Zhu Xiaojin et al Improving Diversity in Ranking using Absorbing Random Walks HLT NAACL 2007 Hui Lin Jeff Bilmes Learning mixtures of submodular shells with application to document summarization Alex Kulesza and Ben Taskar Determinantal point processes for machine learning Foundations and Trends in Machine Learning December 2012 Nemhauser George L Laurence A Wolsey and Marshall L Fisher An analysis of approximations for maximizing submodular set functions I Mathematical Programming 14 1 1978 265 294 Hui Lin Jeff Bilmes Learning mixtures of submodular shells with application to document summarization UAI 2012 Hui Lin Jeff Bilmes A Class of Submodular Functions for Document Summarization The 49th Annual Meeting of the Association for Computational Linguistics Human Language Technologies ACL HLT 2011 Sebastian Tschiatschek Rishabh Iyer Hoachen Wei and Jeff Bilmes Learning Mixtures of Submodular Functions for Image Collection Summarization In Advances of Neural Information Processing Systems NIPS Montreal Canada December 2014 Ramakrishna Bairi Rishabh Iyer Ganesh Ramakrishnan and Jeff Bilmes Summarizing Multi Document Topic Hierarchies using Submodular Mixtures To Appear In the Annual Meeting of the Association for Computational Linguistics ACL Beijing China July 2015 Kai Wei Rishabh Iyer and Jeff Bilmes Submodularity in Data Subset Selection and Active Learning Archived 2017 03 13 at the Wayback Machine To Appear In Proc International Conference on Machine Learning ICML Lille France June 2015 overview for autotldr reddit Retrieved 9 February 2017 Squire Megan 2016 08 29 Mastering Data Mining with Python Find patterns hidden in your data Packt Publishing Ltd ISBN 9781785885914 Retrieved 9 February 2017 What Is TLDR Lifewire Retrieved 9 February 2017 What Does TL DR Mean AMA TIL Glossary Of Reddit Terms And Abbreviations International Business Times 29 March 2012 Retrieved 9 February 2017 Potthast Hagen amp Stein 2016 p 11 12 Mani I Summarization evaluation an overview Yatsko V A Vishnyakov T N 2007 A method for evaluating modern systems of automatic text summarization Automatic Documentation and Mathematical Linguistics 41 3 93 103 doi 10 3103 S0005105507030041 S2CID 7853204 Sebastian Tschiatschek Rishabh Iyer Hoachen Wei and Jeff Bilmes Learning Mixtures of Submodular Functions for Image Collection Summarization In Advances of Neural Information Processing Systems NIPS Montreal Canada December 2014 PDF Sarker Abeed Molla Diego Paris Cecile 2013 An Approach for Query Focused Text Summarisation for Evidence Based Medicine Artificial Intelligence in Medicine Lecture Notes in Computer Science Vol 7885 pp 295 304 doi 10 1007 978 3 642 38326 7 41 ISBN 978 3 642 38325 0 Luhn Hans Peter 1957 A Statistical Approach to Mechanized Encoding and Searching of Literary Information PDF IBM Journal of Research and Development 1 4 309 317 doi 10 1147 rd 14 0309 Widyassari Adhika Pramita Rustad Supriadi Shidik Guruh Fajar Noersasongko Edi Syukur Abdul Affandy Affandy Setiadi De Rosal Ignatius Moses 2020 05 20 Review of automatic text summarization techniques amp methods Journal of King Saud University Computer and Information Sciences 34 4 1029 1046 doi 10 1016 j jksuci 2020 05 006 ISSN 1319 1578 Exploring Transfer Learning with T5 the Text To Text Transfer Transformer Google AI Blog 24 February 2020 Retrieved 2022 04 03 Zhang J Zhao Y Saleh M amp Liu P 2020 November Pegasus Pre training with extracted gap sentences for abstractive summarization In International Conference on Machine Learning pp 11328 11339 PMLR Works cited editPotthast Martin Hagen Matthias Stein Benno 2016 Author Obfuscation Attacking the State of the Art in Authorship Verification PDF Conference and Labs of the Evaluation Forum Further reading editHercules Dalianis 2003 Porting and evaluation of automatic summarization Roxana Angheluta 2002 The Use of Topic Segmentation for Automatic Summarization Anne Buist 2004 Automatic Summarization of Meeting Data A Feasibility Study PDF Archived from the original PDF on 2021 01 23 Retrieved 2020 07 19 Annie Louis 2009 Performance Confidence Estimation for Automatic Summarization Elena Lloret and Manuel Palomar 2009 Challenging Issues of Automatic Summarization Relevance Detection and Quality based Evaluation Archived from the original on 2018 10 03 Retrieved 2018 10 03 a href Template Cite book html title Template Cite book cite book a CS1 maint multiple names authors list link Andrew Goldberg 2007 Automatic Summarization Alrehamy Hassan 2018 SemCluster Unsupervised Automatic Keyphrase Extraction Using Affinity Propagation Advances in Computational Intelligence Systems Advances in Intelligent Systems and Computing Vol 650 pp 222 235 doi 10 1007 978 3 319 66939 7 19 ISBN 978 3 319 66938 0 Endres Niggemeyer Brigitte 1998 Summarizing Information Springer ISBN 978 3 540 63735 6 Marcu Daniel 2000 The Theory and Practice of Discourse Parsing and Summarization MIT Press ISBN 978 0 262 13372 2 Mani Inderjeet 2001 Automatic Summarization ISBN 978 1 58811 060 2 Huff Jason 2010 AutoSummarize Conceptual artwork using automatic summarization software in Microsoft Word 2008 Lehmam Abderrafih 2010 Essential summarizer innovative automatic text summarization software in twenty languages ACM Digital Library Riao 10 pp 216 217 Published in Proceeding RIAO 10 Adaptivity Personalization and Fusion of Heterogeneous Information CID Paris France Xiaojin Zhu Andrew Goldberg Jurgen Van Gael and David Andrzejewski 2007 Improving diversity in ranking using absorbing random walks PDF a href Template Cite book html title Template Cite book cite book a CS1 maint multiple names authors list link The GRASSHOPPER algorithm Miranda Jimenez Sabino Gelbukh Alexander and Sidorov Grigori 2013 Summarizing Conceptual Graphs for Automatic Summarization Task Conceptual Structures for STEM Research and Education Lecture Notes in Computer Science Vol 7735 pp 245 253 doi 10 1007 978 3 642 35786 2 18 ISBN 978 3 642 35785 5 a href Template Cite book html title Template Cite book cite book a CS1 maint multiple names authors list link Conceptual Structures for STEM Research and Education Retrieved from https en wikipedia org w index php title Automatic summarization amp oldid 1194759715, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.