文本摘要提取_了解自动文本摘要-1:提取方法
文本摘要提取
Text summarization is commonly used by several websites and applications to create news feed and article summaries. It has become very essential for us due to our busy schedules. We prefer short summaries with all the important points over reading a whole report and summarizing it ourselves. So, several attempts had been made to automate the summarizing process. In this article, we will talk about some of them and see how they work.
幾個網站和應用程序通常使用文本摘要來創建新聞提要和文章摘要。 由于我們繁忙的日程安排,對我們來說這已經變得非常重要。 與閱讀整個報告并自己進行總結相比,我們更喜歡具有所有重要要點的簡短摘要。 因此,已經進行了幾次嘗試來使摘要過程自動化。 在本文中,我們將討論其中的一些,并了解它們的工作原理。
什么是總結? (What is summarization?)
Summarization is a technique to shorten long texts such that the summary has all the important points of the actual document.
摘要是一種縮短長文本的技術,以便摘要具有實際文檔的所有要點。
There are mainly four types of summaries:
摘要主要有四種類型:
自動匯總的方法 (Approaches to Automatic summarization)
There are mainly two types of summarization:
摘要主要有兩種類型:
Extraction-based Summarization: The extractive approach involves picking up the most important phrases and lines from the documents. It then combines all the important lines to create the summary. So, in this case, every line and word of the summary actually belongs to the original document which is summarized.
基于提取的摘要:提取方法涉及從文檔中挑選最重要的短語和行。 然后,它將所有重要的行合并以創建摘要。 因此,在這種情況下,摘要的每一行和每個單詞實際上都屬于摘要的原始文檔。
Abstraction-based Summarization: The abstractive approach involves summarization based on deep learning. So, it uses new phrases and terms, different from the actual document, keeping the points the same, just like how we actually summarize. So, it is much harder than the extractive approach.
基于抽象的摘要:抽象方法涉及基于深度學習的摘要。 因此,它使用與實際文檔不同的新短語和術語,使要點保持不變,就像我們實際進行總結一樣。 因此,這比提取方法難得多。
It has been observed that extractive summaries sometimes work better than the abstractive ones probably because extractive ones don’t require natural language generations and semantic representations.
已經觀察到,提取摘要有時比抽象摘要更好,這可能是因為提取摘要不需要自然語言生成和語義表示。
評估方法 (Evaluation methods)
There are two types of evaluations:
評估有兩種類型:
- Human Evaluation 人工評估
- Automatic Evaluation 自動評估
Human Evaluation: Scores are assigned by human experts based on how well the summary covers the points, answer the queries, and other factors like grammaticality and non-redundancy.
人工評估 :人工評估專家將根據摘要涵蓋的分數,回答問題以及其他因素(如語法和非冗余)分配分數。
自動評估 (Automatic Evaluation)
ROUGE: ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It is the method that determines the quality of the summary by comparing it to other summaries made by humans as a reference. To evaluate the model, there are a number of references created by humans and the generated candidate summary by machine. The intuition behind this is if a model creates a good summary, then it must have common overlapping portions with the human references. It was proposed by Chin-Yew Lin, University of California.
ROUGE: ROUGE代表針對召回評估的面向召回的本科。 通過將摘要與人為參考的其他摘要進行比較來確定摘要的質量的方法。 為了評估模型,人類創建了許多參考,機器生成了候選摘要。 這背后的直覺是,如果模型創建了一個好的摘要,則它必須與人工參考具有共同的重疊部分。 它是由加利福尼亞大學的Chin-Yew Lin提出的。
Common versions of ROUGE are:
ROUGE的常見版本是:
ROUGE-n: It is measure on the comparison between the machine-generated output and the reference output based on n-grams. An n-gram is a contiguous sequence of n items from a given sample of text or speech, i.e, it is simply a sequence of words. Bigrams mean two words, Trigrams mean 3 words and so on. We normally use Bigrams.
ROUGE-n:基于n-gram對機器生成的輸出與參考輸出之間的比較進行度量。 n-gram是來自給定文本或語音樣本的n個項目的連續序列,即,它只是單詞序列。 雙字母組表示兩個單詞,三字母組表示三個單詞,依此類推。 我們通常使用Bigrams。
Source資源“Where p is “the number of common n-grams between candidate and reference summary”, and q is “the number of n-grams extracted from the reference summary only”. -Source
“其中p是“候選者和參考摘要之間的常見n元語法數”,q是“僅從參考摘要中提取的n元語法數”。 - 來源
ROUGE-L: It states that the longer the longest common subsequence in two texts, the similar they are. So, it is flexible than n-gram. It assigns scores based on how long can be a sequence, which is common to the machine-generated candidate and the human reference.
ROUGE-L:它指出,兩個文本中最長的公共子序列越長,它們相似。 因此,它比n-gram靈活。 它根據序列的長度來分配分數,這對于機器生成的候選者和人工參考是通用的。
ROUGE-SU: It brings a concept of skip bi-grams and unigrams. Basically it allows or considers a bigram if there are some other words between the two words, i.e, the bigrams don’t need to be consecutive words.
ROUGE-SU:它帶來了跳過二字組和單字組的概念。 基本上,如果兩個單詞之間還有其他單詞,則它允許或考慮一個雙字母組,即,雙字母組不需要是連續的單詞。
ROUGE-2 is most popular and given by:
ROUGE-2最受歡迎,由以下提供:
Where for every bigram ‘i’ we calculate the minimum of the number of times it occurred in the generated document X and the reference document S, for all reference documents give, divided by the total number of times each bigram appears in all of the reference documents. It is based on BLEU scores.
對于每個二元組“ i”,我們計算生成的文檔X和參考文檔S中出現的最小次數,對于所有給定的參考文檔,除以每個二元組出現在所有參考文獻中的總次數文件。 它基于BLEU分數。
Feature-Based Summarization: Developed by H. P Luhan at IBM in 1958. The paper proposed that the importance of a sentence is a function of the high-frequency words in the document. Elaborately speaking, the algorithm measures the frequency of words and phrases in the document and decides the importance of the sentence considering the words in the sentence and their frequencies. It states if the sentence has words with higher frequencies, It is important, but here we do not include the common words like “a”,” the”. Etc.
基于特征的摘要 :由IBM的H. P Luhan在1958年開發。該論文提出,句子的重要性取決于文檔中高頻詞的作用。 確切地說,該算法測量文檔中單詞和短語的頻率,并根據句子中的單詞及其頻率來確定句子的重要性。 它說明句子中是否包含頻率更高的單詞,這很重要,但是在這里我們不包括“ a”,“ the”等常見單詞。 等等。
Extractive Summarizations: “Extractive summarization techniques produce summaries by choosing a subset of the sentences in the original text”.-Source
摘錄 摘要:“ 摘錄摘要技術通過選擇原始文本中句子的子集來產生摘要?!? 來源
The Extractive Summarizers first create an intermediate representation that has the main task of highlighting or taking out the most important information of the text to be summarized based on the representations. There are two main types of representations:
提取摘要器首先創建一個中間表示形式 ,其主要任務是突出顯示或提取基于表示形式的要摘要文本的最重要信息。 表示有兩種主要類型:
Topic representations: It focuses on representing the topics represented in the texts. There are several kinds of approaches to get this representation. We here are going to talk about two of them. Others include Latent Semantic Analysis and Bayesian Models. If you want to study others as well I will encourage you to go through the references.
主題表示:它專注于表示文本中表示的主題。 有幾種方法可以得到這種表示。 我們在這里將討論其中兩個。 其他還包括潛在語義分析和貝葉斯模型。 如果您也想學習其他人,我將鼓勵您閱讀參考資料。
Frequency Driven Approaches: In this approach, we assign weights to the words. If the word is related to the topic we assign 1 or else 0. The weights may be continuous depending on the implementation. Two common techniques for topic representations are:
頻率驅動方法 :在這種方法中,我們為單詞分配權重。 如果單詞與主題相關,我們將其分配為1或0。權重可以是連續的,具體取決于實現方式。 主題表示的兩種常用技術是:
Word Probability: It simply uses the frequency of words as an indicator of the importance of the word. The probability of a word w is given by the frequency of occurrences of the word, f (w), divided by all words in the input which has a total of N words.
單詞概率 :它僅使用單詞的頻率作為單詞重要性的指標。 單詞w的概率由單詞出現的頻率f(w)除以輸入中總共有N個單詞的所有單詞得出。
For sentence importance using the word probabilities, the importance of a sentence is given by the average importance of the words in the sentence.
對于使用單詞概率的句子重要性,句子的重要性由句子中單詞的平均重要性給出。
TFIDF.(Tern Frequency Inver Document Frequency): This method is devised as an advancement to the word probability method. Here the TF-IDF method is used for assigning the weights. TFIDF is a method that assigns low weights to the words that occur very frequently in most of the documents under the intuitions that they are stopwords or words like “The”. Otherwise, due to the term frequency if a word appears in a document uniquely with a high frequency it is given high weightage.
TFIDF(Tern Frequency Inver Document Frequency):此方法是對單詞概率方法的改進。 此處,TF-IDF方法用于分配權重。 TFIDF是一種對大多數文檔中經常出現的單詞賦予低權重的方法,因為它們是停用詞或諸如“ The”之類的單詞。 否則,由于術語“頻率”,如果單詞以高頻率唯一地出現在文檔中,則會被賦予較高的權重。
Topic word Approaches: This approach is similar to Luhan’s approach. “The topic word technique is one of the common topic representation approaches which aims to identify words that describe the topic of the input document”.-Source This method calculates the word frequencies and uses a frequency threshold to find the word that can potentially describe a topic. It classifies the importance of a sentence as the function of the number of topic words it contains.
主題詞方法 :此方法與鹿Lu的方法相似。 “ 主題詞技術是常見的主題表示方法之一,旨在識別描述輸入文檔主題的詞。”- 來源 此方法計算單詞頻率并使用頻率閾值來查找可能描述主題的單詞。 它將句子的重要性根據其包含的主題詞的數量進行分類。
Indicator Representations: This type of representation depend on the features of the sentences and rank them on the basis of the features. So, here the importance of the sentence is not dependent on the words it contains as we have seen in the Topic representations but directly on the sentence features. There are two methods for this type of representation. Let’s look at them.
指示符表示:這種表示形式取決于句子的特征,并根據這些特征對它們進行排名。 因此,這里句子的重要性并不取決于主題表示中所包含的單詞,而是直接取決于句子的特征。 對于這種類型的表示,有兩種方法。 讓我們看看它們。
Graph-Based Methods: This is based on the Page Rank algorithm. It represents text documents as connected graphs. The sentences are represented as the nodes of the graphs and edges between the nodes or the sentences are a measure of similarity between the two sentences. We will talk about this in detail in the upcoming portions.
基于圖的方法 :這是基于頁面排名算法的。 它表示文本文檔為連接圖。 句子表示為圖的節點,節點之間的邊緣或句子是兩個句子之間相似度的度量。 我們將在接下來的部分中詳細討論這一點。
Machine-Learning Methods: The machine learning methods approach the summarization problem as a classification problem. The models try to classify sentences based on their features into, summary or non-summary sentences. For training the models, we have a training set of documents and their corresponding human reference extractive summaries. Normally Naive Bayes, Decision Tree, and SVMs are used here.
機器學習方法:機器學習方法將匯總問題作為分類問題。 這些模型嘗試根據句子的特征將句子分為摘要句子或非摘要句子。 為了訓練模型,我們有一套訓練文檔和它們對應的人類參考摘要。 通常,此處使用樸素貝葉斯,決策樹和SVM。
評分和句子選擇 (Scoring and Sentences Selection)
Now, once we get the intermediate representations, we move to assign some scores to each sentence to specify their importance. For topic representations, a score to a sentence depends on the topic words it contains, and for an indicator representation, the score depends on the features of the sentences. Finally, the sentences having top scores, are picked and used to generate a summary.
現在,一旦獲得了中間表示,我們便開始為每個句子分配一些分數以指定其重要性。 對于主題表示,句子的分數取決于其包含的主題詞,對于指示符表示,分數取決于句子的特征。 最后,挑選出得分最高的句子,然后將其用于生成摘要。
基于圖的方法 (Graph-Based methods)
The graph-based methods were first introduced by a paper by Rada Mihalcea and Paul Tarau, University of North Texas. The method is called the Text Rank algorithm and is influenced by Google’s Page Rank Algorithm. This algorithm primarily tries to find the importance of a vertex in a given graph.
基于圖的方法首先由北德克薩斯大學的Rada Mihalcea和Paul Tarau發表。 該方法稱為“文本排名算法”,受Google的“頁面排名算法”影響。 該算法主要試圖找到給定圖中頂點的重要性。
Now, how the algorithm works?
現在,該算法如何工作?
We have learned in this algorithm, each sentence is represented as a vertex. An edge joining two vertices or two sentences denotes that the two sentences are similar. If the similarity of any two sentences is greater than a particular threshold, the nodes representing the sentences are joined by an edge.
我們已經在該算法中學習到,每個句子都表示為一個頂點。 連接兩個頂點或兩個句子的邊表示兩個句子相似。 如果任意兩個句子的相似度大于特定閾值,則表示這些句子的節點將通過一條邊連接。
When two vertices are joined, it portrays that, one vertex is casting a vote to the other one. More the number of votes to a particular node( vertex or sentence), more important is that node and apparently the sentence represented. Now, the votes are also kind of weighted, each vote is not of the same weight or importance. The importance of the vote also depends on the importance of the node or sentence casting the vote, higher the importance of the node casting the vote higher is the importance of the vote. So, the number of votes cast to a sentence and the importance of those votes determine the importance of the sentence. This is the same idea behind the google page rank algorithm, and how it decides and ranks webpages, just that the nodes represent the webpages.
當兩個頂點連接在一起時,它表示,一個頂點正在向另一個頂點投票。 對特定節點(頂點或句子)的投票數越多,則該節點以及顯然表示的句子越重要。 現在,投票也經過加權,每種投票的權重或重要性都不相同。 投票的重要性還取決于投票的節點或句子的重要性,投票的節點的重要性越高,投票的重要性越高。 因此,投給一個句子的票數和這些票的重要性決定了句子的重要性。 這與google頁面排名算法,其如何決定和排名網頁的想法相同,只是節點代表了網頁。
If we have a paragraph, we will decompose it into a set of sentences. Now,say we represent each sentence as a vertex ‘vi’ so, we obtain a set of vertices V. As discussed, an edge joins a vertex with another vertex of the same set, so an edge E can be represented as a subset of (V x V). In the case of a directed graph say, In(V{i}) is the number of incoming edges to a node and the Out(v{j}) is the number of outgoing edges from a given node, and the importance score of a vertex is given by S{j}.
如果我們有一個段落,我們會將其分解為一組句子。 現在,假設我們將每個句子表示為一個頂點“ vi”,因此,我們獲得了一組頂點V。如上所述,一條邊將一個頂點與同一集合的另一個頂點連接在一起,因此,一條邊E可以表示為一個頂點V的子集。 (V x V)。 在有向圖的情況下,In(V {i})是節點進入邊緣的數量,Out(v {j})是從給定節點離開邊緣的數量,重要性得分為頂點由S {j}給定。
頁面排名算法 (Page Rank Algorithm)
According to the google page rank algorithm,
根據Google頁面排名算法,
Source資源Where S(V{i}) is the score of the subject node under consideration, and S(V(j)) represents all the nodes that have outgoing edges to V{i}. Now, the score of V{j} is divided by the out-degree of V{j}, which is the consideration of the probability that the user will choose that particular webpage.
其中S(V {i})是所考慮的主題節點的分數,而S(V(j))表示具有到V {i}的輸出邊緣的所有節點。 現在,將V {j}的分數除以V {j}的外度,這是用戶選擇該特定網頁的可能性的考慮因素。
Elaborately, if this is the graph, standing at A, as a user, I can go to both B and C, so the chance of me going to C is ?, i.e, 1/( outdegree of A). The factor d is called the damping factor. In the original page rank algorithm, factor d incorporates randomness. 1-d denotes that the user will move to a random webpage, not going to the connected ones. The factor is generally set to 0.85. The same algorithm is implemented in the text-rank algorithm.
詳細地講,如果這是一個以用戶身份站在A處的圖形,那么我可以同時訪問B和C,因此我訪問C的機會是?,即1 /(超出A的度)。 因子d稱為阻尼因子。 在原始頁面等級算法中,因子d包含了隨機性。 1-d表示用戶將移動到隨機網頁,而不是連接的網頁。 該系數通常設置為0.85。 文本秩算法中實現了相同的算法。
Now, the question arises, how we obtain the scores?
現在,問題來了,我們如何獲得分數?
Let’s check for the page rank algorithm first, then transform it for text rank. As we can see above there are 4 vertices, first, we assign random scores to all the vertices, say, [0.8,0.9,0.9,0.9]. Then, probability scores are assigned to the edges.
讓我們先檢查頁面排名算法,然后將其轉換為文本排名。 正如我們在上面看到的,有4個頂點,首先,我們為所有頂點分配隨機分數,例如[0.8,0.9,0.9,0.9]。 然后,將概率分數分配給邊緣。
The matrix is the adjacent matrix of the graph. It can be observed that the values of the adjacent matrices are the probability values i.e, 1/outdegree of that node or vertex. So, actually the page rank graph becomes unweighted as the equation only contains the term that gives the weight.
矩陣是圖形的相鄰矩陣。 可以觀察到,相鄰矩陣的值是概率值,即該節點或頂點的1 / outdegree。 因此,實際上頁面排名圖沒有加權,因為等式僅包含給出權重的項。
Now, the whole equation becomes,
現在,整個方程變為
We can see, the old score matrix is multiplied using the adjacency matrix to get the new score matrix. We will continue this until the L2 norm of the new score matrix and the old score matrix becomes less than a given constant mostly 1 x10^-8. This is a convergence property based on linear algebra and the theory of eigenvalues and vectors. We will skip the maths to keep it simple. Once the convergence is achieved we obtain the final importance scores from the score matrix.
我們可以看到,使用鄰接矩陣將舊的得分矩陣相乘得到新的得分矩陣。 我們將繼續進行此操作,直到新分數矩陣和舊分數矩陣的L2范數變得小于給定常數(大部分為1 x10 ^ -8)為止。 這是基于線性代數以及特征值和向量理論的收斂性。 為了簡化起見,我們將跳過數學。 一旦收斂,就可以從得分矩陣中獲得最終的重要性得分。
For the text rank algorithm, the equation and the graph are modified to a weighted graph, because here, just dividing with the out-degree won’t convey the full importance. As a result, the equation becomes:
對于文本等級算法,將等式和圖形修改為加權圖形,因為在這里,僅用出度進行除法并不能表達全部的重要性。 結果,方程變為:
Source資源W represents the weight factor.
W代表權重因子。
The implementation of text rank consists of two different natural language processes:
文本等級的實現包括兩種不同的自然語言過程:
Keyword extraction task
關鍵字提取任務
Previously this was done using the frequency factor, which gave poor results comparatively. The text rank paper introduced a fully unsupervised algorithm. According to the algorithm, the natural language text is tokenized and parts of speech are tagged, and single words are added to the word graph as nodes. Now, if two words are similar, the corresponding nodes are connected using an edge. The similarity is measured using the co-occurrences of words. If two words occur in a window of N words, N varying from 2 to 10, the two words are considered similar. The words with the maximum number of important incident edges are selected as the most important keywords.
以前,這是使用頻率因子完成的,相對而言效果較差。 文本等級論文介紹了一種完全不受監督的算法。 根據該算法,對自然語言文本進行標記并標記語音部分,并將單個單詞作為節點添加到單詞圖中。 現在,如果兩個單詞相似,則使用邊連接相應的節點。 使用單詞的共現來衡量相似度。 如果兩個單詞出現在N個單詞的窗口中,N在2到10之間變化,則認為這兩個單詞相似。 選擇具有最大重要入射邊數的單詞作為最重要的關鍵字。
Source資源Sentence extraction task
句子提取任務
It also works similar to keyword extraction, the only difference is in keyword extraction, the nodes represented keywords, here they represent entire sentences. Now, for the formation of the graph for sentence ranking, the algorithm creates a vertex for each sentence in the text and adds to the graph. The sentences are too large so, co-occurrence measures can not be applied. So, the paper uses a “similarity” between two sentences using the content overlap between two sentences, in simpler words, the similarity depends on the number of common word tokens present in the two sentences. The authors propose a very interesting “recommendation” insight here. They denote the joining of the edge between two similar sentences or vertices as if it is recommending the reader to read another line, which is similar to the current line he/she is reading. The similarity I feel therefore denotes the similar content or interest among the two sentences. To prevent long sentences from getting recommended the importances are multiplied with a normalizing factor.
它也類似于關鍵字提取,其唯一區別在于關鍵字提取,節點表示關鍵字,此處表示整個句子。 現在,為了形成用于句子排名的圖形,該算法為文本中的每個句子創建一個頂點,并將其添加到圖形中。 句子太大,因此不能使用共現措施。 因此,本文使用兩個句子之間的內容重疊來使用兩個句子之間的“相似性”,簡單來說,相似性取決于兩個句子中存在的常用單詞標記的數量。 作者在這里提出了一個非常有趣的“建議”見解。 它們表示兩個相似的句子或頂點之間的邊的連接,就像建議讀者閱讀另一行一樣,這類似于他/她正在閱讀的當前行。 因此,我認為相似性表示兩個句子之間的相似內容或興趣。 為了防止長句被推薦,重要性與標準化因子相乘。
Source資源The similarity between two sentences is given by:
兩個句子之間的相似性由下式給出:
Source資源Where given two sentences Si and Sj, with a sentence being represented by the set of Ni words that appear in the sentence:
給定兩個句子Si和Sj,其中一個句子由出現在句子中的Ni個單詞集合表示:
Source資源The most important sentences are obtained in the same way we did for Keyword extraction.
以與提取關鍵字相同的方式獲得最重要的句子。
This is an overall view of how text rank operates please go through the original paper to explore more.
這是文本等級如何運作的整體視圖,請仔細閱讀原始文章以進行更多研究。
In practice, for summary extraction, we use cosine similarity, to decide the similarity between two sentences. Using this method, we may obtain several connected subgraphs that denote the number of important topics in the whole document. The connected component of the subgraphs gives the sentences important for the corresponding topics.
實際上,對于摘要提取,我們使用余弦相似度來確定兩個句子之間的相似度。 使用這種方法,我們可以獲得幾個相連的子圖,這些子圖表示整個文檔中重要主題的數量。 子圖的連接部分提供了對于相應主題很重要的句子。
The “Pytextrank” library allows applying the text rank algorithm directly on python.
“ Pytextrank ”庫允許直接在python上應用文本排名算法。
import spacyimport pytextrank
# example text
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."
# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")
# add PyTextRank to the spaCy pipeline
tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)
doc = nlp(text)
# examine the top-ranked phrases in the document
for p in doc._.phrases:
print("{:.4f} {:5d} {}".format(p.rank, p.count, p.text))
print(p.chunks)
Implementation of Pytextrank library by Source.
Pytextrank庫的Source實現 。
For application details, please refer to the GitHub link.
有關應用程序的詳細信息,請參閱GitHub鏈接。
結論 (Conclusion)
In this article, we have seen basic extractive summarization approaches and the details of the Textrank algorithm. For abstractive methods, feel free to go through Part 2 of the article.
在本文中,我們了解了基本的提取摘要方法和Textrank算法的細節。 對于抽象方法,請隨意閱讀本文的第2部分 。
I hope this helps.
我希望這有幫助。
翻譯自: https://towardsdatascience.com/understanding-automatic-text-summarization-1-extractive-methods-8eb512b21ecc
文本摘要提取
總結
以上是生活随笔為你收集整理的文本摘要提取_了解自动文本摘要-1:提取方法的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: pandas 机器学习_机器学习的PAN
- 下一篇: 用户细分_基于购买历史的用户细分