當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

文本查重：difflib.SequenceMatcher

發布時間：2025/3/21 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了文本查重：difflib.SequenceMatcher 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1. SequenceMatcher FlowChart

1.1 get_matching_blocks()?

1.2?find_longest_match()

1.3 ratio()

2. 例子說明

3. 項目需求函數更改

參考

SequenceMatcher in Python
difflib?

SequenceMatcher的基本思想是找到不包含“junk”元素的最長連續匹配子序列（LCS）。這不會產生最小的編輯序列，但是會產生對人“看起來正確”的匹配。?“junk”是不希望算法與之匹配的東西：例如普通文本文件中的空行，或者HTML文件中的“ <P>”行，等等。SequenceMatcher的輸出很友好，例如：

Input Strings:?my stackoverflow mysteries?AND?mystery

SequenceMatcher algorithm returns：to anyone, the natural match is?"myster"?as follows

my stackoverflow?mysteries
.................mystery..

However, the LCS will output?"mystery"

my?stackoverflow mysteries
my.st.....er......y.......

1. SequenceMatcher FlowChart

? ? ? ? ? ? ? ? ? ??

Given two input strings a and b,

ratio( )?returns the similarity score ( float in [0,1] ) between input strings. It sums the sizes of all matched sequences returned by function?get_matching_blocks?and calculates the ratio as:?ratio = 2.0*M / T?, where M = matches , T = total number of elements in both sequences
get_matching_blocks( )?return list of triples describing matching subsequences. The last triple is a dummy, (len(a), len(b), 0). It works by repeated application of find_longest_match( )
find_longest_match( )?returns a triple containing the longest matching block in a[aLow:aHigh] and b[bLow:bHigh]

? ? ? ? ? ? ? ? ? ? ? ???

1.1 get_matching_blocks()?

該函數可根據自己項目需求進行修改：

def get_matching_blocks(self):if self.matching_blocks is not None:return self.matching_blocksla, lb = len(self.a), len(self.b)queue = [(0, la, 0, lb)]matching_blocks = []while queue:alo, ahi, blo, bhi = queue.pop()i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)# a[alo:i] vs b[blo:j] unknown# a[i:i+k] same as b[j:j+k]# a[i+k:ahi] vs b[j+k:bhi] unknownif k: # if k is 0, there was no matching blockmatching_blocks.append(x)if alo < i and blo < j:queue.append((alo, i, blo, j))if i+k < ahi and j+k < bhi:queue.append((i+k, ahi, j+k, bhi))matching_blocks.sort()

新建一個隊列queue，用包含兩個輸入字符串上下限索引的四元組初始化。當隊列中存在四元組時，將其彈出并傳遞給find_longest_match()函數，該函數返回描述匹配子序列的三元組。三元組添加到matching_blocks列表中。

?三元組格式為：?(i, j, n)，即?a[i:i+n] == b[j:j+n]

匹配子序列左側和右側的序列片段將進一步添加到隊列queue中。重復該過程，直到隊列為空。

然后對matching_blocks列表進行排序，并作為輸出返回。

1.2?find_longest_match()

def find_longest_match(self, alo=0, ahi=None, blo=0, bhi=None):a, b, b2j, isbjunk = self.a, self.b, self.b2j, self.bjunk.__contains__besti, bestj, bestsize = alo, blo, 0# find longest junk-free match# during an iteration of the loop, j2len[j] = length of longest# junk-free match ending with a[i-1] and b[j]j2len = {} # {位置：重復最長長度}nothing = []for i in range(alo, ahi):# look at all instances of a[i] in b; note that because# b2j has no junk keys, the loop is skipped if a[i] is junkj2lenget = j2len.getnewj2len = {} # newj2len 記錄到目前為止匹配的字符串的長度for j in b2j.get(a[i], nothing):# a[i] matches b[j]if j < blo: # 當前匹配子串[]之前的元素continueif j >= bhi: # 當前匹配子串之后的元素breakk = newj2len[j] = j2lenget(j-1, 0) + 1 # 當前位置j的前一個位置對應的重復最長長度，使其+1if k > bestsize: # 若當前最長長度 > bestsize，則將bestsize更新為當前最長長度，同時更新besti、bestjbesti, bestj, bestsize = i-k+1, j-k+1, kj2len = newj2len

輸入為包含字符串上下限索引的四元組，輸出為包含最長匹配塊的三元組。

首先，定義字典b2j，其中對于字符串b中的x，b2j [x]是x出現的索引（到b）的列表。

在外循環中按字符掃描第一個字符串a，我們使用b2j檢查字符串b中該字符的出現。如果存在匹配項，我們將更新另一個字典newj2len，這有助于確保到目前為止匹配的字符串的長度。因此，變量besti，bestj和bestsize進行了更新，其中考慮了迄今為止獲得的最長的匹配塊數據。

在所有最大匹配塊中，該算法返回最早在a中開始的那個，而在所有最大匹配中最早在a中開始的那個，它返回最早在b中開始的那個。

1.3 ratio()

def ratio(self):"""Return a measure of the sequences' similarity (float in [0,1]).Where T is the total number of elements in both sequences, andM is the number of matches, this is 2.0*M / T.Note that this is 1 if the sequences are identical, and 0 ifthey have nothing in common."""matches = sum(triple[-1] for triple in self.get_matching_blocks())return _calculate_ratio(matches, len(self.a) + len(self.b))

retio()函數計算序列a和b的相似度，ratio = 2.0*M / T，M為匹配的字符數，T為兩個序列的總字符數。相似度的計算可根據實際情況進行修改。

根據經驗，ratio（）值超過0.6表示序列是緊密匹配的。在這里，根據計算，我們獲得了0.8的相似性得分比，因此輸入序列對被視為相似。

2. 例子說明

Let’s describe the step by step procedure of the algorithm by implementing it using an input pair of strings.

3. 項目需求函數更改

get_matching_blocks()：

計算a在b中的抄襲率，即a序列在b序列中的重復度，b保持不動，遍歷a的子序列去跟b比較，即（alo, i, 0, lb）或（i+k, ahi, 0, lb）。
當重復字數k達到一定值時，視為抄襲，即k >= duplicateNumber。
相似度radio：ratio = M / len(a)，即重復率?= 重復字符數/a的長度

def get_matching_blocks(self, duplicateNumber=6):"""Return list of triples describing matching subsequences.參數：duplicateNumber: 重復多少字數才算抄襲的下限值"""if self.matching_blocks is not None:return self.matching_blocksla, lb = len(self.a), len(self.b)queue = [(0, la, 0, lb)]matching_blocks = []while queue:alo, ahi, blo, bhi = queue.pop()i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)# a[alo:i] vs b[blo:j] unknown# a[i:i+k] same as b[j:j+k]# a[i+k:ahi] vs b[j+k:bhi] unknownif k: # if k is 0, there was no matching block# 匹配的size長度， k > duplicateNumber， duplicateNumber值根據具體情況而定if k >= duplicateNumber:matching_blocks.append(x)if alo < i:queue.append((alo, i, 0, lb))if i+k < ahi:queue.append((i+k, ahi, 0, lb))matching_blocks.sort() def ratio(self, only_a=True):matches = sum(triple[-1] for triple in self.get_matching_blocks())return matches/len(self.a) if only_a else _calculate_ratio(matches, len(self.a) + len(self.b))

總結

以上是生活随笔為你收集整理的文本查重：difflib.SequenceMatcher的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：人工智能到底是什么？人工智能如何改变社会
下一篇： python正则表达式——regex模块