當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

人工智能：python 实现第十章，NLP 第四天 A　Ｂａｇ Of Words

發(fā)布時(shí)間：2023/12/20 python 32 豆豆

生活随笔收集整理的這篇文章主要介紹了人工智能：python 实现第十章，NLP 第四天 A　Ｂａｇ Of Words 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

使用用詞袋（a bag of words）模型提取頻繁項(xiàng)

文本分析的主要目標(biāo)之一是將文本轉(zhuǎn)化為數(shù)值形式。以便使用機(jī)器進(jìn)行學(xué)習(xí)。我們考慮下，數(shù)以百萬計(jì)的單詞文檔，為了去分析這些文檔，我們需要提取文本并且將其轉(zhuǎn)化為數(shù)值符號。

機(jī)器學(xué)習(xí)算法需要處理數(shù)值的數(shù)據(jù)，以便他們能夠分析數(shù)據(jù)并且提取有用的信息。用詞袋模型從文檔的所有單詞中提取特征單詞，并且用這些特征項(xiàng)矩陣建模。這就使得我們能夠?qū)⒚恳环菸臋n描述成一個(gè)用詞袋。我們只需要記錄單詞的數(shù)量，語法和單詞的順序都可以忽略。

那么一份文檔的單詞矩陣是怎樣的呢。一個(gè)文檔的單詞矩陣是一個(gè)記錄出現(xiàn)在文檔中的所有單詞的次數(shù)。因此一份文檔能被描述成各種單詞權(quán)重的組合體。我們能夠設(shè)置條件，篩選出更有意義的單詞。順帶，我們能構(gòu)建出現(xiàn)在文檔中所有單詞的頻率柱狀圖，這就是一個(gè)特征向量。這個(gè)特征向量將被用在文本分類。

思考一下幾句話：

句1：the children are playing in the hall
句2：The hall has? a lot of space
句3：Lots of children like playing in an open space

假設(shè)你思考完了這三句話，我們能夠得到下面14個(gè)唯一的單詞：

the、children 、are 、playing、in 、hall、has 、a、lot、of、space、like、an、open

我們可以用出現(xiàn)在每句話中的單詞次數(shù)為每一句話構(gòu)建一個(gè)柱狀圖。每一個(gè)特征矩陣都將有14維，因?yàn)橛?4個(gè)不同的單詞:

句1：[2 1 1 1 1 1 0 0 0 0 0 0 0 0]

句2：[1 0 0 0 0 1 1 1 1 1 1 0 0 0]

句3：[0 1 0 1 1 0 0 0 1 1 1 1 1 1]

既然我們已經(jīng)提取這些特征向量，我們能夠使用機(jī)器學(xué)習(xí)算法分析這些數(shù)據(jù)

如何使用NLTK構(gòu)建用詞袋模型呢？創(chuàng)建一個(gè)python程序，導(dǎo)入如下包

import numpy as np from sklearn.feature_extraction.text import CountVectorizer from nltk.corpus import brown

讀入布朗語料庫文本，我們將讀入5400個(gè)單詞，你能按照自己的意愿輸入

# Read the data from the Brown corpus input_data = ' '.join(brown.words()[:5400])定義每塊的單詞數(shù)量

# Number of words in each chunk chunk_size = 800

定義分塊函數(shù)：

#將輸入的文本分塊，每一塊含有N個(gè)單詞 def chunker(input_data,N): input_words = input_data.split(' ') output=[] cur_chunk = [] count = 0 for word in input_words: cur_chunk.append(word) count+=1 if count==N: output.append(' '.join(cur_chunk)) count,cur_chunk =0,[] output.append(' '.join(cur_chunk)) return output

對輸入文本分塊

text_chunks = chunker(input_data, chunk_size)

將所分的塊轉(zhuǎn)換為字典項(xiàng)

# Convert to dict items chunks = [] for count, chunk in enumerate(text_chunks):d = {'index': count, 'text': chunk}chunks.append(d)使用已經(jīng)得到的單詞出現(xiàn)次數(shù)，提取文檔術(shù)語矩陣。我們將使用CountVectorizer方法完成此工作，該方法需要兩個(gè)輸入?yún)?shù)。。第一個(gè)參數(shù)是出現(xiàn)在文檔中單詞的最小頻率度，第二個(gè)參數(shù)是出現(xiàn)在文檔中的單詞的最大的頻率度。這兩個(gè)頻度是參考在文本中單詞的出現(xiàn)次數(shù)。

max_df：可以設(shè)置為范圍在[0.0 1.0]的float，也可以設(shè)置為沒有范圍限制的int，默認(rèn)為1.0。這個(gè)參數(shù)的作用是作為一個(gè)閾值，當(dāng)構(gòu)造語料庫的關(guān)鍵詞集的時(shí)候，如果某個(gè)詞的document frequence大于max_df，這個(gè)詞不會被當(dāng)作關(guān)鍵詞。如果這個(gè)參數(shù)是float，則表示詞出現(xiàn)的次數(shù)與語料庫文檔數(shù)的百分比，如果是int，則表示詞出現(xiàn)的次數(shù)。如果參數(shù)中已經(jīng)給定了vocabulary，則這個(gè)參數(shù)無效

min_df：類似于max_df，不同之處在于如果某個(gè)詞的document frequence小于min_df，則這個(gè)詞不會被當(dāng)作關(guān)鍵詞

# Extract the document term matrix count_vectorizer = CountVectorizer(min_df=7, max_df=20) document_term_matrix = count_vectorizer.fit_transform([chunk['text'] for chunk in chunks])

提取詞匯并顯示。單詞引用于之前步驟所提取的并去重的一系列單詞。

# Extract the vocabulary and display it vocabulary = np.array(count_vectorizer.get_feature_names()) print("\nVocabulary:\n", vocabulary)

創(chuàng)建顯示列：

# Generate names for chunks chunk_names = [] for i in range(len(text_chunks)):chunk_names.append('Chunk-' + str(i+1))

輸出文檔項(xiàng)矩陣：

# Print the document term matrix print("\nDocument term matrix:") formatted_text = '{:>12}' * (len(chunk_names) + 1) print('\n', formatted_text.format('Word', *chunk_names), '\n') for word, item in zip(vocabulary, document_term_matrix.T):# 'item' is a 'csr_matrix' data structureoutput = [word] + [str(freq) for freq in item.data]print(formatted_text.format(*output))

完整代碼如下：

import numpy as np from sklearn.feature_extraction.text import CountVectorizer from nltk.corpus import brown#將輸入的文本分塊，每一塊含有N個(gè)單詞 def chunker(input_data,N): input_words = input_data.split(' ') output=[] cur_chunk = [] count = 0 for word in input_words: cur_chunk.append(word) count+=1 if count==N: output.append(' '.join(cur_chunk)) count,cur_chunk =0,[] output.append(' '.join(cur_chunk)) return output # Read the data from the Brown corpus input_data = ' '.join(brown.words()[:5400])# Number of words in each chunk chunk_size = 800text_chunks = chunker(input_data, chunk_size)# Convert to dict items chunks = [] for count, chunk in enumerate(text_chunks):d = {'index': count, 'text': chunk}chunks.append(d)# Extract the document term matrix count_vectorizer = CountVectorizer(min_df=7, max_df=20) document_term_matrix = count_vectorizer.fit_transform([chunk['text'] for chunk in chunks])# Extract the vocabulary and display it vocabulary = np.array(count_vectorizer.get_feature_names()) print("\nVocabulary:\n", vocabulary)# Generate names for chunks chunk_names = [] for i in range(len(text_chunks)):chunk_names.append('Chunk-' + str(i+1))# Print the document term matrix print("\nDocument term matrix:") formatted_text = '{:>12}' * (len(chunk_names) + 1) print('\n', formatted_text.format('Word', *chunk_names), '\n') for word, item in zip(vocabulary, document_term_matrix.T):# 'item' is a 'csr_matrix' data structureoutput = [word] + [str(freq) for freq in item.data]print(formatted_text.format(*output))

我們能夠看到所有的文檔單詞矩陣和每個(gè)單詞在每一塊的出現(xiàn)次數(shù)

總結(jié)

以上是生活随笔為你收集整理的人工智能：python 实现第十章，NLP 第四天 A　Ｂａｇ Of Words的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：创建资产负债表
下一篇： vscode 折叠所有区域代码快捷键

python

人工智能：python 实现 第十章，NLP 第四天 A Ｂａｇ Of Words

總結(jié)

人工智能：python 实现第十章，NLP 第四天 A　Ｂａｇ Of Words