edittext禁止换行符但能自动换行简书_使用n-gram创建自动完成系统
n-gram語言模型用于就是計算句子的概率,通俗來講就是判斷這句話是人話的可能性有多少。n就是將句子做切割,n個單詞為一組。
如何計算句子的概率?根據條件概率和鏈式規則
P(B|A)=P(A,B)/P(A) ==>P(A,B) = P(A)P(B|A)
所以P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
如果句子很長,那么這個式子就會很長,計算會變得很復雜。為了簡化,引入了馬爾科夫假設:隨意一個詞出現的概率只與它前面出現的有限的一個或者幾個詞有關。
假設前面的例子當中,一個詞出現的概率只和前面的一個詞有關,那么
P(A,B,C,D) = P(B|A)P(C|B)P(D|C)
用公式來表達,如果出現的一個詞出現的概率只和前面的k個詞(就是n-gram里的n)有關,那么一個句子的計算概率就為
以下為翻譯文,原版是
https://github.com/tsuirak/deeplearning.ai/blob/master/Natural%20Language%20Processing/Course%202%20-%20Probabilistic%20Models/Labs/Week%203/C2-W3-assginment-Auto%20Complete.ipynb?github.com在這篇文章里,你將創建一個自動匹配系統。自動匹配系統,是你每天都能看到的
- 當在你google里搜索的時候,你經常會得到一些提示來幫助你完成你的搜索。
- 當你在寫一封郵件的時候,你會得到一些對于你的語句中結束詞的建議
這這個任務的結尾,你將會開發類似于這個系統的原型。
大綱
1.加載和處理數據
1.1加載數據
1.2處理數據
2.開發一個n-gram基本語言模型
3.復雜性
4.創建一個自動完成系統
自動完成系統中,一個關鍵的創建區塊就是一個語言模型。一個語言模型給一個序列的所有詞分配概率,換句話來說最有可能的句子序列有更高的分數
"I have a pen" 跟"I am a pen" 相比,我們希望它有更高的概率,因為它在我們的實際中,更加符合自然的句子你可以利用概率計算去開發一套自動完成系統。比如用戶輸入:
"I eat scrambled" ,那么你可以找到一個單詞x 使 "I eat scrambled x" 擁有最高的概率. 如果x = "eggs", 那么句子將會是 "I eat scrambled eggs"現在已經有很多種語言模型已經被開發出來, 在這個任務當中,我們使用 N-grams, 這個簡單又強大的方法去開發語言模型。
- N-grams 同樣使用于機器翻譯和語音識別.
以下是這個任務的步驟
- 加載數據,然后分詞.
- 把句子分成訓練數據集和測試數據集.
- 對于低頻詞,使用 <unk>代替.
2.開發一個n-grams的基礎語言模型
- 從一個給定的數據集中計算n-grams的數量
- 使用k-平滑方式,估算下個一個詞的條件概率
3.通過計算困惑度分數來評估N-gram模型
4. 利用的你模型,來預測你句子的下一個單詞
先引入所需要的庫
import math import random import numpy as np import nltk import pandas as pd nltk.data.path.append('.')Part 1: 加載和預處理數據
Part 1.1: 加載數據
你將使用twitter data. 運行下面的代碼,加載和看前幾個句子。
注意數據是一個包含很多很多推文的長長的字符串,在推文之間有換行符 "n" 分割。
with open("en_US.twitter.txt","r") as f:data = f.read()print("Data type:", type(data)) print("Number of letters:", len(data)) print("First 300 letters of the data") print("-------") display(data[0:300]) print("-------")print("Last 300 letters of the data") print("-------") display(data[-300:]) print("-------")輸出:
Data type: <class 'str'> Number of letters: 3335477 First 300 letters of the data -------
"How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.nWhen you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.nthey've decided its more fun if I don't.nSo Tired D; Played Lazer Tag & Ran A "
------- Last 300 letters of the data -------
"ust had one a few weeks back....hopefully we will be back soon! wish you the best yonColombia is with an 'o'...“: We now ship to 4 countries in South America (fist pump). Please welcome Columbia to the Stunner Family”n#GutsiestMovesYouCanMake Giving a cat a bath.nCoffee after 5 was a TERRIBLE idea.n"
-------
Part 1.2 預處理數據
根據以下步驟預處理數據:
Note: 在這項練習里,我們拋棄了驗證數據集合。
- 在真實的應用中,我們應該保留一部分數據當作驗證數據集,去調整我們的訓練
- 為了簡單起見,我們跳過了這一步
Exercise 01
將數據分隔成句子
def split_to_sentences(data):"""Split data by linebreak "n"Args:data: strReturns:A list of sentences"""sentences = data.split('n')# Additional clearning (This part is already implemented)# - Remove leading and trailing spaces from each sentence# - Drop sentences if they are empty strings.sentences = [s.strip() for s in sentences]sentences = [s for s in sentences if len(s) > 0]return sentencesExercise 02
下一步就是將句子分詞 (將一個句子分成一系列的詞).
- 將所有的詞都轉換成小寫,這樣的話大寫的詞和小寫的詞(比如big 和Big就可以當作同一個詞)
- 將每個一個句子的分詞列表添加道一個總的list里
Exercise 03
使用以上定義的兩個方法獲取分詞數據。
- 將數據分成句子
- 給句子分詞
將數據分為訓練集和測試集
tokenized_data = get_tokenized_data(data) random.seed(87) random.shuffle(tokenized_data)train_size = int(len(tokenized_data) * 0.8) train_data = tokenized_data[0:train_size] test_data = tokenized_data[train_size:]Exercise 04
并非有所的詞在訓練里你都會用到,你只會用到高頻詞
- 你只會關注在數據集里出現n次的詞
- 首先需要計算詞出現的此時
你需要嵌套兩個循環,一個是sentences ,一個是sentences里的詞
def count_words(tokenized_sentences):"""Count the number of word appearence in the tokenized sentencesArgs:tokenized_sentences: List of lists of stringsReturns:dict that maps word (str) to the frequency (int)"""word_counts = {}# Loop through the sentencesfor sentence in tokenized_sentences:for token in sentence:# If the token is not in the dictionary yet,set the count to 1if token not in word_counts.keys():word_counts[token] = 1# If the token already in the dictionary,increment the count by 1else:word_counts[token] += 1return word_counts處理 'Out of Vocabulary' words
如果你的模型正在實現自動完成,但是遇到一個在訓練的時候從來沒有出現過的詞,沒法得到下一個詞的建議。這個模型不可以預測下一個詞因為對于當前詞(未出現的詞,它的數量是沒有的)
- 這個新詞叫做 'unknown word', 或者 out of vocabulary (OOV) words.
- unknown words 在測試集里的百分比叫做 OOV 率.
在預測過程中,為了處理這些unknown words,我們用一個特殊詞'unk'表示 。
修改訓練數據,方便訓練一些 'unknown' words .
- 在測試數據里將低頻詞轉換成"unknown" words
- 在訓練集里創建一個高頻詞列表,叫closed vocabulary .
- 將所有非高頻詞都轉化成 'unk'.
Exercise 05
創建一個方法,使用文檔和一個數量門檻'count_threshold'
- 任何詞頻大于 'count_threshold' 的被當作 closed vocabulary.
Exercise 06
- 除了高頻詞,其他都是'unknown'.
- 將'unknown'.用"<unk>"表示
Exercise 07
現在我們已經準備通過組合之前已經實現的方法來處理數據.
Part 2: 開發 n-gram 基本語言模型
在這個章節里, 將會開發 n-grams 語言模型.
- 假設當前詞的概率只和前面的 n-gram,有關
- 前面的 n-gram 是指前面的一系列的n個詞
這句子中,位置t的詞,它前面的詞是 wt?1,wt?2?wt?nwt?1,wt?2?wt?n ,它的條件概率是is:
P(wt|wt?1…wt?n)
你可以通過計算在訓練集里這一序列的詞的次數來估算這個概率值.
這個概率中,分子是位置t的詞(wt)出現在wt?n.....wt?1這一序列詞之后的次數。分母是wt?n.....wt?1這一序列的詞出現的次數。
如果數量為0(無論是分子還是分母),可以通過添加k平滑修改這個概率計算公式。
根據以上概率計算公式,對于分母我們需要計算n個詞組成的序列的次數;對于分子,我們需要計算n+1個詞組成的序列的次數
Exercise 08
下面,你將會編寫一個可以計算n-grams 數量(n是任意值)的方法。
計算之前,需要在句子面前添加n個<s>表明是句子的開頭,比如n=2,假若句子是"I like food",則需要將句子修改成"<s><s> I like food"。同時,也要在句子后面添加<e>表明是句子的結尾。???
Technical note: 在這個方法里, 你將會使用dictionary來保存數量.
- dictionary 的key是一個 n 個單詞的tuple (并非list)
- dictionary 的value是出現的次數
- key使用tuple而不是list的原因是list在python里是一個可變對象(可以修改的);但是 tuple 是不可變的,一旦創建就不能修改。
Exercise 09
下一步,評估在給定的n個詞之后的單詞的概率。
根據這個式子,如果n-grams在訓練集里沒有出現過,那么分母就為0,上面的式子就行不通。為了處理這種數量為0的情況,我們加入了 k平滑。
分子加入一個常量k,分母加入k|v|,任何數量為0的n-gram的概率都是1/v(v是單詞數量)
def estimate_probability(word, previous_n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):"""Estimate the probabilities of a next word using the n-gram counts with k-smoothingArgs:word: next wordprevious_n_gram: A sequence of words of length nn_gram_counts: Dictionary of counts of (n+1)-gramsn_plus1_gram_counts: Dictionary of counts of (n+1)-gramsvocabulary_size: number of words in the vocabularyk: positive constant, smoothing parameterReturns:A probability"""# convert list to tuple to use it as a dictionary keyprevious_n_gram = tuple(previous_n_gram)# Set the denominator# If the previous n-gram exists in the dictionary of n-gram counts,# Get its count. Otherwise set the count to zero# Use the dictionary that has counts for n-gramsprevious_n_gram_count = n_gram_counts[previous_n_gram] if previous_n_gram in n_gram_counts else 0# Calculate the denominator using the count of the previous n-gram# and apply k-smoothingdenominator = previous_n_gram_count + k*vocabulary_size# Define n plus 1 gram as the previous n-gram plus the current word as a tuplen_plus1_gram = previous_n_gram + (word,)# Set the count to the count in the dictionary,# otherwise 0 if not in the dictionary# use the dictionary that has counts for the n-gram plus current wordn_plus1_gram_count = n_plus1_gram_counts[n_plus1_gram] if n_plus1_gram in n_plus1_gram_counts else 0# Define the numerator use the count of the n-gram plus current word,# and apply smoothingnumerator = n_plus1_gram_count + k# Calculate the probability as the numerator divided bt denominatorprobability = numerator / denominatorreturn probability計算所有詞的概率
下面的方法是遍歷語料庫,計算所有詞的概率。
def estimate_probabilities(previous_n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0):"""Estimate the probabilities of next words using the n-gram counts with k-smoothingArgs:previous_n_gram: A sequence of words of length nn_gram_counts: Dictionary of counts of (n+1)-gramsn_plus1_gram_counts: Dictionary of counts of (n+1)-gramsvocabulary: List of wordsk: positive constant, smoothing parameterReturns:A dictionary mapping from next words to the probability."""# Convert list to tuple to use it as dictionary keyprevious_n_gram = tuple(previous_n_gram)# add <e> <unk> to the vocabulary# <s> is not needed since it should not appear as the next wordvocabulary = vocabulary + ['<e>','<unk>']vocabulary_size = len(vocabulary)probabilities = {}for word in vocabulary:probability = estimate_probability(word,previous_n_gram,n_gram_counts,n_plus1_gram_counts,vocabulary_size,k=k)probabilities[word] = probabilityreturn probabilitiesCount and probability matrices
數量和概率矩陣。
根據以上我們定義的方法,我們可以創建數量矩陣和概率矩陣。
def make_count_matrix(n_plus1_gram_counts, vocabulary):# add <e> <unk> to the vocabulary# <s> is omitted since it should not appear as the next wordvocabulary = vocabulary + ["<e>", "<unk>"]# obtain unique n-gramsn_grams = []for n_plus1_gram in n_plus1_gram_counts.keys():n_gram = n_plus1_gram[0:-1]n_grams.append(n_gram)n_grams = list(set(n_grams))# mapping from n-gram to rowrow_index = {n_gram:i for i, n_gram in enumerate(n_grams)}# mapping from next word to columncol_index = {word:j for j, word in enumerate(vocabulary)}nrow = len(n_grams)ncol = len(vocabulary)count_matrix = np.zeros((nrow, ncol))for n_plus1_gram, count in n_plus1_gram_counts.items():n_gram = n_plus1_gram[0:-1]word = n_plus1_gram[-1]if word not in vocabulary:continuei = row_index[n_gram]j = col_index[word]count_matrix[i, j] = countcount_matrix = pd.DataFrame(count_matrix, index=n_grams, columns=vocabulary)return count_matrixvv
sentences = [['i', 'like', 'a', 'cat'],['this', 'dog', 'is', 'like', 'a', 'cat']] unique_words = list(set(sentences[0] + sentences[1])) print('ntrigram counts') trigram_counts = count_n_grams(sentences, 3) display(make_count_matrix(trigram_counts, unique_words))可得如下結果
計算概率矩陣
def make_probability_matrix(n_plus1_gram_counts, vocabulary, k):count_matrix = make_count_matrix(n_plus1_gram_counts, unique_words)count_matrix += kprob_matrix = count_matrix.div(count_matrix.sum(axis=1), axis=0)return prob_matrixsentences = [['i', 'like', 'a', 'cat'],['this', 'dog', 'is', 'like', 'a', 'cat']] unique_words = list(set(sentences[0] + sentences[1])) print("trigram probabilities") trigram_counts = count_n_grams(sentences, 3) display(make_probability_matrix(trigram_counts, unique_words, k=1))可得如下結果
Part 3: 困惑度
困惑度用來衡量一個模型的好壞,困惑度越低,模型越好。
在這節里,我們會在測試集里計算困惑度。
- N 是句子的數量
- n 是在n-gram中,單詞的數量 in the n-gram (e.g. 2 for a bigram).
- 數字從1開始,而非0
在代碼里,數組索引從0開始,所以在代碼里t的范圍改成從n到N+1將使用以下公式
概率越好,困惑度越低.
n-grams給我們提供越多的句子信息,困惑度越低。
Exercise 10
def calculate_perplexity(sentence, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):"""Calculate perplexity for a list of sentencesArgs:sentence: List of stringsn_gram_counts: Dictionary of counts of (n+1)-gramsn_plus1_gram_counts: Dictionary of counts of (n+1)-gramsvocabulary_size: number of unique words in the vocabularyk: Positive smoothing constantReturns:Perplexity score"""# length of previous wordsn = len(list(n_gram_counts.keys())[0]) # prepend <s> and append <e>sentence = ["<s>"] * n + sentence + ["<e>"]# Cast the sentence from a list to a tuplesentence = tuple(sentence)# length of sentence (after adding <s> and <e> tokens)N = len(sentence)# The variable p will hold the product# that is calculated inside the n-root# Update this in the code belowproduct_pi = 1.0### START CODE HERE (Replace instances of 'None' with your code) #### Index t ranges from n to N - 1(t的范圍是從n到N-1)for t in range(n, N): # complete this line# get the n-gram preceding the word at position tn_gram = sentence[t-n:t]# get the word at position tword = sentence[t]# Estimate the probability of the word given the n-gram# using the n-gram counts, n-plus1-gram counts,# vocabulary size, and smoothing constantprobability = estimate_probability(word,n_gram, n_gram_counts, n_plus1_gram_counts, len(unique_words), k=1)# Update the product of the probabilities# This 'product_pi' is a cumulative product # of the (1/P) factors that are calculated in the loopproduct_pi *= 1 / probability# Take the Nth root of the productperplexity = product_pi**(1/float(N))### END CODE HERE ### return perplexityPart 4: 創建一個自動完成系統
在這一章節里,我們將使用前面使用的方法,來創建一個自動完成系統。
在以下的方法中,使用了一個start_with方法,指定下個單詞的前幾個字母
def suggest_a_word(previous_tokens, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0, start_with=None):"""Get suggestion for the next wordArgs:previous_tokens: The sentence you input where each token is a word. Must have length > n n_gram_counts: Dictionary of counts of (n+1)-gramsn_plus1_gram_counts: Dictionary of counts of (n+1)-gramsvocabulary: List of wordsk: positive constant, smoothing parameterstart_with: If not None, specifies the first few letters of the next wordReturns:A tuple of - string of the most likely next word- corresponding probability"""# length of previous wordsn = len(list(n_gram_counts.keys())[0]) # From the words that the user already typed# get the most recent 'n' words as the previous n-gramprevious_n_gram = previous_tokens[-n:]# Estimate the probabilities that each word in the vocabulary# is the next word,# given the previous n-gram, the dictionary of n-gram counts,# the dictionary of n plus 1 gram counts, and the smoothing constantprobabilities = estimate_probabilities(previous_n_gram,n_gram_counts, n_plus1_gram_counts,vocabulary, k=k)# Initialize suggested word to None# This will be set to the word with highest probabilitysuggestion = None# Initialize the highest word probability to 0# this will be set to the highest probability # of all words to be suggestedmax_prob = 0### START CODE HERE (Replace instances of 'None' with your code) #### For each word and its probability in the probabilities dictionary:for word, prob in probabilities.items(): # complete this line# If the optional start_with string is setif start_with != None: # complete this line# Check if the word starts with the letters in 'start_with'if not word.startswith(start_with): # complete this line#If so, don't consider this word (move onto the next word)continue # complete this line# Check if this word's probability# is greater than the current maximum probabilityif prob > max_prob: # complete this line# If so, save this word as the best suggestion (so far)suggestion = word# Save the new maximum probabilitymax_prob = prob### END CODE HEREreturn suggestion, max_prob簡單測試下:
sentences = [['i', 'like', 'a', 'cat'],['this', 'dog', 'is', 'like', 'a', 'cat']] unique_words = list(set(sentences[0] + sentences[1]))unigram_counts = count_n_grams(sentences, 1) bigram_counts = count_n_grams(sentences, 2)previous_tokens = ["i", "like"] tmp_suggest1 = suggest_a_word(previous_tokens, unigram_counts, bigram_counts, unique_words, k=1.0) print(f"The previous words are 'i like',ntand the suggested word is `{tmp_suggest1[0]}` with a probability of {tmp_suggest1[1]:.4f}")print() # test your code when setting the starts_with tmp_starts_with = 'c' tmp_suggest2 = suggest_a_word(previous_tokens, unigram_counts, bigram_counts, unique_words, k=1.0, start_with=tmp_starts_with) print(f"The previous words are 'i like', the suggestion must start with `{tmp_starts_with}`ntand the suggested word is `{tmp_suggest2[0]}` with a probability of {tmp_suggest2[1]:.4f}")輸出是
The previous words are 'i like',
and the suggested word is `a` with a probability of 0.2727
The previous words are 'i like', the suggestion must start with `c`
and the suggested word is `cat` with a probability of 0.0909
獲取多條建議
def get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0, start_with=None):model_counts = len(n_gram_counts_list)suggestions = []for i in range(model_counts-1):n_gram_counts = n_gram_counts_list[i]n_plus1_gram_counts = n_gram_counts_list[i+1]suggestion = suggest_a_word(previous_tokens, n_gram_counts,n_plus1_gram_counts, vocabulary,k=k, start_with=start_with)suggestions.append(suggestion)return suggestions使用任意長度的n-grams ,獲取多條建議
祝賀你! 你已經開發了自己的自動完成系統需要的所有模塊。
讓我們來看看基于任意長度的n-grams模型 (unigrams, bigrams, trigrams, 4-grams...6-grams).
n_gram_counts_list = [] for n in range(1, 6):print("Computing n-gram counts with n =", n, "...")n_model_counts = count_n_grams(train_data_processed, n)n_gram_counts_list.append(n_model_counts) previous_tokens = ["i", "am", "to"] tmp_suggest4 = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0)print(f"The previous words are {previous_tokens}, the suggestions are:") display(tmp_suggest4)輸出
The previous words are ['i', 'am', 'to'], the suggestions are:
[('be', 0.027665685098338604), ('have', 0.00013487086115044844), ('have', 0.00013490725126475548), ('i', 6.746272684341901e-05)]
輸出:
The previous words are ['hey', 'how', 'are', 'you'], the suggestions are:
[("'re", 0.023973994311255586), ('?', 0.002888465830762161), ('?', 0.0016134453781512605), ('<e>', 0.00013491635186184566)]
總結
以上是生活随笔為你收集整理的edittext禁止换行符但能自动换行简书_使用n-gram创建自动完成系统的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 树状选择框测试用例_【转】【测试用例设计
- 下一篇: redis启动后 允许访问_最全Redi