當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

MIT自然语言处理第三讲：概率语言模型（第四、五、六部分）

發布時間：2025/3/21 编程问答 11 豆豆

生活随笔收集整理的這篇文章主要介紹了 MIT自然语言处理第三讲：概率语言模型（第四、五、六部分）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

MIT自然語言處理第三講：概率語言模型（第四部分）

自然語言處理：概率語言模型

Natural Language Processing: Probabilistic Language Modeling
作者：Regina Barzilay（MIT,EECS Department, November 15, 2004)
譯者：我愛自然語言處理（www.52nlp.cn ，2009年1月20日）

四、平滑算法
a) 最大似然估計（Maximum Likelihood Estimate）
　i. MLE使訓練數據盡可能的“大”（MLE makes training data as probable as possible）：
　　　P_{ML}(w_{i}/{w_{i-1},w_{i-2}}) = {Count(w_{i-2},w_{i-1},w_{i})}/{Count(w_{i-2},w_{i-1})}
　　1. 對于詞匯規模為N的語料庫，我們將要在模型中得到N^{3}的參數（For vocabulary of size N, we will have N3 parameters in the model）
　　2. 對于N=1000，我們將要估計1000^{3}=10^{9}個參數（For N =1, 000, we have to estimate1, 000^{3}=10^{9}?parameters）
　　3. 問題（Problem）: 如何處理未登錄詞（how to deal with unseen words）?
　ii. 數據稀疏問題（Sparsity）
　　1. 未知事件的總計概率構成了測試數據的很大一部分（The aggregate probability of unseen events constitutes a large fraction of the test data）
　　2. Brown et al (1992): 考慮一個3.5億詞的英語語料庫，14%的三元詞是未知的（considered a 350 million word corpus of English, 14% of trigrams are unseen）
　iii. 注：關于MLE的簡要補充
　　1. 最大似然估計是一種統計方法，它用來求一個樣本集的相關概率密度函數的參數。這個方法最早是遺傳學家以及統計學家羅納德?費雪爵士在1912年至1922年間開始使用的。
　　2. “似然”是對likelihood 的一種較為貼近文言文的翻譯，“似然”用現代的中文來說即“可能性”。故而，若稱之為“最大可能性估計”則更加通俗易懂。
　　3.MLE選擇的參數使訓練語料具有最高的概率，它沒有浪費任何概率在訓練語料中沒有出現的事件中
　　4.但是MLE概率模型通常不適合做NLP的統計語言模型，會出現0概率，這是不允許的。
b) 如何估計未知元素的概率（How to estimate probability of unseen elements）?
　i. 打折（Discounting）
　　1. Laplace加1平滑（Laplace）
　　2. Good-Turing打折法（Good-Turing）
　ii. 線性插值法（Linear Interpolation）
　iii. Katz回退（Katz Back-Off）
c) 加一(Laplace)平滑（Add-One (Laplace) Smoothing）
　i. 最簡單的打折方法（Simplest discounting technique）:
　　　{P(w_{i}/w_{i-1})} = {C(w_{i-1},w_{i})+1}/{C(w_{i-1})+V}
　　這里Ｖ是詞匯表的數目——語料庫的“型”（where |ν| is a vocabulary size）
　　注：MIT課件這里似乎有誤，我已修改
　ii. 貝葉斯估計假設事件發生前是一個均勻分布（Bayesian estimator assuming a uniform unit prior on events）
　iii. 問題（Problem）: 對于未知事件占去的概率太多了（Too much probability mass to unseen events）
　iv. 例子（Example）：
　　假設V=10000(詞型)，S=1000000(詞例)（Assume |ν| =10, 000, and S=1, 000, 000）：
　　　P_{MLE}(ball/{kike~a}) = {{Count(kike~a~ball)}/{Count(kick~a)}} = 9/10 = 0.9
　　　P_{+1}(ball/{kike~a}) = {{Count(kike~a~ball)+1}/{Count(kick~a)+V}} = {9+1}/{10+10000} = 9*10^{-4}
　v. Laplace的缺點（Weaknesses of Laplace）
　　1. 對于稀疏分布，Laplace法則賦予未知事件太多的概率空間（For Sparse distribution, Laplace’s Law gives too much of the probability space to unseen events）
　　2. 在預測二元語法的實際概率時與其他平滑方法相比顯得非常差（Worst at predicting the actual probabilities of bigrams than other methods）
　　3. 使用加epsilon平滑更合理一些（More reasonable to use add-epsilonsmoothing (Lidstone’s Law)）

未完待續:第五部分

附：課程及課件pdf下載MIT英文網頁地址：
　　　http://people.csail.mit.edu/regina/6881/

注：本文遵照麻省理工學院開放式課程創作共享規范翻譯發布，轉載請注明出處“我愛自然語言處理”：www.52nlp.cn

from：http://www.52nlp.cn/mit-nlp-third-lesson-probabilistic-language-modeling-fourth-part/

MIT自然語言處理第三講：概率語言模型（第五部分）

自然語言處理：概率語言模型
Natural Language Processing: Probabilistic Language Modeling
作者：Regina Barzilay（MIT,EECS Department, November 15, 2004)
譯者：我愛自然語言處理（www.52nlp.cn ，2009年2月10日）

五、 Good-Turing打折法（Good-Turing Discounting）
a) 你在將來看到一個新詞的可能性有多大？用所看到的事件去估計未知事件的概率（How likely are you to see a new word type in the future? Use things you’ve seen once to estimate the probability of unseen things）
　i.?n_r——頻率為r的元素（n元語法）計數并且r>0（number of elements with r frequency and r>0）
　ii.?n_0——總詞匯規模減去觀察到的詞匯規模，既出現次數為0的n元語法（size of the total lexicon minus the size of observed lexicon）
　iii. 對于頻率為r的元素，修正計數為（Modified count for elements with frequency r）：
　　　　　　　　r^* = (r+1)*{n_{r+1}/n_r}
b) 關于Good-Turing打折法的補充說明：
　i. Good(1953)首先描述了Good-Turing算法，而這種算法的原創思想則來自Turing 。
　ii. Good-Turing平滑的基本思想是：用觀察較高的N元語法數的方法來重新估計概率量的大小，并把它指派給那些具有零計數或較低計數的N元語法。
c) 直觀的Good-Turing打折法（Good-Turing Discounting: Intuition）
　i. 目的（Goal）: 估計訓練數據中計數為r的單詞在同樣規模測試集中的出現頻率（estimate how often word with r counts in training data occurs in test set of equal size）。
　ii. 我們使用刪除估計（We use deleted estimation）：
　　1. 每次刪除一個單詞（delete one word at a time）
　　2. 如果單詞“test”在所有的數據集中出現了r+1次（if “test” word occurs r +1 times in complete data set）：
　　——它在訓練集中出現了r 次（it occurs r times in “training” set）
　　——對計數為r的單詞加1（add one count to words with r counts）
　iii. r-count單詞“桶”中的總的計數為（total count placed to bucket for r-count words is）:
　　　　　　　　　n_{r+1}*(r +1)
　iv. 平均計數為：
　　　　　　(avg-count of r count words) = {n_{r+1}*(r+1)}/n_r?
d) Good-Turing打折法續（Good-Turing Discounting (cont.)）：
　i. 在Good-Turing中，分配給所有未知事件的總的概率等于n_1/N, 其中N是訓練集的規模。它與分配給獨立事件的相對頻率公式相似。
　ii. In Good-Turing, the total probability assigned to all the unobserved events is equal ton_1/N , where N is the size of the training set. It is the same as a relative frequency formula would assign to singleton events.
e) 舉例（Example: Good-Turing）
Training sample of 22,000,000 (Church&Gale’1991))
r 　　　N_r　　　　　　　heldout　　r^*
0 　　74,671,100,000　0.00027　0.00027
1 　　2,018,046　　　　0.448　　0.446
2 　　449,721　　　　　1.25　　　1.26
3 　　188,933　　　　　2.24　　　2.24
4 　　105,668　　　　　3.23　　　3.24
5 　　68,379　　　　　 4.21　　　4.22
6 　　48,190　　　　　 5.23　　　5.19
f) 補充說明：
　i. 根據Zipf定律,對于小的r,?N_r比較大;對于大的r,N_r小,對于出現次數最多的n元組,r*=0!
　ii. 所以對于出現次數很多的n元組, GT估計不準,而MLE估計比較準,因此可以直接采用MLE. GT估計一般適用于出現次數為k(k<10)的n元組　iii. 如果這樣,考慮”劫富濟貧”,這里的”富”就變成了”中產”階級!呵呵,真正的富翁沾光了!（雖然富翁損一點也沒什么）連打折法也不敢欺富人！這就是“為富不仁”，“一毛不拔”的來歷。未完待續：第六部分

附：課程及課件pdf下載MIT英文網頁地址：
　　　http://people.csail.mit.edu/regina/6881/

注：本文遵照麻省理工學院開放式課程創作共享規范翻譯發布，轉載請注明出處“我愛自然語言處理”：www.52nlp.cn

from：http://www.52nlp.cn/mit-nlp-third-lesson-probabilistic-language-modeling-fifth-part/

MIT自然語言處理第三講：概率語言模型（第六部分）

六、插值及回退
a) The Bias-Variance Trade-Off
　i. 未平滑的三元模型估計(Unsmoothed trigram estimate)：　　　　　　
　　P_ML({w_i}/{w_{i-2},w_{i-1}})={Count(w_{i-2}w_{i-1}w_{i})}/{Count(w_{i-2},w_{i-1})}
　ii. 未平滑的二元模型估計(Unsmoothed bigram estimate）：
　　　P_ML({w_i}/{w_{i-1}})={Count(w_{i-1}w_{i})}/{Count(w_{i-1})}
　iii. 未平滑的一元模型估計(Unsmoothed unigram estimate)：
　　　P_ML({w_i})={Count(w_{i})}/sum{j}{}{Count(w_{j})}
　iv. 這些不同的估計中哪個和“真實”的P({w_i}/{w_{i-2},w_{i-1}})概率最接近（How close are these different estimates to the “true” probability?P({w_i}/{w_{i-2},w_{i-1}}))?
b) 插值（Interpolation）
　i. 一種解決三元模型數據稀疏問題的方法是在模型中混合使用受數據稀疏影響較小的二元模型和一元模型（One way of solving the sparseness in a trigram model is to mix that model with bigram and unigram models that suffer less from data sparseness）。
　ii. 權值可以使用期望最大化算法（EM）或其它數值優化技術設置（The weights can be set using the Expectation-Maximization Algorithm or another numerical optimization technique）
　iii. 線性插值（Linear Interpolation)
　　hat{P}({w_i}/{w_{i-2},w_{i-1}})={lambda_1}*P_ML({w_i}/{w_{i-2},w_{i-1}})
　　+{lambda_2}*P_ML({w_i}/w_{i-1})+{lambda_3}*P_ML({w_i})
　　這里{lambda_1}+{lambda_2}+{lambda_3}=1并且{lambda_i}>=0?對于所有的 i
　iv. 參數估計（Parameter Estimation）
　　1. 取出訓練集的一部分作為“驗證”數據（Hold out part of training set as “validation” data）
　　2. 定義Count_2(w_1,w_2,w_3)作為驗證集中三元集?w_1,w_2,w_3?的出現次數（Define?Count_2(w_1,w_2,w_3)?to be the number of times the trigram?w_1,w_2,w_3?is seen in validation set）
　　3. 選擇{lambda_i}去最大化(Choose?{lambda_i}?to maximize):
L({lambda_1},{lambda_2},{lambda_3})=sum{(w_1,w_2,w_3)in{upsilon}}{}{Count_2(w_1,w_2,w_3)}log{hat{P}}({w_3}/{w_2,w_1})
　　這里{lambda_1}+{lambda_2}+{lambda_3}=1并且{lambda_i}>=0?對于所有的 i
　　注：關于參數估計的其他內容，由于公式太多，這里略，請參考原始課件
c)Kats回退模型-兩元（Katz Back-Off Models (Bigrams)）：
　i. 定義兩個集合（Define two sets）：
　　A(w_{i-1})=delim{lbrace}{w:Count(w_{i-1},w)>0}{rbrace}
　　
　　B(w_{i-1})=delim{lbrace}{w:Count(w_{i-1},w)=0}{rbrace}
　ii. 一種兩元模型（A bigram model）：
P_K({w_i}/w_{i-1})=delim{lbrace}{matrix{2}{2}{{{Count^{*}(w_{i-1},w)}/{Count(w_{i-1})}>0} {if{w_i}{in}{A(w_{i-1})}} {alpha(w_{i-1}){{P_ML(w_{i})}/sum{w{in}B(w_{i-1})}{}{P_ML(w)}} } {if{w_i}{in}{B(w_{i-1})}} }}{}
{alpha(w_{i-1})=1-sum{w{in}A(w_{i-1})}{}{{Count^{*}(w_{i-1},w)}/{Count(w_{i-1})}}}
　iii. Count^*定義（Count^*definitions）
　　1. Kats對于Count(x)<5使用Good-Turing方法,對于Count(x)>=5令Count^*(x)=Count(x)(Katz uses Good-Turing method for Count(x)< 5, and?Count^*(x)=Count(x)for Count(x)>=5)
　　2. “Kneser-Ney”方法（“Kneser-Ney” method）：
　　　Count^*(x)=Count(x)-D,其中?D={n_1}/{n_1+n_2}
　　　n_1是頻率為1的元素個數（n_1?is a number of elements with frequency 1)
　　　n_2是頻率為2的元素個數（n_2?is a number of elements with frequency 2)

七、綜述
a) N元模型的弱點（Weaknesses of n-gram Models）
　i. 有何想法（Any ideas）?
　　短距離（Short-range）
　　中距離（Mid-range）
　　長距離（Long-range）
b) 更精確的模型（More Refined Models）
　i. 基于類的模型（Class-based models）
　ii. 結構化模型（Structural models）
　iii. 主題和長距離模型（Topical and long-range models）
c) 總結（Summary）
　i. 從一個詞表開始（Start with a vocabulary）
　ii. 選擇一種模型（Select type of model）
　iii. 參數估計（Estimate Parameters）
d) 工具包參考：
　i. CMU-Cambridge language modeling toolkit:
　　http://mi.eng.cam.ac.uk/~prc14/toolkit.html
　ii.SRILM – The SRI Language Modeling Toolkit:
　　http://www.speech.sri.com/projects/srilm/

第三講結束！
第四講：標注

附：課程及課件pdf下載MIT英文網頁地址：
　　　http://people.csail.mit.edu/regina/6881/

注：本文遵照麻省理工學院開放式課程創作共享規范翻譯發布，轉載請注明出處“我愛自然語言處理”：www.52nlp.cn

from：http://www.52nlp.cn/mit-nlp-third-lesson-probabilistic-language-modeling-sixth-part/

總結

以上是生活随笔為你收集整理的MIT自然语言处理第三讲：概率语言模型（第四、五、六部分）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： MIT自然语言处理第三讲：概率语言模型（
下一篇： MIT自然语言处理第四讲：标注