SVD分解.潜语义分析.PythonCode
原文鏈接:http://www.cnblogs.com/appler/archive/2012/02/02/2335886.html
原始英文鏈接:http://www.puffinwarellc.com/index.php/news-and-articles/articles/33.html
潛語義分析LSA介紹
Latent Semantic Analysis (LSA), also known as Latent Semantic Indexing (LSI) literally means analyzing documents to find the underlying meaning or concepts of those documents. If each word only meant one concept, and each concept was only described by one word, then LSA would be easy since there is a simple mapping from words to concepts.
?
Latent Semantic Analysis (LSA)也被叫做Latent Semantic Indexing (LSI),從字面上的意思理解就是通過分析文檔去發現這些文檔中潛在的意思和概念。假設每個詞僅表示一個概念,并且每個概念僅僅被一個詞所描述,LSA將非常簡單(從詞到概念存在一個簡單的映射關系)
Unfortunately, this problem is difficult because English has different words that mean the same thing (synonyms), words with multiple meanings, and all sorts of ambiguities that obscure the concepts to the point where even people can have a hard time understanding.
?
不幸的是,這個問題并沒有如此簡單,因為存在不同的詞表示同一個意思(同義詞),一個詞表示多個意思,所有這種二義性(多義性)都會混淆概念以至于有時就算是人也很難理解。
?
For example, the word bank when used together with mortgage, loans, and rates probably means a financial institution. However, the word bank when used together with lures, casting, and fish probably means a stream or river bank.
?
例如,銀行這個詞和抵押、貸款、利率一起出現時往往表示金融機構。但是,和魚餌,投擲、魚一起出現時往往表示河岸。
?
How Latent Semantic Analysis Works
潛語義分析工作原理
?
Latent Semantic Analysis arose from the problem of how to find relevant documents from search words. The fundamental difficulty arises when we compare?words?to find relevant documents, because what we really want to do is compare the?meanings or concepts behind the words. LSA attempts to solve this problem by mapping both words and documents into a "concept" space and doing the comparison in this space.
?
潛語義分析(Latent Semantic Analysis)源自問題:如何從搜索query中找到相關的文檔。當我們試圖通過比較詞來找到相關的文本時,存在著難以解決的局限性,那就是在搜索中我們實際想要去比較的不是詞,而是隱藏在詞之后的意義和概念。潛語義分析試圖去解決這個問題,它把詞和文檔都映射到一個‘概念’空間并在這個空間內進行比較(注:也就是一種降維技術)。
?
Since authors have a wide choice of words available when they write, the concepts can be obscured due to different word choices from different authors. This essentially random choice of words introduces noise into the word-concept relationship. Latent Semantic Analysis filters out some of this noise and also attempts to find the smallest set of concepts that spans all the documents.
?
當文檔的作者寫作的時候,對于詞語有著非常寬泛的選擇。不同的作者對于詞語的選擇有著不同的偏好,這樣會導致概念的混淆。這種對于詞語的隨機選擇在 詞-概念 的關系中引入了噪音。LSA濾除了這樣的一些噪音,并且還能夠從全部的文檔中找到最小的概念集合(為什么是最小?)。
?
In order to make this difficult problem solvable, LSA introduces some dramatic simplifications.
1.???? Documents are represented as "bags of words", where the order of the words in a document is not important, only how many times each word appears in a document.
2.???? Concepts are represented as patterns of words that usually appear together in documents. For example "leash", "treat", and "obey" might usually appear in documents about dog training.
3.???? Words are assumed to have only one meaning. This is clearly not the case (banks could be river banks or financial banks) but it makes the problem tractable.
To see a small example of LSA, take a look at the next section.
?
為了讓這個難題更好解決,LSA引入一些重要的簡化:
??? 1. 文檔被表示為”一堆詞(bags of words)”,因此詞在文檔中出現的位置并不重要,只有一個詞的出現次數。
??? 2. 概念被表示成經常出現在一起的一些詞的某種模式。例如“leash”(栓狗的皮帶)、“treat”、“obey”(服從)經常出現在關于訓練狗的文檔中。
??? 3. 詞被認為只有一個意思。這個顯然會有反例(bank表示河岸或者金融機構),但是這可以使得問題變得更加容易。(這個簡化會有怎樣的缺陷呢?)
?
接下來看一個LSA的小例子,Next Part:
A Small Example
一個例子
As a small example, I searched for books using the word “investing” at Amazon.com and took the top 10 book titles that appeared. One of these titles was dropped because it had only one index word in common with the other titles. An index word is any word that:
- appears in 2 or more titles, and
- is not a very common word such as “and”, “the”, and so on (known as stop words). These words are not included because do not contribute much (if any) meaning.
In this example we have removed the following stop words: “and”, “edition”, “for”, “in”, “little”, “of”, “the”, “to”.
一個小例子,我在amazon.com上搜索”investing”(投資) 并且取top 10搜索結果的書名。其中一個被廢棄了,因為它只含有一個索引詞(index word)和其它標題相同。索引詞可以是任何滿足下列條件的詞:
??? 1. 在2個或者2個以上標題中出現 并且
??? 2. 不是那種特別常見的詞例如 “and”, ”the” 這種(停用詞-stop word)。這種詞沒有包含進來是因為他們本身不存在什么意義。
在這個例子中,我們拿掉了如下停用詞:“and”, “edition”, “for”, “in”, “little”, “of”, “the”, “to”.
Here are the 9 remaining tiles. The index words (words that appear in 2 or more titles and are not stop words) are underlined.
下面就是那9個標題,索引詞(在2個或2個以上標題出現過的非停用詞)被下劃線標注:
1.???? The Neatest Little Guide to Stock Market Investing
2.???? Investing ForDummies, 4th Edition
3.???? The Little Book of Common SenseInvesting: The Only Way to Guarantee Your Fair Share ofStock Market Returns
4.???? The Little Book ofValue Investing
5.???? Value Investing: From Graham to Buffett and Beyond
6.???? Rich Dad's Guide toInvesting: What theRich Invest in, That the Poor and the Middle Class Do Not!
7.???? Investing inReal Estate, 5th Edition
8.???? Stock Investing ForDummies
9.???? Rich Dad's Advisors: The ABC's ofReal Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss
Once Latent Semantic Analysis has been run on this example, we can plot the index words and titles on an XY graph and identify clusters of titles. The 9 titles are plotted with blue circles and the 11 index words are plotted with red squares. Not only can we spot clusters of titles, but since index words can be plotted along with titles, we can label the clusters. For example, the blue cluster, containing titles T7 and T9, is about real estate. The green cluster, with titles T2, T4, T5, and T8, is about value investing, and finally the red cluster, with titles T1 and T3, is about the stock market. The T6 title is an outlier, off on its own.
在這個例子里面應用了LSA,我們可以在XY軸的圖中畫出詞和標題的位置(只有2維),并且識別出標題的聚類。藍色圓圈表示9個標題,紅色方塊表示11個索引詞。我們不但能夠畫出標題的聚類,并且由于索引詞可以被畫在標題一起,我們還可以給這些聚類打標簽。例如,藍色的聚類,包含了T7和T9,是關于real estate(房地產)的,綠色的聚類,包含了標題T2,T4,T5和T8,是講value investing(價值投資)的,最后是紅色的聚類,包含了標題T1和T3,是講stock market(股票市場)的。標題T6是孤立點(outlier)
?
In the next few sections, we'll go through all steps needed to run Latent Semantic Analysis on this example.
在下面的部分,我們會通過這個例子介紹LSA的整個流程。
Part 1 - Creating the Count Matrix
第一部分 - 創建計數矩陣
The first step in Latent Semantic Analysis is to create theword by title (or document) matrix. In this matrix, each index word is a rowand each title is a column. Each cell contains the number of times that wordoccurs in that title. For example, the word "book" appears one timein title T3 and one time in title T4, whereas "investing" appears onetime in every title. In general, the matrices built during LSA tend to be verylarge, but also very sparse (most cells contain 0). That is because each titleor document usually contains only a small number of all the possible words.This sparseness can be taken advantage of in both memory and time by moresophisticated LSA implementations.
LSA的第一步是要去創建詞到標題(文檔)的矩陣。在這個矩陣里,每一個索引詞占據了一行,每一個標題占據一列。每一個單元(cell)包含了這個詞出現在那個標題中的次數。例如,詞”book”出現在T3中一次,出現在T4中一次,而” investing”在所有標題中都出現了一次。一般來說,在LSA中的矩陣會非常大而且會非常稀疏(大部分的單元都是0)。這是因為每個標題或者文檔一般只包含所有詞匯的一小部分。更復雜的LSA算法會利用這種稀疏性去改善空間和時間復雜度。
In the following matrix, we have left out the 0's to reduceclutter.
| Index Words | Titles | ||||||||
| ? | T1 | T2 | T3 | T4 | T5 | T6 | T7 | T8 | T9 |
| book | ? | ? | 1 | 1 | ? | ? | ? | ? | ? |
| dads | ? | ? | ? | ? | ? | 1 | ? | ? | 1 |
| dummies | ? | 1 | ? | ? | ? | ? | ? | 1 | ? |
| estate | ? | ? | ? | ? | ? | ? | 1 | ? | 1 |
| guide | 1 | ? | ? | ? | ? | 1 | ? | ? | ? |
| investing | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| market | 1 | ? | 1 | ? | ? | ? | ? | ? | ? |
| real | ? | ? | ? | ? | ? | ? | 1 | ? | 1 |
| rich | ? | ? | ? | ? | ? | 2 | ? | ? | 1 |
| stock | 1 | ? | 1 | ? | ? | ? | ? | 1 | ? |
| value | ? | ? | ? | 1 | 1 | ? | ? | ? | ? |
?
Python - Getting Started
Download the python code?here.
Throughout this article, we'll givePython code that implements all the steps necessary for doing Latent SemanticAnalysis. We'll go through the code section by section and explain everything.The Python code used in this article can be downloaded?here?and then run in Python. You need to havealready installed the Python NumPy and SciPy libraries.
?
在這篇文章中,我們用python代碼去實現LSA的所有步驟。我們將介紹所有的代碼。Python代碼可以在這里被下到(見上)。需要安裝NumPy 和 SciPy這兩個庫。
Python - Import Functions
First we need to import a few functions from Python librariesto handle some of the math we need to do. NumPy is the Python numericallibrary, and we'll import zeros, a function that creates a matrix of zeros thatwe use when building our words by titles matrix. From the linear algebra partof the scientific package (scipy.linalg) we import the svd function thatactually does the singular value decomposition, which is the heart of LSA.
NumPy是python的數值計算類,用到了zeros(初始化矩陣),scipy.linalg這個線性代數的庫中,我們引入了svd函數也就是做奇異值分解,LSA的核心。
[python] view plaincopyprint??
Python - Define Data
Next, we define the data that we are using. Titles holds the9 book titles that we have gathered, stopwords holds the 8 common words that weare going to ignore when we count the words in each title, and ignorechars hasall the punctuation characters that we will remove from words. We use Python'striple quoted strings, so there are actually only 4 punctuation symbols we areremoving: comma (,), colon (:), apostrophe ('), and exclamation point (!).
Stopwords 是停用詞 ignorechars是無用的標點
[python] view plaincopyprint??
Python - Define LSA Class
The LSA class has methods for initialization, parsingdocuments, building the matrix of word counts, and calculating. The firstmethod is the __init__ method, which is called whenever an instance of the LSAclass is created. It stores the stopwords and ignorechars so they can be usedlater, and then initializes the word dictionary and the document countvariables.
這里定義了一個LSA的類,包括其初始化過程wdict是詞典,dcount用來記錄文檔號。
[python] view plaincopyprint??
Python - Parse Documents
The parse method takes a document, splits it into words, removesthe ignored characters and turns everything into lowercase so the words can becompared to the stop words. If the word is a stop word, it is ignored and wemove on to the next word. If it is not a stop word, we put the word in thedictionary, and also append the current document number to keep track of whichdocuments the word appears in.
The documents that each word appears in are kept in a listassociated with that word in the dictionary. For example, since the word bookappears in titles 3 and 4, we would have self.wdict['book'] = [3, 4] after alltitles are parsed.
After processing all words from the current document, weincrease the document count in preparation for the next document to be parsed.
這個函數就是把文檔拆成詞并濾除停用詞和標點,剩下的詞會把其出現的文檔號填入到wdict中去,例如,詞book出現在標題3和4中,則我們有self.wdict['book'] = [3, 4]。相當于建了一下倒排。
[python] view plaincopyprint??
Python - Build the Count Matrix
Once all documents are parsed, all the words (dictionarykeys) that are in more than 1 document are extracted and sorted, and a matrix isbuilt with the number of rows equal to the number of words (keys), and thenumber of columns equal to the document count. Finally, for each word (key) anddocument pair the corresponding matrix cell is incremented.
所有的文檔被解析之后,所有出現的詞(也就是詞典的keys)被取出并且排序。建立一個矩陣,其行數是詞的個數,列數是文檔個數。最后,所有的詞和文檔對所對應的矩陣單元的值被統計出來。
[python] view plaincopyprint??
Python - Print the Count Matrix
The printA() method is very simple, it just prints out thematrix that we have built so it can be checked.
把矩陣打印出來
[python] view plaincopyprint??
Python - Test the LSA Class
After defining the LSA class, it's time to try it out on our9 book titles. First we create an instance of LSA, called mylsa, and pass itthe stopwords and ignorechars that we defined. During creation, the __init__method is called which stores the stopwords and ignorechars and initializes theword dictionary and document count.
Next, we call the parse method on each title. This methodextracts the words in each title, strips out punctuation characters, convertseach word to lower case, throws out stop words, and stores remaining words in adictionary along with what title number they came from.
Finally we call the build() method to create the matrix ofword by title counts. This extracts all the words we have seen so far, throwsout words that occur in less than 2 titles, sorts them, builds a zero matrix ofthe right size, and then increments the proper cell whenever a word appears ina title.
[python] view plaincopyprint?Here is the raw output produced by printA(). As you can see,it's the same as the matrix that we showed earlier.
在剛才的測試數據中驗證程序邏輯,并查看最終生成的矩陣:
[python] view plaincopyprint?
總結
以上是生活随笔為你收集整理的SVD分解.潜语义分析.PythonCode的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: ML二:NNSearch数据结构--二叉
- 下一篇: 煮有机咖啡的方法和注意事项