Exploring Word Vextors
CS2224N Assignment 1:Exploring Word Vextors
# All Import Statements Defined Here # Note:Do not add to this list. #-------------# 查詢python版本號 import sys assert sys.version_info[0]==3 assert sys.version_info[1]>=5from gensim.models import KeyedVectors from gensim.test.utils import datapath import pprint import matplotlib.pyplot as plt plt.rcParams['figure.figsize']=[10,5] import nltk nltk.download('reuters') from nltk.corpus import reuters #從corpus語料庫中導出路透社語語言庫 import numpy as np import random import scipy as sp # 高級數學計算庫 from sklearn.decomposition import TruncatedSVD from sklearn.decomposition import PCASTART_TOKEN='<START>' END_TOKEN='<END>'np.random.seed(0) random.seed(0) #-------- [nltk_data] Downloading package reuters to [nltk_data] C:\Users\羅松\AppData\Roaming\nltk_data... [nltk_data] Package reuters is already up-to-date!Word Vectors
Word Vectors are often used as a fundamental component for downstream NLP tasks, e.g. question answering, text generation, translation, etc., so it is important to build some intuitions as to their strengths and weaknesses. Here, you will explore two types of word vectors: those derived from co-occurrence matrices, and those derived via GloVe.
Note on Terminology[術語]: The terms “word vectors” and “word embeddings” are often used interchangeably. The term “embedding” refers to the fact that we are encoding aspects of a word’s meaning in a lower dimensional space. As Wikipedia states, “conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension[從概念上講,它涉及到一種數學上的嵌入,從每個單詞一個維度的空間映射到一個維度低得多的連續向量空間。]”.
Part 1: Count-Based Word Vectors
Most word vector models start from the following idea:
- You shall know a word by the company it keeps (Firth, J. R. 1957:11)
Many word vector implementations are driven by the idea that similar words, i.e., (near) synonyms, will be used in similar contexts. As a result, similar words will often be spoken or written along with a shared subset of words, i.e., contexts. By examining these contexts, we can try to develop embeddings for our words. With this intuition in mind, many “old school” approaches to constructing word vectors relied on word counts. Here we elaborate upon one of those strategies, co-occurrence matrices (for more information, see here or here).
Co-Occurrence
A co-occurrence matrix counts how often things co-occur in some environment. Given some word wi occurring in the document, we consider the context window surrounding wi . Supposing our fixed window size is n , then this is the n preceding and n subsequent words in that document, i.e. words wi?n…wi?1 and wi+1…wi+n . We build a co-occurrence matrix M , which is a symmetric word-by-word matrix in which Mij is the number of times wj appears inside wi 's window among all documents.
Example: Co-Occurrence with Fixed Window of n=1:
Document 1: “all that glitters is not gold”
Document 2: “all is well that ends well”
一個非常重要的思想是,我們認為某個詞的意思跟它臨近的單詞是緊密相關的。這是我們可以設定一個窗口(大小一般是5~10),如下窗口大小是2,那么在這個窗口內,與rests 共同出現的單詞就有life、he、in、peace。然后我們就利用這種共現關系來生成詞向量。
例如,現在我們的語料庫包括下面三份文檔資料:
I like deep learning.
I like NLP.
I enjoy flying.
作為示例,我們設定的窗口大小為1,也就是只看某個單詞周圍緊鄰著的那個單詞。此時,將得到一個對稱矩陣——共現矩陣。因為在我們的語料庫中,I 和 like做為鄰居同時出現在窗口中的次數是2,所以下表中I 和like相交的位置其值就是2。這樣我們也實現了將word變成向量的設想,在共現矩陣每一行(或每一列)都是對應單詞的一個向量表示。
雖然Cocurrence matrix一定程度上解決了單詞間相對位置也應予以重視這個問題。但是它仍然面對維度災難。也即是說一個word的向量表示長度太長了。這時,很自然地會想到SVD或者PCA等一些常用的降維方法。當然,這也會帶來其他的一些問題。
Plotting Co-Occurrence Word Embeddings
Here, we will be using the Reuters (business and financial news) corpus. If you haven’t run the import cell at the top of this page, please run it now (click it and press SHIFT-RETURN). The corpus consists of 10,788 news documents totaling 1.3 million words. These documents span 90 categories and are split into train and test. For more details, please see https://www.nltk.org/book/ch02.html. We provide a read_corpus function below that pulls out only articles from the “crude” (i.e. news articles about oil, gas, etc.) category. The function also adds START and END tokens to each of the documents, and lowercases words. You do not have to perform any other kind of pre-processing.
def read_corpus(category='crude'):""" Read files from the specified Reuter's category.Params:category(string):category nameReturn:list of lists,with words from each of the processed files"""files=reuters.fileids(category)return [[START_TOKEN]+[w.lower() for w in list(reuters.words(f))]+[END_TOKEN] for f in files]Let’s have a look what these documents are like….
reuters_corpus=read_corpus() pprint.pprint(reuters_corpus[:3],compact=True,width=100) [['<START>', 'japan', 'to', 'revise', 'long', '-', 'term', 'energy', 'demand', 'downwards', 'the','ministry', 'of', 'international', 'trade', 'and', 'industry', '(', 'miti', ')', 'will', 'revise','its', 'long', '-', 'term', 'energy', 'supply', '/', 'demand', 'outlook', 'by', 'august', 'to','meet', 'a', 'forecast', 'downtrend', 'in', 'japanese', 'energy', 'demand', ',', 'ministry','officials', 'said', '.', 'miti', 'is', 'expected', 'to', 'lower', 'the', 'projection', 'for','primary', 'energy', 'supplies', 'in', 'the', 'year', '2000', 'to', '550', 'mln', 'kilolitres','(', 'kl', ')', 'from', '600', 'mln', ',', 'they', 'said', '.', 'the', 'decision', 'follows','the', 'emergence', 'of', 'structural', 'changes', 'in', 'japanese', 'industry', 'following','the', 'rise', 'in', 'the', 'value', 'of', 'the', 'yen', 'and', 'a', 'decline', 'in', 'domestic','electric', 'power', 'demand', '.', 'miti', 'is', 'planning', 'to', 'work', 'out', 'a', 'revised','energy', 'supply', '/', 'demand', 'outlook', 'through', 'deliberations', 'of', 'committee','meetings', 'of', 'the', 'agency', 'of', 'natural', 'resources', 'and', 'energy', ',', 'the','officials', 'said', '.', 'they', 'said', 'miti', 'will', 'also', 'review', 'the', 'breakdown','of', 'energy', 'supply', 'sources', ',', 'including', 'oil', ',', 'nuclear', ',', 'coal', 'and','natural', 'gas', '.', 'nuclear', 'energy', 'provided', 'the', 'bulk', 'of', 'japan', "'", 's','electric', 'power', 'in', 'the', 'fiscal', 'year', 'ended', 'march', '31', ',', 'supplying','an', 'estimated', '27', 'pct', 'on', 'a', 'kilowatt', '/', 'hour', 'basis', ',', 'followed','by', 'oil', '(', '23', 'pct', ')', 'and', 'liquefied', 'natural', 'gas', '(', '21', 'pct', '),','they', 'noted', '.', '<END>'],['<START>', 'energy', '/', 'u', '.', 's', '.', 'petrochemical', 'industry', 'cheap', 'oil','feedstocks', ',', 'the', 'weakened', 'u', '.', 's', '.', 'dollar', 'and', 'a', 'plant','utilization', 'rate', 'approaching', '90', 'pct', 'will', 'propel', 'the', 'streamlined', 'u','.', 's', '.', 'petrochemical', 'industry', 'to', 'record', 'profits', 'this', 'year', ',','with', 'growth', 'expected', 'through', 'at', 'least', '1990', ',', 'major', 'company','executives', 'predicted', '.', 'this', 'bullish', 'outlook', 'for', 'chemical', 'manufacturing','and', 'an', 'industrywide', 'move', 'to', 'shed', 'unrelated', 'businesses', 'has', 'prompted','gaf', 'corp', '&', 'lt', ';', 'gaf', '>,', 'privately', '-', 'held', 'cain', 'chemical', 'inc',',', 'and', 'other', 'firms', 'to', 'aggressively', 'seek', 'acquisitions', 'of', 'petrochemical','plants', '.', 'oil', 'companies', 'such', 'as', 'ashland', 'oil', 'inc', '&', 'lt', ';', 'ash','>,', 'the', 'kentucky', '-', 'based', 'oil', 'refiner', 'and', 'marketer', ',', 'are', 'also','shopping', 'for', 'money', '-', 'making', 'petrochemical', 'businesses', 'to', 'buy', '.', '"','i', 'see', 'us', 'poised', 'at', 'the', 'threshold', 'of', 'a', 'golden', 'period', ',"', 'said','paul', 'oreffice', ',', 'chairman', 'of', 'giant', 'dow', 'chemical', 'co', '&', 'lt', ';','dow', '>,', 'adding', ',', '"', 'there', "'", 's', 'no', 'major', 'plant', 'capacity', 'being','added', 'around', 'the', 'world', 'now', '.', 'the', 'whole', 'game', 'is', 'bringing', 'out','new', 'products', 'and', 'improving', 'the', 'old', 'ones', '."', 'analysts', 'say', 'the','chemical', 'industry', "'", 's', 'biggest', 'customers', ',', 'automobile', 'manufacturers','and', 'home', 'builders', 'that', 'use', 'a', 'lot', 'of', 'paints', 'and', 'plastics', ',','are', 'expected', 'to', 'buy', 'quantities', 'this', 'year', '.', 'u', '.', 's', '.','petrochemical', 'plants', 'are', 'currently', 'operating', 'at', 'about', '90', 'pct','capacity', ',', 'reflecting', 'tighter', 'supply', 'that', 'could', 'hike', 'product', 'prices','by', '30', 'to', '40', 'pct', 'this', 'year', ',', 'said', 'john', 'dosher', ',', 'managing','director', 'of', 'pace', 'consultants', 'inc', 'of', 'houston', '.', 'demand', 'for', 'some','products', 'such', 'as', 'styrene', 'could', 'push', 'profit', 'margins', 'up', 'by', 'as','much', 'as', '300', 'pct', ',', 'he', 'said', '.', 'oreffice', ',', 'speaking', 'at', 'a','meeting', 'of', 'chemical', 'engineers', 'in', 'houston', ',', 'said', 'dow', 'would', 'easily','top', 'the', '741', 'mln', 'dlrs', 'it', 'earned', 'last', 'year', 'and', 'predicted', 'it','would', 'have', 'the', 'best', 'year', 'in', 'its', 'history', '.', 'in', '1985', ',', 'when','oil', 'prices', 'were', 'still', 'above', '25', 'dlrs', 'a', 'barrel', 'and', 'chemical','exports', 'were', 'adversely', 'affected', 'by', 'the', 'strong', 'u', '.', 's', '.', 'dollar',',', 'dow', 'had', 'profits', 'of', '58', 'mln', 'dlrs', '.', '"', 'i', 'believe', 'the','entire', 'chemical', 'industry', 'is', 'headed', 'for', 'a', 'record', 'year', 'or', 'close','to', 'it', ',"', 'oreffice', 'said', '.', 'gaf', 'chairman', 'samuel', 'heyman', 'estimated','that', 'the', 'u', '.', 's', '.', 'chemical', 'industry', 'would', 'report', 'a', '20', 'pct','gain', 'in', 'profits', 'during', '1987', '.', 'last', 'year', ',', 'the', 'domestic','industry', 'earned', 'a', 'total', 'of', '13', 'billion', 'dlrs', ',', 'a', '54', 'pct', 'leap','from', '1985', '.', 'the', 'turn', 'in', 'the', 'fortunes', 'of', 'the', 'once', '-', 'sickly','chemical', 'industry', 'has', 'been', 'brought', 'about', 'by', 'a', 'combination', 'of', 'luck','and', 'planning', ',', 'said', 'pace', "'", 's', 'john', 'dosher', '.', 'dosher', 'said', 'last','year', "'", 's', 'fall', 'in', 'oil', 'prices', 'made', 'feedstocks', 'dramatically', 'cheaper','and', 'at', 'the', 'same', 'time', 'the', 'american', 'dollar', 'was', 'weakening', 'against','foreign', 'currencies', '.', 'that', 'helped', 'boost', 'u', '.', 's', '.', 'chemical','exports', '.', 'also', 'helping', 'to', 'bring', 'supply', 'and', 'demand', 'into', 'balance','has', 'been', 'the', 'gradual', 'market', 'absorption', 'of', 'the', 'extra', 'chemical','manufacturing', 'capacity', 'created', 'by', 'middle', 'eastern', 'oil', 'producers', 'in','the', 'early', '1980s', '.', 'finally', ',', 'virtually', 'all', 'major', 'u', '.', 's', '.','chemical', 'manufacturers', 'have', 'embarked', 'on', 'an', 'extensive', 'corporate','restructuring', 'program', 'to', 'mothball', 'inefficient', 'plants', ',', 'trim', 'the','payroll', 'and', 'eliminate', 'unrelated', 'businesses', '.', 'the', 'restructuring', 'touched','off', 'a', 'flurry', 'of', 'friendly', 'and', 'hostile', 'takeover', 'attempts', '.', 'gaf', ',','which', 'made', 'an', 'unsuccessful', 'attempt', 'in', '1985', 'to', 'acquire', 'union','carbide', 'corp', '&', 'lt', ';', 'uk', '>,', 'recently', 'offered', 'three', 'billion', 'dlrs','for', 'borg', 'warner', 'corp', '&', 'lt', ';', 'bor', '>,', 'a', 'chicago', 'manufacturer','of', 'plastics', 'and', 'chemicals', '.', 'another', 'industry', 'powerhouse', ',', 'w', '.','r', '.', 'grace', '&', 'lt', ';', 'gra', '>', 'has', 'divested', 'its', 'retailing', ',','restaurant', 'and', 'fertilizer', 'businesses', 'to', 'raise', 'cash', 'for', 'chemical','acquisitions', '.', 'but', 'some', 'experts', 'worry', 'that', 'the', 'chemical', 'industry','may', 'be', 'headed', 'for', 'trouble', 'if', 'companies', 'continue', 'turning', 'their','back', 'on', 'the', 'manufacturing', 'of', 'staple', 'petrochemical', 'commodities', ',', 'such','as', 'ethylene', ',', 'in', 'favor', 'of', 'more', 'profitable', 'specialty', 'chemicals','that', 'are', 'custom', '-', 'designed', 'for', 'a', 'small', 'group', 'of', 'buyers', '.', '"','companies', 'like', 'dupont', '&', 'lt', ';', 'dd', '>', 'and', 'monsanto', 'co', '&', 'lt', ';','mtc', '>', 'spent', 'the', 'past', 'two', 'or', 'three', 'years', 'trying', 'to', 'get', 'out','of', 'the', 'commodity', 'chemical', 'business', 'in', 'reaction', 'to', 'how', 'badly', 'the','market', 'had', 'deteriorated', ',"', 'dosher', 'said', '.', '"', 'but', 'i', 'think', 'they','will', 'eventually', 'kill', 'the', 'margins', 'on', 'the', 'profitable', 'chemicals', 'in','the', 'niche', 'market', '."', 'some', 'top', 'chemical', 'executives', 'share', 'the','concern', '.', '"', 'the', 'challenge', 'for', 'our', 'industry', 'is', 'to', 'keep', 'from','getting', 'carried', 'away', 'and', 'repeating', 'past', 'mistakes', ',"', 'gaf', "'", 's','heyman', 'cautioned', '.', '"', 'the', 'shift', 'from', 'commodity', 'chemicals', 'may', 'be','ill', '-', 'advised', '.', 'specialty', 'businesses', 'do', 'not', 'stay', 'special', 'long','."', 'houston', '-', 'based', 'cain', 'chemical', ',', 'created', 'this', 'month', 'by', 'the','sterling', 'investment', 'banking', 'group', ',', 'believes', 'it', 'can', 'generate', '700','mln', 'dlrs', 'in', 'annual', 'sales', 'by', 'bucking', 'the', 'industry', 'trend', '.','chairman', 'gordon', 'cain', ',', 'who', 'previously', 'led', 'a', 'leveraged', 'buyout', 'of','dupont', "'", 's', 'conoco', 'inc', "'", 's', 'chemical', 'business', ',', 'has', 'spent', '1','.', '1', 'billion', 'dlrs', 'since', 'january', 'to', 'buy', 'seven', 'petrochemical', 'plants','along', 'the', 'texas', 'gulf', 'coast', '.', 'the', 'plants', 'produce', 'only', 'basic','commodity', 'petrochemicals', 'that', 'are', 'the', 'building', 'blocks', 'of', 'specialty','products', '.', '"', 'this', 'kind', 'of', 'commodity', 'chemical', 'business', 'will', 'never','be', 'a', 'glamorous', ',', 'high', '-', 'margin', 'business', ',"', 'cain', 'said', ',','adding', 'that', 'demand', 'is', 'expected', 'to', 'grow', 'by', 'about', 'three', 'pct','annually', '.', 'garo', 'armen', ',', 'an', 'analyst', 'with', 'dean', 'witter', 'reynolds', ',','said', 'chemical', 'makers', 'have', 'also', 'benefitted', 'by', 'increasing', 'demand', 'for','plastics', 'as', 'prices', 'become', 'more', 'competitive', 'with', 'aluminum', ',', 'wood','and', 'steel', 'products', '.', 'armen', 'estimated', 'the', 'upturn', 'in', 'the', 'chemical','business', 'could', 'last', 'as', 'long', 'as', 'four', 'or', 'five', 'years', ',', 'provided','the', 'u', '.', 's', '.', 'economy', 'continues', 'its', 'modest', 'rate', 'of', 'growth', '.','<END>'],['<START>', 'turkey', 'calls', 'for', 'dialogue', 'to', 'solve', 'dispute', 'turkey', 'said','today', 'its', 'disputes', 'with', 'greece', ',', 'including', 'rights', 'on', 'the','continental', 'shelf', 'in', 'the', 'aegean', 'sea', ',', 'should', 'be', 'solved', 'through','negotiations', '.', 'a', 'foreign', 'ministry', 'statement', 'said', 'the', 'latest', 'crisis','between', 'the', 'two', 'nato', 'members', 'stemmed', 'from', 'the', 'continental', 'shelf','dispute', 'and', 'an', 'agreement', 'on', 'this', 'issue', 'would', 'effect', 'the', 'security',',', 'economy', 'and', 'other', 'rights', 'of', 'both', 'countries', '.', '"', 'as', 'the','issue', 'is', 'basicly', 'political', ',', 'a', 'solution', 'can', 'only', 'be', 'found', 'by','bilateral', 'negotiations', ',"', 'the', 'statement', 'said', '.', 'greece', 'has', 'repeatedly','said', 'the', 'issue', 'was', 'legal', 'and', 'could', 'be', 'solved', 'at', 'the','international', 'court', 'of', 'justice', '.', 'the', 'two', 'countries', 'approached', 'armed','confrontation', 'last', 'month', 'after', 'greece', 'announced', 'it', 'planned', 'oil','exploration', 'work', 'in', 'the', 'aegean', 'and', 'turkey', 'said', 'it', 'would', 'also','search', 'for', 'oil', '.', 'a', 'face', '-', 'off', 'was', 'averted', 'when', 'turkey','confined', 'its', 'research', 'to', 'territorrial', 'waters', '.', '"', 'the', 'latest','crises', 'created', 'an', 'historic', 'opportunity', 'to', 'solve', 'the', 'disputes', 'between','the', 'two', 'countries', ',"', 'the', 'foreign', 'ministry', 'statement', 'said', '.', 'turkey',"'", 's', 'ambassador', 'in', 'athens', ',', 'nazmi', 'akiman', ',', 'was', 'due', 'to', 'meet','prime', 'minister', 'andreas', 'papandreou', 'today', 'for', 'the', 'greek', 'reply', 'to', 'a','message', 'sent', 'last', 'week', 'by', 'turkish', 'prime', 'minister', 'turgut', 'ozal', '.','the', 'contents', 'of', 'the', 'message', 'were', 'not', 'disclosed', '.', '<END>']]Question 1.1: Implement distinct_words
Write a method to work out the distinct words (word types) that occur in the corpus. You can do this with for loops, but it’s more efficient to do it with Python list comprehensions. In particular, this may be useful to flatten a list of lists. If you’re not familiar with Python list comprehensions in general, here’s more information.
Your returned corpus_words should be sorted. You can use python’s sorted function for this.
You may find it useful to use Python sets to remove duplicate words.
這道題目的讓你獲取 單詞列表 的列表 中不同單詞的,其實就是讓你熟悉一下使用 python 。這里使用集合這個數據結構來實現(集合是 Hash 表存儲結構,讀取起來更快)。然后每次循環在舊集合與新列表之間取并集即可:
Question 1.2: Implement compute_co_occurrence_matrix
Write a method that constructs a co-occurrence matrix for a certain window-size n (with a default of 4), considering words n before and n after the word in the center of the window. Here, we start to use numpy (np) to represent vectors, matrices, and tensors. If you’re not familiar with NumPy, there’s a NumPy tutorial in the second half of this cs231n Python NumPy tutorial.
第二個問題是構建一個鄰接矩陣。首先先看它需要返回的兩個東西:
- 第一個 word2ind 是由字符作為 key ,這個字符在之前第一小問返回的列表中的索引作為 val 的一個字典結構。根據這個結構,使我們能夠在 O ( 1 )的時間復雜度下獲得矩陣的位置。關于怎么創建全 0矩陣,見底下代碼
- 第二個 M 是一個矩陣,這個矩陣包含了信息:出現在這個詞左右的,是什么詞。我們只需要對當前字符串進行一次遍歷,在每次遍歷的時候,注意 window 這個變量,我們需要統計這個詞±window 這個變量周圍的所有單詞。
先回顧一下課堂上講的知識:為什么這個窗口是可行的:這個窗口就是這個單詞出現的上下文,因此出現在類似窗口(也即類似上下文)中的兩個單詞,他們的詞意一般是類似的,因此類似意思的兩個單詞會聚類在一起。
def compute_co_occurrence_matrix(corpus, window_size=4):""" Compute co-occurrence matrix for the given corpus and window_size (default of 4).Note: Each word in a document should be at the center of a window. Words near edges will have a smallernumber of co-occurring words.For example, if we take the document "<START> All that glitters is not gold <END>" with window size of 4,"All" will co-occur with "<START>", "that", "glitters", "is", and "not".Params:corpus (list of list of strings): corpus of documentswindow_size (int): size of context windowReturn:M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): Co-occurence matrix of word counts. The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.word2ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M."""words, num_words = distinct_words(corpus)M = Noneword2ind = {}# ------------------# Write your implementation here.i=0for key in words:word2ind[key]=ii+=1M=np.zeros((num_words,num_words))for sentence in corpus:for i,word in enumerate(sentence): #enumerate() 函數用于將一個可遍歷的數據對象(如列表、元組或字符串)組合為一個索引序列,#同時列出數據和數據下標,一般用在 for 循環當中。for j in range(i-window_size,i+window_size+1):if j<0 or j>=len(sentence):continueif j !=i:M[word2ind[word],word2ind[sentence[j]]]+=1# ------------------return M, word2ind # --------------------- # Run this sanity check # Note that this is not an exhaustive check for correctness. # ---------------------# Define toy corpus and get student's co-occurrence matrix test_corpus = ["{} All that glitters isn't gold {}".format(START_TOKEN, END_TOKEN).split(" "), "{} All's well that ends well {}".format(START_TOKEN, END_TOKEN).split(" ")] M_test, word2ind_test = compute_co_occurrence_matrix(test_corpus, window_size=1)# Correct M and word2ind M_test_ans = np.array( [[0., 0., 0., 0., 0., 0., 1., 0., 0., 1.,],[0., 0., 1., 1., 0., 0., 0., 0., 0., 0.,],[0., 1., 0., 0., 0., 0., 0., 0., 1., 0.,],[0., 1., 0., 0., 0., 0., 0., 0., 0., 1.,],[0., 0., 0., 0., 0., 0., 0., 0., 1., 1.,],[0., 0., 0., 0., 0., 0., 0., 1., 1., 0.,],[1., 0., 0., 0., 0., 0., 0., 1., 0., 0.,],[0., 0., 0., 0., 0., 1., 1., 0., 0., 0.,],[0., 0., 1., 0., 1., 1., 0., 0., 0., 1.,],[1., 0., 0., 1., 1., 0., 0., 0., 1., 0.,]] ) ans_test_corpus_words = sorted([START_TOKEN, "All", "ends", "that", "gold", "All's", "glitters", "isn't", "well", END_TOKEN]) word2ind_ans = dict(zip(ans_test_corpus_words, range(len(ans_test_corpus_words))))# Test correct word2ind assert (word2ind_ans == word2ind_test), "Your word2ind is incorrect:\nCorrect: {}\nYours: {}".format(word2ind_ans, word2ind_test)# Test correct M shape assert (M_test.shape == M_test_ans.shape), "M matrix has incorrect shape.\nCorrect: {}\nYours: {}".format(M_test.shape, M_test_ans.shape)# Test correct M values for w1 in word2ind_ans.keys():idx1 = word2ind_ans[w1]for w2 in word2ind_ans.keys():idx2 = word2ind_ans[w2]student = M_test[idx1, idx2]correct = M_test_ans[idx1, idx2]if student != correct:print("Correct M:")print(M_test_ans)print("Your M: ")print(M_test)raise AssertionError("Incorrect count at index ({}, {})=({}, {}) in matrix M. Yours has {} but should have {}.".format(idx1, idx2, w1, w2, student, correct))# Print Success print ("-" * 80) print("Passed All Tests!") print ("-" * 80) -------------------------------------------------------------------------------- Passed All Tests! --------------------------------------------------------------------------------Question 1.3: Implement reduce_to_k_dim [code] (1 point)
Construct a method that performs dimensionality reduction on the matrix to produce k-dimensional embeddings. Use SVD to take the top k components and produce a new matrix of k-dimensional embeddings.
Note: All of numpy, scipy, and scikit-learn (sklearn) provide some implementation of SVD, but only scipy and sklearn provide an implementation of Truncated SVD, and only sklearn provides an efficient randomized algorithm for calculating large-scale Truncated SVD. So please use sklearn.decomposition.TruncatedSVD.
這題就是導包,進行 SVD 分解(M = U Σ V T 其中 U , V 都是酉矩陣,Σ 為對角矩陣,且 r a n k ( Σ ) = r a n k ( M ),保留奇異值前 k 大的值,然后得到一個降維的矩陣 U Σ 。
原來那個 10 ? 10 的矩陣被降為一個 10 ? 2 的矩陣了。
def reduce_to_k_dim(M, k=2):""" Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:- http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.htmlParams:M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurence matrix of word countsk (int): embedding size of each word after dimension reductionReturn:M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.In terms of the SVD from math class, this actually returns U * S""" n_iters = 10 # Use this parameter in your call to `TruncatedSVD`M_reduced = Noneprint("Running Truncated SVD over %i words..." % (M.shape[0]))# ------------------# Write your implementation here.handle=TruncatedSVD(k,n_iter=n_iters)M_reduced=handle.fit_transform(M)# ------------------print("Done.")return M_reduced # --------------------- # Run this sanity check # Note that this is not an exhaustive check for correctness # In fact we only check that your M_reduced has the right dimensions. # ---------------------# Define toy corpus and run student code test_corpus = ["{} All that glitters isn't gold {}".format(START_TOKEN, END_TOKEN).split(" "), "{} All's well that ends well {}".format(START_TOKEN, END_TOKEN).split(" ")] M_test, word2ind_test = compute_co_occurrence_matrix(test_corpus, window_size=1) M_test_reduced = reduce_to_k_dim(M_test, k=2)# Test proper dimensions assert (M_test_reduced.shape[0] == 10), "M_reduced has {} rows; should have {}".format(M_test_reduced.shape[0], 10) assert (M_test_reduced.shape[1] == 2), "M_reduced has {} columns; should have {}".format(M_test_reduced.shape[1], 2)# Print Success print ("-" * 80) print("Passed All Tests!") print ("-" * 80) Running Truncated SVD over 10 words... Done. -------------------------------------------------------------------------------- Passed All Tests! --------------------------------------------------------------------------------Question 1.4: Implement plot_embeddings [code] (1 point)
Here you will write a function to plot a set of 2D vectors in 2D space. For graphs, we will use Matplotlib (plt).
For this example, you may find it useful to adapt this code. In the future, a good way to make a plot is to look at the Matplotlib gallery, find a plot that looks somewhat like what you want, and adapt the code they give.
def plot_embeddings(M_reduced, word2ind, words):""" Plot in a scatterplot the embeddings of the words specified in the list "words".NOTE: do not plot all the words listed in M_reduced / word2ind.Include a label next to each point.Params:M_reduced (numpy matrix of shape (number of unique words in the corpus , 2)): matrix of 2-dimensioal word embeddingsword2ind (dict): dictionary that maps word to indices for matrix Mwords (list of strings): words whose embeddings we want to visualize"""# ------------------# Write your implementation here.fig = plt.figure()plt.style.use("seaborn-whitegrid")for word in words:point = M_reduced[word2ind[word]]plt.scatter(point[0], point[1], marker = "*")plt.annotate(word, xy = (point[0], point[1]), xytext = (point[0], point[1]+0.1))# ------------------ # --------------------- # Run this sanity check # Note that this is not an exhaustive check for correctness. # The plot produced should look like the "test solution plot" depicted below. # ---------------------print ("-" * 80) print ("Outputted Plot:")M_reduced_plot_test = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1], [0, 0]]) word2ind_plot_test = {'test1': 0, 'test2': 1, 'test3': 2, 'test4': 3, 'test5': 4} words = ['test1', 'test2', 'test3', 'test4', 'test5'] plot_embeddings(M_reduced_plot_test, word2ind_plot_test, words)print ("-" * 80) -------------------------------------------------------------------------------- Outputted Plot: --------------------------------------------------------------------------------Question 1.5: Co-Occurrence Plot Analysis [written] (3 points)
Now we will put together all the parts you have written! We will compute the co-occurrence matrix with fixed window of 4 (the default window size), over the Reuters “crude” (oil) corpus. Then we will use TruncatedSVD to compute 2-dimensional embeddings of each word. TruncatedSVD returns U*S, so we need to normalize the returned vectors, so that all the vectors will appear around the unit circle (therefore closeness is directional closeness). Note: The line of code below that does the normalizing uses the NumPy concept of broadcasting. If you don’t know about broadcasting, check out Computation on Arrays: Broadcasting by Jake VanderPlas.
Run the below cell to produce the plot. It’ll probably take a few seconds to run. What clusters together in 2-dimensional embedding space? What doesn’t cluster together that you might think should have? Note: “bpd” stands for “barrels per day” and is a commonly used abbreviation in crude oil topic articles.
# ----------------------------- # Run This Cell to Produce Your Plot # ------------------------------ reuters_corpus = read_corpus() M_co_occurrence, word2ind_co_occurrence = compute_co_occurrence_matrix(reuters_corpus) M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)# Rescale (normalize) the rows to make them each of unit-length M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1) M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcastingwords = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'iraq']plot_embeddings(M_normalized, word2ind_co_occurrence, words) Running Truncated SVD over 8185 words... Done.Part 2: Prediction-Based Word Vectors (15 points)
As discussed in class, more recently prediction-based word vectors have demonstrated better performance, such as word2vec and GloVe (which also utilizes the benefit of counts). Here, we shall explore the embeddings produced by GloVe. Please revisit the class notes and lecture slides for more details on the word2vec and GloVe algorithms. If you’re feeling adventurous, challenge yourself and try reading GloVe’s original paper.
Then run the following cells to load the GloVe vectors into memory. Note: If this is your first time to run these cells, i.e. download the embedding model, it will take a couple minutes to run. If you’ve run these cells before, rerunning them will load the model without redownloading it, which will take about 1 to 2 minutes.
總結
以上是生活随笔為你收集整理的Exploring Word Vextors的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 联邦学习(Federated Learn
- 下一篇: Oracle数据库账号被锁了怎么解锁