【Kaggle微课程】Natural Language Processing - 3. Word Vectors
文章目錄
- 1. 詞嵌入 Word Embeddings
- 2. 分類模型
- 3. 文檔相似度
- 練習(xí):
- 1. 使用文檔向量訓(xùn)練模型
- 2. 文本相似度
learn from https://www.kaggle.com/learn/natural-language-processing
1. 詞嵌入 Word Embeddings
參考博文:05.序列模型 W2.自然語言處理與詞嵌入 https://michael.blog.csdn.net/article/details/108886394
類似的詞語有著類似的向量表示,向量間可以相減作類比
- 加載模型
- 提取單詞向量
- 合并單詞向量為文檔向量,最簡(jiǎn)單的做法是,平均每個(gè)單詞的向量
2. 分類模型
有了文檔向量,你可以使用 sklearn 模型、XGB模型等進(jìn)行建模
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(doc_vectors, spam.label, test_size=0.1, random_state=1)- SVM 的例子
3. 文檔相似度
cosine similarity 余弦相似度 cos?θ=a?b∥a∥∥b∥\cos \theta=\frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\|\|\mathbf{b}\|}cosθ=∥a∥∥b∥a?b?
def cosine_similarity(a, b):return a.dot(b)/np.sqrt(a.dot(a) * b.dot(b)) a = nlp("REPLY NOW FOR FREE TEA").vector b = nlp("According to legend, Emperor Shen Nung discovered tea when leaves from a wild tree blew into his pot of boiling water.").vector cosine_similarity(a, b)輸出:
0.7030031練習(xí):
試試你為餐館建立的情緒分析模型。在給定的一些示例文本的數(shù)據(jù)集中找到最相似的評(píng)論。
%matplotlib inlineimport matplotlib.pyplot as plt import numpy as np import pandas as pd import spacy# Set up code checking from learntools.core import binder binder.bind(globals()) from learntools.nlp.ex3 import * print("\nSetup complete")- 加載模型、數(shù)據(jù)
- 為了節(jié)省時(shí)間,加載已經(jīng)處理好的所有評(píng)論詞向量
1. 使用文檔向量訓(xùn)練模型
- SVM
輸出:
Model test accuracy: 93.847%- KNN
輸出:
Model test accuracy: 86.998%2. 文本相似度
- Centering the Vectors
有時(shí)在計(jì)算相似性時(shí),人們會(huì)計(jì)算所有文檔的平均向量,然后每個(gè)文檔的向量減去這個(gè)向量。為什么你認(rèn)為這有助于相似性度量?
有時(shí)候你的文檔已經(jīng)相當(dāng)相似了。例如,這個(gè)數(shù)據(jù)集是對(duì)企業(yè)的所有評(píng)論,這些文檔之間有很強(qiáng)的相似度,與新聞文章、技術(shù)手冊(cè)和食譜相比。最終你得到0.8和1之間的所有相似性,并且沒有反相似文檔(相似性<0)。當(dāng)中心化向量時(shí),您將比較數(shù)據(jù)集中的文檔,而不是所有可能的文檔。
- 找到最相似的評(píng)論
輸出:
After purchasing my final christmas gifts at the Urban Tea Merchant in Vancouver, I was surprised to hear about Teopia at the new outdoor mall at Don Mills and Lawrence when I went back home to Toronto for Christmas. Across from the outdoor skating rink and perfect to sit by the ledge to people watch, the location was prime for tea connesieurs... or people who are just freezing cold in need of a drinK! Like any gourmet tea shop, there were large tins of tea leaves on the walls, and although the tea menu seemed interesting enough, you can get any specialty tea as your drink. We didn't know what to get... so the lady suggested the Goji Berries... it smelled so succulent and juicy... instantly SOLD! I got it into a tea latte and watched the tea steep while the milk was steamed, and surprisingly, with the click of a button, all the water from the tea can be instantly drained into the cup (see photo).. very fascinating!The tea was aromatic and tasty, not over powering. The price was also very reasonable and I recommend everyone to get a taste of this place :)- 評(píng)論1
- 與評(píng)論1最相似的評(píng)論
- 看看相似的評(píng)論
如果你看看其他類似的評(píng)論,你會(huì)看到很多咖啡店。為什么你認(rèn)為咖啡評(píng)論和只提到茶的例子評(píng)論相似?
咖啡店的評(píng)論也將類似于我們的茶館評(píng)論,因?yàn)榭Х群筒柙谡Z義上是相似的。大多數(shù)咖啡館都提供咖啡和茶,所以你會(huì)經(jīng)常看到這兩個(gè)詞同時(shí)出現(xiàn)。
刷完了課程,獲得鼓勵(lì)證書,繼續(xù)加油!
我的CSDN博客地址 https://michael.blog.csdn.net/
長按或掃碼關(guān)注我的公眾號(hào)(Michael阿明),一起加油、一起學(xué)習(xí)進(jìn)步!
總結(jié)
以上是生活随笔為你收集整理的【Kaggle微课程】Natural Language Processing - 3. Word Vectors的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: TensorFlow 2.0 - CNN
- 下一篇: 05.序列模型 W1.循环序列模型(作业