當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【Kaggle微课程】Natural Language Processing - 1. Intro to NLP

發布時間：2024/1/18 编程问答 50 豆豆

生活随笔收集整理的這篇文章主要介紹了【Kaggle微课程】Natural Language Processing - 1. Intro to NLP 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

- 1. 使用 spacy 庫進行 NLP
- 2. Tokenizing
- 3. 文本處理
- 4. 模式匹配
- 練習：食譜滿意度調查
- - 1 在評論中找到菜單項
  - 2 對所有的評論匹配
  - 3 最不受歡迎的菜
  - 4 菜譜出現的次數

learn from https://www.kaggle.com/learn/natural-language-processing

1. 使用 spacy 庫進行 NLP

spacy：https://spacy.io/usage

spacy 需要指定語言種類，使用spacy.load()加載語言

管理員身份打開 cmd 輸入python -m spacy download en 下載英語語言en模型

import spacy nlp = spacy.load('en')

你可以處理文本

doc = nlp("Tea is healthy and calming, don't you think?")

2. Tokenizing

Tokenizing 將返回一個包含 tokens 的 document 對象。 token 是文檔中的文本單位，例如單個單詞和標點符號。

SpaCy 將像 "don't"這樣的縮略語分成兩個標記：“do”和“n’t”。可以通過遍歷文檔來查看 token。

for token in doc:print(token)

輸出：

Tea is healthy and calming , do n't you think ?

3. 文本處理

有幾種類型的預處理可以改進我們如何用單詞建模。

第一種是 "lemmatizing"，一個詞的 "lemma"是它的基本形式。
例如，“walk”是單詞“walking”的 "lemma"。所以，當你把walking這個詞"lemmatizing"時，你會把它轉換成walk。
刪除stopwords也是很常見的。stopwords是指在語言中經常出現的不包含太多信息的單詞。英語的stopwords包括“the”，“is”，“and”，“but”，“not”。

token.lemma_返回單詞的lemma
token.is_stop，如果是停用詞，返回布爾值True（否則返回False）

print(f"Token \t\tLemma \t\tStopword".format('Token', 'Lemma', 'Stopword')) print("-"*40) for token in doc:print(f"{str(token)}\t\t{token.lemma_}\t\t{token.is_stop}")

在上面的句子中，重要的詞是tea, healthy, calming。刪除停用詞 可能有助于預測模型關注相關詞。

Lemmatizing 同樣有助于將同一單詞的多種形式組合成一個基本形式（"calming", "calms", "calmed" 都會轉成 "calm"）。

然而，Lemmatizing 和刪除停用詞 可能會導致模型性能更差。因此，您應該將此預處理視為超參數優化過程的一部分。

4. 模式匹配

另一個常見的NLP任務：在文本塊或整個文檔中匹配單詞或短語。
可以使用正則表達式進行模式匹配，但spaCy的匹配功能往往更易于使用。

要匹配單個tokens令牌，需要創建Matcher匹配器。當你想匹配一個詞語列表時，使用PhraseMatcher會更容易、更有效。

例如，如果要查找不同智能手機型號在某些文本中的顯示位置，可以為感興趣的型號名稱創建 patterns。

首先創建PhraseMatcher

from spacy.matcher import PhraseMatcher matcher = PhraseMatcher(nlp.vocab, attr='lower')

以上，我們使用已經加載過的英語模型的單詞進行匹配，并轉換為小寫后進行匹配

創建要匹配的詞語列表

terms = ['Galaxy Note', 'iPhone 11', 'iPhone XS', 'Google Pixel'] patterns = [nlp(text) for text in terms] print(patterns) # 輸出 [Galaxy Note, iPhone 11, iPhone XS, Google Pixel] matcher.add("match1", patterns) # help(matcher.add) text_doc = nlp("Glowing review overall, and some really interesting side-by-side ""photography tests pitting the iPhone 11 Pro against the ""Galaxy Note 10 Plus and last year’s iPhone XS and Google Pixel 3.") for i, text in enumerate(text_doc):print(i, text) matches = matcher(text_doc) print(matches)

輸出：

0 Glowing 1 review 2 overall 3 , 4 and 5 some 6 really 7 interesting 8 side 9 - 10 by 11 - 12 side 13 photography 14 tests 15 pitting 16 the 17 iPhone 18 11 19 Pro 20 against 21 the 22 Galaxy 23 Note 24 10 25 Plus 26 and 27 last 28 year 29 ’s 30 iPhone 31 XS 32 and 33 Google 34 Pixel 35 3 36 . [(12981744483764759145, 17, 19), # iPhone 11 (12981744483764759145, 22, 24), # Galaxy Note (12981744483764759145, 30, 32), # iPhone XS (12981744483764759145, 33, 35)] # Google Pixel 返回元組（匹配id, 匹配開始位置，匹配結束位置） match_id, start, end = matches[3] print(nlp.vocab.strings[match_id], text_doc[start:end])

輸出：

match1 Google Pixel

練習：食譜滿意度調查

你是DelFalco意大利餐廳的顧問。店主讓你確認他們的菜單上是否有令食客失望的食物。

店主建議你使用Yelp網站上的評論來判斷人們喜歡和不喜歡哪些菜。你從Yelp那里提取了數據。在開始分析之前，請運行下面的代碼單元，快速查看必須使用的數據。

import pandas as pd data = pd.read_json('../input/nlp-course/restaurant.json') data.head()

店主還給了你這個菜單項和常見的替代拼寫列表

menu = ["Cheese Steak", "Cheesesteak", "Steak and Cheese", "Italian Combo", "Tiramisu", "Cannoli","Chicken Salad", "Chicken Spinach Salad", "Meatball", "Pizza", "Pizzas", "Spaghetti","Bruchetta", "Eggplant", "Italian Beef", "Purista", "Pasta", "Calzones", "Calzone","Italian Sausage", "Chicken Cutlet", "Chicken Parm", "Chicken Parmesan", "Gnocchi","Chicken Pesto", "Turkey Sandwich", "Turkey Breast", "Ziti", "Portobello", "Reuben","Mozzarella Caprese", "Corned Beef", "Garlic Bread", "Pastrami", "Roast Beef","Tuna Salad", "Lasagna", "Artichoke Salad", "Fettuccini Alfredo", "Chicken Parmigiana","Grilled Veggie", "Grilled Veggies", "Grilled Vegetable", "Mac and Cheese", "Macaroni", "Prosciutto", "Salami"]

根據Yelp提供的數據和菜單項列表，您有什么想法可以找到哪些菜單項讓食客失望？

你可以根據評論中提到的菜單項對其進行分組，然后計算每個項目的平均評分。你可以分辨出哪些食物在評價中被提及得分較低，這樣餐館就可以修改食譜或從菜單中刪除這些食物。

1 在評論中找到菜單項

import spacy from spacy.matcher import PhraseMatcherindex_of_review_to_test_on = 14 text_to_test_on = data.text.iloc[index_of_review_to_test_on]# Load the SpaCy model nlp = spacy.blank('en')# Create the tokenized version of text_to_test_on review_doc = nlp(text_to_test_on)# Create the PhraseMatcher object. The tokenizer is the first argument. Use attr = 'LOWER' to make consistent capitalization matcher = PhraseMatcher(nlp.vocab, attr='LOWER')# Create a list of tokens for each item in the menu menu_tokens_list = [nlp(item) for item in menu]# Add the item patterns to the matcher. # Look at https://spacy.io/api/phrasematcher#add in the docs for help with this step # Then uncomment the lines below matcher.add("MENU", # Just a name for the set of rules we're matching tomenu_tokens_list )# Find matches in the review_doc matches = matcher(review_doc) for i, text in enumerate(review_doc):print(i, text) for match in matches:print(f"Token number {match[1]}: {review_doc[match[1]:match[2]]}")

找到了評論中包含食譜中的單詞的位置

0 The 1 Il 2 Purista 3 sandwich 4 has 5 become 6 a 7 staple 8 of 9 my 10 life 11 . 12 Mozzarella 13 , 14 basil 15 , 16 prosciutto 17 , 18 roasted 19 red 20 peppers 21 and 22 balsamic 23 vinaigrette 24 blend 25 into 26 a 27 front 28 runner 29 for 30 the 31 best 32 sandwich 33 in 34 the 35 valley 36 . 37 Goes 38 great 39 with 40 sparkling 41 water 42 or 43 a 44 beer 45 . 46 47 DeFalco 48 's 49 also 50 has 51 other 52 Italian 53 fare 54 such 55 as 56 a 57 delicious 58 meatball 59 sub 60 and 61 classic 62 pastas 63 . Token number 2: Purista Token number 16: prosciutto Token number 58: meatball

2 對所有的評論匹配

每條評論里出現的食譜（key）：[stars 。。。]（value），將分數加到列表里

from collections import defaultdict# item_ratings is a dictionary of lists. If a key doesn't exist in item_ratings, # the key is added with an empty list as the value. item_ratings = defaultdict(list) # 字典的值是listfor idx, review in data.iterrows():doc = nlp(review.text)# Using the matcher from the previous exercisematches = matcher(doc)# Create a set of the items found in the review textfound_items = set([doc[m[1]:m[2]].lower_ for m in matches])# Update item_ratings with rating for each item in found_items# Transform the item strings to lowercase to make it case insensitivefor item in found_items:item_ratings[item].append(review.stars)

3 最不受歡迎的菜

# Calculate the mean ratings for each menu item as a dictionary mean_ratings = {name: sum(scores)/len(scores) for name,scores in item_ratings.items()}# Find the worst item, and write it as a string in worst_text. This can be multiple lines of code if you want.worst_item = sorted(mean_ratings, key=lambda x : mean_ratings[x])[0] # After implementing the above cell, uncomment and run this to print # out the worst item, along with its average rating. print(worst_item) print(mean_ratings[worst_item])

輸出：

chicken cutlet 3.4

4 菜譜出現的次數

每個菜有多少條評論

counts = {item: len(ratings) for item, ratings in item_ratings.items()}item_counts = sorted(counts, key=counts.get, reverse=True) for item in item_counts:print(f"{item:>25}{counts[item]:>5}")

輸出：

pizza 265pasta 206meatball 128cheesesteak 97cheese steak 76cannoli 72calzone 72eggplant 69purista 63lasagna 59italian sausage 53prosciutto 50chicken parm 50garlic bread 39gnocchi 37spaghetti 36calzones 35pizzas 32salami 28chicken pesto 27italian beef 25tiramisu 21italian combo 21ziti 21chicken parmesan 19chicken parmigiana 17portobello 14mac and cheese 11chicken cutlet 10steak and cheese 9pastrami 9roast beef 7fettuccini alfredo 6grilled veggie 6tuna salad 5turkey sandwich 5artichoke salad 5macaroni 5chicken salad 5reuben 4chicken spinach salad 2corned beef 2turkey breast 1

打印出平均打分前十的和倒數10個的

sorted_ratings = sorted(mean_ratings, key=mean_ratings.get)print("Worst rated menu items:") for item in sorted_ratings[:10]:print(f"{item:20} Ave rating: {mean_ratings[item]:.2f} \tcount: {counts[item]}")print("\n\nBest rated menu items:") for item in sorted_ratings[-10:]:print(f"{item:20} Ave rating: {mean_ratings[item]:.2f} \tcount: {counts[item]}")

輸出：

Worst rated menu items: chicken cutlet Ave rating: 3.40 count: 10 turkey sandwich Ave rating: 3.80 count: 5 spaghetti Ave rating: 3.89 count: 36 italian beef Ave rating: 3.92 count: 25 tuna salad Ave rating: 4.00 count: 5 macaroni Ave rating: 4.00 count: 5 italian combo Ave rating: 4.05 count: 21 garlic bread Ave rating: 4.13 count: 39 roast beef Ave rating: 4.14 count: 7 eggplant Ave rating: 4.16 count: 69Best rated menu items: chicken pesto Ave rating: 4.56 count: 27 chicken salad Ave rating: 4.60 count: 5 purista Ave rating: 4.67 count: 63 prosciutto Ave rating: 4.68 count: 50 reuben Ave rating: 4.75 count: 4 steak and cheese Ave rating: 4.89 count: 9 artichoke salad Ave rating: 5.00 count: 5 fettuccini alfredo Ave rating: 5.00 count: 6 turkey breast Ave rating: 5.00 count: 1 corned beef Ave rating: 5.00 count: 2

你對任何特定商品的數據越少，你就越不相信平均評級是客戶的“真實”情緒。

我會把評分較低，且評價人數超過20個人的菜撤掉。

我的CSDN博客地址 https://michael.blog.csdn.net/

長按或掃碼關注我的公眾號（Michael阿明），一起加油、一起學習進步！

總結

以上是生活随笔為你收集整理的【Kaggle微课程】Natural Language Processing - 1. Intro to NLP的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：技术大老64岁被裁员，说出一句话，让我好
下一篇：基于Kubernetes、Docker的