當(dāng)前位置：首頁(yè) > 人文社科 > 生活经验 >内容正文

生活经验

利用人工智能（Magpie开源库）给一段中文的文本内容进行分类打标签

發(fā)布時(shí)間：2023/11/27 生活经验 30 豆豆

生活随笔收集整理的這篇文章主要介紹了利用人工智能（Magpie开源库）给一段中文的文本内容进行分类打标签小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

當(dāng)下人工智能是真心的火熱呀，各種原來(lái)傳統(tǒng)的業(yè)務(wù)也都在嘗試用人工智能技術(shù)來(lái)處理，以此來(lái)節(jié)省人工成本，提高生產(chǎn)效率。既然有這么火的利器，那么我們就先來(lái)簡(jiǎn)單認(rèn)識(shí)下什么是人工智能吧，人工智能是指利用語(yǔ)音識(shí)別、語(yǔ)義理解、圖像識(shí)別、視覺(jué)處理、機(jī)器學(xué)習(xí)、大數(shù)據(jù)分析等技術(shù)實(shí)現(xiàn)機(jī)器智能自動(dòng)化做出響應(yīng)的一種模擬人行為的手段。而我們這里介紹的Magpie則屬于人工智能領(lǐng)域里語(yǔ)義理解、機(jī)器學(xué)習(xí)中的一個(gè)具體的實(shí)現(xiàn)技術(shù)。

前述

近期因?yàn)楣ぷ髟?#xff0c;需要從來(lái)自于客戶聊天對(duì)話的文本中進(jìn)行用戶行為判斷，并對(duì)其打上相應(yīng)的標(biāo)簽。而我需要解決的就是利用文本內(nèi)容進(jìn)行機(jī)器自動(dòng)分類打標(biāo)簽，但在業(yè)務(wù)中，一段文本會(huì)存有不同的多個(gè)標(biāo)簽，那么如何來(lái)實(shí)現(xiàn)呢？通過(guò)Github，找到了Magpie，發(fā)現(xiàn)其與我的需求非常吻合。一番折騰后，就有了本文章。

Magpie

Magpie是一個(gè)開(kāi)源的文本分類庫(kù)，基于一個(gè)高層神經(jīng)網(wǎng)絡(luò)Keras技術(shù)編寫，后端默認(rèn)由Tensorflow來(lái)處理。Magpie是由Python語(yǔ)言開(kāi)發(fā)的，且默認(rèn)只支持英文文本分類，我因?yàn)闃I(yè)務(wù)需要便在其基礎(chǔ)上做了中文文本的支持。如下是Magpie相關(guān)的網(wǎng)址：

Magpie官網(wǎng)：https://github.com/inspirehep/magpie

Keras中文文檔：https://keras-cn.readthedocs.io/en/latest/

實(shí)現(xiàn)

通過(guò)上面的介紹，我們清楚了需要實(shí)現(xiàn)的業(yè)務(wù)與目的，以及采用的技術(shù)手段。那么，就讓我們一起來(lái)看看借助Magpie會(huì)有什么神秘的事情發(fā)生吧。

1、從Magpie下載源碼包到本地，通過(guò)PyCharm IDE開(kāi)發(fā)工具打開(kāi)項(xiàng)目后發(fā)現(xiàn)有“data”、“magpie”、“save”等目錄。其中“data”目錄用于存放訓(xùn)練的源數(shù)據(jù)，“magpie”目錄用于存放源代碼，“save”目錄用于存放訓(xùn)練后的模型文件，具體結(jié)如下圖：

2、在項(xiàng)目中引用相應(yīng)的第三方類庫(kù)，如下：

  1 'nltk~=3.2',
  2 'numpy~=1.12',
  3 'scipy~=0.18',
  4 'gensim~=0.13',
  5 'scikit-learn~=0.18',
  6 'keras~=2.0',
  7 'h5py~=2.6',
  8 'jieba~=0.39',

3、對(duì)項(xiàng)目有了一定認(rèn)識(shí)后，現(xiàn)在我們來(lái)準(zhǔn)備源數(shù)據(jù)。我們這里假定有三種標(biāo)簽，分別為“軍事“、”旅游“'、”政治”，每個(gè)標(biāo)簽各準(zhǔn)備一定數(shù)量的源數(shù)據(jù)（訓(xùn)練數(shù)據(jù)理論上是越多越好，這里我偷懶就只按每個(gè)標(biāo)簽各準(zhǔn)備了50條數(shù)據(jù)），其中拿出70%做為訓(xùn)練數(shù)據(jù)，30%做為測(cè)試數(shù)據(jù)，根據(jù)Magpie規(guī)則將訓(xùn)練源數(shù)據(jù)放到“data”目錄內(nèi)。

4、數(shù)據(jù)準(zhǔn)備好后，我們需要改動(dòng)源代碼，使其能支持中文。中文面臨一個(gè)問(wèn)題就是分詞，而我們這里使用jieba分詞庫(kù)。依次打開(kāi)”magpie\base“目下的”Document“類中，并在該類內(nèi)加入相應(yīng)的分詞代碼，具體代碼如下：

  1 from __future__ import print_function, unicode_literals
  2 
  3 import re
  4 import io
  5 import os
  6 import nltk
  7 import string
  8 import jieba
  9 
 10 from nltk.tokenize import WordPunctTokenizer, sent_tokenize, word_tokenize
 11 
 12 nltk.download('punkt', quiet=True)  # make sure it's downloaded before using
 13 
 14 class Document(object):
 15 
 16     """ Class representing a document that the keywords are extracted from """
 17     def __init__(self, doc_id, filepath, text=None):
 18         self.doc_id = doc_id
 19 
 20         if text:
 21             text = self.clean_text(text)
 22             text = self.seg_text(text)
 23             self.text = text
 24             self.filename = None
 25             self.filepath = None
 26         else:  # is a path to a file
 27             if not os.path.exists(filepath):
 28                 raise ValueError("The file " + filepath + " doesn't exist")
 29 
 30             self.filepath = filepath
 31             self.filename = os.path.basename(filepath)
 32             with io.open(filepath, 'r', encoding='gbk') as f:
 33                 text_context = f.read()
 34                 text_context = self.clean_text(text_context)
 35                 self.text = self.seg_text(text_context)
 36                 print(self.text)
 37         self.wordset = self.compute_wordset()
 38 
 39 
 40 
 41     # 利用jieba包進(jìn)行分詞，并并且去掉停詞，返回分詞后的文本
 42     def seg_text(self,text):
 43         stop = [line.strip() for line in open('data/stopwords.txt', encoding='utf8').readlines()]
 44         text_seged = jieba.cut(text.strip())
 45         outstr = ''
 46         for word in text_seged:
 47             if word not in stop:
 48                 outstr += word
 49                 outstr += ""
 50         return outstr.strip()
 51 
 52     # 清洗文本，去除標(biāo)點(diǎn)符號(hào)數(shù)字以及特殊符號(hào)
 53     def clean_text(self,content):
 54         text = re.sub(r'[+——！，；／·。？、~@#￥%……&*“”《》：（）［］【】〔〕]+', '', content)
 55         text = re.sub(r'[▲!"#$%&\'()*+,-./:;<=>\\?@[\\]^_`{|}~]+', '', text)
 56         text = re.sub('\d+', '', text)
 57         text = re.sub('\s+', '', text)
 58         return text
 59 
 60     def __str__(self):
 61         return self.text
 62 
 63     def compute_wordset(self):
 64         tokens = WordPunctTokenizer().tokenize(self.text)
 65         lowercase = [t.lower() for t in tokens]
 66         return set(lowercase) - {',', '.', '!', ';', ':', '-', '', None}
 67 
 68     def get_all_words(self):
 69         """ Return all words tokenized, in lowercase and without punctuation """
 70         return [w.lower() for w in word_tokenize(self.text)
 71                 if w not in string.punctuation]
 72 
 73     def read_sentences(self):
 74         lines = self.text.split('\n')
 75         raw = [sentence for inner_list in lines
 76                for sentence in sent_tokenize(inner_list)]
 77         return [[w.lower() for w in word_tokenize(s) if w not in string.punctuation]
 78                 for s in raw]
 79

5、通這上述的改造，我們的分類程序可以較好的支持中文了，接下來(lái)就可以進(jìn)行數(shù)據(jù)訓(xùn)練了。項(xiàng)目是通過(guò)運(yùn)行”train.py“類來(lái)進(jìn)行訓(xùn)練操作，但在運(yùn)行之前我們需要對(duì)該類做下改動(dòng)，具體代碼如下：

  1 from magpie import Magpie
  2 
  3 magpie = Magpie()
  4 magpie.train_word2vec('data/hep-categories', vec_dim=3) #訓(xùn)練一個(gè)word2vec
  5 magpie.fit_scaler('data/hep-categories') #生成scaler
  6 magpie.init_word_vectors('data/hep-categories', vec_dim=3) #初始化詞向量
  7 labels = ['軍事','旅游','政治'] #定義所有類別
  8 magpie.train('data/hep-categories', labels, test_ratio=0.2, epochs=20) #訓(xùn)練，20%數(shù)據(jù)作為測(cè)試數(shù)據(jù)，5輪
  9 
 10 #保存訓(xùn)練后的模型文件
 11 magpie.save_word2vec_model('save/embeddings/here', overwrite=True)
 12 magpie.save_scaler('save/scaler/here', overwrite=True)
 13 magpie.save_model('save/model/here.h5')

6、運(yùn)行”train.py“類來(lái)進(jìn)行訓(xùn)練數(shù)據(jù)，截圖如下：

7、模型訓(xùn)練成功后，接下來(lái)就可以進(jìn)行模擬測(cè)試了。項(xiàng)目是通過(guò)運(yùn)行”test.py“類來(lái)進(jìn)行測(cè)試操作，但在運(yùn)行之前我們需要對(duì)該類做下改動(dòng)，具體代碼如下：

  1 from magpie import Magpie
  2 
  3 magpie = Magpie(
  4     keras_model='save/model/here.h5',
  5     word2vec_model='save/embeddings/here',
  6     scaler='save/scaler/here',
  7     labels=['旅游', '軍事', '政治']
  8 )
  9 
 10 #單條模擬測(cè)試數(shù)據(jù)
 11 text = '特朗普在聯(lián)合國(guó)大會(huì)發(fā)表演講談到這屆美國(guó)政府成績(jī)時(shí)，稱他已經(jīng)取得了美國(guó)歷史上幾乎最大的成就。隨后大會(huì)現(xiàn)場(chǎng)傳出了嘲笑聲，特朗普立即回應(yīng)道：“這是真的。”'
 12 mag1 = magpie.predict_from_text(text)
 13 print(mag1)
 14 
 15 '''
 16 #也可以通過(guò)從txt文件中讀取測(cè)試數(shù)據(jù)進(jìn)行批量測(cè)試
 17 mag2 = magpie.predict_from_file('data/hep-categories/1002413.txt')
 18 print(mag2)
 19 '''

8、運(yùn)行”test.py“類來(lái)進(jìn)行測(cè)試數(shù)據(jù)，截圖如下：

總結(jié)

1、文本分類在很多場(chǎng)景中都能應(yīng)用，比如垃圾郵件分類、用戶行為分析、文章分類等，通過(guò)本文簡(jiǎn)單的演示后聰明的你是不是有了一個(gè)更大的發(fā)現(xiàn)呢！

2、本文使用了Magpie開(kāi)源庫(kù)實(shí)現(xiàn)模型訓(xùn)練與測(cè)試，后臺(tái)用Tensorflow來(lái)運(yùn)算。并結(jié)合jieba分詞進(jìn)行中文切詞處理。

3、Magpie本身是不支持中文文本內(nèi)容的，在這里我加入了jieba分詞庫(kù)后使得整個(gè)分類程序有了較好的支持中文文本內(nèi)容的能力。

4、本文測(cè)試分值跟訓(xùn)練數(shù)據(jù)的數(shù)量有一定關(guān)系，訓(xùn)練數(shù)據(jù)理論上是越多越好。

5、分享一句話：人工智能要有多少的智能，就必需要有多少的人工。

聲明

本文為作者原創(chuàng)，轉(zhuǎn)載請(qǐng)備注出處與保留原文地址，謝謝。如文章能給您帶來(lái)幫助，請(qǐng)點(diǎn)下推薦或關(guān)注，感謝您的支持！

轉(zhuǎn)載于:https://www.cnblogs.com/Miidy/p/9844170.html

總結(jié)

以上是生活随笔為你收集整理的利用人工智能（Magpie开源库）给一段中文的文本内容进行分类打标签的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。