fasttext 安装_fasttext的简单介绍
需要注意的問題:
1、linux mac 平臺(tái)
2、標(biāo)簽中的下劃線是兩個(gè)!兩個(gè)!兩個(gè)!
環(huán)境說(shuō)明:python2.7、linux
自己打自己臉,目前官方的包只能在linux,mac環(huán)境下使用。誤導(dǎo)大家了,對(duì)不起。
測(cè)試facebook開源的基于深度學(xué)習(xí)的對(duì)文本分類的fastText模型
fasttext python包的安裝:
1 pip install fasttext
第一步獲取分類文本,文本直接用的清華大學(xué)的新聞分本,可在文本系列的第三篇找到下載地址。
輸出數(shù)據(jù)格式: 樣本 + 樣本標(biāo)簽
說(shuō)明:這一步不是必須的,可以直接從第二步開始,第二步提供了處理好的文本格式。寫這一步主要是為了記憶當(dāng)時(shí)是怎么處理原始文本的。
import jieba
import os
basedir = "/home/li/corpus/news/" #這是我的文件地址,需跟據(jù)文件夾位置進(jìn)行更改
dir_list = ['affairs','constellation','economic','edu','ent','fashion','game','home','house','lottery','science','sports','stock']
##生成fastext的訓(xùn)練和測(cè)試數(shù)據(jù)集
ftrain = open("news_fasttext_train.txt","w")
ftest = open("news_fasttext_test.txt","w")
num = -1
for e in dir_list:
num += 1
indir = basedir + e + '/'
files = os.listdir(indir)
count = 0
for fileName in files:
count += 1
filepath = indir + fileName
with open(filepath,'r') as fr:
text = fr.read()
text = text.decode("utf-8").encode("utf-8")
seg_text = jieba.cut(text.replace("\t"," ").replace("\n"," "))
outline = " ".join(seg_text)
outline = outline.encode("utf-8") + "\t__label__" + e + "\n"
# print outline
# break
if count < 10000:
ftrain.write(outline)
ftrain.flush()
continue
elif count < 20000:
ftest.write(outline)
ftest.flush()
continue
else:
break
ftrain.close()
ftest.close()
第二步:利用fasttext進(jìn)行分類。使用的是fasttext的python包。
整理好的數(shù)據(jù):百度網(wǎng)盤下載
news_fasttext_train.txt
news_fasttext_test.txt
# _*_coding:utf-8 _*_
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import fasttext
#訓(xùn)練模型
classifier = fasttext.supervised("news_fasttext_train.txt","news_fasttext.model",label_prefix="__label__")
#load訓(xùn)練好的模型
#classifier = fasttext.load_model('news_fasttext.model.bin', label_prefix='__label__')
```
測(cè)試模型
result = classifier.test("news_fasttext_test.txt")
print result.precision
print result.recall
0.92240420242
0.92240420242
由于fasttext貌似只提供全部結(jié)果的p值和r值,想要統(tǒng)計(jì)不同分類的結(jié)果,就需要自己寫代碼來(lái)實(shí)現(xiàn)了。
-- coding: utf-8 --
"""
Created on Wed Oct 18 14:17:27 2017
@author: xiaoguangli
"""
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import fasttext
classifier = fasttext.load_model('news_fasttext.model.bin', label_prefix='label')
labels_right = []
texts = []
with open("news_fasttext_test.txt") as fr:
for line in fr:
line = line.decode("utf-8").rstrip()
labels_right.append(line.split("\t")[1].replace("label",""))
texts.append(line.split("\t")[0])
# print labels
# print texts
break
labels_predict = [e[0] for e in classifier.predict(texts)] #預(yù)測(cè)輸出結(jié)果為二維形式
print labels_predict
text_labels = list(set(labels_right))
text_predict_labels = list(set(labels_predict))
print text_predict_labels
print text_labels
A = dict.fromkeys(text_labels,0) #預(yù)測(cè)正確的各個(gè)類的數(shù)目
B = dict.fromkeys(text_labels,0) #測(cè)試數(shù)據(jù)集中各個(gè)類的數(shù)目
C = dict.fromkeys(text_predict_labels,0) #預(yù)測(cè)結(jié)果中各個(gè)類的數(shù)目
for i in range(0,len(labels_right)):
B[labels_right[i]] += 1
C[labels_predict[i]] += 1
if labels_right[i] == labels_predict[i]:
A[labels_right[i]] += 1
print A
print B
print C
計(jì)算準(zhǔn)確率,召回率,F值
for key in B:
try:
r = float(A[key]) / float(B[key])
p = float(A[key]) / float(C[key])
f = p * r * 2 / (p + r)
print "%s:\t p:%f\t r:%f\t f:%f" % (key,p,r,f)
except:
print "error:", key, "right:", A.get(key,0), "real:", B.get(key,0), "predict:",C.get(key,0)
實(shí)驗(yàn)數(shù)據(jù)分類
[u'affairs', u'fashion', u'lottery', u'house', u'science', u'sports', u'game', u'economic', u'ent', u'edu', u'home', u'constellation', u'stock']
['affairs', 'fashion', 'house', 'sports', 'game', 'economic', 'ent', 'edu', 'home', 'stock', 'science']
{'science': 8415, 'affairs': 8257, 'fashion': 3173, 'house': 9491, 'sports': 9739, 'game': 9506, 'economic': 9235, 'ent': 9665, 'edu': 9491, 'home': 9315, 'stock': 9015}
{'science': 10000, 'affairs': 10000, 'fashion': 3369, 'house': 10000, 'sports': 10000, 'game': 10000, 'economic': 10000, 'ent': 10000, 'edu': 10000, 'home': 10000, 'stock': 10000}
{u'affairs': 8562, u'fashion': 3585, u'lottery': 96, u'science': 9088, u'edu': 10068, u'sports': 10099, u'game': 10151, u'economic': 10131, u'ent': 10798, u'house': 10000, u'home': 10103, u'constellation': 432, u'stock': 10256}
#實(shí)驗(yàn)結(jié)果
science: p:0.841500 r:0.925946r: f:0.881706
affairs: p:0.825700 r:0.964377r: f:0.889667
fashion: p:0.941822 r:0.885077r: f:0.912568
house: p:0.949100 r:0.949100r: f:0.949100
sports: p:0.973900 r:0.964353r: f:0.969103
game: p:0.950600 r:0.936459r: f:0.943477
economic: p:0.923500 r:0.911559r: f:0.917490
ent: p:0.966500 r:0.895073r: f:0.929416
edu: p:0.949100 r:0.942690r: f:0.945884
home: p:0.931500 r:0.922003r: f:0.926727
stock: p:0.901500 r:0.878998r: f:0.890107
從結(jié)果上,看出fasttext的分類效果還是不錯(cuò)的,沒有進(jìn)行對(duì)fasttext的調(diào)參,結(jié)果都基本在90以上,不過(guò)在預(yù)測(cè)的時(shí)候,不知道怎么多出了一個(gè)分類constellation。難道。。。。查找原因中。。。。
2016/11/7更正:從集合B中可以看出訓(xùn)練集的標(biāo)簽中是沒有l(wèi)ottery和constellation的數(shù)據(jù)的,說(shuō)明在數(shù)據(jù)準(zhǔn)備的時(shí)候,每類選取10000篇,導(dǎo)致在測(cè)試數(shù)據(jù)集中l(wèi)ottery和constellation不存在數(shù)據(jù)了。因此在第一步準(zhǔn)備數(shù)據(jù)的時(shí)候可以根據(jù)lottery和constellation類的數(shù)據(jù)進(jìn)行訓(xùn)練集和測(cè)試集的大小劃分,或者簡(jiǎn)單粗暴點(diǎn),這兩類沒有達(dá)到我們的數(shù)量要求,可以直接刪除掉
總結(jié)
以上是生活随笔為你收集整理的fasttext 安装_fasttext的简单介绍的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: c# 带返回值的action_C#委托A
- 下一篇: android 字体像素转换工具类_An