使用fastText实现文本分类-java版
使用FastText實現文本分類-java版
文本分類又稱自動文本分類,是指計算機將載有信心的一篇文本映射到預先給定的某一類別或某幾個類別主題的過程,實現這一過程的算法模型叫做分類器。哈哈哈,這一句是從大佬文章中借鑒來得,這是是原文 ,這篇文章具體介紹了文本分類的歷史發展和一些分類算法,有興趣的可以去看看。
我這里主要說的是使用FastText實現文本分類,至于想要弄明白原理的,建議看這里 ,大佬對原理做了很詳細的說明,向大佬表示敬意。
依賴
<dependency><groupId>com.github.sszuev</groupId><artifactId>fasttext</artifactId><version>1.0.0</version> </dependency>想要jar包的可以去maven repository官網 去搜索FastText,這里有各個版本的jar包選擇。
訓練模型
以下就是訓練模型的代碼,訓練出的模型會保存到指定的路徑下,一次訓練就可以得到模型,之后就可以用模型預測數據了。
try {Main.train(new String[]{"supervised","-input","./parameter/category_train.txt", //帶標簽的訓練數據"-output","./parameter/commodity_model", //訓練出的模型保存路徑及命名"-dim","64", //詞向量維度"-lr","0.5", //學習率"-wordNgrams","2", //最大子詞長度"-minCount","1", //最小詞頻"-bucket","10000000" //桶數}); }catch (Exception e) {e.printStackTrace(); }那么帶標簽的訓練數據集是什么樣的?如下,這是從官網 找到的測試訓練數據集,標簽都是以 __label__ 為開頭的,當然這個可以設置,只不過默認支持的是這樣,以下數據都是英文,如果使用中文可以考慮分詞和停用詞過濾了。
__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe? __label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments __label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove? __label__restaurant Michelin Three Star Restaurant; but if the chef is not there __label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables? __label__storage-method __label__equipment __label__bread What's the purpose of a bread box? __label__baking __label__food-safety __label__substitutions __label__peanuts how to seperate peanut oil from roasted peanuts at home? __label__chocolate American equivalent for British chocolate terms __label__baking __label__oven __label__convection Fan bake vs bake __label__sauce __label__storage-lifetime __label__acidity __label__mayonnaise Regulation and balancing of readymade packed mayonnaise and other sauces __label__tea What kind of tea do you boil for 45minutes? __label__baking __label__baking-powder __label__baking-soda __label__leavening How long can batter sit before chemical leaveners lose their power? __label__food-safety __label__soup Can I RE-freeze chicken soup after it has thawed?其實在訓練模型中還有很多參數可以設置,如下,參數很多,所以如果想要搞明白所有的相關參數意義最好去官網多花點時間讀一下。
The following arguments are mandatory:-input training file path-output output file pathThe following arguments are optional:-lr learning rate [0.05]-lrUpdateRate change the rate of updates for the learning rate [100]-dim size of word vectors [100] //這里決定了詞向量的維度-ws size of the context window [5]-epoch number of epochs [5]-minCount minimal number of word occurences [1] //最小詞頻-neg number of negatives sampled [5]-wordNgrams max length of word ngram [1]-loss loss function {ns, hs, softmax} [ns]-bucket number of buckets [2000000]-minn min length of char ngram [3]-maxn max length of char ngram [6]-thread number of threads [12]-t sampling threshold [0.0001]-label labels prefix [__label__] //看,這里就可以使用你自己想要的標簽符號生成的模型是在文件 commodity_model.bin 中,還會生成一個詞向量文件 commodity_model.vec ,詞向量文件內容如下。
37899 100 </s> -0.55874 -0.10935 0.17115 0.2388 0.2805 0.11883 0.10674 -0.13214 0.32224 -0.54488 0.4239 -0.51117 -0.56261 -0.23805 0.68661 0.35541 -0.77821 -0.53468 0.60013 0.063523 1.1078 -0.31767 -0.38917 -0.85577 -0.35906 0.23197 0.99904 0.26574 -0.27303 0.0091292 0.62824 0.35154 -0.18146 0.62103 -0.65914 0.99872 0.16585 0.2622 -0.1481 -0.22537 -0.27048 0.075146 -0.15598 -0.28847 0.16145 0.10381 0.32652 -0.45171 -0.26597 -0.061287 0.01858 0.50429 -0.17517 0.21205 -0.023571 0.37332 -0.41411 -0.34945 0.28114 -0.0046294 0.39406 0.39902 -0.25752 -0.052666 0.27889 -0.53025 -0.13618 1.1267 0.032445 -0.62555 0.45881 0.24994 -0.12783 1.0336 -0.56122 0.36742 0.26783 -0.064086 0.56198 -0.054111 0.27858 0.43912 -0.21053 0.20468 0.29792 0.16496 0.13347 0.85231 -0.048318 -0.86905 -0.57763 -0.26486 0.58158 0.49263 -0.14475 0.27656 0.29959 -0.37355 -0.24024 -0.83969 新款 -0.36792 -0.052885 0.2175 0.25894 0.14599 0.67911 0.0028615 0.13432 0.26988 0.44109 -0.047146 0.30435 -0.0052764 0.086225 -0.024577 -0.19852 -0.13228 0.27177 -0.26321 0.53231 0.030532 -0.12034 -0.21005 -0.035567 -0.09993 -0.21439 0.30124 -0.081924 0.219 -0.27545 -0.1321 -0.19909 0.49169 -0.35514 0.010071 0.32131 0.13274 0.0017961 -0.25752 0.15799 -0.15891 -0.19768 0.11458 -0.071166 -0.0049989 -0.10033 -0.089192 0.0051551 0.1948 -0.32105 -0.12673 -0.021479 0.2035 -0.3036 0.042287 -0.18418 -0.16937 0.0034305 -0.00054013 -0.20262 0.21633 0.55448 0.5047 -0.30521 -0.33969 0.23641 -0.13683 -0.039237 -0.038262 0.30688 -0.42023 0.17422 -0.36334 -0.027693 0.13593 0.055707 -0.45232 0.21901 0.0038972 0.073224 -0.24337 -0.17771 0.36014 0.0093526 -0.12155 -0.20782 -0.10076 0.0017971 -0.24683 -0.35155 -0.33855 -0.032884 -0.10803 -0.11618 0.06904 0.36682 0.30256 -0.11811 -0.2719 -0.3967 款 0.10719 -0.047097 0.035313 -0.11604 -0.11128 0.098889 0.07338 -0.062592 -0.36591 0.33696 -0.034139 0.41233 -0.044029 0.26736 -0.35058 0.0011242 0.012222 0.28305 -0.42311 0.49023 -0.063058 0.063913 0.065111 -0.073718 0.0074959 -0.36847 -0.041656 -0.09248 0.11114 -0.20892 -0.16453 -0.47284 0.23434 -0.26671 -0.22557 0.31284 0.21075 0.018571 0.10862 0.093459 -0.40156 0.10052 0.15448 0.38009 0.057931 0.10816 -0.456 -0.088661 0.25591 -0.25163 -0.44263 -0.44058 -0.1008 -0.38986 0.14394 -0.45649 0.031487 0.030237 -0.10706 -0.16413 0.43202 0.39386 0.21072 0.26409 0.43194 0.21329 0.10789 0.1402 0.23095 0.044306 -0.19572 -0.022013 -0.080285 -0.13523 0.076271 -0.073187 -0.044929 -0.15822 0.15628 -0.32383 -0.35998 0.16517 -0.20735 -0.066001 -0.012753 0.08091 -0.057798 -0.14446 -0.10688 -0.19407 -0.38554 0.12981 0.28439 -0.030242 0.16122 0.47328 -0.30161 -0.27374 -0.3731 -0.051387 斤 -0.18143 -0.40266 0.036 0.009181 0.12026 -0.29335 -0.15063 -0.078739 -0.1841 0.038445 0.11106 -0.32147 0.1611 0.38768 -0.14828 -0.30435 0.0057699 -0.1971 0.14482 -0.20802 0.71822 0.062339 0.067823 0.23867 -0.013739 -0.20461 0.045837 -0.19937 0.0002158 0.020062 -0.051749 -0.38833 0.010522 -0.056716 0.14504 0.16327 -0.096777 0.079007 0.015782 -0.073867 0.14208 -0.30683 -0.37885 -0.0098635 -0.071231 -0.079818 0.36771 0.097449 -0.54126 0.081269 0.017566 -0.80512 -0.18466 -0.45232 0.11477 0.53177 -0.36932 -0.46266 0.22826 0.20209 0.20264 0.22821 -0.1284 0.30707 0.33509 0.25278 0.18359 -0.122 -0.16871 0.078083 0.03181 0.043715 -0.027783 0.18351 0.22478 0.12866 0.54027 0.033152 0.14358 -0.08765 0.82464 0.19282 -0.64852 -0.045806 -0.12577 0.46663 0.24793 0.40814 0.10723 -0.050818 0.39453 -0.16839 -0.096019 -0.30689 0.3576 0.078405 -0.070142 -0.10879 0.08264 0.23929預測
模型準備完畢,開始預測。我這里取了25253條已經標注好標簽的數據驗證模型的準確度。
//加載模型 FastText fastText = FastText.load("./parameter/commodity_model.bin"); //預測 Map<String, Float> map = fastText.predictLine("某條數據", 1);這里預測的輸入參數1表示只取一個分類標簽,所以想要幾條標簽就輸入幾就好了。返回的結果map中是標簽及占比。運行結果如圖。
這里顯示準確率86.0%,根據對預測結果與貼標結果不同的數據查看,最少有七八百條數據預測結果更加符合,另外有五六百條數據預測結果和貼標結果均適合,所以準確率應該是可以達到90%以上的。基本達到了上線生產的要求,可以使用。
總結
以上是生活随笔為你收集整理的使用fastText实现文本分类-java版的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 架构设计实践思路:什么是架构,怎么画架构
- 下一篇: spark视频-第二期:Shark、Sp