生活随笔
收集整理的這篇文章主要介紹了
nlp-关键词搜索
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
關鍵詞搜索
"""功能:實現關鍵詞搜索可以嘗試修改/調試/升級的部分是:文本預處理步驟: 你可以使用很多不同的方法來使得文本數據變得更加清潔自制的特征: 相處更多的特征值表達方法(關鍵詞全段重合數量,重合比率,等等)更好的回歸模型: 根據之前的課講的Ensemble方法,把分類器提升到極致版本1.0日期:10.10.2019
"""import numpy
as np
import pandas
as pd
from sklearn
.ensemble
import RandomForestRegressor
, BaggingRegressor
from nltk
.stem
.snowball
import SnowballStemmer
from sklearn
.model_selection
import cross_val_score
import matplotlib
.pyplot
as pltdf_train
= pd
.read_csv
('C:/Users/Administrator/Desktop/七月在線課程下載/word2vec/input/train.csv',encoding
="ISO-8859-1")
df_test
= pd
.read_csv
('C:/Users/Administrator/Desktop/七月在線課程下載/word2vec/input/test.csv',encoding
="ISO-8859-1")
df_desc
= pd
.read_csv
('C:/Users/Administrator/Desktop/七月在線課程下載/word2vec/input/product_descriptions.csv')
df_all
= pd
.concat
((df_train
, df_test
), axis
=0, ignore_index
=True)
df_all
= pd
.merge
(df_all
, df_desc
, how
='left', on
='product_uid')
stemmer
= SnowballStemmer
('english')def str_stemmer(s
):""":param s: 字符格式的字符:return: 詞干抽取后的字符"""return " ".join
([stemmer
.stem
(word
) for word
in s
.lower
().split
()])
df_all
['search_term'] = df_all
['search_term'].map(lambda x
: str_stemmer
(x
))
df_all
['product_title'] = df_all
['product_title'].map(lambda x
: str_stemmer
(x
))
df_all
['product_description'] = df_all
['product_description'].map(lambda x
: str_stemmer
(x
))
df_all
['len_of_query'] = df_all
['search_term'].map(lambda x
:len(x
.split
())).astype
(np
.int64
)
def str_common_word(str1
, str2
):print(str1
.head
())return sum(int(str2
.find
(word
) >= 0) for word
in str1
.split
())
df_all
['commons_in_title'] = df_all
.apply(lambda x
:str_common_word
(x
['search_term'], x
['product_title']),axis
=1)
df_all
['commons_in_desc'] = df_all
.apply(lambda x
:str_common_word
(x
['search_term'], x
['product_description']),axis
=1)
df_all
= df_all
.drop
(['search_term', 'product_title', 'product_description'], axis
=1)
df_train
= df_all
.loc
[df_train
.index
]
df_test
= df_all
.loc
[df_train
.shape
[0]:]
test_ids
= df_test
['id']
y_train
= df_train
['relevance'].values X_train
= df_train
.drop
(['id', 'relevance'], axis
=1).values
X_test
= df_test
.drop
(['id', 'relevance'], axis
=1).values
params
= [2, 6, 7, 9]
test_scores
= []
for param
in params
:classfier
= RandomForestRegressor
(n_estimators
=30, max_depth
=param
)test_score
= np
.sqrt
(-cross_val_score
(classfier
, X_train
, y_train
, cv
=5, scoring
='neg_mean_squared_error'))test_scores
.append
(np
.mean
(test_score
))plt
.plot
(params
, test_scores
)
plt
.title
("Param vs CV Error")
rf
= RandomForestRegressor
(n_estimators
=30, max_depth
=6)
rf
.fit
(X_train
, y_train
)
y_pred
= rf
.predict
(X_test
)
pd
.DataFrame
({"id": test_ids
, "relevance": y_pred
}).to_csv
('submission.csv',index
=False)
總結
以上是生活随笔為你收集整理的nlp-关键词搜索的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。