基于逻辑回归的讽刺文本检测
Python 語言程序設(shè)計實驗2021年秋”實驗報告
挑戰(zhàn)五 邏輯回歸用于諷刺文本檢測
一、實驗?zāi)康募耙?/h4>
二、實驗環(huán)境(工具、配置等)
三、實驗內(nèi)容(實驗方案、實驗步驟、設(shè)計思路等)
實驗方案:本實驗通過對文本的特征提取,并使用邏輯回歸模型對所提取的特征進(jìn)行建模,最后通過得到的邏輯回歸模型預(yù)測最后結(jié)果的回歸率。
實驗步驟:
設(shè)計思路:
首先,了解邏輯回歸的模型,詞袋模型以及特征向量化的方法以及作用;
然后使用:train_test_split,將train_df['comment']數(shù)據(jù)按照默認(rèn)比例劃分為訓(xùn)練集和測試集;
接下來:以comment為依據(jù)清除缺少數(shù)據(jù)的整條數(shù)據(jù),使用TfidfVectorizer進(jìn)行特征提取
再接著用LogisticRegression建立邏輯回歸模型,訓(xùn)練并通過accuracy_score得到相關(guān)的結(jié)果;
最后改進(jìn)模型,通過commnet和 subreddit雙指標(biāo)進(jìn)行訓(xùn)練模型
四、實驗結(jié)果
圖一:諷刺文本數(shù)據(jù)的表頭以及前10行數(shù)據(jù)
圖二:數(shù)據(jù)列的信息
[外鏈圖片轉(zhuǎn)存失敗,源站可能有防盜鏈機(jī)制,建議將圖片保存下來直接上傳(img-Ke5TQPZH-1638516456877)(https://dn-simplecloud.shiyanlou.com/courses/uid1550954-20211201-1638299332053)]
圖三:以label為依據(jù)對數(shù)據(jù)進(jìn)行分類并查看數(shù)量
圖四:以comment長度為依據(jù)使用log1p函數(shù)平滑化后畫出對應(yīng)數(shù)量的直方圖
圖五:諷刺文本詞云
圖六:非諷刺文本詞云
圖七:各子板塊諷刺評論數(shù)量排序
圖八:調(diào)整參數(shù)之后形成的訓(xùn)練結(jié)果
圖九:應(yīng)用 plot_confusion_matrix 繪制出測試數(shù)據(jù)原始標(biāo)簽和預(yù)測標(biāo)簽類別的混淆矩陣
圖十:查看特征值的權(quán)重
圖十一:采用混合comment和subreddit混合特征的預(yù)測結(jié)果
五、附源程序
下載數(shù)據(jù)
!wget -nc "http://labfile.oss.aliyuncs.com/courses/1283/train-balanced-sarcasm.csv.zip" !unzip -o "train-balanced-sarcasm.csv.zip"將數(shù)據(jù)轉(zhuǎn)化為panda 的 DataFragme數(shù)據(jù)表
import pandas as pd train_df = pd.read_csv('train-balanced-sarcasm.csv') # 查看一下各個列的屬性,看了之后會發(fā)現(xiàn)有數(shù)據(jù)中是沒有comment,這些都屬于是無效數(shù)據(jù) train_df.info() # 順便查看一下數(shù)據(jù)表格的前10行數(shù)據(jù) train_df.head(10) # 刪除缺失comment的數(shù)據(jù) train_df.dropna(subset=['comment'], inplace=True) train_df['label'].value_counts() """ X_train,X_test, y_train, y_test =sklearn.model_selection.train_test_split(train_data,train_target,test_size=0.4, random_state=0,stratify=y_train) train_target:所要劃分的樣本結(jié)果 test_size:樣本占比,如果是整數(shù)的話就是樣本的數(shù)量 默認(rèn)是75% random_state:是隨機(jī)數(shù)的種子。 """ train_texts, valid_texts, y_train, y_valid = train_test_split(train_df['comment'], train_df['label'], random_state=17)以comment數(shù)據(jù)的長度為依據(jù)將數(shù)據(jù)進(jìn)行可視化
from matplotlib import pyplot as plttrain_df.loc[train_df['label'] == 1, 'comment'].str.len().apply(np.log1p).hist(label='sarcastic', alpha=.5) train_df.loc[train_df['label'] == 0, 'comment'].str.len().apply(np.log1p).hist(label='normal', alpha=.5) plt.legend()使用詞云更加直觀的展示特征詞
!pip install wordcloud # 安裝必要模塊 from wordcloud import WordCloud, STOPWORDSwordcloud = WordCloud(background_color='black', stopwords=STOPWORDS, max_words=200, max_font_size=100, random_state=17, width=800, height=400) # plt.figure(長,寬) plt.figure(figsize=(32, 24)) wordcloud.generate(str(train_df.loc[train_df['label'] == 1, 'comment'])) plt.imshow(wordcloud)from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline# 使用 tf-idf 提取文本特征 tf_idf = TfidfVectorizer(ngram_range=(1, 2), max_features=50000, min_df=2) # 建立邏輯回歸模型 logit = LogisticRegression(C=1, n_jobs=4, solver='lbfgs', random_state=17, verbose=1) # 使用 sklearn pipeline 封裝 2 個步驟 tfidf_logit_pipeline = Pipeline([('tf_idf', tf_idf), ('logit', logit)])tfidf_logit_pipeline.fit(train_texts, y_train) valid_pred = tfidf_logit_pipeline.predict(valid_texts) sub_df = train_df.groupby('subreddit')['label'].agg([np.size, np.mean, np.sum]) sub_df.sort_values(by='sum', ascending=False).head(10)
構(gòu)建混淆矩陣
def plot_confusion_matrix(actual, predicted, classes,normalize=False,title='Confusion matrix', figsize=(7, 7),cmap=plt.cm.Blues, path_to_save_fig=None):"""This function prints and plots the confusion matrix.Normalization can be applied by setting `normalize=True`."""import itertoolsfrom sklearn.metrics import confusion_matrixcm = confusion_matrix(actual, predicted).Tif normalize:cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]plt.figure(figsize=figsize)plt.imshow(cm, interpolation='nearest', cmap=cmap)plt.title(title)plt.colorbar()tick_marks = np.arange(len(classes))plt.xticks(tick_marks, classes, rotation=90)plt.yticks(tick_marks, classes)fmt = '.2f' if normalize else 'd'thresh = cm.max() / 2.for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):plt.text(j, i, format(cm[i, j], fmt),horizontalalignment="center",color="white" if cm[i, j] > thresh else "black")plt.tight_layout()plt.ylabel('Predicted label')plt.xlabel('True label')if path_to_save_fig:plt.savefig(path_to_save_fig, dpi=300, bbox_inches='tight')繪制頻率直方圖
plot_confusion_matrix(y_valid, valid_pred, tfidf_logit_pipeline.named_steps['logit'].classes_, figsize=(8, 8))模型改進(jìn)
接下來,我們期望模型能得到進(jìn)一步改進(jìn),所以再補(bǔ)充一個 subreddit 特征,同樣完成切分。注意,這里切分時一定要選擇同一個 random_state,保證能和上面的評論數(shù)據(jù)對齊。
subreddits = train_df['subreddit'] train_subreddits, valid_subreddits = train_test_split(subreddits, random_state=17) tf_idf_texts = tf_idf tf_idf_subreddits = TfidfVectorizer(ngram_range=(1, 1))comment特征向量
X_train_texts = tf_idf_texts.fit_transform(train_texts) X_valid_texts = tf_idf_texts.transform(valid_texts)subreddits特征向量
X_train_subreddits = tf_idf_subreddits.fit_transform(train_subreddits) X_valid_subreddits = tf_idf_subreddits.transform(valid_subreddits)訓(xùn)練結(jié)果
from scipy.sparse import hstack X_train = hstack([X_train_texts, X_train_subreddits]) X_valid = hstack([X_valid_texts, X_valid_subreddits]) logit.fit(X_train, y_train) valid_pred = logit.predict(X_valid)得到結(jié)果的回歸率
from sklearn.metrics import accuracy_score accuracy_score(y_valid, valid_pred)總結(jié)
以上是生活随笔為你收集整理的基于逻辑回归的讽刺文本检测的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: P3386 【模板】二分图匹配(匈牙利算
- 下一篇: 关键路径法详解【CPM】