當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Datawhale-零基础入门NLP-新闻文本分类Task02

發(fā)布時間：2023/12/20 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了 Datawhale-零基础入门NLP-新闻文本分类Task02 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

Task01里邊對賽題進行了分析,接下來進行數(shù)據(jù)讀取與數(shù)據(jù)分析，通過使用Pandas庫完成數(shù)據(jù)讀取和分析操作。

1 數(shù)據(jù)讀取

由賽題數(shù)據(jù)格式可知，可通過read_csv讀取train_set.csv數(shù)據(jù)：

import pandas as pd import numpy as np import matplotlib.pyplot as plt#讀取全量數(shù)據(jù) train_df = pd.read_csv('./data/data45216/train_set.csv',sep='\t') train_df.shape#讀取部分數(shù)據(jù) train_df = pd.read_csv('./data/data45216/train_set.csv',sep='\t'，nrows=100) train_df.shape

參數(shù)：sep每列的分隔符,用‘\t’分割，nrows=100，讀取100條數(shù)據(jù)

Pandas還可以讀取sql，excel，table,html,json等格式數(shù)據(jù)。

2 數(shù)據(jù)分析

2.1 計算新聞文本的長度

賽題數(shù)據(jù)中每行句子的字符使用空格進行分隔，可通過直接統(tǒng)計單詞的個數(shù)得到每個句子的長度。

train_df['text_len'] = train_df['text'].apply(lambda x:len(x.split(' '))) print(train_df['text_len'].describe())

由輸出結(jié)果可知，句子的長度均值在907，最短的長度是2，最大的長度是57921：

查看句子長度的直方圖：

_ = plt.hist(train_df['text_len'],bins=50) plt.xlabel('Text char count') plt.title('Histogram of char count')

輸出結(jié)果：

2.2 查看賽題數(shù)據(jù)的類別分布

通過繪制直方圖來查看每個新聞類別的分布。

train_df['label'].value_counts().plot(kind='bar') plt.title('News class count') plt.xlabel('category')

由輸出結(jié)果可知，大部分的新聞分布是0,1,2，最少的是13，新聞的類別標識為：{‘科技’：0，‘股票’：1，‘體育’：2，‘娛樂’：3，‘時政’：4，‘社會’：5，‘教育’：6，‘財經(jīng)’：7，‘家居’：8，‘游戲’：9，‘房產(chǎn)’：10，‘時尚’：11，‘彩票’：12，‘星座’：13}。

2.3 字符分布

統(tǒng)計每個字符出現(xiàn)的次數(shù)，將句子進行拼接進而劃分為字符，并統(tǒng)計每個字符的個數(shù)。通過統(tǒng)計，知道3750,900,648的出現(xiàn)頻率較高，可推測為標點符號。

from collections import Counter#將文本變?yōu)橐粋€list all_lines = ' '.join(list(train_df['text'])) print(len(all_lines)) #對每個詞統(tǒng)計個數(shù) word_count = Counter(all_lines.split(" ")) #進行排序 word_count = sorted(word_count.items(),key=lambda d:d[1], reverse = True) print(len(word_count)) print(word_count[0]) print(word_count[-1])

使用Lambda函數(shù)，先對train_df['text']的數(shù)據(jù)進行去重，然后拼接統(tǒng)計：

train_df['text_unique'] = train_df['text'].apply(lambda x: ' '.join(list(set(x.split(' '))))) all_lines = ' '.join(list(train_df['text_unique'])) word_count = Counter(all_lines.split(' ')) word_count = sorted(word_count.items(),key=lambda d:int(d[1]),reverse=True) print(len(word_count)) print(word_count[0]) print(word_count[-1])

分析結(jié)論：

1.每個新聞的字符個數(shù)在900多，還有個別新聞較長，可能需要截斷；

2.新聞類別分布不均勻，會影響模型精度。

3 作業(yè)

（1）假設(shè)字符3750,900,648是句子的標點符號，請分析每篇新聞平均由多少個句子構(gòu)成？

一、利用for循環(huán)實現(xiàn)

flaglist1 = [] flaglist2 = [] flaglist3 = [] for i in range(train_df['text'].shape[0]):flag1,flag2,flag3 = train_df['text'].loc[i].split(' ').count('3750'),train_df['text'].loc[i].split(' ').count('900'),train_df['text'].loc[i].split(' ').count('648')flaglist1.append(flag1)flaglist2.append(flag2)flaglist3.append(flag3) flaglist = list(map(lambda x:x[0]+x[1]+x[2],zip(flaglist1,flaglist2,flaglist3))) train_df['flag_freq'] = flaglist train_df['flag_freq'].mean()

二、用Counter實現(xiàn)

train_df['text_freq'] = train_df['text'].apply(lambda x: ' '.join(list(x.split(' ')))) print(len(train_df['text'])) # # #將文本變?yōu)橐粋€list strlist1 = [] strlist2 = [] strlist3 = [] for i in range(train_df['text_freq'].shape[0]):all_lines = train_df['text_freq'].loc[i]# #對每個詞統(tǒng)計個數(shù)word_count = Counter(all_lines.split(' '))# print(word_count['3750'],word_count['900'],word_count['648'])strlist1.append(word_count['3750'])strlist2.append(word_count['900'])strlist3.append(word_count['648'])flaglist = list(map(lambda x:x[0]+x[1]+x[2],zip(strlist1,strlist2,strlist3))) train_df['flag_freq'] = flaglist train_df['flag_freq'].mean()

（2）統(tǒng)計每類新聞出現(xiàn)次數(shù)最多的字符

一、用groupby進行分組實現(xiàn)

groupdata = train_df.groupby(by=['label']) print(groupdata.size())#每類新聞出現(xiàn)最多的詞 max_freq = [] for i in range(len(groupdata.size())):df = groupdata.get_group(i)['text'].apply(lambda x: ' '.join(list(x.split(' '))))all_lines = ' '.join(list(df))word_count = Counter(all_lines.split(' '))del word_count['3750']del word_count['900']del word_count['648']word_count = sorted(word_count.items(),key=lambda d:int(d[1]),reverse=True)print(word_count[1][0])max_freq.append(word_count[1][0])

二、通過Pandas的類別數(shù)據(jù)實現(xiàn)

train_df['new_label'] = pd.cut(train_df['label'],[-1,0,1,2,3,4,5,6,7,8,9,10,11,12,13],labels=['0','1','2','3','4','5','6','7','8','9','10','11','12','13']) train_df.set_index('new_label').sort_index(ascending=False).head()max_freq = [] for i in range(14):df = train_df[train_df['new_label']==str(i)]['text'].apply(lambda x: ' '.join(list(x.split(' '))))all_lines = ' '.join(list(df))word_count = Counter(all_lines.split(' '))del word_count['3750']del word_count['900']del word_count['648']word_count = sorted(word_count.items(),key=lambda d:int(d[1]),reverse=True)print(word_count[1][0])max_freq.append(word_count[1][0])

思考：如何解決類別不均衡問題？

總結(jié)

以上是生活随笔為你收集整理的Datawhale-零基础入门NLP-新闻文本分类Task02的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。