Character-level Convolutional Networks for Text Classification
論文總體結構
本文歷史意義:
?1、構建多個文本分類數據集,推動文本分類發展
?2、提出CharTextCNN方法,由于只使用字符信息,所以可以用于多種語言中
?
一、Abstract(通過實驗探究了字符級別卷積神經網絡用于文本分類的有效性,模型取得較好結果)
? ? ? ?摘要部分講解了本文主要做什么,主要是三個方面,一是從實驗角度探究字符級別卷積神經網絡的有效性,二是構造幾個大規模文本分類數據集,三是和對比模型相互比較。
?
二、Introduction(字符級別特征可以有效從原始新號如圖像、語音中提取特征,字符級別也長用于自然語言任務,本文進行探索)
? ? ? 背景介紹部分主要從兩個角度展開,一個是卷積神經網絡,另一個是字符級別特征,首先介紹了文本分類簡介、CNN的有效性、論述了本文使用字符級別CNN用戶文本分類、文本分類在自然語言處理中的應用,最后論述了字符級別信息的使用。
??
三、Character level Convolutional Networks(字符級別的卷積神經網絡模型以及一種數據擴充方法):
主要寫了卷積計算公式,以及量化字符表大小長度為70,右側給出網絡結構,6個卷積層、3個全連接,后面給出網絡模型一些參數
數據增強部分主要說的是用同義詞替換的方法,增加數據,每個同義詞按照語意的相似性進行排序,然后按照一定的數據分布進行替換。
?
CharTextCNN模型的優缺點:
缺點:
? ? ? 1、 字符級別文本長度特別長,不利于處理長文本分類
? ? ? ?2、只使用字符級別信息,所以模型使用到的語意方面信息較少
? ? ? ?3、小語料效果較差
優點:
? ? ? ?1、模型結構簡單在大語料上效果很好(3層卷積、3層全連接)
? ? ? ?2、可以用于各種語言,不需要做分詞處理
? ? ? ?3、在噪音比較多的文本上表現較好,因為基本上不存在oov問題
?
四、Comparision Models:
? ? 介紹一些對比分類模型,包括詞袋模型和基于深度學習的模型
?
五、Datasets and Results:
? ? ? 幾個文本數據集及實驗結果? ? ?
? ?介紹幾個文本數據集的大小,以及后續介紹每個數據集的介紹。
?
六、Discussion:
? ? ? 討論實驗結果以及一些參數設置
? ? ?主要討論不同模型之間的對比效果,主要圍繞數據集大小、數據集topic為語意還是語法的實驗效果進行討論
?
七、Conclusion and Outlook
? ? ?全文總結、未來展望
?
? ? ?本文關鍵點:
? ? ? 1、卷積神經網絡能夠有效地提取關鍵的特征
? ? ? 2、字符級別的特征對于自然語言處理的有效性
? ? ? 3、CharTextCNN模型
?
? ? 創新點:
? ? ? ?1、提出一種新的文本分類模型-CharTextCNN
? ? ? ?2、提出多個大規模文本分類數據集
? ? ? ?3、在多個文本分類數據集上取得最好或者非常有競爭力的結果
?
? ? ?啟發點:
? ? ? ?1、基于卷積神經網絡的文本分類不需要語言的語法和語義結構的知識
? ? ? ?2、實驗結果告訴我們沒有一個機器學習模型能夠在各種數據集上都能表現的最好
? ? ? ?3、本文從實驗的角度分析了字符級別卷積神經網絡在文本分類任務上的適用性
?
八、代碼實現?
""" 數據預處理 """# encoding = 'utf-8'import os import torch import json import csvf = open("./data/AG/train.csv") datas = csv.reader(f,delimiter=',',quotechar='"') datas = list(datas)label,data,lowercase = [],[],Truefor row in datas:label.append(int(row[0])-1)text = " ".join(row[1:])if lowercase:text = text.lower()data.append(text)print(label[0:5]) print(data[0:5])[2, 2, 2, 2, 2] ["wall st. bears claw back into the black (reuters) reuters - short-sellers, wall street's dwindling\\band of ultra-cynics, are seeing green again.", 'carlyle looks toward commercial aerospace (reuters) reuters - private investment firm carlyle group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.', "oil and economy cloud stocks' outlook (reuters) reuters - soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.", 'iraq halts oil exports from main southern pipeline (reuters) reuters - authorities have halted oil export\\flows from the main pipeline in southern iraq after\\intelligence showed a rebel militia could strike\\infrastructure, an oil official said on saturday.', 'oil prices soar to all-time record, posing new menace to us economy (afp) afp - tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the us presidential elections.']with open("./data/alphabet.json") as f:alphabet = "".join(json.load(f))def char2Index(char):return alphabet.find(char)l0 = 1014 def ontHotEncode(idx):X = torch.zeros(l0,len(alphabet))for index_char,char in enumerate(data[idx]):if char2Index(char)!=-1:X[index_char][char2Index(char)] = 1.0return X模型理論細節:
""" 模型代碼 """import torch import torch.nn as nn import numpy as npclass CharTextCNN(nn.Module):def __init__(self,config):super(CharTextCNN,self).__init__()in_features = [config.char_num] + config.features[0:-1]out_features = config.featureskernel_sizes = config.kernel_sizesself.convs = []self.conv1 = nn.Sequential(nn.Conv1d(in_features[0], out_features[0], kernel_size=kernel_sizes[0], stride=1), # 一維卷積nn.BatchNorm1d(out_features[0]), # bn層nn.ReLU(), # relu激活函數層nn.MaxPool1d(kernel_size=3, stride=3) #一維池化層) # 卷積+bn+relu+pooling模塊self.conv2 = nn.Sequential(nn.Conv1d(in_features[1], out_features[1], kernel_size=kernel_sizes[1], stride=1),nn.BatchNorm1d(out_features[1]),nn.ReLU(),nn.MaxPool1d(kernel_size=3, stride=3))self.conv3 = nn.Sequential(nn.Conv1d(in_features[2], out_features[2], kernel_size=kernel_sizes[2], stride=1),nn.BatchNorm1d(out_features[2]),nn.ReLU())self.conv4 = nn.Sequential(nn.Conv1d(in_features[3], out_features[3], kernel_size=kernel_sizes[3], stride=1),nn.BatchNorm1d(out_features[3]),nn.ReLU())self.conv5 = nn.Sequential(nn.Conv1d(in_features[4], out_features[4], kernel_size=kernel_sizes[4], stride=1),nn.BatchNorm1d(out_features[4]),nn.ReLU())self.conv6 = nn.Sequential(nn.Conv1d(in_features[5], out_features[5], kernel_size=kernel_sizes[5], stride=1),nn.BatchNorm1d(out_features[5]),nn.ReLU(),nn.MaxPool1d(kernel_size=3, stride=3))self.fc1 = nn.Sequential(nn.Linear(8704, 1024), # 全連接層 #((l0-96)/27)*256nn.ReLU(),nn.Dropout(p=config.dropout) # dropout層) # 全連接+relu+dropout模塊self.fc2 = nn.Sequential(nn.Linear(1024, 1024),nn.ReLU(),nn.Dropout(p=config.dropout))self.fc3 = nn.Linear(1024, config.num_classes)def forward(self, x):x = self.conv1(x)x = self.conv2(x)x = self.conv3(x)x = self.conv4(x)x = self.conv5(x)x = self.conv6(x)x = x.view(x.size(0), -1) x = self.fc1(x)x = self.fc2(x)x = self.fc3(x)return xclass config:def __init__(self):self.char_num = 70 # 字符的個數self.features = [256,256,256,256,256,256] # 每一層特征個數self.kernel_sizes = [7,7,3,3,3,3] # 每一層的卷積核尺寸self.dropout = 0.5 # dropout大小self.num_classes = 4 # 數據的類別個數config = config() chartextcnn = CharTextCNN(config) test = torch.zeros([64,70,1014]) out = chartextcnn(test)from torchsummary import summarysummary(chartextcnn, input_size=(70,1014))----------------------------------------------------------------Layer (type) Output Shape Param # ================================================================Conv1d-1 [-1, 256, 1008] 125,696BatchNorm1d-2 [-1, 256, 1008] 512ReLU-3 [-1, 256, 1008] 0MaxPool1d-4 [-1, 256, 336] 0Conv1d-5 [-1, 256, 330] 459,008BatchNorm1d-6 [-1, 256, 330] 512ReLU-7 [-1, 256, 330] 0MaxPool1d-8 [-1, 256, 110] 0Conv1d-9 [-1, 256, 108] 196,864BatchNorm1d-10 [-1, 256, 108] 512ReLU-11 [-1, 256, 108] 0Conv1d-12 [-1, 256, 106] 196,864BatchNorm1d-13 [-1, 256, 106] 512ReLU-14 [-1, 256, 106] 0Conv1d-15 [-1, 256, 104] 196,864BatchNorm1d-16 [-1, 256, 104] 512ReLU-17 [-1, 256, 104] 0Conv1d-18 [-1, 256, 102] 196,864BatchNorm1d-19 [-1, 256, 102] 512ReLU-20 [-1, 256, 102] 0MaxPool1d-21 [-1, 256, 34] 0Linear-22 [-1, 1024] 8,913,920ReLU-23 [-1, 1024] 0Dropout-24 [-1, 1024] 0Linear-25 [-1, 1024] 1,049,600ReLU-26 [-1, 1024] 0Dropout-27 [-1, 1024] 0Linear-28 [-1, 4] 4,100 ================================================================ Total params: 11,342,852 Trainable params: 11,342,852 Non-trainable params: 0 ---------------------------------------------------------------- Input size (MB): 0.27 Forward/backward pass size (MB): 11.29 Params size (MB): 43.27 Estimated Total Size (MB): 54.83 ---------------------------------------------------------------- """ 訓練模型部分 """import torch import torch.autograd as autograd import torch.nn as nn import torch.optim as optim from model import CharTextCNN from data import AG_Data from tqdm import tqdm import numpy as np import config as argumentparserconfig = argumentparser.ArgumentParser() # 讀入參數設置 config.features = list(map(int,config.features.split(","))) # 將features用,分割,并且轉成int config.kernel_sizes = list(map(int,config.kernel_sizes.split(","))) # kernel_sizes,分割,并且轉成int# 導入訓練集 training_set = AG_Data(data_path="/AG/train.csv",l0=config.l0) training_iter = torch.utils.data.DataLoader(dataset=training_set,batch_size=config.batch_size,shuffle=True,num_workers=0)# 導入測試集 test_set = AG_Data(data_path="/AG/test.csv",l0=config.l0)test_iter = torch.utils.data.DataLoader(dataset=test_set,batch_size=config.batch_size,shuffle=False,num_workers=0)model = CharTextCNN(config) # 初始化模型 criterion = nn.CrossEntropyLoss() # 構建loss結構 optimizer = optim.Adam(model.parameters(), lr=config.learning_rate) #構建優化器 loss = -1def get_test_result(data_iter,data_set):# 生成測試結果model.eval()data_loss = 0true_sample_num = 0for data, label in data_iter:if config.cuda and torch.cuda.is_available():data = data.cuda()label = label.cuda()else:data = torch.autograd.Variable(data).float()out = model(data)true_sample_num += np.sum((torch.argmax(out, 1) == label).cpu().numpy()) # 得到一個batch的預測正確的樣本個數acc = true_sample_num / data_set.__len__()return accfor epoch in range(config.epoch):model.train()process_bar = tqdm(training_iter)for data, label in process_bar:if config.cuda and torch.cuda.is_available():data = data.cuda() # 如果使用gpu,將數據送進goulabel = label.cuda()else:data = torch.autograd.Variable(data).float()label = torch.autograd.Variable(label).squeeze()out = model(data)loss_now = criterion(out, autograd.Variable(label.long()))if loss == -1:loss = loss_now.data.item()else:loss = 0.95*loss+0.05*loss_now.data.item() # 平滑操作process_bar.set_postfix(loss=loss_now.data.item()) # 輸出loss,實時監測loss的大小process_bar.update()optimizer.zero_grad() # 梯度更新loss_now.backward()optimizer.step()?
總結
以上是生活随笔為你收集整理的Character-level Convolutional Networks for Text Classification的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Convolutional Neural
- 下一篇: Bag of Tricks for Ef