Convolutional Neural Networks for Sentence Classification
論文總體結(jié)構(gòu)
一、摘要
? ? ? ? ? 使用卷積神經(jīng)網(wǎng)絡處理句子級別文本分類,并在多個數(shù)據(jù)集上有好的效果
二、Introduction(背景介紹)
? ? ? ? ? 使用預訓練詞向量和卷積神經(jīng)網(wǎng)絡,提出一種有效分類模型
? ? ? ?本文的主要契機:
? ? ? 1、深度學習的發(fā)展(2012)
? ? ? 2、預訓練詞向量方法
? ? ? 3、卷積神經(jīng)網(wǎng)絡的方法
? ? ? ?本文的歷史意義:
? ? ? 1、開啟基于深度學習的文本分類的序幕
? ? ? 2、推動卷積神經(jīng)網(wǎng)絡在自然語言處理的發(fā)展
三、Model
? ? ? ? ? TextCNN模型結(jié)構(gòu)和正則化
? ? ? ? ??
?
模型結(jié)構(gòu)如上所示,先通過卷積操作,計算卷積核和數(shù)據(jù)卷積的效果,然后進行池化操作,最后進行全連接 + softmax操作
注:全連接之前拼接成的數(shù)據(jù)維度等于卷積核的個數(shù),模型的通道(channel)個數(shù)等于相同維度卷積核的個數(shù)
為了防止過擬合,本文提出了兩種方法:
1、Dropout:
? ? ? 在神經(jīng)網(wǎng)絡的傳播過程中,讓某個神經(jīng)元以一定的概率p停止工作,從而增加模型的泛化能力
2、L2-正則
? ? ? 給所有參數(shù)加以限制,使得學習不偏激,增加模型的泛化能力
四、Dataset and Experiment Setup
? ? ? ? ? 數(shù)據(jù)集介紹、實驗超參設置以及實驗結(jié)果
實驗參數(shù)及模型對比,證明該模型較好。
五、Results and Discussion
? ? ? ? ? 實驗探究、通道個數(shù)討論和詞向量使用方法討論
六、Conclusion
? ? ? ? ? 對全文進行總結(jié)
? ? ? ? ?關(guān)鍵點:
? ? ? ? ?1、預訓練的詞向量 - wordd2vec、Glove
? ? ? ? ?2、卷積神經(jīng)網(wǎng)絡結(jié)構(gòu)- 一維卷積、池化
? ? ? ? ?3、超參選擇- 卷積核選擇、詞向量方式選擇
? ? ? ? ?創(chuàng)新點:
? ? ? ? ?1、提出基于CNN文本分類模型TextCNN
? ? ? ? ?2、提出多種詞向量設置方式
? ? ? ? ?3、在四個文本分類任務上取得最優(yōu)結(jié)果
? ? ? ? ?4、對超參進行大量實驗和分析
? ? ? ? ?啟發(fā)點:
? ? ? ? ?1、在預訓練模型的基礎上微調(diào)能夠得到非常好的結(jié)果,說明預訓練詞向量學習到了一些通用的特征
? ? ? ? ?2、預訓練詞向量的基礎上使用簡單模型比復雜模型表現(xiàn)的還要好
? ? ? ? ?3、對于不在預訓練詞向量中的詞,微調(diào)能夠?qū)W習到更多的意義
?
七、代碼實現(xiàn)
# ************* 數(shù)據(jù)預處理部分 **********# encoding = 'utf-8'from torch.utils import data import os import random import numpy as np from gensim.test.utils import datapath,get_tmpfile from gensim.models import KeyedVectors"""加載預訓練詞向量模型""" wvmodel = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz",binary=True)""" 讀取論文中的實驗數(shù)據(jù) """ pos_data = open("./data/MR/rt-polarity.pos",errors='ignore').readlines() neg_data = open("./data/MR/rt-polarity.neg",errors='ignore').readlines() datas = pos_data + neg_data datas = [data.split() for data in datas] labels = [1] * len(pos_data) + [0] * len(neg_data)""" 構(gòu)建詞表,以句子最大長度為標準做padding""" max_sentence_length = max([len(sentence) for sentence in datas])word2id = {'<pad>':0}for i,data in enumerate(datas):for j,word in enumerate(data):if word2id.get(word) == None:word2id[word] = len(word2id)datas[i][j] = word2id[word]datas[i] = datas[i] + [0] * (max_sentence_length - len(datas[i]))""" 計算已有詞向量的均值和方差 """ tmp = []for word,index in word2id.items():try:tmp.append(wvmodel.get_vector(word))except:passmean = np.mean(np.array(tmp)) std = np.std(np.array(tmp)) print(mean,std)""" 如果詞在預訓練的詞向量模型中,則使用詞向量,否則使用已有詞向量計算的均值和方差構(gòu)造的隨機初始化向量 """ vocab_size = len(word2id) embed_size = 300embedding_weigths = np.random.normal(mean,std,[vocab_size,embed_size])for word,index in word2id.items():try:embedding_weigths[index,:] = wvmodel.get_vector(word)except:pass""" 打亂數(shù)據(jù)順序 """ c = list(zip(datas,labels)) random.seed(1) random.shuffle(c) datas[:],labels[:] = zip(*c)""" 生成訓練集、驗證集、 測試集 """k = 0 train_datas = datas[:int(k*len(datas)/10)] + datas[int((k+1)*len(datas)/10):] train_labels = labels[:int(k*len(datas)/10)] + labels[int((k+1)*len(datas)/10):]valid_datas = np.array(train_datas[int(0.9*len(train_datas)):]) valid_labels = np.array(train_labels[int(0.9*len(train_labels)):])train_datas = np.array(train_datas[:int(0.9*len(train_datas))]) train_labels = np.array(train_labels[:int(0.9*len(train_labels))])test_datas = np.array(datas[int(k*len(datas)/10):int((k+1)*len(datas)/10)]) test_labels = np.array(datas[int(k*len(datas)/10):int((k+1)*len(datas)/10)])?
# *********** 模型構(gòu)建部分 ***********# encoding = 'utf-8'import torch import torch.nn as nn import torch.nn.functional as F from torchsummary import summaryclass BasicModule(nn.Module):def __init__(self):super(BasicModule,self).__init__()self.model_name = str(type(self))def load(self, path):self.load_state_dict(torch.load(path))def save(self, path):torch.save(self.state_dict(),path)def forward(self):passclass TextCNN(BasicModule):def __init__(self,config):super(TextCNN,self).__init__()# 嵌入層if config.embedding_pretrained is not None:self.embedding = nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)else:self.embedding = nn.Embedding(config.n_vocab, config.embed_size)# 卷積層self.conv1d_1 = nn.Conv1d(config.embed_size, config.filter_num, config.filters[0])self.convld_2 = nn.Conv1d(config.embed_size, config.filter_num, config.filters[1])self.convld_3 = nn.Conv1d(config.embed_size, config.filter_num, config.filters[2])# 池化層self.max_pool_1 = nn.MaxPool1d(config.sentence_max_size - config.filters[0] + 1)self.max_pool_2 = nn.MaxPool1d(config.sentence_max_size - config.filters[1] + 1)self.max_pool_3 = nn.MaxPool1d(config.sentence_max_size - config.filters[2] + 1)# Dropout 層self.dropout = nn.Dropout(config.dropout)# 分類層self.fc = nn.Linear(config.filter_num * len(config.filters), config.label_num)def forward(self, x):x = x.long()out = self.embedding(x) # batch_size * embeding_size * sentence_lengthout = out.transpose(1,2).contiguous() # batch_size * sentence_length * embeding_sizex1 = F.relu(self.conv1d_1(out))x2 = F.relu(self.convld_2(out))x3 = F.relu(self.convld_3(out))x1 = self.max_pool_1(x1).squeeze()x2 = self.max_pool_2(x2).squeeze()x3 = self.max_pool_3(x3).squeeze()out = torch.cat([x1,x2,x3],1)out = self.dropout(out)out = self.fc(out)return outclass config:def __init__(self):self.embedding_pretrained = None # 是否使用預訓練詞向量self.n_vocab = 100 # 詞表中單詞個數(shù)self.embed_size = 300 # 詞向量維度self.cuda = False # 是否使用GPUself.filter_num = 100 # 每種尺寸卷積核的個數(shù)self.filters = [3,4,5] # 卷積核尺寸self.label_num = 2 # 標簽個數(shù)self.dropout = 0.5 # dropout的概率self.sentence_max_size = 50 # 最大句子長度configs = config() textcnn = TextCNN(configs) summary(textcnn,input_size=(50,))?
# ******** 模型訓練部分 **********from pytorchtools import EarlyStopping import torch import torch.autograd as autograd import torch.nn as nn import torch.optim as optim from model import TextCNN from data import MR_Dataset import numpy as np import config as argumentparserconfig = argumentparser.ArgumentParser() config.filters = list(map(int,config.filters.split(","))) early_stopping = EarlyStopping(patience=10, verbose=True,cv_index=i)training_set = MR_Dataset(state="train",k=i) config.n_vocab = training_set.n_vocab training_iter = torch.utils.data.DataLoader(dataset=training_set,batch_size=config.batch_size,shuffle=True,num_workers=2)if config.use_pretrained_embed:config.embedding_pretrained = torch.from_numpy(training_set.weight).float() else:passvalid_set = MR_Dataset(state="valid", k=i) valid_iter = torch.utils.data.DataLoader(dataset=valid_set,batch_size=config.batch_size,shuffle=False,num_workers=2)test_set = MR_Dataset(state="test", k=i) test_iter = torch.utils.data.DataLoader(dataset=test_set,batch_size=config.batch_size,shuffle=False,num_workers=2)model = TextCNN(config) if config.cuda and torch.cuda.is_available():model.cuda()config.embedding_pretrained.cuda()criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=config.learning_rate) count = 0 loss_sum = 0def get_test_result(data_iter,data_set):model.eval()data_loss = 0true_sample_num = 0for data, label in data_iter:if config.cuda and torch.cuda.is_available():data = data.cuda()label = label.cuda()else:data = torch.autograd.Variable(data).long()out = model(data)loss = criterion(out, autograd.Variable(label.long()))data_loss += loss.data.item()true_sample_num += np.sum((torch.argmax(out, 1) == label).cpu().numpy()) #(0,0.5)acc = true_sample_num / data_set.__len__()return data_loss,accfor epoch in range(config.epoch):# 訓練開始model.train()for data, label in training_iter:if config.cuda and torch.cuda.is_available():data = data.cuda()label = label.cuda()else:data = torch.autograd.Variable(data).long()label = torch.autograd.Variable(label).squeeze()out = model(data)# l2_alpha*w^2l2_loss = config.l2*torch.sum(torch.pow(list(model.parameters())[1],2))loss = criterion(out, autograd.Variable(label.long()))+l2_lossloss_sum += loss.data.item()count += 1if count % 100 == 0:print("epoch", epoch, end=' ')print("The loss is: %.5f" % (loss_sum / 100))loss_sum = 0count = 0optimizer.zero_grad()loss.backward()optimizer.step()# save the model in every epoch# 一輪訓練結(jié)束# 驗證集上測試valid_loss,valid_acc = get_test_result(valid_iter,valid_set)early_stopping(valid_loss, model)print ("The valid acc is: %.5f" % valid_acc)if early_stopping.early_stop:print("Early stopping")break # 訓練結(jié)束,開始測試 model.load_state_dict(torch.load('./checkpoints/checkpoint%d.pt'%i)) test_loss, test_acc = get_test_result(test_iter, test_set) print("The test acc is: %.5f" % test_acc) """ EarlyStopping """import numpy as np import torchclass EarlyStopping:"""Early stops the training if validation loss doesn't improve after a given patience."""def __init__(self, patience=7, verbose=False, delta=0,cv_index = 0):"""Args:patience (int): How long to wait after last time validation loss improved.Default: 7verbose (bool): If True, prints a message for each validation loss improvement. Default: Falsedelta (float): Minimum change in the monitored quantity to qualify as an improvement.Default: 0"""self.patience = patienceself.verbose = verboseself.counter = 0self.best_score = Noneself.early_stop = Falseself.val_loss_min = np.Infself.delta = deltaself.cv_index = cv_indexdef __call__(self, val_loss, model):score = -val_lossif self.best_score is None:self.best_score = scoreself.save_checkpoint(val_loss, model)elif score < self.best_score + self.delta:self.counter += 1print('EarlyStopping counter: %d out of %d'%(self.counter,self.patience))if self.counter >= self.patience:self.early_stop = Trueelse:self.best_score = scoreself.save_checkpoint(val_loss, model)self.counter = 0def save_checkpoint(self, val_loss, model):'''Saves model when validation loss decrease.'''if self.verbose:print('Validation loss decreased (%.5f --> %.5f). Saving model ...'%(self.val_loss_min,val_loss))torch.save(model.state_dict(), './checkpoints/checkpoint%d.pt'%self.cv_index)self.val_loss_min = val_loss完整代碼詳見:https://github.com/wangtao666666/NLP/tree/master/TextCNN
總結(jié)
以上是生活随笔為你收集整理的Convolutional Neural Networks for Sentence Classification的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: GloVe:Global Vectors
- 下一篇: Character-level Conv