lstm训练情感分析的优点_LSTM对电影评论进行简单的情感分析
今天自己嘗試使用LSTM對電影評論進行簡單的情感分析
代碼中npy文件:
代碼使用的數據集是IMDB,網盤地址:
首先讀取已經做好的詞向量模型
import numpy as np
# 這里有兩個表,一個是ID和單詞的映射關系,一個是ID和詞向量的映射關系
wordsList = np.load('./npy/wordsList.npy')
wordsList = wordsList.tolist()
wordsList = [word.decode('UTF-8') for word in wordsList]
wordVectors = np.load('./npy/wordVectors.npy')
這里可以打印看看維數和長度等
# 打印長度
print(len(wordsList))
print(wordVectors.shape)
400000
(400000, 50)
這里先舉個簡單的例子:
我們是先找到單詞的索引index,然后再通過Index找到對應詞的詞向量(50維)
nameIndex = wordsList.index('name')
wordVectors[nameIndex]
array([ 0.20957 , 0.75197 , -0.48559 , 0.1302 , 0.60071 , 0.43273 ,
-0.95424 , -0.19335 , -0.66756 , -0.25893 , 0.66367 , 1.0509 ,
0.10627 , -0.75438 , 0.45617 , 0.37878 , -0.40237 , 0.1821 ,
-0.028768, 0.24349 , -0.35723 , -0.55817 , 0.14103 , 0.58807 ,
0.076804, -1.972 , -1.4459 , 0.081884, -0.29207 , -0.65623 ,
2.718 , -0.96886 , -0.33354 , -0.19526 , 0.33918 , -0.24307 ,
0.29058 , -0.37178 , -0.38133 , -0.20901 , 0.48504 , 0.20702 ,
-0.5754 , -0.32403 , -0.19267 , -0.043298, -0.57702 , -0.4727 ,
0.42171 , -0.14112 ], dtype=float32)
所以通過這個操作我們可以得到每個詞的向量表示,因此,句子就是在上面再增加一個維度,比如一個句子長度是20,那么等會變成詞向量的表示就是(20,50)。當然這里只是舉一個小例子,實際訓練還需要設定一個固定的句子長度(多截少補),還需要考慮batch_size。
這里先讀取訓練數據集(積極和消極各25000條)
from os import listdir
from os.path import isfile, join
# 指定好數據集位置,這里需要一個個讀取
positiveFiles = ['lmdb/train/pos/' + f for f in listdir('lmdb/train/pos/') if isfile(join('lmdb/train/pos/', f))]
negativeFiles = ['lmdb/train/neg/' + f for f in listdir('lmdb/train/neg/') if isfile(join('lmdb/train/neg/', f))]
numWords = []
# 分別統計積極和消極情感數據集
for pf in positiveFiles:
with open(pf, "r", encoding='utf-8') as f:
line = f.readline()
counter = len(line.split())
numWords.append(counter)
# print('積極情感數據集加載完畢')
for pf in negativeFiles:
with open(pf, "r", encoding='utf-8') as f:
line = f.readline()
counter = len(line.split())
numWords.append(counter)
# print('消極情感數據集加載完畢')
numFiles = len(numWords)
print('全部序列數量', numFiles)
print('全部詞語數量', sum(numWords))
print('平均每個評論序列詞語數量', sum(numWords)/len(numWords))
全部序列數量 25000
全部詞語數量 5844680
平均每個評論序列詞語數量 233.7872
我們需要確定最長序列長度,所以這里用圖表形式先展示一下:
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(numWords, 50)
plt.xlabel('Sequence Length')
plt.ylabel('Frequency')
plt.axis([0, 1200, 0, 8000])
plt.show()
統計
從直方圖可以粗略看到,序列長度在200左右占大部分。這里可以將最大長度設為250。
maxSeqLength = 250
然后需要將文本序列轉換成索引矩陣,先用正則做一個簡單的轉換
import re
strip_special_chars = re.compile("[^A-Za-z0-9 ]+")
# 過濾一下
def cleanSentences(string):
string = string.lower().replace("
", " ")
return re.sub(strip_special_chars, "", string.lower())
接下來是對25000條序列都做一次 詞->ID的映射 形成一個25000×250的矩陣,計算較久。直接使用處理好的索引矩陣文件,詞->ID映射代碼如下:
# ids = np.zeros((numFiles, maxSeqLength), dtype='int32')
# fileCounter = 0
# for pf in positiveFiles:
# with open(pf, "r") as f:
# indexCounter = 0
# line = f.readline()
# cleanedLine = cleanSentences(line)
# split = cleanedLine.split()
# for word in split:
# try:
# ids[fileCounter][indexCounter] = wordsList.index(word)
# except ValueError:
# ids[fileCounter][indexCounter] = 599999
# indexCounter = indexCounter + 1
# if indexCounter >= maxSeqLength:
# break
# fileCounter = fileCounter + 1
# for nf in negativeFiles:
# with open(nf, "r") as f:
# indexCounter = 0
# line = f.readline()
# cleanedLine = cleanSentences(line)
# split = cleanedLine.split()
# for word in split:
# try:
# ids[fileCounter][indexCounter] = wordsList.index(word)
# except ValueError:
# ids[fileCounter][indexCounter] = 599999
# indexCounter = indexCounter + 1
# if indexCounter >= maxSeqLength:
# break
# fileCounter = fileCounter + 1
# np.save('idsMatrix', ids)
ids = np.load('npy/idsMatrix.npy')
下面開始構建模型,使用tensorflow圖模型。首先定義一些超參數,例如批處理大小,LSTM單元個數,分類類別和訓練次數
batchSize = 24
lstmUnits = 64
numClasses = 2
numDimensions = 50
iterations = 50000
輸入數據的維度應該是 batchSize×250(最大序列長度)×50(詞向量維度)
輸出數據的維度應該是 batchSize×2(分類數目)
import tensorflow as tf
tf.reset_default_graph()
labels = tf.placeholder(tf.float32, [batchSize, numClasses])
input_data = tf.placeholder(tf.int32, [batchSize, maxSeqLength]) # 這里只是中間結果,還沒轉換成詞向量
data = tf.Variable(tf.zeros([batchSize, maxSeqLength, numDimensions]), dtype=tf.float32)
data = tf.nn.embedding_lookup(wordVectors, input_data)
這里可以打印看看data格式
print(data)
然后構造模型:先使用tf.nn.rnn_cell.BasicLSTMCell函數,然后設置一個dropout參數避免過擬合,最后輸入到tf.nn.dynamic_rnn展開整個網絡
lstmCell = tf.contrib.rnn.BasicLSTMCell(lstmUnits)
lstmCell = tf.contrib.rnn.DropoutWrapper(cell=lstmCell, output_keep_prob=0.75)
value, _ = tf.nn.dynamic_rnn(lstmCell, data, dtype=tf.float32)
# 打印看看獲取的數據
print(value)
# 權重參數初始化
weight = tf.Variable(tf.truncated_normal([lstmUnits, numClasses]))
bias = tf.Variable(tf.constant(0.1, shape=[numClasses]))
value = tf.transpose(value, [1, 0, 2])
# 取最終的結果值
last = tf.gather(value, int(value.get_shape()[0])-1)
prediction = (tf.matmul(last, weight) + bias)
print(prediction)
然后定義正確的預測函數和正確率評估參數。正確的預測形式是查看最后輸出的0-1向量是否和標記的0-1向量相同
correctPred = tf.equal(tf.argmax(prediction, 1), tf.argmax(labels, 1))
accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))
最后,使用一個交叉熵損失函數作為損失值。對于優化器,使用Adam,并且采用默認的學習率:
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=labels))
optimizer = tf.train.AdamOptimizer().minimize(loss)
為了訓練,需要定義輔助函數:
from random import randint
# 制作batch數據,通過數據集索引位置來設置訓練集和測試集
# 并且讓batch中正負樣本各占一半,同時給定其當前標簽
def getTrainBatch():
labels = []
arr = np.zeros([batchSize, maxSeqLength])
for i in range(batchSize):
if (i % 2 == 0):
num = randint(1, 11499)
labels.append([1,0])
else:
num = randint(13499, 24999)
labels.append([0,1])
arr[i] = ids[num-1:num]
return arr, labels
def getTestBatch():
labels = []
arr = np.zeros([batchSize, maxSeqLength])
for i in range(batchSize):
num = randint(11499, 13499)
if (num <= 12499):
labels.append([1,0])
else:
labels.append([0,1])
arr[i] = ids[num-1:num]
return arr, labels
訓練
sess = tf.InteractiveSession()
saver = tf.train.Saver()
sess.run(tf.global_variables_initializer())
for i in range(iterations):
# 通過輔助函數拿到batch數據
nextBatch, nextBatchLabels = getTrainBatch()
sess.run(optimizer, {input_data: nextBatch, labels: nextBatchLabels})
# 每隔1000次打印一下當前的結果
if (i % 100 == 0 and i != 0):
loss_ = sess.run(loss, {input_data: nextBatch, labels: nextBatchLabels})
accuracy_ = sess.run(accuracy, {input_data: nextBatch, labels: nextBatchLabels})
print("iteration {}/{}...".format(i+1, iterations),
"loss {}...".format(loss_),
"accuracy {}...".format(accuracy_))
# 每1W次保存一下當前模型
if (i % 10000 == 0 and i != 0):
save_path = saver.save(sess, "models/pretrained_lstm.ckpt", global_step=i)
print("saved to %s" % save_path)
訓練大概就是這樣,因為我的筆記本太差,跑太久了。所以設置每100次打印一次。有條件的可以跑一下。
下面是在測試集上面跑的代碼:
sess = tf.InteractiveSession()
saver = tf.train.Saver()
saver.restore(sess, tf.train.latest_checkpoint('models'))
然后導入測試數據集,進行測試
test_iterations = 10
for i in range(test_iterations):
nextBatch, nextBatchLabels = getTestBatch()
print("Accuracy for this batch:", (sess.run(accuracy, {input_data: nextBatch, labels: nextBatchLabels})) * 100)
總結
以上是生活随笔為你收集整理的lstm训练情感分析的优点_LSTM对电影评论进行简单的情感分析的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 设置域名_详解在nginx中设置三级域名
- 下一篇: 下拉框联动_058-ajax之三级联动案