深度学习笔记——基于传统机器学习算法(LR、SVM、GBDT、RandomForest)的句子对匹配方法
句子對匹配(Sentence Pair Matching)問題是NLP中非常常見的一類問題,所謂“句子對匹配”,就是說給定兩個句子S1和S2,任務目標是判斷這兩個句子是否具備某種類型的關系。如果形式化地對這個問題定義,可以理解如下:
意思是給定兩個句子,需要學習一個映射函數,輸入是兩個句子對,經過映射函數變換,輸出是任務分類標簽集合中的某類標簽。
典型的例子就是Paraphrase任務,即要判斷兩個句子是否語義等價,所以它的分類標簽集合就是個{等價,不等價}的二值集合。除此外,還有很多其它類型的任務都屬于句子對匹配,比如問答系統中相似問題匹配和Answer Selection。
我在前一篇文章中寫了一個基于Doc2vec和Word2vec的無監督句子匹配方法,這里就順便用傳統的機器學習算法做一下。用機器學習算法處理的話,這里的映射函數就是用訓練一個分類模型來擬合F,當分類模型訓練好之后,對于未待分類的數據,就可以輸入分類模型,用訓練好的分類模型進行預測直接輸出結果。
?
關于分類算法:
常見的分類模型有邏輯回歸(LR)、樸素貝葉斯、SVM、GBDT和隨機森林(RandomForest)等。本文選用的機器學習分類算法有:邏輯回歸(LR)、SVM、GBDT和隨機森林(RandomForest)。
由于Sklearn中集成了常見的機器學習算法,包括分類、回歸、聚類等,所以本文使用的是Sklearn,版本是0.17.1。
?
關于特征選擇:
由于最近一直在使用doc2vec和Word2vec,而且上篇文章中對比結果表示,用Doc2vec得到句子向量表示比Word2vec求均值得到句子向量表示要好,所以這里使用doc2vec得到句子的向量表示,向量維數為100維,直接將句子的100維doc2vec向量作為特征輸入分類算法。
?
關于數據集:
數據集使用的是Quora發布的Question Pairs語義等價數據集,和上文是同一個數據集,可以點擊這個鏈接下載點擊打開鏈接,其中包含了40多萬對標注好的問題對,如果兩個問題語義等價,則label為1,否則為0。統計之后,共有53萬多個問題。具體格式如下圖所示:
?
統計出所有的問題之后訓練得到每一個問題的doc2vec向量,作為分類算法的特征輸入。
將語料庫隨機打亂之后,切分出10000對數據作為驗證集,剩余的作為訓練集。
?
下面是具體的訓練代碼:
數據加載和得到句子的doc2vec代碼是同一份,放在前面:
?
# coding:utf-8
import numpy as np
import csv
import datetime
from sklearn.ensemble import RandomForestClassifier
import os
import pandas as pd
from sklearn import metrics, feature_extraction
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
cwd = os.getcwd()
def load_data(datapath):
data_train = pd.read_csv(datapath, sep='\t', encoding='utf-8')
print data_train.shape
qid1 = []
qid2 = []
question1 = []
question2 = []
labels = []
count = 0
for idx in range(data_train.id.shape[0]):
# for idx in range(400):
# count += 1
# if count == 21: break
print idx
q1 = data_train.qid1[idx]
q2 = data_train.qid2[idx]
qid1.append(q1)
qid2.append(q2)
question1.append(data_train.question1[idx])
question2.append(data_train.question2[idx])
labels.append(data_train.is_duplicate[idx])
return qid1, qid2, question1, question2, labels
def load_doc2vec(word2vecpath):
f = open(word2vecpath)
embeddings_index = {}
count = 0
for line in f:
# count += 1
# if count == 10000: break
values = line.split('\t')
id = values[0]
print id
coefs = np.asarray(values[1].split(), dtype='float32')
embeddings_index[int(id)+1] = coefs
f.close()
print('Total %s word vectors.' % len(embeddings_index))
return embeddings_index
def sentence_represention(qid, embeddings_index):
vectors = np.zeros((len(qid), 100))
for i in range(len(qid)):
print i
vectors[i] = embeddings_index.get(qid[i])
return vectors
?
?
將main函數中的數據集路徑和doc2vec路徑換成自己的之后就可以直接使用了。
1.邏輯回歸(LR):
?
def main():
start = datetime.datetime.now()
datapath = 'D:/dataset/quora/quora_duplicate_questions_Chinese_seg.tsv'
doc2vecpath = "D:/dataset/quora/vector2/quora_duplicate_question_doc2vec_100.vector"
qid1, qid2, labels = load_data(datapath)
embeddings_index = load_doc2vec(word2vecpath=doc2vecpath)
vectors1 = sentence_represention(qid1, embeddings_index)
vectors2 = sentence_represention(qid2, embeddings_index)
vectors = np.hstack((vectors1, vectors2))
labels = np.array(labels)
VALIDATION_SPLIT = 10000
VALIDATION_SPLIT0 = 1000
indices = np.arange(vectors.shape[0])
np.random.shuffle(indices)
vectors = vectors[indices]
labels = labels[indices]
train_vectors = vectors[:-VALIDATION_SPLIT]
train_labels = labels[:-VALIDATION_SPLIT]
test_vectors = vectors[-VALIDATION_SPLIT:]
test_labels = labels[-VALIDATION_SPLIT:]
# train_vectors = vectors[:VALIDATION_SPLIT0]
# train_labels = labels[:VALIDATION_SPLIT0]
# test_vectors = vectors[-VALIDATION_SPLIT0:]
# test_labels = labels[-VALIDATION_SPLIT0:]
lr = LogisticRegression()
print '***********************training************************'
lr.fit(train_vectors, train_labels)
print '***********************predict*************************'
prediction = lr.predict(test_vectors)
accuracy = metrics.accuracy_score(test_labels, prediction)
print accuracy
end = datetime.datetime.now()
print end-start
if __name__ == '__main__':
main() # the whole one model
?
?
2.SVM:
?
def main():
start = datetime.datetime.now()
datapath = 'D:/dataset/quora/quora_duplicate_questions_Chinese_seg.tsv'
doc2vecpath = "D:/dataset/quora/vector2/quora_duplicate_question_doc2vec_100.vector"
qid1, qid2, labels = load_data(datapath)
embeddings_index = load_doc2vec(word2vecpath=doc2vecpath)
vectors1 = sentence_represention(qid1, embeddings_index)
vectors2 = sentence_represention(qid2, embeddings_index)
vectors = np.hstack((vectors1, vectors2))
labels = np.array(labels)
VALIDATION_SPLIT = 10000
VALIDATION_SPLIT0 = 1000
indices = np.arange(vectors.shape[0])
np.random.shuffle(indices)
vectors = vectors[indices]
labels = labels[indices]
train_vectors = vectors[:-VALIDATION_SPLIT]
train_labels = labels[:-VALIDATION_SPLIT]
test_vectors = vectors[-VALIDATION_SPLIT:]
test_labels = labels[-VALIDATION_SPLIT:]
# train_vectors = vectors[:VALIDATION_SPLIT0]
# train_labels = labels[:VALIDATION_SPLIT0]
# test_vectors = vectors[-VALIDATION_SPLIT0:]
# test_labels = labels[-VALIDATION_SPLIT0:]
svm = SVC()
print '***********************training************************'
svm.fit(train_vectors, train_labels)
print '***********************predict*************************'
prediction = svm.predict(test_vectors)
accuracy = metrics.accuracy_score(test_labels, prediction)
print accuracy
end = datetime.datetime.now()
print end-start
if __name__ == '__main__':
main() # the whole one model
?
?
3.GBDT:
?
def main():
start = datetime.datetime.now()
datapath = 'D:/dataset/quora/quora_duplicate_questions_Chinese_seg.tsv'
doc2vecpath = "D:/dataset/quora/vector2/quora_duplicate_question_doc2vec_100.vector"
qid1, qid2, labels = load_data(datapath)
embeddings_index = load_doc2vec(word2vecpath=doc2vecpath)
vectors1 = sentence_represention(qid1, embeddings_index)
vectors2 = sentence_represention(qid2, embeddings_index)
vectors = np.hstack((vectors1, vectors2))
labels = np.array(labels)
VALIDATION_SPLIT = 10000
VALIDATION_SPLIT0 = 1000
indices = np.arange(vectors.shape[0])
np.random.shuffle(indices)
vectors = vectors[indices]
labels = labels[indices]
train_vectors = vectors[:-VALIDATION_SPLIT]
train_labels = labels[:-VALIDATION_SPLIT]
test_vectors = vectors[-VALIDATION_SPLIT:]
test_labels = labels[-VALIDATION_SPLIT:]
# train_vectors = vectors[:VALIDATION_SPLIT0]
# train_labels = labels[:VALIDATION_SPLIT0]
# test_vectors = vectors[-VALIDATION_SPLIT0:]
# test_labels = labels[-VALIDATION_SPLIT0:]
gbdt = GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
max_depth=3, max_features=None, max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
random_state=None, subsample=1.0, verbose=0,
warm_start=False)
print '***********************training************************'
gbdt.fit(train_vectors, train_labels)
print '***********************predict*************************'
prediction = gbdt.predict(test_vectors)
accuracy = metrics.accuracy_score(test_labels, prediction)
acc = gbdt.score(test_vectors, test_labels)
print accuracy
print acc
end = datetime.datetime.now()
print end-start
if __name__ == '__main__':
main() # the whole one model
?
?
4.隨機森林(RandomForest):
?
def main():
start = datetime.datetime.now()
datapath = 'D:/dataset/quora/quora_duplicate_questions_Chinese_seg.tsv'
doc2vecpath = "D:/dataset/quora/vector2/quora_duplicate_question_doc2vec_100.vector"
qid1, qid2, question1, question2, labels = load_data(datapath)
embeddings_index = load_doc2vec(word2vecpath=doc2vecpath)
vectors1 = sentence_represention(qid1, embeddings_index)
vectors2 = sentence_represention(qid2, embeddings_index)
vectors = np.hstack((vectors1, vectors2))
labels = np.array(labels)
VALIDATION_SPLIT = 10000
VALIDATION_SPLIT0 = 1000
indices = np.arange(vectors.shape[0])
np.random.shuffle(indices)
vectors = vectors[indices]
labels = labels[indices]
train_vectors = vectors[:-VALIDATION_SPLIT]
train_labels = labels[:-VALIDATION_SPLIT]
test_vectors = vectors[-VALIDATION_SPLIT:]
test_labels = labels[-VALIDATION_SPLIT:]
# train_vectors = vectors[:VALIDATION_SPLIT0]
# train_labels = labels[:VALIDATION_SPLIT0]
# test_vectors = vectors[-VALIDATION_SPLIT0:]
# test_labels = labels[-VALIDATION_SPLIT0:]
randomforest = RandomForestClassifier()
print '***********************training************************'
randomforest.fit(train_vectors, train_labels)
print '***********************predict*************************'
prediction = randomforest.predict(test_vectors)
accuracy = metrics.accuracy_score(test_labels, prediction)
print accuracy
end = datetime.datetime.now()
print end-start
if __name__ == '__main__':
main() # the whole one model
?
最終的結果如下:
LR?68.56%?
SVM?69.77%
GBDT?71.4%
RandomForest?78.36%(跑了多次,最好的一次)
從準確率上來看,隨機森林的效果最好。時間上面,SVM耗時最長。
?
未來:
其實本文在特征選擇和分類算法的參數調整上還有很多可以深入的地方,我相信,通過繼續挖掘更多的有用特征,以及對模型的參數進行調整還可以得到更好的結果。
詳細代碼參見我的GitHub,地址為:點擊打開鏈接
總結
以上是生活随笔為你收集整理的深度学习笔记——基于传统机器学习算法(LR、SVM、GBDT、RandomForest)的句子对匹配方法的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Spark MLlib学习
- 下一篇: 新手的深度学习综述 | 入门