Tensorflow搞一个聊天机器人
catalogue
0. 前言 1. 訓(xùn)練語料庫 2. 數(shù)據(jù)預(yù)處理 3. 詞匯轉(zhuǎn)向量 4. 訓(xùn)練 5. 聊天機器人 - 驗證效果?
0. 前言
不是搞機器學(xué)習(xí)算法專業(yè)的,3個月前開始補了一些神經(jīng)網(wǎng)絡(luò),卷積,神經(jīng)網(wǎng)絡(luò)一大堆基礎(chǔ)概念,尼瑪,還真有點復(fù)雜,不過搞懂這些基本數(shù)學(xué)概念,再看tensorflow的api和python代碼覺得跌跌撞撞竟然能看懂了,背后的意思也能明白一點點
0x1: 模型分類
1. 基于檢索的模型 vs. 產(chǎn)生式模型
基于檢索的模型(Retrieval-Based Models)有一個預(yù)先定義的"回答集(repository)",包含了許多回答(responses),還有一些根據(jù)輸入的問句和上下文(context),以及用于挑選出合適的回答的啟發(fā)式規(guī)則。這些啟發(fā)式規(guī)則可能是簡單的基于規(guī)則的表達(dá)式匹配,或是相對復(fù)雜的機器學(xué)習(xí)分類器的集成。基于檢索的模型不會產(chǎn)生新的文字,它只能從預(yù)先定義的"回答集"中挑選出一個較為合適的回答。
產(chǎn)生式模型(Generative Models)不依賴于預(yù)先定義的回答集,它會產(chǎn)生一個新的回答。經(jīng)典的產(chǎn)生式模型是基于機器翻譯技術(shù)的,只不過不是將一種語言翻譯成另一種語言,而是將問句"翻譯"成回答(response)
2. 長對話模型 vs. 短對話模型
短對話(Short Conversation)指的是一問一答式的單輪(single turn)對話。舉例來說,當(dāng)機器收到用戶的一個提問時,會返回一個合適的回答。對應(yīng)地,長對話(Long Conversation)指的是你來我往的多輪(multi-turn)對話,例如兩個朋友對某個話題交流意見的一段聊天。在這個場景中,需要談話雙方(聊天機器人可能是其中一方)記得雙方曾經(jīng)談?wù)撨^什么,這是和短對話的場景的區(qū)別之一。現(xiàn)下,機器人客服系統(tǒng)通常是長對話模型
3. 開放話題模型 vs. 封閉話題模型
開放話題(Open Domain)場景下,用戶可以說任何內(nèi)容,不需要是有特定的目的或是意圖的詢問。人們在Twitter、Reddit等社交網(wǎng)絡(luò)上的對話形式就是典型的開放話題情景。由于該場景下,可談?wù)摰闹黝}的數(shù)量不限,而且需要一些常識作為聊天基礎(chǔ),使得搭建一個這樣的聊天機器人變得相對困難。
封閉話題(Closed Domain)場景,又稱為目標(biāo)驅(qū)動型(goal-driven),系統(tǒng)致力于解決特定領(lǐng)域的問題,因此可能的詢問和回答的數(shù)量相對有限。技術(shù)客服系統(tǒng)或是購物助手等應(yīng)用就是封閉話題模型的例子。我們不要求這些系統(tǒng)能夠談?wù)撜?#xff0c;只需要它們能夠盡可能有效地解決我們的問題。雖然用戶還是可以向這些系統(tǒng)問一些不著邊際的問題,但是系統(tǒng)同樣可以不著邊際地給你回復(fù) ;)
Relevant Link:
http://naturali.io/deeplearning/chatbot/introduction/2016/04/28/chatbot-part1.html http://blog.topspeedsnail.com/archives/10735/comment-page-1#comment-1161 http://blog.csdn.net/malefactor/article/details/51901115?
1. 訓(xùn)練語料庫
wget https://raw.githubusercontent.com/rustch3n/dgk_lost_conv/master/dgk_shooter_min.conv.zip 解壓 unzip dgk_shooter_min.conv.zipRelevant Link:
https://github.com/rustch3n/dgk_lost_conv?
2. 數(shù)據(jù)預(yù)處理
一般來說,我們拿到的基礎(chǔ)語料庫可能是一些電影臺詞對話,或者是UBUNTU對話語料庫(Ubuntu Dialog Corpus),但基本上我們都要完成以下幾大步驟
1. 分詞(tokenized) 2. 英文單詞取詞根(stemmed) 3. 英文單詞變形的歸類(lemmatized)(例如單復(fù)數(shù)歸類)等 4. 此外,例如人名、地名、組織名、URL鏈接、系統(tǒng)路徑等專有名詞,我們也可以統(tǒng)一用類型標(biāo)識符來替代M 表示話語,E 表示分割,遇到M就吧當(dāng)前對話片段加入臨時對話集,遇到E就說明遇到一個中斷或者交談雙方轉(zhuǎn)換了,一口氣吧臨時對話集加入convs總對話集,一次加入一個對話集,可以理解為拍電影里面的一個"咔"
convs = [] # conversation set with open(conv_path, encoding="utf8") as f:one_conv = [] # a complete conversationfor line in f:line = line.strip('\n').replace('/', '')if line == '':continueif line[0] == 'E':if one_conv:convs.append(one_conv)one_conv = []elif line[0] == 'M':one_conv.append(line.split(' ')[1])因為場景是聊天機器人,影視劇的臺詞也是一人一句對答的,所以這里需要忽略2種特殊情況,只有一問或者只有一答,以及問和答的數(shù)量不一致,即最后一個人問完了沒有得到回答
Relevant Link:
?
3. 詞匯轉(zhuǎn)向量
我們知道圖像識別、語音識別之所以能率先在深度學(xué)習(xí)領(lǐng)域取得較大成就,其中一個原因在于這2個領(lǐng)域的原始輸入數(shù)據(jù)本身就帶有很強的樣本關(guān)聯(lián)性,例如像素權(quán)重分布在同一類物體的不同圖像中,表現(xiàn)是基本一致的,這本質(zhì)上也人腦識別同類物體的機制是一樣的,即我們常說的"舉一反三"能力,我們學(xué)過的文字越多,就越可能駕馭甚至能創(chuàng)造組合出新的文字用法,寫出華麗的文章
但是NPL或者語義識別領(lǐng)域的輸入數(shù)據(jù),對話或者叫語料往往是不具備這種強關(guān)聯(lián)性的,為此,就需要引入一個概念模型,叫詞向量(word2vec)或短語向量(seq2seq),簡單來說就是將語料庫中的詞匯抽象映射到一個向量空間中,向量的排布是根據(jù)預(yù)發(fā)和詞義語境決定的,例如,"中國->人"(中國后面緊跟著一個人字的可能性是極大的)、"你今年幾歲了->我 ** 歲了"
0x1: Token化處理、詞編碼
將訓(xùn)練集中的對話的每個文件拆分成單獨的一個個文字,形成一個詞表(word table)
def gen_vocabulary_file(input_file, output_file):vocabulary = {}with open(input_file) as f:counter = 0for line in f:counter += 1tokens = [word for word in line.strip()]for word in tokens:if word in vocabulary:vocabulary[word] += 1else:vocabulary[word] = 1vocabulary_list = START_VOCABULART + sorted(vocabulary, key=vocabulary.get, reverse=True)# For taking 10000 custom character kanjiif len(vocabulary_list) > 10000:vocabulary_list = vocabulary_list[:10000]print(input_file + " phrase table size:", len(vocabulary_list))with open(output_file, "w") as ff:for word in vocabulary_list:ff.write(word + "\n")
完成了Token化之后,需要對單詞進(jìn)行數(shù)字編碼,方便后續(xù)的向量空間處理,這里依據(jù)的核心思想是這樣的
我們的訓(xùn)練語料庫的對話之間都是有強關(guān)聯(lián)的,基于這份有關(guān)聯(lián)的對話集獲得的詞表的詞之間也有邏輯關(guān)聯(lián)性,那么我們只要按照此表原生的順序?qū)υ~進(jìn)行編碼,這個編碼后的[work, id]就是一個有向量空間關(guān)聯(lián)性的詞表
def convert_conversation_to_vector(input_file, vocabulary_file, output_file):tmp_vocab = []with open(vocabulary_file, "r") as f:tmp_vocab.extend(f.readlines())tmp_vocab = [line.strip() for line in tmp_vocab]vocab = dict([(x, y) for (y, x) in enumerate(tmp_vocab)])for item in vocab:print item.encode('utf-8')所以我們根據(jù)訓(xùn)練預(yù)料集得到的此表可以作為對話訓(xùn)練集和對話測試機進(jìn)行向量化的依據(jù),我們的目的是將對話(包括訓(xùn)練集和測試集)的問和答都轉(zhuǎn)化映射到向量空間
土 968 "土"字在訓(xùn)練集詞匯表中的位置是968,我們就給該字設(shè)置一個編碼9680x2: 對話轉(zhuǎn)為向量
原作者在詞表的選取上作了裁剪,只選取前5000個詞匯,但是仔細(xì)思考了一下,感覺問題源頭還是在訓(xùn)練語料庫不夠豐富,不能完全覆蓋所有的對話語言場景
這一步得到一個ask/answer的語句seq向量空間集,對于訓(xùn)練集,我們將ask和answer建立映射關(guān)系
Relevant Link:
?
4. 訓(xùn)練
0x1: Sequence-to-sequence basics
A basic sequence-to-sequence model, as introduced in Cho et al., 2014, consists of two recurrent neural networks (RNNs): an encoder that processes the input and a decoder that generates the output. This basic architecture is depicted below.
Each box in the picture above represents a cell of the RNN, most commonly a GRU cell or an LSTM cell. Encoder and decoder can share weights or, as is more common, use a different set of parameters. Multi-layer cells have been successfully used in sequence-to-sequence models too
In the basic model depicted above, every input has to be encoded into a fixed-size state vector, as that is the only thing passed to the decoder. To allow the decoder more direct access to the input, an attention mechanism was introduced in Bahdanau et al., 2014.; suffice it to say that it allows the decoder to peek into the input at every decoding step. A multi-layer sequence-to-sequence network with LSTM cells and attention mechanism in the decoder looks like this.
0x2: 訓(xùn)練過程
利用ask/answer的訓(xùn)練集輸入神經(jīng)網(wǎng)絡(luò),并使用ask/answer測試向量映射集實現(xiàn)BP反饋與,使用一個三層神經(jīng)網(wǎng)絡(luò),讓tensorflow自動調(diào)整權(quán)重參數(shù),獲得一個ask-?的模型
# -*- coding: utf-8 -*-import tensorflow as tf # 0.12 from tensorflow.models.rnn.translate import seq2seq_model import os import numpy as np import mathPAD_ID = 0 GO_ID = 1 EOS_ID = 2 UNK_ID = 3# ask/answer conversation vector file train_ask_vec_file = 'train_ask.vec' train_answer_vec_file = 'train_answer.vec' test_ask_vec_file = 'test_ask.vec' test_answer_vec_file = 'test_answer.vec'# word table 6000 vocabulary_ask_size = 6000 vocabulary_answer_size = 6000buckets = [(5, 10), (10, 15), (20, 25), (40, 50)] layer_size = 256 num_layers = 3 batch_size = 64# read *dencode.vec和*decode.vec data into memory def read_data(source_path, target_path, max_size=None):data_set = [[] for _ in buckets]with tf.gfile.GFile(source_path, mode="r") as source_file:with tf.gfile.GFile(target_path, mode="r") as target_file:source, target = source_file.readline(), target_file.readline()counter = 0while source and target and (not max_size or counter < max_size):counter += 1source_ids = [int(x) for x in source.split()]target_ids = [int(x) for x in target.split()]target_ids.append(EOS_ID)for bucket_id, (source_size, target_size) in enumerate(buckets):if len(source_ids) < source_size and len(target_ids) < target_size:data_set[bucket_id].append([source_ids, target_ids])breaksource, target = source_file.readline(), target_file.readline()return data_setif __name__ == '__main__':model = seq2seq_model.Seq2SeqModel(source_vocab_size=vocabulary_ask_size,target_vocab_size=vocabulary_answer_size,buckets=buckets, size=layer_size, num_layers=num_layers, max_gradient_norm=5.0,batch_size=batch_size, learning_rate=0.5, learning_rate_decay_factor=0.97,forward_only=False)config = tf.ConfigProto()config.gpu_options.allocator_type = 'BFC' # forbidden out of memorywith tf.Session(config=config) as sess:# 恢復(fù)前一次訓(xùn)練ckpt = tf.train.get_checkpoint_state('.')if ckpt != None:print(ckpt.model_checkpoint_path)model.saver.restore(sess, ckpt.model_checkpoint_path)else:sess.run(tf.global_variables_initializer())train_set = read_data(train_ask_vec_file, train_answer_vec_file)test_set = read_data(test_ask_vec_file, test_answer_vec_file)train_bucket_sizes = [len(train_set[b]) for b in range(len(buckets))]train_total_size = float(sum(train_bucket_sizes))train_buckets_scale = [sum(train_bucket_sizes[:i + 1]) / train_total_size for i in range(len(train_bucket_sizes))]loss = 0.0total_step = 0previous_losses = []# continue train,save modle after a decade of timewhile True:random_number_01 = np.random.random_sample()bucket_id = min([i for i in range(len(train_buckets_scale)) if train_buckets_scale[i] > random_number_01])encoder_inputs, decoder_inputs, target_weights = model.get_batch(train_set, bucket_id)_, step_loss, _ = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id, False)loss += step_loss / 500total_step += 1print(total_step)if total_step % 500 == 0:print(model.global_step.eval(), model.learning_rate.eval(), loss)# if model has't not improve,decrese the learning rateif len(previous_losses) > 2 and loss > max(previous_losses[-3:]):sess.run(model.learning_rate_decay_op)previous_losses.append(loss)# save modelcheckpoint_path = "chatbot_seq2seq.ckpt"model.saver.save(sess, checkpoint_path, global_step=model.global_step)loss = 0.0# evaluation the model by test datasetfor bucket_id in range(len(buckets)):if len(test_set[bucket_id]) == 0:continueencoder_inputs, decoder_inputs, target_weights = model.get_batch(test_set, bucket_id)_, eval_loss, _ = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id, True)eval_ppx = math.exp(eval_loss) if eval_loss < 300 else float('inf')print(bucket_id, eval_ppx)Relevant Link:
https://www.tensorflow.org/tutorials/seq2seq http://suriyadeepan.github.io/2016-06-28-easy-seq2seq/?
5. 聊天機器人 - 驗證效果
# -*- coding: utf-8 -*-import tensorflow as tf # 0.12 from tensorflow.models.rnn.translate import seq2seq_model import os import sys import locale import numpy as npPAD_ID = 0 GO_ID = 1 EOS_ID = 2 UNK_ID = 3train_ask_vocabulary_file = "train_ask_vocabulary.vec" train_answer_vocabulary_file = "train_answer_vocabulary.vec"def read_vocabulary(input_file):tmp_vocab = []with open(input_file, "r") as f:tmp_vocab.extend(f.readlines())tmp_vocab = [line.strip() for line in tmp_vocab]vocab = dict([(x, y) for (y, x) in enumerate(tmp_vocab)])return vocab, tmp_vocabif __name__ == '__main__':vocab_en, _, = read_vocabulary(train_ask_vocabulary_file)_, vocab_de, = read_vocabulary(train_answer_vocabulary_file)# word table 6000vocabulary_ask_size = 6000vocabulary_answer_size = 6000buckets = [(5, 10), (10, 15), (20, 25), (40, 50)]layer_size = 256num_layers = 3batch_size = 1model = seq2seq_model.Seq2SeqModel(source_vocab_size=vocabulary_ask_size,target_vocab_size=vocabulary_answer_size,buckets=buckets, size=layer_size, num_layers=num_layers, max_gradient_norm=5.0,batch_size=batch_size, learning_rate=0.5, learning_rate_decay_factor=0.99,forward_only=True)model.batch_size = 1with tf.Session() as sess:# restore last trainckpt = tf.train.get_checkpoint_state('.')if ckpt != None:print(ckpt.model_checkpoint_path)model.saver.restore(sess, ckpt.model_checkpoint_path)else:print("model not found")while True:input_string = raw_input('me > ').decode(sys.stdin.encoding or locale.getpreferredencoding(True)).strip()# 退出if input_string == 'quit':exit()# convert the user's input to vectorinput_string_vec = []for words in input_string.strip():input_string_vec.append(vocab_en.get(words, UNK_ID))bucket_id = min([b for b in range(len(buckets)) if buckets[b][0] > len(input_string_vec)])encoder_inputs, decoder_inputs, target_weights = model.get_batch({bucket_id: [(input_string_vec, [])]},bucket_id)_, _, output_logits = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id, True)outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits]if EOS_ID in outputs:outputs = outputs[:outputs.index(EOS_ID)]response = "".join([tf.compat.as_str(vocab_de[output]) for output in outputs])print('AI > ' + response)神經(jīng)網(wǎng)絡(luò)還是很依賴樣本的訓(xùn)練的,我在實驗的過程中發(fā)現(xiàn),用GPU跑到20000 step之后,模型的效果才逐漸顯現(xiàn)出來,才開始逐漸像正常的人機對話了
Relevant Link:
總結(jié)
以上是生活随笔為你收集整理的Tensorflow搞一个聊天机器人的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 最新|TensorFlow开源的序列到序
- 下一篇: TensorFlow教程之完整教程 2.