生活随笔
收集整理的這篇文章主要介紹了
tf13: 简单聊天机器人
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
現在很多賣貨公司都使用聊天機器人充當客服人員,許多科技巨頭也紛紛推出各自的聊天助手,如蘋果Siri、Google Now、Amazon Alexa、微軟小冰等等。前不久有一個視頻比較了Google Now和Siri哪個更智能,貌似Google Now更智能。
本帖使用TensorFlow制作一個簡單的聊天機器人。這個聊天機器人使用中文對話數據集進行訓練(使用什么數據集訓練決定了對話類型)。使用的模型為RNN(seq2seq)
相關博文:
- 使用深度學習打造智能聊天機器人
- 腦洞大開:基于美劇字幕的聊天語料庫建設方案
- 中文對白語料
- https://www.tensorflow.org/versions/r0.12/tutorials/seq2seq/index.html
- https://github.com/tflearn/tflearn/blob/master/examples/nlp/lstm_generator_shakespeare.py
code:Here。
數據集
我使用現成的影視對白數據集,跪謝作者分享數據。
下載數據集:
[python]?view plain
?copy $?wget?https://raw.githubusercontent.com/rustch3n/dgk_lost_conv/master/dgk_shooter_min.conv.zip?? ?? $?unzip?dgk_shooter_min.conv.zip??
數據預處理:
[python]?view plain
?copy import?os?? import?random?? ??? conv_path?=?'dgk_shooter_min.conv'?? ??? if?not?os.path.exists(conv_path):?? ????print('數據集不存在')?? ????exit()?? ??? ?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ??? ?? convs?=?[]???? with?open(conv_path,?encoding?=?"utf8")?as?f:?? ????one_conv?=?[]?????????? ????for?line?in?f:?? ????????line?=?line.strip('\n').replace('/',?'')?? ????????if?line?==?'':?? ????????????continue?? ????????if?line[0]?==?'E':?? ????????????if?one_conv:?? ????????????????convs.append(one_conv)?? ????????????one_conv?=?[]?? ????????elif?line[0]?==?'M':?? ????????????one_conv.append(line.split('?')[1])?? ? ? ? ? ? ?? ??? ?? ask?=?[]?????????? response?=?[]????? for?conv?in?convs:?? ????if?len(conv)?==?1:?? ????????continue?? ????if?len(conv)?%?2?!=?0:???? ????????conv?=?conv[:-1]?? ????for?i?in?range(len(conv)):?? ????????if?i?%?2?==?0:?? ????????????ask.append(conv[i])?? ????????else:?? ????????????response.append(conv[i])?? ??? ? ? ? ? ? ? ?? ??? def?convert_seq2seq_files(questions,?answers,?TESTSET_SIZE?=?8000):?? ?????? ????train_enc?=?open('train.enc','w')???? ????train_dec?=?open('train.dec','w')???? ????test_enc??=?open('test.enc',?'w')???? ????test_dec??=?open('test.dec',?'w')???? ??? ?????? ????test_index?=?random.sample([i?for?i?in?range(len(questions))],TESTSET_SIZE)?? ??? ????for?i?in?range(len(questions)):?? ????????if?i?in?test_index:?? ????????????test_enc.write(questions[i]+'\n')?? ????????????test_dec.write(answers[i]+?'\n'?)?? ????????else:?? ????????????train_enc.write(questions[i]+'\n')?? ????????????train_dec.write(answers[i]+?'\n'?)?? ????????if?i?%?1000?==?0:?? ????????????print(len(range(len(questions))),?'處理進度:',?i)?? ??? ????train_enc.close()?? ????train_dec.close()?? ????test_enc.close()?? ????test_dec.close()?? ??? convert_seq2seq_files(ask,?response)?? ?? ??
創建詞匯表,然后把對話轉為向量形式,參看練習1和7:
[python]?view plain
?copy ?? train_encode_file?=?'train.enc'?? train_decode_file?=?'train.dec'?? test_encode_file?=?'test.enc'?? test_decode_file?=?'test.dec'?? ??? print('開始創建詞匯表...')?? ?? PAD?=?"__PAD__"?? GO?=?"__GO__"?? EOS?=?"__EOS__"???? UNK?=?"__UNK__"???? START_VOCABULART?=?[PAD,?GO,?EOS,?UNK]?? PAD_ID?=?0?? GO_ID?=?1?? EOS_ID?=?2?? UNK_ID?=?3?? ?? ??? vocabulary_size?=?5000?? ?? def?gen_vocabulary_file(input_file,?output_file):?? ????vocabulary?=?{}?? ????with?open(input_file)?as?f:?? ????????counter?=?0?? ????????for?line?in?f:?? ????????????counter?+=?1?? ????????????tokens?=?[word?for?word?in?line.strip()]?? ????????????for?word?in?tokens:?? ????????????????if?word?in?vocabulary:?? ????????????????????vocabulary[word]?+=?1?? ????????????????else:?? ????????????????????vocabulary[word]?=?1?? ????????vocabulary_list?=?START_VOCABULART?+?sorted(vocabulary,?key=vocabulary.get,?reverse=True)?? ?????????? ????????if?len(vocabulary_list)?>?5000:?? ????????????vocabulary_list?=?vocabulary_list[:5000]?? ????????print(input_file?+?"?詞匯表大小:",?len(vocabulary_list))?? ????????with?open(output_file,?"w")?as?ff:?? ????????????for?word?in?vocabulary_list:?? ????????????????ff.write(word?+?"\n")?? ??? gen_vocabulary_file(train_encode_file,?"train_encode_vocabulary")?? gen_vocabulary_file(train_decode_file,?"train_decode_vocabulary")?? ??? train_encode_vocabulary_file?=?'train_encode_vocabulary'?? train_decode_vocabulary_file?=?'train_decode_vocabulary'?? ??? print("對話轉向量...")?? ?? def?convert_to_vector(input_file,?vocabulary_file,?output_file):?? ????tmp_vocab?=?[]?? ????with?open(vocabulary_file,?"r")?as?f:?? ????????tmp_vocab.extend(f.readlines())?? ????tmp_vocab?=?[line.strip()?for?line?in?tmp_vocab]?? ????vocab?=?dict([(x,?y)?for?(y,?x)?in?enumerate(tmp_vocab)])?? ?????? ????output_f?=?open(output_file,?'w')?? ????with?open(input_file,?'r')?as?f:?? ????????for?line?in?f:?? ????????????line_vec?=?[]?? ????????????for?words?in?line.strip():?? ????????????????line_vec.append(vocab.get(words,?UNK_ID))?? ????????????output_f.write("?".join([str(num)?for?num?in?line_vec])?+?"\n")?? ????output_f.close()?? ??? convert_to_vector(train_encode_file,?train_encode_vocabulary_file,?'train_encode.vec')?? convert_to_vector(train_decode_file,?train_decode_vocabulary_file,?'train_decode.vec')?? ??? convert_to_vector(test_encode_file,?train_encode_vocabulary_file,?'test_encode.vec')?? convert_to_vector(test_decode_file,?train_decode_vocabulary_file,?'test_decode.vec')??
生成的train_encode.vec和train_decode.vec用于訓練,對應的詞匯表是train_encode_vocabulary和train_decode_vocabulary。
訓練
需要很長時間訓練,這還是小數據集,如果用百GB級的數據,沒10天半個月也訓練不完。
使用的模型:seq2seq_model.py。
代碼:
[python]?view plain
?copy import?tensorflow?as?tf???? from?tensorflow.models.rnn.translate?import?seq2seq_model?? import?os?? import?numpy?as?np?? import?math?? ??? PAD_ID?=?0?? GO_ID?=?1?? EOS_ID?=?2?? UNK_ID?=?3?? ??? train_encode_vec?=?'train_encode.vec'?? train_decode_vec?=?'train_decode.vec'?? test_encode_vec?=?'test_encode.vec'?? test_decode_vec?=?'test_decode.vec'?? ??? ?? vocabulary_encode_size?=?5000?? vocabulary_decode_size?=?5000?? ??? buckets?=?[(5,?10),?(10,?15),?(20,?25),?(40,?50)]?? layer_size?=?256???? num_layers?=?3????? batch_size?=??64?? ??? ?? def?read_data(source_path,?target_path,?max_size=None):?? ????data_set?=?[[]?for?_?in?buckets]?? ????with?tf.gfile.GFile(source_path,?mode="r")?as?source_file:?? ????????with?tf.gfile.GFile(target_path,?mode="r")?as?target_file:?? ????????????source,?target?=?source_file.readline(),?target_file.readline()?? ????????????counter?=?0?? ????????????while?source?and?target?and?(not?max_size?or?counter?<?max_size):?? ????????????????counter?+=?1?? ????????????????source_ids?=?[int(x)?for?x?in?source.split()]?? ????????????????target_ids?=?[int(x)?for?x?in?target.split()]?? ????????????????target_ids.append(EOS_ID)?? ????????????????for?bucket_id,?(source_size,?target_size)?in?enumerate(buckets):?? ????????????????????if?len(source_ids)?<?source_size?and?len(target_ids)?<?target_size:?? ????????????????????????data_set[bucket_id].append([source_ids,?target_ids])?? ????????????????????????break?? ????????????????source,?target?=?source_file.readline(),?target_file.readline()?? ????return?data_set?? ??? model?=?seq2seq_model.Seq2SeqModel(source_vocab_size=vocabulary_encode_size,?target_vocab_size=vocabulary_decode_size,?? ???????????????????????????????????buckets=buckets,?size=layer_size,?num_layers=num_layers,?max_gradient_norm=?5.0,?? ???????????????????????????????????batch_size=batch_size,?learning_rate=0.5,?learning_rate_decay_factor=0.97,?forward_only=False)?? ??? config?=?tf.ConfigProto()?? config.gpu_options.allocator_type?=?'BFC'???? ??? with?tf.Session(config=config)?as?sess:?? ?????? ????ckpt?=?tf.train.get_checkpoint_state('.')?? ????if?ckpt?!=?None:?? ????????print(ckpt.model_checkpoint_path)?? ????????model.saver.restore(sess,?ckpt.model_checkpoint_path)?? ????else:?? ????????sess.run(tf.global_variables_initializer())?? ??? ????train_set?=?read_data(train_encode_vec,?train_decode_vec)?? ????test_set?=?read_data(test_encode_vec,?test_decode_vec)?? ??? ????train_bucket_sizes?=?[len(train_set[b])?for?b?in?range(len(buckets))]?? ????train_total_size?=?float(sum(train_bucket_sizes))?? ????train_buckets_scale?=?[sum(train_bucket_sizes[:i?+?1])?/?train_total_size?for?i?in?range(len(train_bucket_sizes))]?? ??? ????loss?=?0.0?? ????total_step?=?0?? ????previous_losses?=?[]?? ?????? ????while?True:?? ????????random_number_01?=?np.random.random_sample()?? ????????bucket_id?=?min([i?for?i?in?range(len(train_buckets_scale))?if?train_buckets_scale[i]?>?random_number_01])?? ??? ????????encoder_inputs,?decoder_inputs,?target_weights?=?model.get_batch(train_set,?bucket_id)?? ????????_,?step_loss,?_?=?model.step(sess,?encoder_inputs,?decoder_inputs,?target_weights,?bucket_id,?False)?? ??? ????????loss?+=?step_loss?/?500?? ????????total_step?+=?1?? ??? ????????print(total_step)?? ????????if?total_step?%?500?==?0:?? ????????????print(model.global_step.eval(),?model.learning_rate.eval(),?loss)?? ??? ?????????????? ????????????if?len(previous_losses)?>?2?and?loss?>?max(previous_losses[-3:]):?? ????????????????sess.run(model.learning_rate_decay_op)?? ????????????previous_losses.append(loss)?? ?????????????? ????????????checkpoint_path?=?"chatbot_seq2seq.ckpt"?? ????????????model.saver.save(sess,?checkpoint_path,?global_step=model.global_step)?? ????????????loss?=?0.0?? ?????????????? ????????????for?bucket_id?in?range(len(buckets)):?? ????????????????if?len(test_set[bucket_id])?==?0:?? ????????????????????continue?? ????????????????encoder_inputs,?decoder_inputs,?target_weights?=?model.get_batch(test_set,?bucket_id)?? ????????????????_,?eval_loss,?_?=?model.step(sess,?encoder_inputs,?decoder_inputs,?target_weights,?bucket_id,?True)?? ????????????????eval_ppx?=?math.exp(eval_loss)?if?eval_loss?<?300?else?float('inf')?? ????????????????print(bucket_id,?eval_ppx)??
聊天機器人
使用訓練好的模型:
[python]?view plain
?copy import?tensorflow?as?tf???? from?tensorflow.models.rnn.translate?import?seq2seq_model?? import?os?? import?numpy?as?np?? ??? PAD_ID?=?0?? GO_ID?=?1?? EOS_ID?=?2?? UNK_ID?=?3?? ??? train_encode_vocabulary?=?'train_encode_vocabulary'?? train_decode_vocabulary?=?'train_decode_vocabulary'?? ??? def?read_vocabulary(input_file):?? ????tmp_vocab?=?[]?? ????with?open(input_file,?"r")?as?f:?? ????????tmp_vocab.extend(f.readlines())?? ????tmp_vocab?=?[line.strip()?for?line?in?tmp_vocab]?? ????vocab?=?dict([(x,?y)?for?(y,?x)?in?enumerate(tmp_vocab)])?? ????return?vocab,?tmp_vocab?? ??? vocab_en,?_,?=?read_vocabulary(train_encode_vocabulary)?? _,?vocab_de,?=?read_vocabulary(train_decode_vocabulary)?? ??? ?? vocabulary_encode_size?=?5000?? vocabulary_decode_size?=?5000?? ??? buckets?=?[(5,?10),?(10,?15),?(20,?25),?(40,?50)]?? layer_size?=?256???? num_layers?=?3????? batch_size?=??1?? ??? model?=?seq2seq_model.Seq2SeqModel(source_vocab_size=vocabulary_encode_size,?target_vocab_size=vocabulary_decode_size,?? ???????????????????????????????????buckets=buckets,?size=layer_size,?num_layers=num_layers,?max_gradient_norm=?5.0,?? ???????????????????????????????????batch_size=batch_size,?learning_rate=0.5,?learning_rate_decay_factor=0.99,?forward_only=True)?? model.batch_size?=?1?? ??? with?tf.Session()?as?sess:?? ?????? ????ckpt?=?tf.train.get_checkpoint_state('.')?? ????if?ckpt?!=?None:?? ????????print(ckpt.model_checkpoint_path)?? ????????model.saver.restore(sess,?ckpt.model_checkpoint_path)?? ????else:?? ????????print("沒找到模型")?? ??? ????while?True:?? ????????input_string?=?input('me?>?')?? ?????????? ????????if?input_string?==?'quit':?? ????????????exit()?? ??? ????????input_string_vec?=?[]?? ????????for?words?in?input_string.strip():?? ????????????input_string_vec.append(vocab_en.get(words,?UNK_ID))?? ????????bucket_id?=?min([b?for?b?in?range(len(buckets))?if?buckets[b][0]?>?len(input_string_vec)])?? ????????encoder_inputs,?decoder_inputs,?target_weights?=?model.get_batch({bucket_id:?[(input_string_vec,?[])]},?bucket_id)?? ????????_,?_,?output_logits?=?model.step(sess,?encoder_inputs,?decoder_inputs,?target_weights,?bucket_id,?True)?? ????????outputs?=?[int(np.argmax(logit,?axis=1))?for?logit?in?output_logits]?? ????????if?EOS_ID?in?outputs:?? ????????????outputs?=?outputs[:outputs.index(EOS_ID)]?? ??? ????????response?=?"".join([tf.compat.as_str(vocab_de[output])?for?output?in?outputs])?? ????????print('AI?>?'?+?response)??
測試:
額,好差勁。
上面的實現并沒有用到任何自然語言的特性(分詞、語法等等),只是單純的使用數據強行提高它的“智商”。
后續練習:中文語音識別、文本轉語音
Share the post "TensorFlow練習13: 制作一個簡單的聊天機器人"
原文地址:?http://blog.csdn.net/u014365862/article/details/53869660
總結
以上是生活随笔為你收集整理的tf13: 简单聊天机器人的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。