深度学习实战:基于bilstm或者dialated convolutions做NER
文章目錄
- Before You Start:
- 什么是dialated convolutions?
- 什么是NER?
- 為什么文本處理可以使用CNN?
- 整體框架
- input data
- embedding layer
- dialated convolution layer or Bilstm
- Bilstm
- dilated convolution layer
- projection layer
- dilated convolution 分類
- bilstm 分類
- loss layer
- 標記數據,數據預處理
- 原始數據
- 標記數據
- 準備jieba,建立標記的字典
- 開始標記數據,打標記的同時,把數據分成3組,用于train,validation,test,最終得到的是IOB格式的標簽
- 將IOB格式的標簽轉化成IOBES格式
Before You Start:
什么是dialated convolutions?
CNN 是新的feature map上一個點舊的feature map上一個filter windows上的總結(作了pooling),pooling就是在做下采樣,feature map不斷縮小,也就是resolution不斷衰減,可以理解為獲取receptive field犧牲了resolution,站得高看得遠但是看不清楚細節了。pooling是導致resolution衰減的原因。
為了不丟失細節,去掉pooling,但是去掉pooling會導致receptive field變小(這里是相比較加pooling的情況),這里就加入dialated處理,dialated即為filter windows內部的間隔,加入filter window本來3×3,看到9個點,這9個點是挨著的正方形9個點,當dialated=2時,也是看到9個點,但是9個點之前是挨著的,現在變成每個點中間間隔了一個點,也就是視野變成了7×7。
dialated convolutions一般用于擴大receptive field。
什么是NER?
Named entity recongnition: 命名實體識別,就是把一篇文章的專有名詞識別出來。對比圖像,圖像是把圖中物體識別出來,這里是把一段話里面的實體識別出來。
為什么文本處理可以使用CNN?
從感受野的角度:處理文本就是要看上下文,filter windows就是1×N的窗口,N就是看到的字的數字,每個字的feature_dim就是filter windows 的channels。
整體框架
input data
即為標記的文本,具體而言:一個batch有四個維度,分割的原文,原文對應的int,標記的實體長度(0,1,2,3),原文對應的標簽。
_, chars, segs, tags = batchchars, segs, 訓練會用到,tags計算loss會用到。
embedding layer
chars, segs分別通過一個embedding layer然后將兩者的feature拼接起來(100+20)為最后的feature map。
def embedding_layer(self, char_inputs, seg_inputs, config, name=None):""":param char_inputs: one-hot encoding of sentence:param seg_inputs: segmentation feature:param config: wither use segmentation feature:return: [1, num_steps, embedding size], """#高:3 血:22 糖:23 和:24 高:3 血:22 壓:25 char_inputs=[3,22,23,24,3,22,25]#高血糖 和 高血壓 seg_inputs 高血糖=[1,2,3] 和=[0] 高血壓=[1,2,3] seg_inputs=[1,2,3,0,1,2,3]embedding = []self.char_inputs_test=char_inputsself.seg_inputs_test=seg_inputswith tf.variable_scope("char_embedding" if not name else name), tf.device('/cpu:0'):self.char_lookup = tf.get_variable(name="char_embedding",shape=[self.num_chars, self.char_dim],initializer=self.initializer)#輸入char_inputs='常' 對應的字典的索引/編號/value為:8#self.char_lookup=[2677*100]的向量,char_inputs字對應在字典的索引/編號/key=[1]embedding.append(tf.nn.embedding_lookup(self.char_lookup, char_inputs))#self.embedding1.append(tf.nn.embedding_lookup(self.char_lookup, char_inputs))if config["seg_dim"]:with tf.variable_scope("seg_embedding"), tf.device('/cpu:0'):self.seg_lookup = tf.get_variable(name="seg_embedding",#shape=[4*20]shape=[self.num_segs, self.seg_dim],initializer=self.initializer)embedding.append(tf.nn.embedding_lookup(self.seg_lookup, seg_inputs))embed = tf.concat(embedding, axis=-1)self.embed_test=embedself.embedding_test=embeddingreturn embeddialated convolution layer or Bilstm
Bilstm
def biLSTM_layer(self, model_inputs, lstm_dim, lengths, name=None):""":param lstm_inputs: [batch_size, num_steps, emb_size]:return: [batch_size, num_steps, 2*lstm_dim]"""with tf.variable_scope("char_BiLSTM" if not name else name):lstm_cell = {}for direction in ["forward", "backward"]:with tf.variable_scope(direction):lstm_cell[direction] = tf.contrib.rnn.CoupledInputForgetGateLSTMCell(lstm_dim,use_peepholes=True,initializer=self.initializer,state_is_tuple=True)outputs, final_states = tf.nn.bidirectional_dynamic_rnn(lstm_cell["forward"],lstm_cell["backward"],model_inputs,dtype=tf.float32,sequence_length=lengths)return tf.concat(outputs, axis=2)dilated convolution layer
輸入的形狀:[第x段話,1,窗口看多少個word,每個字的feature_dim]
shape of input = [batch, in_height, in_width, in_channels]
窗口的形狀:[1,窗口看多少個word,每個字的feature_dim,filters的數目]
shape of filter = [filter_height, filter_width, in_channels, out_channels]
先做一次正常的卷積,然后做self.repeat_times次,每一次有3次dilated convolution,dilated=1,dilated=1,dilated=2,感受野不斷擴大。
注意dilated=1,相當于正常的卷積,但是視野也是會擴大的,想一下卷積不做pooling視野也是擴大的。
projection layer
卷積為特征提取器,后面需要添加FC分類。
dilated convolution 分類
#Project layer for idcnn by crownpku#Delete the hidden layer, and change bias initializerdef project_layer_idcnn(self, idcnn_outputs, name=None):""":param lstm_outputs: [batch_size, num_steps, emb_size] :return: [batch_size, num_steps, num_tags]"""with tf.variable_scope("project" if not name else name):# project to score of tagswith tf.variable_scope("logits"):W = tf.get_variable("W", shape=[self.cnn_output_width, self.num_tags],dtype=tf.float32, initializer=self.initializer)b = tf.get_variable("b", initializer=tf.constant(0.001, shape=[self.num_tags]))pred = tf.nn.xw_plus_b(idcnn_outputs, W, b)return tf.reshape(pred, [-1, self.num_steps, self.num_tags])bilstm 分類
def project_layer_bilstm(self, lstm_outputs, name=None):"""hidden layer between lstm layer and logits:param lstm_outputs: [batch_size, num_steps, emb_size] :return: [batch_size, num_steps, num_tags]"""with tf.variable_scope("project" if not name else name):with tf.variable_scope("hidden"):W = tf.get_variable("W", shape=[self.lstm_dim*2, self.lstm_dim],dtype=tf.float32, initializer=self.initializer)b = tf.get_variable("b", shape=[self.lstm_dim], dtype=tf.float32,initializer=tf.zeros_initializer())output = tf.reshape(lstm_outputs, shape=[-1, self.lstm_dim*2])hidden = tf.tanh(tf.nn.xw_plus_b(output, W, b))# project to score of tagswith tf.variable_scope("logits"):W = tf.get_variable("W", shape=[self.lstm_dim, self.num_tags],dtype=tf.float32, initializer=self.initializer)b = tf.get_variable("b", shape=[self.num_tags], dtype=tf.float32,initializer=tf.zeros_initializer())pred = tf.nn.xw_plus_b(hidden, W, b)return tf.reshape(pred, [-1, self.num_steps, self.num_tags])loss layer
NLP處理一般使用條件隨機場。
def loss_layer(self, project_logits, lengths, name=None):"""calculate crf loss:param project_logits: [1, num_steps, num_tags]:return: scalar loss"""with tf.variable_scope("crf_loss" if not name else name):small = -1000.0# pad logits for crf lossstart_logits = tf.concat([small * tf.ones(shape=[self.batch_size, 1, self.num_tags]), tf.zeros(shape=[self.batch_size, 1, 1])], axis=-1)pad_logits = tf.cast(small * tf.ones([self.batch_size, self.num_steps, 1]), tf.float32)logits = tf.concat([project_logits, pad_logits], axis=-1)logits = tf.concat([start_logits, logits], axis=1)targets = tf.concat([tf.cast(self.num_tags*tf.ones([self.batch_size, 1]), tf.int32), self.targets], axis=-1)self.trans = tf.get_variable("transitions",shape=[self.num_tags + 1, self.num_tags + 1],initializer=self.initializer)#crf_log_likelihood在一個條件隨機場里面計算標簽序列的log-likelihood#inputs: 一個形狀為[batch_size, max_seq_len, num_tags] 的tensor,#一般使用BILSTM處理之后輸出轉換為他要求的形狀作為CRF層的輸入. #tag_indices: 一個形狀為[batch_size, max_seq_len] 的矩陣,其實就是真實標簽. #sequence_lengths: 一個形狀為 [batch_size] 的向量,表示每個序列的長度. #transition_params: 形狀為[num_tags, num_tags] 的轉移矩陣 #log_likelihood: 標量,log-likelihood #transition_params: 形狀為[num_tags, num_tags] 的轉移矩陣 log_likelihood, self.trans = crf_log_likelihood(inputs=logits,tag_indices=targets,transition_params=self.trans,sequence_lengths=lengths+1)return tf.reduce_mean(-log_likelihood)標記數據,數據預處理
原始數據
從網站爬取的數據,類似如下:
患者精神狀況好,無發熱,訴右髖部疼痛,飲食差,二便正常,查體:神清,各項生命體征平穩,心肺腹查體未見異常。右髖部壓痛,右下肢皮牽引固定好,無松動,右足背動脈搏動好,足趾感覺運動正常。標記數據
準備jieba,建立標記的字典
#%% for jieba dics=csv.reader(open("DICT_NOW.csv",'r',encoding='utf8')) #%% get word and class for row in dics: # 將醫學專有名詞以及標簽加入結巴詞典中if len(row)==2:jieba.add_word(row[0].strip(),tag=row[1].strip()) # add_word保證添加的詞語不會被cut掉jieba.suggest_freq(row[0].strip()) # 可調節單個詞語的詞頻,使其能(或不能)被分出來。開始標記數據,打標記的同時,把數據分成3組,用于train,validation,test,最終得到的是IOB格式的標簽
for file in os.listdir(c_root):if "txtoriginal.txt" in file:fp=open(c_root+file,'r',encoding='utf8')for line in fp:split_num+=1words=pseg.cut(line)for key,value in words: #print(key)#print(value)if value.strip() and key.strip():import time start_time=time.time()index=str(1) if split_num%15<2 else str(2) if split_num%15>1 and split_num%15<4 else str(3) end_time=time.time()#print("method one used time is {}".format(end_time-start_time))if value not in biaoji:value='O'for achar in key.strip():if achar and achar.strip() in fuhao:string=achar+" "+value.strip()+"\n"+"\n"dev.write(string) if index=='1' else test.write(string) if index=='2' else train.write(string) elif achar.strip() and achar.strip() not in fuhao:string = achar + " " + value.strip() + "\n"dev.write(string) if index=='1' else test.write(string) if index=='2' else train.write(string) elif value.strip() in biaoji:begin=0for char in key.strip():if begin==0:begin+=1string1=char+' '+'B-'+value.strip()+'\n'if index=='1': dev.write(string1)elif index=='2':test.write(string1)elif index=='3':train.write(string1)else:passelse:string1 = char + ' ' + 'I-' + value.strip() + '\n'if index=='1': dev.write(string1)elif index=='2':test.write(string1)elif index=='3':train.write(string1)else:passelse:continue將IOB格式的標簽轉化成IOBES格式
# Use selected tagging scheme (IOB / IOBES) I:中間,O:其他,B:開始 | E:結束,S:單個update_tag_scheme(train_sentences, FLAGS.tag_schema)update_tag_scheme(test_sentences, FLAGS.tag_schema)update_tag_scheme(dev_sentences, FLAGS.tag_schema) def update_tag_scheme(sentences, tag_scheme):"""Check and update sentences tagging scheme to IOB2.Only IOB1 and IOB2 schemes are accepted."""for i, s in enumerate(sentences):tags = [w[-1] for w in s]# Check that tags are given in the IOB formatif not iob2(tags):s_str = '\n'.join(' '.join(w) for w in s)raise Exception('Sentences should be given in IOB format! ' +'Please check sentence %i:\n%s' % (i, s_str))if tag_scheme == 'iob':# If format was IOB1, we convert to IOB2for word, new_tag in zip(s, tags):word[-1] = new_tagelif tag_scheme == 'iobes':new_tags = iob_iobes(tags)for word, new_tag in zip(s, new_tags):word[-1] = new_tagelse:raise Exception('Unknown tagging scheme!')總結
以上是生活随笔為你收集整理的深度学习实战:基于bilstm或者dialated convolutions做NER的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: window 10下 Spark 安装简
- 下一篇: idea破解,Maven配置web步骤