Day03『NLP打卡营』实践课3:使用预训练模型实现快递单信息抽取
生活随笔
收集整理的這篇文章主要介紹了
Day03『NLP打卡营』实践课3:使用预训练模型实现快递单信息抽取
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.
Day03 詞法分析作業(yè)輔導
本教程旨在輔導同學如何完成 AI Studio課程——『NLP打卡營』實踐課3:使用預(yù)訓練模型實現(xiàn)快遞單信息抽取
課后作業(yè)。
1. 更換預(yù)訓練模型
在PaddleNLP Transformer API查詢PaddleNLP所支持的Transformer預(yù)訓練模型。選擇其中一個模型,如bert-base-chinese,只需將代碼中的
from paddlenlp.transformers import ErnieTokenizer, ErnieForTokenClassificationmodel = ErnieForTokenClassification.from_pretrained("ernie-1.0", num_classes=len(label_vocab)) tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')修改為
from paddlenlp.transformers import BertTokenizer, BertForTokenClassificationmodel = BertForTokenClassification.from_pretrained("bert-base-chinese", num_classes=len(label_vocab)) tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')即可將預(yù)訓練模型從ernie-1.0切換至bert-base-chinese。
2. 更換數(shù)據(jù)集
PaddleNLP集成了一系列序列標注數(shù)據(jù)集,用戶可以一鍵調(diào)用相應(yīng)API快速下載調(diào)用相關(guān)數(shù)據(jù)集,我們在這里選擇其中MSRA_NER數(shù)據(jù)集,將
def load_dataset(datafiles):def read(data_path):with open(data_path, 'r', encoding='utf-8') as fp:next(fp) # Skip headerfor line in fp.readlines():words, labels = line.strip('\n').split('\t')words = words.split('\002')labels = labels.split('\002')yield words, labelsif isinstance(datafiles, str):return MapDataset(list(read(datafiles)))elif isinstance(datafiles, list) or isinstance(datafiles, tuple):return [MapDataset(list(read(datafile))) for datafile in datafiles]# Create dataset, tokenizer and dataloader. train_ds, dev_ds, test_ds = load_dataset(datafiles=('./data/train.txt', './data/dev.txt', './data/test.txt'))修改為
from paddlenlp.datasets import load_dataset# 由于MSRA_NER數(shù)據(jù)集沒有dev dataset,我們這里重復(fù)加載test dataset作為dev_ds train_ds, dev_ds, test_ds = load_dataset('msra_ner', splits=('train', 'test', 'test'), lazy=False)# 注意刪除 label_vocab = load_dict('./data/tag.dic') label_vocab = {label:label_id for label_id, label in enumerate(train_ds.label_list)}2.1 適配數(shù)據(jù)集預(yù)處理
為了適配該數(shù)據(jù)集,我們還需要修改數(shù)據(jù)預(yù)處理代碼,修改utils.py中的convert_example函數(shù)為:
def convert_example(example, tokenizer, label_vocab, max_seq_len=128):labels = example['labels']example = example['tokens']no_entity_id = label_vocab['O']tokenized_input = tokenizer(example,return_length=True,is_split_into_words=True,max_seq_len=max_seq_len)# -2 for [CLS] and [SEP]if len(tokenized_input['input_ids']) - 2 < len(labels):labels = labels[:len(tokenized_input['input_ids']) - 2]tokenized_input['labels'] = [no_entity_id] + labels + [no_entity_id]tokenized_input['labels'] += [no_entity_id] * (len(tokenized_input['input_ids']) - len(tokenized_input['labels']))return tokenized_input['input_ids'], tokenized_input['token_type_ids'], tokenized_input['seq_len'], tokenized_input['labels']2.2 適配數(shù)據(jù)集后處理
不同于快遞單數(shù)據(jù)集,MSRA_NER數(shù)據(jù)集的標注采用的是’BIO’在前的標注方式,因此還需要修改utils.py中的parse_decodes函數(shù)為:
def parse_decodes(ds, decodes, lens, label_vocab):decodes = [x for batch in decodes for x in batch]lens = [x for batch in lens for x in batch]id_label = dict(zip(label_vocab.values(), label_vocab.keys()))outputs = []for idx, end in enumerate(lens):sent = ds.data[idx]['tokens'][:end]tags = [id_label[x] for x in decodes[idx][1:end]]sent_out = []tags_out = []words = ""for s, t in zip(sent, tags):if t.endswith('B-') or t == 'O':if len(words):sent_out.append(words)if t.startswith('B-'):tags_out.append(t.split('-')[1])else:tags_out.append(t)words = selse:words += sif len(sent_out) < len(tags_out):sent_out.append(words)outputs.append(''.join([str((s, t)) for s, t in zip(sent_out, tags_out)]))return outputs總結(jié)
以上是生活随笔為你收集整理的Day03『NLP打卡营』实践课3:使用预训练模型实现快递单信息抽取的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: CCKS 2018 | 前沿技术讲习班
- 下一篇: 神经网络不应视为模型,推理过程当为机器学