當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Day03『NLP打卡营』实践课3：使用预训练模型实现快递单信息抽取

發(fā)布時間：2024/7/5 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 Day03『NLP打卡营』实践课3：使用预训练模型实现快递单信息抽取小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

Day03 詞法分析作業(yè)輔導

本教程旨在輔導同學如何完成 AI Studio課程——『NLP打卡營』實踐課3：使用預(yù)訓練模型實現(xiàn)快遞單信息抽取
課后作業(yè)。

1. 更換預(yù)訓練模型

在PaddleNLP Transformer API查詢PaddleNLP所支持的Transformer預(yù)訓練模型。選擇其中一個模型，如bert-base-chinese，只需將代碼中的

from paddlenlp.transformers import ErnieTokenizer, ErnieForTokenClassificationmodel = ErnieForTokenClassification.from_pretrained("ernie-1.0", num_classes=len(label_vocab)) tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')

修改為

from paddlenlp.transformers import BertTokenizer, BertForTokenClassificationmodel = BertForTokenClassification.from_pretrained("bert-base-chinese", num_classes=len(label_vocab)) tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

即可將預(yù)訓練模型從ernie-1.0切換至bert-base-chinese。

2. 更換數(shù)據(jù)集

PaddleNLP集成了一系列序列標注數(shù)據(jù)集，用戶可以一鍵調(diào)用相應(yīng)API快速下載調(diào)用相關(guān)數(shù)據(jù)集，我們在這里選擇其中MSRA_NER數(shù)據(jù)集，將

def load_dataset(datafiles):def read(data_path):with open(data_path, 'r', encoding='utf-8') as fp:next(fp) # Skip headerfor line in fp.readlines():words, labels = line.strip('\n').split('\t')words = words.split('\002')labels = labels.split('\002')yield words, labelsif isinstance(datafiles, str):return MapDataset(list(read(datafiles)))elif isinstance(datafiles, list) or isinstance(datafiles, tuple):return [MapDataset(list(read(datafile))) for datafile in datafiles]# Create dataset, tokenizer and dataloader. train_ds, dev_ds, test_ds = load_dataset(datafiles=('./data/train.txt', './data/dev.txt', './data/test.txt'))

修改為

from paddlenlp.datasets import load_dataset# 由于MSRA_NER數(shù)據(jù)集沒有dev dataset，我們這里重復(fù)加載test dataset作為dev_ds train_ds, dev_ds, test_ds = load_dataset('msra_ner', splits=('train', 'test', 'test'), lazy=False)# 注意刪除 label_vocab = load_dict('./data/tag.dic') label_vocab = {label:label_id for label_id, label in enumerate(train_ds.label_list)}

2.1 適配數(shù)據(jù)集預(yù)處理

為了適配該數(shù)據(jù)集，我們還需要修改數(shù)據(jù)預(yù)處理代碼，修改utils.py中的convert_example函數(shù)為：

def convert_example(example, tokenizer, label_vocab, max_seq_len=128):labels = example['labels']example = example['tokens']no_entity_id = label_vocab['O']tokenized_input = tokenizer(example,return_length=True,is_split_into_words=True,max_seq_len=max_seq_len)# -2 for [CLS] and [SEP]if len(tokenized_input['input_ids']) - 2 < len(labels):labels = labels[:len(tokenized_input['input_ids']) - 2]tokenized_input['labels'] = [no_entity_id] + labels + [no_entity_id]tokenized_input['labels'] += [no_entity_id] * (len(tokenized_input['input_ids']) - len(tokenized_input['labels']))return tokenized_input['input_ids'], tokenized_input['token_type_ids'], tokenized_input['seq_len'], tokenized_input['labels']

2.2 適配數(shù)據(jù)集后處理

不同于快遞單數(shù)據(jù)集，MSRA_NER數(shù)據(jù)集的標注采用的是’BIO’在前的標注方式，因此還需要修改utils.py中的parse_decodes函數(shù)為：

def parse_decodes(ds, decodes, lens, label_vocab):decodes = [x for batch in decodes for x in batch]lens = [x for batch in lens for x in batch]id_label = dict(zip(label_vocab.values(), label_vocab.keys()))outputs = []for idx, end in enumerate(lens):sent = ds.data[idx]['tokens'][:end]tags = [id_label[x] for x in decodes[idx][1:end]]sent_out = []tags_out = []words = ""for s, t in zip(sent, tags):if t.endswith('B-') or t == 'O':if len(words):sent_out.append(words)if t.startswith('B-'):tags_out.append(t.split('-')[1])else:tags_out.append(t)words = selse:words += sif len(sent_out) < len(tags_out):sent_out.append(words)outputs.append(''.join([str((s, t)) for s, t in zip(sent_out, tags_out)]))return outputs

總結(jié)

以上是生活随笔為你收集整理的Day03『NLP打卡营』实践课3：使用预训练模型实现快递单信息抽取的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： CCKS 2018 | 前沿技术讲习班
下一篇：神经网络不应视为模型，推理过程当为机器学