CLUENER 细粒度命名实体识别baseline:BiLSTM-CRF
文章目錄
- 數據類別
- 標簽類別定義 & 標注規則
- 數據下載地址
- 數據分布
- 數據字段解釋
- 數據來源
- baseline:BiLSTM-CRF
- 運行
- 參考
命名實體識別(NameEntity Recognition) 是信息提取的一個子任務,其目的是將文本中的命名實體定位并分類為預定義的類別,如人員、組織、位置等。它是信息抽取、問答系統和句法分析等應用領域的重要基礎技術,是結構化信息抽取的重要步驟。
目前可公開訪問獲得的、高質量、細粒度的中文NER數據集較少,其中(CLUE)基于清華大學開源的文本分類數據集THUCNEWS,選出部分數據進行細粒度命名實體標注,并對數據進行清洗,得到一個細粒度的NER數據集。
項目地址:https://github.com/CLUEbenchmark/CLUENER2020
詳細內容可以參考論文:
CLUENER2020: Fine-grained Named Entity Recognition Dataset and Benchmark for Chinese
下面是對數據集的簡單介紹:
數據類別
數據分為10個標簽類別,分別為: 地址(address),書名(book),公司(company),游戲(game),政府(government),電影(movie),姓名(name),組織機構(organization),職位(position),景點(scene)標簽類別定義 & 標注規則
地址(address): **省**市**區**街**號,**路,**街道,**村等(如單獨出現也標記)。地址是標記盡量完全的, 標記到最細。 書名(book): 小說,雜志,習題集,教科書,教輔,地圖冊,食譜,書店里能買到的一類書籍,包含電子書。 公司(company): **公司,**集團,**銀行(央行,中國人民銀行除外,二者屬于政府機構), 如:新東方,包含新華網/中國軍網等。 游戲(game): 常見的游戲,注意有一些從小說,電視劇改編的游戲,要分析具體場景到底是不是游戲。 政府(government): 包括中央行政機關和地方行政機關兩級。 中央行政機關有國務院、國務院組成部門(包括各部、委員會、中國人民銀行和審計署)、國務院直屬機構(如海關、稅務、工商、環保總局等),軍隊等。 電影(movie): 電影,也包括拍的一些在電影院上映的紀錄片,如果是根據書名改編成電影,要根據場景上下文著重區分下是電影名字還是書名。 姓名(name): 一般指人名,也包括小說里面的人物,宋江,武松,郭靖,小說里面的人物綽號:及時雨,花和尚,著名人物的別稱,通過這個別稱能對應到某個具體人物。 組織機構(organization): 籃球隊,足球隊,樂團,社團等,另外包含小說里面的幫派如:少林寺,丐幫,鐵掌幫,武當,峨眉等。 職位(position): 古時候的職稱:巡撫,知州,國師等。現代的總經理,記者,總裁,藝術家,收藏家等。 景點(scene): 常見旅游景點如:長沙公園,深圳動物園,海洋館,植物園,黃河,長江等。數據下載地址
數據下載地址一
或者
數據下載地址二
數據分布
訓練集:10748 驗證集集:1343按照不同標簽類別統計,訓練集數據分布如下(注:一條數據中出現的所有實體都進行標注,如果一條數據出現兩個地址(address)實體,那么統計地址(address)類別數據的時候,算兩條數據): 【訓練集】標簽數據分布如下: 地址(address):2829 書名(book):1131 公司(company):2897 游戲(game):2325 政府(government):1797 電影(movie):1109 姓名(name):3661 組織機構(organization):3075 職位(position):3052 景點(scene):1462【驗證集】標簽數據分布如下: 地址(address):364 書名(book):152 公司(company):366 游戲(game):287 政府(government):244 電影(movie):150 姓名(name):451 組織機構(organization):344 職位(position):425 景點(scene):199數據字段解釋
以train.json為例,數據分為兩列:text & label,其中text列代表文本,label列代表文本中出現的所有包含在10個類別中的實體。 例如:text: "北京勘察設計協會副會長兼秘書長周蔭如"label: {"organization": {"北京勘察設計協會": [[0, 7]]}, "name": {"周蔭如": [[15, 17]]}, "position": {"副會長": [[8, 10]], "秘書長": [[12, 14]]}}其中,organization,name,position代表實體類別,"organization": {"北京勘察設計協會": [[0, 7]]}:表示原text中,"北京勘察設計協會" 是類別為 "組織機構(organization)" 的實體, 并且start_index為0,end_index為7 (注:下標從0開始計數)"name": {"周蔭如": [[15, 17]]}:表示原text中,"周蔭如" 是類別為 "姓名(name)" 的實體, 并且start_index為15,end_index為17"position": {"副會長": [[8, 10]], "秘書長": [[12, 14]]}:表示原text中,"副會長" 是類別為 "職位(position)" 的實體, 并且start_index為8,end_index為10,同時,"秘書長" 也是類別為 "職位(position)" 的實體,并且start_index為12,end_index為14數據來源
本數據是在清華大學開源的文本分類數據集THUCTC基礎上,選出部分數據進行細粒度命名實體標注,原數據來源于Sina News RSS.baseline:BiLSTM-CRF
模型所需的環境:
- pytorch1.12
- python3.7
模型的主要代碼:
from torch.nn import LayerNorm import torch.nn as nn from crf import CRFclass SpatialDropout(nn.Dropout2d):def __init__(self, p=0.6):super(SpatialDropout, self).__init__(p=p)def forward(self, x):x = x.unsqueeze(2) # (N, T, 1, K)x = x.permute(0, 3, 2, 1) # (N, K, 1, T)x = super(SpatialDropout, self).forward(x) # (N, K, 1, T), some features are maskedx = x.permute(0, 3, 2, 1) # (N, T, 1, K)x = x.squeeze(2) # (N, T, K)return xclass NERModel(nn.Module):def __init__(self,vocab_size,embedding_size,hidden_size,label2id,device,drop_p = 0.1):super(NERModel, self).__init__()self.emebdding_size = embedding_sizeself.embedding = nn.Embedding(vocab_size, embedding_size)self.bilstm = nn.LSTM(input_size=embedding_size,hidden_size=hidden_size,batch_first=True,num_layers=2,dropout=drop_p,bidirectional=True)self.dropout = SpatialDropout(drop_p)self.layer_norm = LayerNorm(hidden_size * 2)self.classifier = nn.Linear(hidden_size * 2,len(label2id))self.crf = CRF(tagset_size=len(label2id), tag_dictionary=label2id, device=device)def forward(self, inputs_ids, input_mask):embs = self.embedding(inputs_ids)embs = self.dropout(embs)embs = embs * input_mask.float().unsqueeze(2)seqence_output, _ = self.bilstm(embs)seqence_output= self.layer_norm(seqence_output)features = self.classifier(seqence_output)return featuresdef forward_loss(self, input_ids, input_mask, input_lens, input_tags=None):features = self.forward(input_ids, input_mask)if input_tags is not None:return features, self.crf.calculate_loss(features, tag_list=input_tags, lengths=input_lens)else:return features以下為訓練和評估代碼:
import json import torch import argparse import torch.nn as nn from torch import optim import config from model import NERModel from dataset_loader import DatasetLoader from progressbar import ProgressBar from ner_metrics import SeqEntityScore from data_processor import CluenerProcessor from lr_scheduler import ReduceLROnPlateau from utils_ner import get_entities from common import (init_logger,logger,json_to_text,load_model,AverageMeter,seed_everything)def train(args,model,processor):train_dataset = load_and_cache_examples(args, processor, data_type='train')train_loader = DatasetLoader(data=train_dataset, batch_size=args.batch_size,shuffle=False, seed=args.seed, sort=True,vocab = processor.vocab,label2id = args.label2id)parameters = [p for p in model.parameters() if p.requires_grad]optimizer = optim.Adam(parameters, lr=args.learning_rate)scheduler = ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=3,verbose=1, epsilon=1e-4, cooldown=0, min_lr=0, eps=1e-8)best_f1 = 0for epoch in range(1, 1 + args.epochs):print(f"Epoch {epoch}/{args.epochs}")pbar = ProgressBar(n_total=len(train_loader), desc='Training')train_loss = AverageMeter()model.train()assert model.trainingfor step, batch in enumerate(train_loader):input_ids, input_mask, input_tags, input_lens = batchinput_ids = input_ids.to(args.device)input_mask = input_mask.to(args.device)input_tags = input_tags.to(args.device)features, loss = model.forward_loss(input_ids, input_mask, input_lens, input_tags)loss.backward()torch.nn.utils.clip_grad_norm_(model.parameters(), args.grad_norm)optimizer.step()optimizer.zero_grad()pbar(step=step, info={'loss': loss.item()})train_loss.update(loss.item(), n=1)print(" ")train_log = {'loss': train_loss.avg}if 'cuda' in str(args.device):torch.cuda.empty_cache()eval_log, class_info = evaluate(args,model,processor)logs = dict(train_log, **eval_log)show_info = f'\nEpoch: {epoch} - ' + "-".join([f' {key}: {value:.4f} ' for key, value in logs.items()])logger.info(show_info)scheduler.epoch_step(logs['eval_f1'], epoch)if logs['eval_f1'] > best_f1:logger.info(f"\nEpoch {epoch}: eval_f1 improved from {best_f1} to {logs['eval_f1']}")logger.info("save model to disk.")best_f1 = logs['eval_f1']if isinstance(model, nn.DataParallel):model_stat_dict = model.module.state_dict()else:model_stat_dict = model.state_dict()state = {'epoch': epoch, 'arch': args.arch, 'state_dict': model_stat_dict}model_path = args.output_dir / 'best-model.bin'torch.save(state, str(model_path))print("Eval Entity Score: ")for key, value in class_info.items():info = f"Subject: {key} - Acc: {value['acc']} - Recall: {value['recall']} - F1: {value['f1']}"logger.info(info)def evaluate(args,model,processor):eval_dataset = load_and_cache_examples(args,processor, data_type='dev')eval_dataloader = DatasetLoader(data=eval_dataset, batch_size=args.batch_size,shuffle=False, seed=args.seed, sort=False,vocab=processor.vocab, label2id=args.label2id)pbar = ProgressBar(n_total=len(eval_dataloader), desc="Evaluating")metric = SeqEntityScore(args.id2label,markup=args.markup)eval_loss = AverageMeter()model.eval()with torch.no_grad():for step, batch in enumerate(eval_dataloader):input_ids, input_mask, input_tags, input_lens = batchinput_ids = input_ids.to(args.device)input_mask = input_mask.to(args.device)input_tags = input_tags.to(args.device)features, loss = model.forward_loss(input_ids, input_mask, input_lens, input_tags)eval_loss.update(val=loss.item(), n=input_ids.size(0))tags, _ = model.crf._obtain_labels(features, args.id2label, input_lens)input_tags = input_tags.cpu().numpy()target = [input_[:len_] for input_, len_ in zip(input_tags, input_lens)]metric.update(pred_paths=tags, label_paths=target)pbar(step=step)print(" ")eval_info, class_info = metric.result()eval_info = {f'eval_{key}': value for key, value in eval_info.items()}result = {'eval_loss': eval_loss.avg}result = dict(result, **eval_info)return result, class_infodef predict(args,model,processor):model_path = args.output_dir / 'best-model.bin'model = load_model(model, model_path=str(model_path))test_data = []with open(str(args.data_dir / "test.json"), 'r') as f:idx = 0for line in f:json_d = {}line = json.loads(line.strip())text = line['text']words = list(text)labels = ['O'] * len(words)json_d['id'] = idxjson_d['context'] = " ".join(words)json_d['tag'] = " ".join(labels)json_d['raw_context'] = "".join(words)idx += 1test_data.append(json_d)pbar = ProgressBar(n_total=len(test_data))results = []for step, line in enumerate(test_data):token_a = line['context'].split(" ")input_ids = [processor.vocab.to_index(w) for w in token_a]input_mask = [1] * len(token_a)input_lens = [len(token_a)]model.eval()with torch.no_grad():input_ids = torch.tensor([input_ids], dtype=torch.long)input_mask = torch.tensor([input_mask], dtype=torch.long)input_lens = torch.tensor([input_lens], dtype=torch.long)input_ids = input_ids.to(args.device)input_mask = input_mask.to(args.device)features = model.forward_loss(input_ids, input_mask, input_lens, input_tags=None)tags, _ = model.crf._obtain_labels(features, args.id2label, input_lens)label_entities = get_entities(tags[0], args.id2label)json_d = {}json_d['id'] = stepjson_d['tag_seq'] = " ".join(tags[0])json_d['entities'] = label_entitiesresults.append(json_d)pbar(step=step)print(" ")output_predic_file = str(args.output_dir / "test_prediction.json")output_submit_file = str(args.output_dir / "test_submit.json")with open(output_predic_file, "w") as writer:for record in results:writer.write(json.dumps(record) + '\n')test_text = []with open(str(args.data_dir / 'test.json'), 'r') as fr:for line in fr:test_text.append(json.loads(line))test_submit = []for x, y in zip(test_text, results):json_d = {}json_d['id'] = x['id']json_d['label'] = {}entities = y['entities']words = list(x['text'])if len(entities) != 0:for subject in entities:tag = subject[0]start = subject[1]end = subject[2]word = "".join(words[start:end + 1])if tag in json_d['label']:if word in json_d['label'][tag]:json_d['label'][tag][word].append([start, end])else:json_d['label'][tag][word] = [[start, end]]else:json_d['label'][tag] = {}json_d['label'][tag][word] = [[start, end]]test_submit.append(json_d)json_to_text(output_submit_file, test_submit)def load_and_cache_examples(args,processor, data_type='train'):# Load data features from cache or dataset filecached_examples_file = args.data_dir / 'cached_crf-{}_{}_{}'.format(data_type,args.arch,str(args.task_name))if cached_examples_file.exists():logger.info("Loading features from cached file %s", cached_examples_file)examples = torch.load(cached_examples_file)else:logger.info("Creating features from dataset file at %s", args.data_dir)if data_type == 'train':examples = processor.get_train_examples()elif data_type == 'dev':examples = processor.get_dev_examples()logger.info("Saving features into cached file %s", cached_examples_file)torch.save(examples, str(cached_examples_file))return examplesdef main():parser = argparse.ArgumentParser()# Required parametersparser.add_argument("--do_train", default=False, action='store_true')parser.add_argument('--do_eval', default=False, action='store_true')parser.add_argument("--do_predict", default=False, action='store_true')parser.add_argument('--markup', default='bios', type=str, choices=['bios', 'bio'])parser.add_argument("--arch",default='bilstm_crf',type=str)parser.add_argument('--learning_rate',default=0.001,type=float)parser.add_argument('--seed',default=1234,type=int)parser.add_argument('--gpu',default='0',type=str)parser.add_argument('--epochs',default=50,type=int)parser.add_argument('--batch_size',default=32,type=int)parser.add_argument('--embedding_size',default=128,type=int)parser.add_argument('--hidden_size',default=384,type=int)parser.add_argument("--grad_norm", default=5.0, type=float, help="Max gradient norm.")parser.add_argument("--task_name", type=str, default='ner')args = parser.parse_args()args.data_dir = config.data_dirif not config.output_dir.exists():args.output_dir.mkdir()args.output_dir = config.output_dir / '{}'.format(args.arch)if not args.output_dir.exists():args.output_dir.mkdir()init_logger(log_file=str(args.output_dir / '{}-{}.log'.format(args.arch, args.task_name)))seed_everything(args.seed)if args.gpu!='':args.device = torch.device(f"cuda:{args.gpu}")else:args.device = torch.device("cpu")args.id2label = {i: label for i, label in enumerate(config.label2id)}args.label2id = config.label2idprocessor = CluenerProcessor(data_dir=config.data_dir)processor.get_vocab()model = NERModel(vocab_size=len(processor.vocab), embedding_size=args.embedding_size,hidden_size=args.hidden_size,device=args.device,label2id=args.label2id)model.to(args.device)if args.do_train:train(args,model,processor)if args.do_eval:model_path = args.output_dir / 'best-model.bin'model = load_model(model, model_path=str(model_path))evaluate(args,model,processor)if args.do_predict:predict(args,model,processor)if __name__ == "__main__":main()運行
1.運行下列命令,進行模型訓練:
python run_lstm_crf.py --do_train電腦經過四個小時的奮戰,得到的結果為:
可以看出經過50個epoch之后,
eval_f1 達到了 0.7234823215476984
下面是各個領域的評估結果:
- name - Acc: 0.7734 - Recall: 0.7634 - F1: 0.7684
- address - Acc: 0.542 - Recall: 0.5013 - F1: 0.5209
- movie - Acc: 0.7447 - Recall: 0.6954 - F1: 0.7192
- position - Acc: 0.787 - Recall: 0.7252 - F1: 0.7548
- organization - Acc: 0.8058 - Recall: 0.7575 - F1: 0.7809
- company - Acc: 0.7688 - Recall: 0.7302 - F1: 0.749
- scene - Acc: 0.6568 - Recall: 0.5311 - F1: 0.5873
- government - Acc: 0.7378 - Recall: 0.7976 - F1: 0.7665
- book - Acc: 0.7984 - Recall: 0.6688 - F1: 0.7279
- game - Acc: 0.7814 - Recall: 0.8237 - F1: 0.802
以前五條數據為例:
{“id”: 0, “text”: “四川敦煌學”。近年來,丹棱縣等地一些不知名的石窟迎來了海內外的游客,他們隨身攜帶著胡文和的著作。”}
{“id”: 1, “text”: “尼日利亞海軍發言人當天在阿布賈向尼日利亞通訊社證實了這一消息。”}
{“id”: 2, “text”: “銷售冠軍:輻射3-Bethesda”}
{“id”: 3, “text”: “所以大多數人都是從巴厘島南部開始環島之旅。”}
{“id”: 4, “text”: “備受矚目的動作及冒險類大作《迷失》在其英文版上市之初就受到了全球玩家的大力追捧。”}
{“id”: 5, “text”: “filippagowski:14歲時我感覺自己像梵高”}
提取到的實體
{“id”: 0, “label”: {“address”: {“四川敦煌”: [[0, 3]], “丹棱縣”: [[11, 13]]}, “name”: {“胡文和”: [[41, 43]]}}}
{“id”: 1, “label”: {“government”: {“尼日利亞海軍”: [[0, 5]]}, “position”: {“發言人”: [[6, 8]]}, “organization”: {“阿布賈”: [[12, 14]]}, “company”: {“尼日利亞通訊社”: [[16, 22]]}}}
{“id”: 2, “label”: {}}
{“id”: 3, “label”: {“scene”: {“巴厘島”: [[9, 11]]}}}
{“id”: 4, “label”: {“game”: {"《迷失》": [[13, 16]]}}}
{“id”: 5, “label”: {“name”: {“filippagowski”: [[0, 12]], “梵高”: [[24, 25]]}}}
對整句話進行序列標注
{“id”: 0, “tag_seq”: “B-address I-address I-address I-address O O O O O O O B-address I-address I-address O O O O O O O O O O O O O O O O O O O O O O O O O O O B-name I-name I-name O O O O”, “entities”: [[“address”, 0, 3], [“address”, 11, 13], [“name”, 41, 43]]}
{“id”: 1, “tag_seq”: “B-government I-government I-government I-government I-government I-government B-position I-position I-position O O O B-organization I-organization I-organization O B-company I-company I-company I-company I-company I-company I-company O O O O O O O O”, “entities”: [[“government”, 0, 5], [“position”, 6, 8], [“organization”, 12, 14], [“company”, 16, 22]]}
{“id”: 2, “tag_seq”: “O O O O O O O O O O O O O O O O O”, “entities”: []}
{“id”: 3, “tag_seq”: “O O O O O O O O O B-scene I-scene I-scene O O O O O O O O O”, “entities”: [[“scene”, 9, 11]]}
{“id”: 4, “tag_seq”: “O O O O O O O O O O O O O B-game I-game I-game I-game O O O O O O O O O O O O O O O O O O O O O O O”, “entities”: [[“game”, 13, 16]]}
{“id”: 5, “tag_seq”: “B-name I-name I-name I-name I-name I-name I-name I-name I-name I-name I-name I-name I-name O O O O O O O O O O O B-name I-name”, “entities”: [[“name”, 0, 12], [“name”, 24, 25]]}
可以看出識別效果還算不錯!
參考
CLUENER2020:中文細粒度命名實體識別數據集來了
https://github.com/CLUEbenchmark/CLUENER2020
總結
以上是生活随笔為你收集整理的CLUENER 细粒度命名实体识别baseline:BiLSTM-CRF的全部內容,希望文章能夠幫你解決所遇到的問題。