GPTNERMED下载 - GPTNERMED源代码下载

GPTNERMED

Ai源码

1.0.0

下载

gptnermed

关于

GPTNERMED是一种新颖的开放合成数据集和神经命名 - 实体识别（NER）模型，用于医学自然语言处理（NLP）中的德语文本。

关键功能：

支持的标签：勋章， dosis ，诊断
开放银色标准的德国医疗数据集： 245107代币，带有Dosis注释（ ＃7547 ），Medikation（ ＃9868 ）和诊断（ ＃5996 ）
基于GPT Neox的合成数据集
使用Gbert-Large ， Gottbert-Base或German-Medbert进行NER解析的转移学习
开放，公开访问模型

在线演示：可用一个演示页面：演示或使用下面给出的拥抱面链接。

请参阅我们发表的论文，网址为https://doi.org/10.1016/j.jbi.2023.104478。

我们的预印纸可从https://arxiv.org/pdf/2208.14493.pdf获得。

NER演示：

型号

可以从以下URL中检索到验证的模型：

基于Gbert：模型链接
总部位于戈特伯特：模型链接
总部位于德国 - 媒体：模型链接

这些模型也可以在HuggingFace平台上使用：

总部位于Gbert：HuggingFace链接
总部位于戈特伯特：拥抱面链接
总部位于德国梅德伯特：拥抱面链接

HuggingFace数据集：数据集也可作为拥抱表数据集可用。
您可以按照以下方式加载模型：

 # You need to install datasets first, using: pip install datasets
from datasets import load_dataset
dataset = load_dataset ( "jfrei/GPTNERMED" )

分数

注意：度量得分是通过角色分类评估的。

从分发数据集（在OoD-dataset_GoldStandard.jsonl中提供）：

模型	公制	药物=奖章
吉伯特·莱尔格（Gbert-Large）	PR	0.707
	关于	0.979
	F1	0.821
戈特伯特·基斯	PR	0.800
	关于	0.899
	F1	0.847
德国媒体	PR	0.727
	关于	0.818
	F1	0.770

测试集：

模型	公制	奖章	诊断	Dosis	全部的
吉伯特·莱尔格（Gbert-Large）	PR	0.870	0.870	0.883	0.918
	关于	0.936	0.895	0.921	0.919
	F1	0.949	0.882	0.901	0.918
戈特伯特·基斯	PR	0.979	0.896	0.887	0.936
	关于	0.910	0.844	0.907	0.886
	F1	0.943	0.870	0.897	0.910
德国媒体	PR	0.980	0.910	0.829	0.932
	关于	0.905	0.730	0.890	0.842
	F1	0.941	0.810	0.858	0.883

设置和用法

这些模型是基于Spacy的。示例代码用Python编写。

model_link= " https://myweb.rz.uni-augsburg.de/~freijoha/GPTNERMED/GPTNERMED_gbert.zip "

# [Optional] Create env
python3 -m venv env
source ./env/bin/activate

# Install dependencies
python3 -m pip install -r requirements.txt

# Download & extract model
wget -O model.zip " $model_link "
unzip model.zip -d " model "

# Run script
python3 GPTNERMED.py

引用

在下面编写的Bibtex引用我们的工作，或使用纸张中的引用工具。

 @article{FREI2023104478,
title = {Annotated dataset creation through large language models for non-english medical NLP},
journal = {Journal of Biomedical Informatics},
volume = {145},
pages = {104478},
year = {2023},
issn = {1532-0464},
doi = {https://doi.org/10.1016/j.jbi.2023.104478},
url = {https://www.sciencedirect.com/science/article/pii/S1532046423001995},
author = {Johann Frei and Frank Kramer},
keywords = {Natural language processing, Information extraction, Named entity recognition, Data augmentation, Knowledge distillation, Medication detection},
abstract = {Obtaining text datasets with semantic annotations is an effortful process, yet crucial for supervised training in natural language processing (NLP). In general, developing and applying new NLP pipelines in domain-specific contexts for tasks often requires custom-designed datasets to address NLP tasks in a supervised machine learning fashion. When operating in non-English languages for medical data processing, this exposes several minor and major, interconnected problems such as the lack of task-matching datasets as well as task-specific pre-trained models. In our work, we suggest to leverage pre-trained large language models for training data acquisition in order to retrieve sufficiently large datasets for training smaller and more efficient models for use-case-specific tasks. To demonstrate the effectiveness of your approach, we create a custom dataset that we use to train a medical NER model for German texts, GPTNERMED, yet our method remains language-independent in principle. Our obtained dataset as well as our pre-trained models are publicly available at https://github.com/frankkramer-lab/GPTNERMED.}
}

展开

附加信息