GPTNERMEDダウンロードGPTNERMEDソースコードのダウンロード

GPTNERMED

AI ソースコード

1.0.0

ダウンロード

gptnermed

について

Gptnermedは、医療自然言語処理（NLP）におけるドイツのテキストの新しいオープンな合成データセットとニューラル名の名目認識（NER）モデルです。

主な機能：

サポートされているラベル：媒介、覚醒、診断
オープンシルバースタンダードのドイツ医療データセット： 245107ドーシスの注釈付きトークン（ ＃7547 ）、Medikation（ ＃9868 ）および診断（ ＃5996 ）
GPT Neoxに基づく合成データセット
Gbert-Large 、 Gottbert-Base 、またはGermen-Medbertを使用したNER解析のための転送学習
オープン、モデルへのパブリックアクセス

オンラインデモ：デモページが利用可能です：デモ、または以下に示すHuggingfaceリンクを使用します。

https://doi.org/10.1016/j.jbi.2023.104478の公開された論文を参照してください。

プリプリントペーパーは、https：//arxiv.org/pdf/2208.14493.pdfで入手できます。

NERデモンストレーション：

モデル

事前に保護されたモデルは、次のURLから取得できます。

Gbertベース：モデルリンク
Gottbertベース：モデルリンク
ドイツ語拠点ベース：モデルリンク

モデルは、 Huggingfaceプラットフォームでも利用できます。

GBERTベース：Huggingfaceリンク
Gottbertベース：Huggingfaceリンク
ドイツのメドバートベース：Huggingfaceリンク

Huggingface Dataset：データセットは、Huggingfaceデータセットとしても利用できます。
次のようにモデルをロードできます。

 # You need to install datasets first, using: pip install datasets
from datasets import load_dataset
dataset = load_dataset ( "jfrei/GPTNERMED" )

スコア

注：メトリックスコアは、文字ごとの分類によって評価されます。

配信データセット（ OoD-dataset_GoldStandard.jsonlで提供）：

モデル	メトリック	薬物=メディキング
Gbert-Large	Pr	0.707
	再	0.979
	F1	0.821
Gottbert-Base	Pr	0.800
	再	0.899
	F1	0.847
ドイツ語 - メドバート	Pr	0.727
	再	0.818
	F1	0.770

テストセット：

モデル	メトリック	媒介	診断	ドーシス	合計
Gbert-Large	Pr	0.870	0.870	0.883	0.918
	再	0.936	0.895	0.921	0.919
	F1	0.949	0.882	0.901	0.918
Gottbert-Base	Pr	0.979	0.896	0.887	0.936
	再	0.910	0.844	0.907	0.886
	F1	0.943	0.870	0.897	0.910
ドイツ語 - メドバート	Pr	0.980	0.910	0.829	0.932
	再	0.905	0.730	0.890	0.842
	F1	0.941	0.810	0.858	0.883

セットアップと使用

モデルはスペイシーに基づいています。サンプルコードはPythonで記述されています。

model_link= " https://myweb.rz.uni-augsburg.de/~freijoha/GPTNERMED/GPTNERMED_gbert.zip "

# [Optional] Create env
python3 -m venv env
source ./env/bin/activate

# Install dependencies
python3 -m pip install -r requirements.txt

# Download & extract model
wget -O model.zip " $model_link "
unzip model.zip -d " model "

# Run script
python3 GPTNERMED.py

引用

以下に書いたようにBibtexで作業を引用するか、論文の引用ツールを使用してください。

 @article{FREI2023104478,
title = {Annotated dataset creation through large language models for non-english medical NLP},
journal = {Journal of Biomedical Informatics},
volume = {145},
pages = {104478},
year = {2023},
issn = {1532-0464},
doi = {https://doi.org/10.1016/j.jbi.2023.104478},
url = {https://www.sciencedirect.com/science/article/pii/S1532046423001995},
author = {Johann Frei and Frank Kramer},
keywords = {Natural language processing, Information extraction, Named entity recognition, Data augmentation, Knowledge distillation, Medication detection},
abstract = {Obtaining text datasets with semantic annotations is an effortful process, yet crucial for supervised training in natural language processing (NLP). In general, developing and applying new NLP pipelines in domain-specific contexts for tasks often requires custom-designed datasets to address NLP tasks in a supervised machine learning fashion. When operating in non-English languages for medical data processing, this exposes several minor and major, interconnected problems such as the lack of task-matching datasets as well as task-specific pre-trained models. In our work, we suggest to leverage pre-trained large language models for training data acquisition in order to retrieve sufficiently large datasets for training smaller and more efficient models for use-case-specific tasks. To demonstrate the effectiveness of your approach, we create a custom dataset that we use to train a medical NER model for German texts, GPTNERMED, yet our method remains language-independent in principle. Our obtained dataset as well as our pre-trained models are publicly available at https://github.com/frankkramer-lab/GPTNERMED.}
}

拡大する

追加情報