scispacyダウンロード - scispacyピーソースコードのダウンロード

このリポジトリには、科学文書にスペイシーを使用することに関連するカスタムパイプとモデルが含まれています。

特に、Spacyのルールベースのトークナイザーの上にトークン化ルールを追加するカスタムトークン剤、生物医学データで訓練されたPOSタガーと構文パーサー、およびエンティティスパン検出モデルがあります。それとは別に、より具体的なタスクのNERモデルもあります。

データのモデルをテストしたいだけですか？デモをチェックしてください（注：このデモはサイズピーの古いバージョンを実行しており、最新バージョンとは異なる結果が生じる可能性があります）。

インストール

Scispacyのインストールには、ライブラリのインストールとモデルの挿入という2つのステップが必要です。ライブラリをインストールするには、実行してください。

pip install scispacy

モデルをインストールするには（以下の使用可能なモデルの完全な選択を参照）、次のようなコマンドを実行します。

pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz

注：孤立したPython環境（VirtualenvやCondaなど）を使用して、Scispacyをインストールすることを強くお勧めします。これについての助けが必要な場合は、以下の「仮想環境のセットアップ」セクションをご覧ください。さらに、ScispacyはPythonの最新の機能を使用しているため、 Python 3.6以降でのみ利用できます。

インストール注：nmslib

長年にわたり、NMSLIBのインストールは非常に困難になりました。これについては、ScispacyとNMSLIB Repo自体に関するGitHubの問題がいくつかあります。このマトリックスは、ユーザーが持っている環境にNMSLIBをインストールするのを支援する試みです。あらゆる種類の環境にアクセスできないので、何かをテストできる場合は、問題を開いたり、リクエストをプルしたりしてください！

	Windows 11	Linux用のWindowsサブシステム	Mac M1	Mac M2	Mac M3	Mac M4	Intel Mac
Python 3.8	✅	✅		❓	❓	？	❓
Python 3.9	？	✅		？	❓	？	❓
Python 3.10	？	✅	❓	❓	❓	？	✅
Python 3.11	？	？	❓	❓	❓	？
Python 3.12	？	??	❓	❓	❓	？	❓

✅=通常はサイズピーのPIPインストールで動作します

= scispacyのPIPインストールでは正常に機能しません

？ = mamba install nmslibでインストールできます

= CFLAGS="-mavx -DWARN(a)=(a)" pip install nmslib

？ = pip install nmslib-metabrainzでインストールできます

？ = conda install -c conda-forge nmslibでインストールできます

❓=未確認

Githubの問題で言及されている他の方法ですが、どのバージョンで働いているかを確認していません。

CFLAGS="-mavx -DWARN(a)=(a)" pip install nmslib
pip install --no-binary :all: nmslib
pip install "nmslib @ git+https://github.com/nmslib/nmslib.git/#subdirectory=python_bindings"
pip install --upgrade pybind11 + pip install --verbose 'nmslib @ git+https://github.com/nmslib/nmslib.git#egg=nmslib&subdirectory=python_bindings'

仮想環境のセットアップ

マンバは、サイズピーに必要なPythonのバージョンを使用して仮想環境を設定できます。使用したいPython環境が既にある場合は、「PIP経由のインストール」セクションをスキップできます。

Mambaのインストール手順に従ってください。
Python 3.9を使用して「Scispacy」と呼ばれるConda環境を作成します（任意のバージョン> = 3.6は機能するはずです）：
```
mamba create -n scispacy python=3.10
```
Mamba環境を有効にします。サイズピーを使用する各端末でコンドラ環境をアクティブにする必要があります。
```
mamba activate scispacy
```

これで、 scispacyと上記の手順を使用してモデルの1つをインストールできます。

上記の手順を完了し、以下のモデルの1つをダウンロードしたら、他のSPACYモデルと同じようにサイズピーモデルをロードできます。例えば：

 import spacy
nlp = spacy . load ( "en_core_sci_sm" )
doc = nlp ( "Alterations in the hypocretin receptor 2 and preprohypocretin genes produce narcolepsy in some animals." )

アップグレードに注意してください

scispacyをアップグレードしている場合は、モデルを再度ダウンロードして、モデルバージョンを持っているサイズscispacy性のあるバージョンを取得する必要があります。ダウンロードしたモデルへのリンクには、持っているscispacyのバージョン数を含める必要があります。

利用可能なモデル

モデルをインストールするには、以下のリンクをクリックしてモデルをダウンロードしてから実行します

 pip install < / path / to / download >

または、リンクを右クリックして「リンクアドレスのコピー」を選択して実行することにより、URLから直接インストールできます。

 pip install CMD - V ( to paste the copied URL )

モデル	説明	URLをインストールします
EN_CORE_SCI_SM	〜100kの語彙を備えた生物医学データのフルスペイシーパイプライン。	ダウンロード
EN_CORE_SCI_MD	〜360kの語彙と50Kワードベクトルを備えた生物医学データのフルスペイシーパイプライン。	ダウンロード
EN_CORE_SCI_LG	〜785kの語彙と600Kワードベクトルを備えた生物医学データのフルスペイシーパイプライン。	ダウンロード
EN_CORE_SCI_SCIBERT	変圧器モデルとして〜785kの語彙と`allenai/scibert-base`備えた生物医学データのフルスペイシーパイプライン。このモデルでGPUを使用することをお勧めします。	ダウンロード
en_ner_craft_md	クラフトコーパスで訓練されたスペイシーNERモデル。	ダウンロード
en_ner_jnlpba_md	JNLPBAコーパスで訓練されたスペイシーNERモデル。	ダウンロード
EN_NER_BC5CDR_MD	BC5CDRコーパスで訓練されたスペイシーNERモデル。	ダウンロード
EN_NER_BIONLP13CG_MD	Bionlp13cgコーパスで訓練されたスペイシーNERモデル。	ダウンロード

追加のパイプラインコンポーネント

略語対応

略語検出器は、「生物医学テキストの略語定義を識別するための単純なアルゴリズム」の略語検出アルゴリズムを実装するスペイシーコンポーネントです。（Schwartz＆Hearst、2003）。

doc._.abbreviations属性を介して略語のリストにアクセスでき、特定の略語の場合、 span._.long_formを使用して長い形式（ spacy.tokens.Span ）にアクセスできます。

使用の例

 import spacy

from scispacy . abbreviation import AbbreviationDetector

nlp = spacy . load ( "en_core_sci_sm" )

# Add the abbreviation pipe to the spacy pipeline.
nlp . add_pipe ( "abbreviation_detector" )

doc = nlp ( "Spinal and bulbar muscular atrophy (SBMA) is an 
           inherited motor neuron disease caused by the expansion 
           of a polyglutamine tract within the androgen receptor (AR). 
           SBMA can be caused by this easily." )

print ( "Abbreviation" , " t " , "Definition" )
for abrv in doc . _ . abbreviations :
	print ( f" { abrv } t ( { abrv . start } , { abrv . end } ) { abrv . _ . long_form } " )

> >> Abbreviation	 Span	    Definition
> >> SBMA 		 ( 33 , 34 )   Spinal and bulbar muscular atrophy
> >> SBMA 	   	 ( 6 , 7 )     Spinal and bulbar muscular atrophy
> >> AR   		 ( 29 , 30 )   androgen receptor

注docオブジェクトをシリアル化できるようにする場合は、 make_serializable=True 、eg nlp.add_pipe("abbreviation_detector", config={"make_serializable": True})で略語検出器をロードします。

entitityLinker

EntityLinker 、ナレッジベースへのリンクを実行するスペイシーコンポーネントです。リンカーは、名前付きエンティティの文字列オーバーラップベースの検索（CHAR -3GRAMS）を実行するだけで、近似の近隣の検索を使用して知識ベースの概念と比較します。

現在（v2.5.0）、5つのサポートされているリンカーがあります。

umls ：統一された医療言語システムへのリンク、レベル0、1、2、および9。これには〜3mの概念があります。
mesh ：医療科目の見出しへのリンク。これには、PubMedでのインデックス作成に使用される高品質のエンティティの小さなセットが含まれています。メッシュには〜30kのエンティティが含まれています。注：メッシュKBはメッシュ自体から直接導出されるため、他のKBとは異なる一意の識別子を使用します。
rxnorm ：rxnormオントロジーへのリンク。 RxNormには、臨床薬物の正規化された名前に焦点を当てた〜100kの概念が含まれています。これは、First Databank、Micromedex、およびGold Standard Drug Databaseなど、薬局管理と薬物相互作用で一般的に使用される他のいくつかの薬物語彙で構成されています。
go ：遺伝子オントロジーへのリンク。遺伝子オントロジーには、遺伝子の機能に焦点を当てた〜67kの概念が含まれています。
hpo ：ヒト表現型オントロジーへのリンク。ヒト表現型オントロジーには、ヒト疾患で遭遇する表現型の異常に焦点を当てた16Kの概念が含まれています。

以下のパラメーターのいくつかを使用して、ユースケースに適応することをお勧めします（より高い精度、より高いリコールなど）。

resolve_abbreviations : bool = True, optional (default = False)リンクを実行する前にdocで特定された略語を解決するかどうか。このパラメーターは、スペイシーパイプラインにAbbreviationDetectorがない場合、効果はありません。
k : int, optional, (default = 30)言及あたり候補ジェネレーターから調べる最近隣人の数。
threshold : float, optional, (default = 0.7)言及候補者が言及候補としてDocに追加するために到達しなければならないしきい値。
no_definition_threshold : float, optional, (default = 0.95)エンティティ候補者が定義を持っていない場合、エンティティ候補者が候補者に言及するために到達しなければならないしきい値。
filter_for_definitions: bool, default = True知識ベースに定義のあるもののみを含むように返される可能性のあるエンティティをフィルタリングするかどうか。
max_entities_per_mention : int, optional, default = 5最も近い隣人が見つかったに関係なく、特定の言及のために返品されるエンティティの最大数。

このクラスは、KB Concept_idに対応するリスト[Tuple [str、float]]で構成されるSpacy Spansの._.kb_ents属性を設定し、 max_entities_per_mention番号のエンティティのリストに関連するスコアに対応しています。

このクラスのKB属性を使用して、特定のIDの詳細情報を調べることができます。

 print(linker.kb.cui_to_entity[concept_id])

使用の例

 import spacy
import scispacy

from scispacy . linking import EntityLinker

nlp = spacy . load ( "en_core_sci_sm" )

# This line takes a while, because we have to download ~1GB of data
# and load a large JSON file (the knowledge base). Be patient!
# Thankfully it should be faster after the first time you use it, because
# the downloads are cached.
# NOTE: The resolve_abbreviations parameter is optional, and requires that
# the AbbreviationDetector pipe has already been added to the pipeline. Adding
# the AbbreviationDetector pipe and setting resolve_abbreviations to True means
# that linking will only be performed on the long form of abbreviations.
nlp . add_pipe ( "scispacy_linker" , config = { "resolve_abbreviations" : True , "linker_name" : "umls" })

doc = nlp ( "Spinal and bulbar muscular atrophy (SBMA) is an 
           inherited motor neuron disease caused by the expansion 
           of a polyglutamine tract within the androgen receptor (AR). 
           SBMA can be caused by this easily." )

# Let's look at a random entity!
entity = doc . ents [ 1 ]

print ( "Name: " , entity )
> >> Name : bulbar muscular atrophy

# Each entity is linked to UMLS with a score
# (currently just char-3gram matching).
linker = nlp . get_pipe ( "scispacy_linker" )
for umls_ent in entity . _ . kb_ents :
	print ( linker . kb . cui_to_entity [ umls_ent [ 0 ]])


> >> CUI : C1839259 , Name : Bulbo - Spinal Atrophy , X - Linked
> >> Definition : An X - linked recessive form of spinal muscular atrophy . It is due to a mutation of the
  				gene encoding the ANDROGEN RECEPTOR .
> >> TUI ( s ): T047
> >> Aliases ( abbreviated , total : 50 ):
         Bulbo - Spinal Atrophy , X - Linked , Bulbo - Spinal Atrophy , X - Linked , ....

>> > CUI : C0541794 , Name : Skeletal muscle atrophy
> >> Definition : A process , occurring in skeletal muscle , that is characterized by a decrease in protein content ,
                fiber diameter , force production and fatigue resistance in response to ...
>> > TUI ( s ): T046
> >> Aliases : ( total : 9 ):
         Skeletal muscle atrophy , ATROPHY SKELETAL MUSCLE , skeletal muscle atrophy , ....

> >> CUI : C1447749 , Name : AR protein , human
> >> Definition : Androgen receptor ( 919 aa , ~ 99 kDa ) is encoded by the human AR gene .
                This protein plays a role in the modulation of steroid - dependent gene transcription .
> >> TUI ( s ): T116 , T192
>> > Aliases ( abbreviated , total : 16 ):
         AR protein , human , Androgen Receptor , Dihydrotestosterone Receptor , AR , DHTR , NR3C4 , ...

ハーストパターン（v0.3.0以上）

このコンポーネントは、Spacy Matcherコンポーネントを使用して、大規模なテキストコーパスからの低音の自動容量を実装します。

extended=True to HyponymDetectorにtrueは、より高いリコールを含むが、より低い精密性低音関係を含むハーストパターンの拡張セットを使用します（例えば、y、xと同様のxと比較してx）。

このコンポーネントは、Spacy doc： doc._.hearst_patternsにdocレベルの属性を生成します。タプルには次のものが含まれています。

仮説を抽出するために使用される関係ルール（タイプ： str ）
より一般的な概念（タイプ： spacy.Span ）
より具体的な概念（タイプ： spacy.Span ）

使用法：

 import spacy
from scispacy . hyponym_detector import HyponymDetector

nlp = spacy . load ( "en_core_sci_sm" )
nlp . add_pipe ( "hyponym_detector" , last = True , config = { "extended" : False })

doc = nlp ( "Keystone plant species such as fig trees are good for the soil." )

print ( doc . _ . hearst_patterns )
> >> [( 'such_as' , Keystone plant species , fig trees )]

引用

研究でサイズピーを使用している場合は、Scispacy：Biomedical Natural Language Processingの高速で堅牢なモデルを引用してください。さらに、研究を再現できるように、使用したサイズピーのバージョンとモデルを示してください。

 @inproceedings{neumann-etal-2019-scispacy,
    title = "{S}cispa{C}y: {F}ast and {R}obust {M}odels for {B}iomedical {N}atural {L}anguage {P}rocessing",
    author = "Neumann, Mark  and
      King, Daniel  and
      Beltagy, Iz  and
      Ammar, Waleed",
    booktitle = "Proceedings of the 18th BioNLP Workshop and Shared Task",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-5034",
    doi = "10.18653/v1/W19-5034",
    pages = "319--327",
    eprint = {arXiv:1902.07669},
    abstract = "Despite recent advances in natural language processing, many statistical models for processing text perform extremely poorly under domain shift. Processing biomedical and clinical text is a critically important application area of natural language processing, for which there are few robust, practical, publicly available models. This paper describes scispaCy, a new Python library and models for practical biomedical/scientific text processing, which heavily leverages the spaCy library. We detail the performance of two packages of models released in scispaCy and demonstrate their robustness on several tasks and datasets. Models and code are available at https://allenai.github.io/scispacy/.",
}

Scispacyは、アレン人工知能研究所（AI2）によって開発されたオープンソースプロジェクトです。 AI2は、衝撃的なAIの研究と工学を通じて人類に貢献するという使命を備えた非営利団体です。

拡大する