scibert下載 - scibert源代碼下載

scibert

其他源碼

1.0.0

下載

`SciBERT`

SciBERT是一種對科學文本訓練的BERT模型。

SciBERT接受了Senticscholar.org語料庫的論文培訓。語料庫的大小為114萬紙，3.1b令牌。我們在培訓中使用論文的全文，而不僅僅是摘要。
SciBERT擁有自己的詞彙（ scivocab ），該詞彙量是為了最適合培訓語料庫的詞彙。我們訓練了Cased和未添加的版本。我們還包括在原始BERT詞彙（ basevocab ）上訓練的模型以進行比較。
它導致在各種科學領域NLP任務上的最新性能。評估的細節在本文中。評估代碼和數據包含在此存儲庫中。

下載訓練有素的模型

更新！ SCIBERT模型現在可直接在allenai組織下的HuggingFace框架中安裝：

 from transformers import *

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_cased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_cased')

我們釋放TensorFlow和訓練有素的模型的Pytorch版本。 TensorFlow版本與與Google Research的模型合作的代碼兼容。 Pytorch版本是使用擁抱面庫創建的，此存儲庫顯示瞭如何在AllennLP中使用它。下面提供了所有scivocab和basevocab ， cased和uncased模型的組合。我們的評估表明， scivocab-uncased通常會產生最佳結果。

TensorFlow模型

scibert-scivocab-uncased （推薦）
scibert-scivocab-cased
scibert-basevocab-uncased
scibert-basevocab-cased

Pytorch Allennlp型號

scibert-scivocab-uncased （推薦）
scibert-scivocab-cased
scibert-basevocab-uncased
scibert-basevocab-cased

Pytorch擁抱面模型

scibert-scivocab-uncased （推薦）
scibert-scivocab-cased
scibert-basevocab-uncased
scibert-basevocab-cased

在您自己的模型中使用Scibert

SCIBERT模型包含所有需要插入您自己的型號的必要文件，並且與Bert相同的格式。如果您使用的是TensorFlow，請參閱Google的Bert Repo，如果您使用Pytorch，請參閱Hugging Face的存儲庫，其中提供了有關使用BERT模型的詳細說明。

使用Allennlp培訓新型號

要在不同任務上進行實驗並在論文中重現我們的結果，您需要首先設置Python 3.6環境：

pip install -r requirements.txt

它將像Allennlp一樣安裝依賴項。

使用scibert/scripts/train_allennlp_local.sh腳本作為如何運行實驗的示例（您需要修改任務和數據集（例如TASK和DATASET的變量名稱）。

在以下任務中，我們在data/目錄中包括一組廣泛的科學NLP數據集。每個任務都有一個可用數據集的子目錄。

 ├── ner
│   ├── JNLPBA
│   ├── NCBI-disease
│   ├── bc5cdr
│   └── sciie
├── parsing
│   └── genia
├── pico
│   └── ebmnlp
└── text_classification
    ├── chemprot
    ├── citation_intent
    ├── mag
    ├── rct-20k
    ├── sci-cite
    └── sciie-relation-extraction

例如，要在命名實體識別（ NER ）任務以及BC5CDR數據集（Biocreative V CDR）上運行模型，請根據以下方式修改scibert/train_allennlp_local.sh腳本。

 DATASET='bc5cdr'
TASK='ner'
...

您下載使用的Pytorch模型
tar -xvf scibert_scivocab_uncased.tar
結果將在包含兩個文件的scibert_scivocab_uncased中，其中包含兩個文件：詞彙文件（ vocab.txt ）和一個striges文件（ weights.tar.gz ）。將文件複製到您所需的位置，然後在腳本中為BERT_WEIGHTS設置正確的路徑，然後在腳本中設置BERT_VOCAB ：

 export BERT_VOCAB=path-to/scibert_scivocab_uncased.vocab
export BERT_WEIGHTS=path-to/scibert_scivocab_uncased.tar.gz

最後運行腳本：

 ./scibert/scripts/train_allennlp_local.sh [serialization-directory]

其中[serialization-directory]是將存儲模型文件的輸出目錄的途徑。

引用

如果您在研究中使用SciBERT ，請引用Scibert：科學文本的審計語言模型。

 @inproceedings{Beltagy2019SciBERT,
  title={SciBERT: Pretrained Language Model for Scientific Text},
  author={Iz Beltagy and Kyle Lo and Arman Cohan},
  year={2019},
  booktitle={EMNLP},
  Eprint={arXiv:1903.10676}
}

SciBERT是由艾倫人工智能研究所（AI2）開發的一個開源項目。 AI2是一家非營利研究所，其使命是通過高影響力的AI研究和工程來為人類做出貢獻。

展開

附加信息

版本 1.0.0
類型其他源碼
更新時間 2025-04-17
大小 27.21MB
來自於 Github

相關應用

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3

相關資訊全部