CKIP Transformers
This project provides traditional Chinese transformers models (including ALBERT, BERT, GPT2) and NLP tools (including word segmentation, part-of-speech tagging, named entity recognition).
这个专案提供了繁体中文的transformers 模型(包含ALBERT、BERT、GPT2)及自然语言处理工具(包含断词、词性标记、实体辨识)。
Git
https://github.com/ckiplab/ckip-transformers
PyPI
https://pypi.org/project/ckip-transformers
Documentation
https://ckip-transformers.readthedocs.io
Demo
https://ckip.iis.sinica.edu.tw/service/transformers
Contributers
- Mu Yang at CKIP (Author & Maintainer).
- Wei-Yun Ma at CKIP (Maintainer).
Related Packages
- CkipTagger: An alternative Chinese NLP library with using BiLSTM.
- CKIP CoreNLP Toolkit: A Chinese NLP library with more NLP tasks and utilities.
Models
You may also use our pretrained models with HuggingFace transformers library directly: https://huggingface.co/ckiplab/.
您可于https://huggingface.co/ckiplab/ 下载预训练的模型。
- Language Models
- ALBERT Tiny:
ckiplab/albert-tiny-chinese - ALBERT Base:
ckiplab/albert-base-chinese - BERT Tiny:
ckiplab/bert-tiny-chinese - BERT Base:
ckiplab/bert-base-chinese - GPT2 Tiny:
ckiplab/gpt2-tiny-chinese - GPT2 Base:
ckiplab/gpt2-base-chinese
- NLP Task Models
- ALBERT Tiny — Word Segmentation:
ckiplab/albert-tiny-chinese-ws - ALBERT Tiny — Part-of-Speech Tagging:
ckiplab/albert-tiny-chinese-pos - ALBERT Tiny — Named-Entity Recognition:
ckiplab/albert-tiny-chinese-ner - ALBERT Base — Word Segmentation:
ckiplab/albert-base-chinese-ws - ALBERT Base — Part-of-Speech Tagging:
ckiplab/albert-base-chinese-pos - ALBERT Base — Named-Entity Recognition:
ckiplab/albert-base-chinese-ner - BERT Tiny — Word Segmentation:
ckiplab/bert-tiny-chinese-ws - BERT Tiny — Part-of-Speech Tagging:
ckiplab/bert-tiny-chinese-pos - BERT Tiny — Named-Entity Recognition:
ckiplab/bert-tiny-chinese-ner - BERT Base — Word Segmentation:
ckiplab/bert-base-chinese-ws - BERT Base — Part-of-Speech Tagging:
ckiplab/bert-base-chinese-pos - BERT Base — Named-Entity Recognition:
ckiplab/bert-base-chinese-ner
Model Usage
You may use our model directly from the HuggingFace's transformers library.
您可直接透过HuggingFace's transformers 套件使用我们的模型。
pip install -U transformers
Please use BertTokenizerFast as tokenizer, and replace ckiplab/albert-tiny-chinese and ckiplab/albert-tiny-chinese-ws by any model you need in the following example.
请使用内建的BertTokenizerFast,并将以下范例中的ckiplab/albert-tiny-chinese与ckiplab/albert-tiny-chinese-ws替换成任何您要使用的模型名称。
from transformers import (
BertTokenizerFast ,
AutoModelForMaskedLM ,
AutoModelForCausalLM ,
AutoModelForTokenClassification ,
)
# masked language model (ALBERT, BERT)
tokenizer = BertTokenizerFast . from_pretrained ( 'bert-base-chinese' )
model = AutoModelForMaskedLM . from_pretrained ( 'ckiplab/albert-tiny-chinese' ) # or other models above
# casual language model (GPT2)
tokenizer = BertTokenizerFast . from_pretrained ( 'bert-base-chinese' )
model = AutoModelForCausalLM . from_pretrained ( 'ckiplab/gpt2-base-chinese' ) # or other models above
# nlp task model
tokenizer = BertTokenizerFast . from_pretrained ( 'bert-base-chinese' )
model = AutoModelForTokenClassification . from_pretrained ( 'ckiplab/albert-tiny-chinese-ws' ) # or other models above
Model Fine-Tunning
To fine tunning our model on your own datasets, please refer to the following example from HuggingFace's transformers.
您可参考以下的范例去微调我们的模型于您自己的资料集。
- https://github.com/huggingface/transformers/tree/master/examples
- https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling
- https://github.com/huggingface/transformers/tree/master/examples/pytorch/token-classification
Remember to set --tokenizer_name bert-base-chinese in order to use Chinese tokenizer.
记得设置--tokenizer_name bert-base-chinese以正确的使用中文的tokenizer。
python run_mlm.py
--model_name_or_path ckiplab/albert-tiny-chinese # or other models above
--tokenizer_name bert-base-chinese
...
python run_ner.py
--model_name_or_path ckiplab/albert-tiny-chinese-ws # or other models above
--tokenizer_name bert-base-chinese
...
Model Performance
The following is a performance comparison between our model and other models.
The results are tested on a traditional Chinese corpus.
以下是我们的模型与其他的模型之性能比较。
各个任务皆测试于繁体中文的测试集。
| Model | #Parameters | Perplexity† | WS (F1)‡ | POS (ACC)‡ | NER (F1)‡ |
|---|
| ckiplab/albert-tiny-chinese | 4M | 4.80 | 96.66% | 94.48% | 71.17% |
| ckiplab/albert-base-chinese | 11M | 2.65 | 97.33% | 95.30% | 79.47% |
| ckiplab/bert-tiny-chinese | 12M | 8.07 | 96.98% | 95.11% | 74.21% |
| ckiplab/bert-base-chinese | 102M | 1.88 | 97.60% | 95.67% | 81.18% |
| ckiplab/gpt2-tiny-chinese | 4M | 16.94 | -- | -- | -- |
| ckiplab/gpt2-base-chinese | 102M | 8.36 | -- | -- | -- |
| | | | | |
| voidful/albert_chinese_tiny | 4M | 74.93 | -- | -- | -- |
| voidful/albert_chinese_base | 11M | 22.34 | -- | -- | -- |
| bert-base-chinese | 102M | 2.53 | -- | -- | -- |
† Perplexity; the smaller the better.
† 混淆度;数字越小越好。
‡ WS: word segmentation; POS: part-of-speech; NER: named-entity recognition; the larger the better.
‡ WS: 断词;POS: 词性标记;NER: 实体辨识;数字越大越好。
Training Corpus
The language models are trained on the ZhWiki and CNA datasets; the WS and POS tasks are trained on the ASBC dataset; the NER tasks are trained on the OntoNotes dataset.
以上的语言模型训练于ZhWiki 与CNA 资料集上;断词(WS)与词性标记(POS)任务模型训练于ASBC 资料集上;实体辨识(NER)任务模型训练于OntoNotes 资料集上。
- ZhWiki: https://dumps.wikimedia.org/zhwiki/
Chinese Wikipedia text (20200801 dump), translated to Traditional using OpenCC.
中文维基的文章(20200801 版本),利用OpenCC 翻译成繁体中文。
- CNA: https://catalog.ldc.upenn.edu/LDC2011T13
Chinese Gigaword Fifth Edition — CNA (Central News Agency) part.
中文Gigaword 第五版— CNA(中央社)的部分。
- ASBC: http://asbc.iis.sinica.edu.tw
Academia Sinica Balanced Corpus of Modern Chinese release 4.0.
中央研究院汉语平衡语料库第四版。
- OntoNotes: https://catalog.ldc.upenn.edu/LDC2013T19
OntoNotes release 5.0, Chinese part, translated to Traditional using OpenCC.
OntoNotes 第五版,中文部分,利用OpenCC 翻译成繁体中文。
Here is a summary of each corpus.
以下是各个资料集的一览表。
| Dataset | #Documents | #Lines | #Characters | Line Type |
|---|
| CNA | 2,559,520 | 13,532,445 | 1,219,029,974 | Paragraph |
| ZhWiki | 1,106,783 | 5,918,975 | 495,446,829 | Paragraph |
| ASBC | 19,247 | 1,395,949 | 17,572,374 | Clause |
| OntoNotes | 1,911 | 48,067 | 1,568,491 | Sentence |
Here is the dataset split used for language models.
以下是用于训练语言模型的资料集切割。
| CNA+ZhWiki | #Documents | #Lines | #Characters |
|---|
| Train | 3,606,303 | 18,986,238 | 4,347,517,682 |
| Dev | 30,000 | 148,077 | 32,888,978 |
| Test | 30,000 | 151,241 | 35,216,818 |
Here is the dataset split used for word segmentation and part-of-speech tagging models.
以下是用于训练断词及词性标记模型的资料集切割。
| ASBC | #Documents | #Lines | #Words | #Characters |
|---|
| Train | 15,247 | 1,183,260 | 9,480,899 | 14,724,250 |
| Dev | 2,000 | 52,677 | 448,964 | 741,323 |
| Test | 2,000 | 160,012 | 1,315,129 | 2,106,799 |
Here is the dataset split used for word segmentation and named entity recognition models.
以下是用于训练实体辨识模型的资料集切割。
| OntoNotes | #Documents | #Lines | #Characters | #Named-Entities |
|---|
| Train | 1,511 | 43,362 | 1,367,658 | 68,947 |
| Dev | 200 | 2,304 | 93,535 | 7,186 |
| Test | 200 | 2,401 | 107,298 | 6,977 |
NLP Tools
The package also provide the following NLP tools.
我们的套件也提供了以下的自然语言处理工具。
- (WS) Word Segmentation 断词
- (POS) Part-of-Speech Tagging 词性标记
- (NER) Named Entity Recognition 实体辨识
Installation
pip install -U ckip-transformers
Requirements:
- Python 3.6+
- PyTorch 1.5+
- HuggingFace Transformers 3.5+
NLP Tools Usage
See here for API details.
详细的API 请参见此处。
The complete script of this example is https://github.com/ckiplab/ckip-transformers/blob/master/example/example.py.
以下的范例的完整档案可参见https://github.com/ckiplab/ckip-transformers/blob/master/example/example.py 。
1. Import module
from ckip_transformers . nlp import CkipWordSegmenter , CkipPosTagger , CkipNerChunker
2. Load models
We provide several pretrained models for the NLP tools.
我们提供了一些适用于自然语言工具的预训练的模型。
# Initialize drivers
ws_driver = CkipWordSegmenter ( model = "bert-base" )
pos_driver = CkipPosTagger ( model = "bert-base" )
ner_driver = CkipNerChunker ( model = "bert-base" )
One may also load their own checkpoints using our drivers.
也可以运用我们的工具于自己训练的模型上。
# Initialize drivers with custom checkpoints
ws_driver = CkipWordSegmenter ( model_name = "path_to_your_model" )
pos_driver = CkipPosTagger ( model_name = "path_to_your_model" )
ner_driver = CkipNerChunker ( model_name = "path_to_your_model" )
To use GPU, one may specify device ID while initialize the drivers. Set to -1 (default) to disable GPU.
可于宣告断词等工具时指定device 以使用GPU,设为-1 (预设值)代表不使用GPU。
# Use CPU
ws_driver = CkipWordSegmenter ( device = - 1 )
# Use GPU:0
ws_driver = CkipWordSegmenter ( device = 0 )
3. Run pipeline
The input for word segmentation and named-entity recognition must be a list of sentences.
The input for part-of-speech tagging must be a list of list of words (the output of word segmentation).
断词与实体辨识的输入必须是list of sentences。
词性标记的输入必须是list of list of words。
# Input text
text = [
"傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。" ,
"美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會,預料她將會很順利通過參議院支持,成為該國有史以來第一位的華裔女性內閣成員。" ,
"空白 也是可以的~" ,
]
# Run pipeline
ws = ws_driver ( text )
pos = pos_driver ( ws )
ner = ner_driver ( text )
The POS driver will automatically segment the sentence internally using there characters ',,。::;;!!??' while running the model. (The output sentences will be concatenated back.) You may set delim_set to any characters you want.
You may set use_delim=False to disable this feature, or set use_delim=True in WS and NER driver to enable this feature.
词性标记工具会自动用',,。::;;!!??'等字元在执行模型前切割句子(输出的句子会自动接回)。可设定delim_set参数使用别的字元做切割。
另外可指定use_delim=False已停用此功能,或于断词、实体辨识时指定use_delim=True已启用此功能。
# Enable sentence segmentation
ws = ws_driver ( text , use_delim = True )
ner = ner_driver ( text , use_delim = True )
# Disable sentence segmentation
pos = pos_driver ( ws , use_delim = False )
# Use new line characters and tabs for sentence segmentation
pos = pos_driver ( ws , delim_set = ' n t ' )
You may specify batch_size and max_length to better utilize you machine resources.
您亦可设置batch_size与max_length以更完美的利用您的机器资源。
# Sets the batch size and maximum sentence length
ws = ws_driver ( text , batch_size = 256 , max_length = 128 )
4. Show results
# Pack word segmentation and part-of-speech results
def pack_ws_pos_sentece ( sentence_ws , sentence_pos ):
assert len ( sentence_ws ) == len ( sentence_pos )
res = []
for word_ws , word_pos in zip ( sentence_ws , sentence_pos ):
res . append ( f" { word_ws } ( { word_pos } )" )
return " u3000 " . join ( res )
# Show results
for sentence , sentence_ws , sentence_pos , sentence_ner in zip ( text , ws , pos , ner ):
print ( sentence )
print ( pack_ws_pos_sentece ( sentence_ws , sentence_pos ))
for entity in sentence_ner :
print ( entity )
print ()傅达仁今将执行安乐死,却突然爆出自己20年前遭纬来体育台封杀,他不懂自己哪里得罪到电视台。
傅达仁(Nb) 今(Nd) 将(D) 执行(VC) 安乐死(Na) ,(COMMACATEGORY) 却(D) 突然(D) 爆出(VJ) 自己(Nh) 20(Neu) 年(Nd) 前(Ng) 遭(P) 纬来(Nb) 体育台(Na) 封杀(VC) ,(COMMACATEGORY) 他(Nh) 不(D) 懂(VK) 自己(Nh) 哪里(Ncd) 得罪到(VC) 电视台(Nc) 。 (PERIODCATEGORY)
NerToken(word='傅达仁', ner='PERSON', idx=(0, 3))
NerToken(word='20年', ner='DATE', idx=(18, 21))
NerToken(word='纬来体育台', ner='ORG', idx=(23, 28))
美国参议院针对今天总统布什所提名的劳工部长赵小兰展开认可听证会,预料她将会很顺利通过参议院支持,成为该国有史以来第一位的华裔女性内阁成员。
美国(Nc) 参议院(Nc) 针对(P) 今天(Nd) 总统(Na) 布什(Nb) 所(D) 提名(VC) 的(DE) 劳工部长(Na) 赵小兰(Nb) 展开(VC) 认可(VC) 听证会(Na) ,(COMMACATEGORY) 预料(VE) 她(Nh) 将(D) 会(D) 很(Dfa) 顺利(VH) 通过(VC) 参议院(Nc) 支持(VC) ,(COMMACATEGORY) 成为(VG) 该(Nes) 国(Nc) 有史以来(D) 第一(Neu) 位(Nf) 的(DE) 华裔(Na) 女性(Na) 内阁(Na) 成员(Na) 。 (PERIODCATEGORY)
NerToken(word='美国参议院', ner='ORG', idx=(0, 5))
NerToken(word='今天', ner='LOC', idx=(7, 9))
NerToken(word='布什', ner='PERSON', idx=(11, 13))
NerToken(word='劳工部长', ner='ORG', idx=(17, 21))
NerToken(word='赵小兰', ner='PERSON', idx=(21, 24))
NerToken(word='认可听证会', ner='EVENT', idx=(26, 31))
NerToken(word='参议院', ner='ORG', idx=(42, 45))
NerToken(word='第一', ner='ORDINAL', idx=(56, 58))
NerToken(word='华裔', ner='NORP', idx=(60, 62))
空白也是可以的~
空白(VH) (WHITESPACE) 也(D) 是(SHI) 可以(VH) 的(T) ~(FW)
NLP Tools Performance
The following is a performance comparison between our tool and other tools.
以下是我们的工具与其他的工具之性能比较。
CKIP Transformers vs Monpa & Jeiba
| Tool | WS (F1) | POS (Acc) | WS+POS (F1) | NER (F1) |
|---|
| CKIP BERT Base | 97.60% | 95.67% | 94.19% | 81.18% |
| CKIP ALBERT Base | 97.33% | 95.30% | 93.52% | 79.47% |
| CKIP BERT Tiny | 96.98% | 95.08% | 93.13% | 74.20% |
| CKIP ALBERT Tiny | 96.66% | 94.48% | 92.25% | 71.17% |
| | | | |
| Monpa† | 92.58% | -- | 83.88% | -- |
| Jeiba | 81.18% | -- | -- | -- |
† Monpa provides only 3 types of tags in NER.
† Monpa 的实体辨识仅提供三种标记而已。
CKIP Transformers vs CkipTagger
The following results are tested on a different dataset.†
以下实验在另一个资料集测试。 †
| Tool | WS (F1) | POS (Acc) | WS+POS (F1) | NER (F1) |
|---|
| CKIP BERT Base | 97.84% | 96.46% | 94.91% | 79.20% |
| CkipTagger | 97.33% | 97.20% | 94.75% | 77.87% |
† Here we retrained/tested our BERT model using the same dataset with CkipTagger.
† 我们重新训练/测试我们的BERT 模型于跟CkipTagger 相同的资料集。
License
Copyright (c) 2023 CKIP Lab under the GPL-3.0 License.