This is a model library that I have partially customized and modified based on Meelfy's pytorch_pretrained_BERT library.
The original intention of this project is to satisfy the convenience of personal experiments, so it will not be updated frequently.
Install:
pip install torchKbertFor typical usage examples, please refer to the official examples directory.
If you want to use hierarchical decomposition position encoding so that BERT can process long text, just pass the parameter is_hierarchical=True in model . Examples are as follows:
model = BertModel(config)
encoder_outputs, _ = model(input_ids, token_ids, input_mask, is_hierarchical=True)
If you want to use Chinese WoBERT based on word granularity, just pass in new parameters when building the BertTokenizer object:
from torchKbert.tokenization import BertTokenizer
tokenizer = BertTokenizer(
vocab_file=vocab_path,
pre_tokenizer=lambda s: jieba.cut(s, HMM=False))
When not passed in, the default is None . When participling words, the default is to be used as words. If you want to restore the use of word units, just pass in the new parameter pre_tokenize=False when tokenize :
tokenzier.tokenize(text, pre_tokenize=False)
I have been writing pytorch_pretrained_BERT in Meelfy before, and it is very convenient to call pretrained models or perform fine-tuning. Later, due to personal needs, I wanted to rewrite a version that supports hierarchical decomposition position coding.
Sushen's bert4keras has implemented such a function. But because I am used to using pytorch, I haven't used keras for a long time, so I plan to rewrite one by myself.