fancy nlp Download - fancy nlp Source code download

Fancy-NLP

NLP for human. A fast and easy-to-use natural language processing (NLP) toolkit, satisfying your imagination about NLP.

^{→ English version}

Basic Introduction • Installation • Getting Started Guide • Detailed Tutorial • Honors • How to Contribute Code • Quotes • Acknowledgements

Basic introduction

Fancy-NLP is a text knowledge mining tool for building product portraits built by Tencent's product advertising strategy team. It supports a variety of common NLP tasks such as entity extraction, text classification and text similarity matching. Compared with the current commonly used frameworks in the industry, it can support users to implement fast functional implementation: it can not only meet the deep customization of models by advanced users, but also allow ordinary users to quickly use pre-trained models to quickly practice functional practice. In the current product advertising business scenario, we use this tool to quickly explore the characteristics of massive product data, thereby supporting modules such as advertising product recommendations.

The original intention of the project is to provide a set of easy-to-use NLP tools, which are directly aimed at the usage scenarios and meet users' needs for NLP tasks, so that users do not need to deal with complex preprocessing and other intermediate processes, and directly complete multiple NLP tasks for the input natural language text, realizing what they think is what they get!

What is Fancy? For many current NLP tasks, such as Named Entity Recognition (NER), text classification, and text similarity matching (Sentence Pair Matching (SPM), most tools are designed to favor model training and evaluation. When ordinary users want to apply these models to actual business scenarios, they often need to perform complex preprocessing and deployment configurations, which often do not match the processes expected by users. Therefore, to meet your imagination , you can implement one-click processing of each link of NLP task in Fancy-NLP, and efficiently apply the model to actual demand scenarios.

Install

Fancy-NLP is currently supported for use in Python 3 environments and has been fully tested in Python 3.6. Tensorflow 2.x is fully reliant on the current version. If you have any concerns about module compatibility, we recommend that you use virtualenv to create a virtual environment to use this tool.

Fancy-NLP supports one-click installation using pip :

pip install fancy-nlp

Beginner's Guide

In the Getting Started Guide, we will use pre-trained models to take you to quickly understand and experience the basic functions of Fancy-NLP.

Note: We will continue to optimize the entity recognition model of multiple scenarios (different annotated data) for users to use directly. If you have relevant data sets, you are also welcome to give us positive feedback in the issue.

Instructions for the use of entity identification

The current version of Fancy-NLP can load the NER model trained using MSRA NER subset data by default. It can identify organizational structures (ORG), locations (LOCs) and persons (PER) entities in Chinese text. The basic model loaded by default is for users to experience directly. If you want to use your own customized model directly, you can refer to the introduction in the subsequent detailed tutorial to build your entity extraction system.

Initialize the entity recognition application

 > >> from fancy_nlp . applications import NER
> >> ner_app = NER ()

When you run the above code for the first time, the pre-trained NER model will be downloaded from the cloud.

Output entity information in text

 > >> ner_app . analyze ( '同济大学位于上海市杨浦区，校长为陈杰' )
{ 'text' : '同济大学位于上海市杨浦区，校长为陈杰' ,
 'entities' : [
  { 'name' : '同济大学' ,
   'type' : 'ORG' ,
   'score' : 1.0 ,
   'beginOffset' : 0 ,
   'endOffset' : 4 },
  { 'name' : '上海市' ,
   'type' : 'LOC' ,
   'score' : 1.0 ,
   'beginOffset' : 6 ,
   'endOffset' : 9 },
  { 'name' : '杨浦区' ,
   'type' : 'LOC' ,
   'score' : 1.0 ,
   'beginOffset' : 9 ,
   'endOffset' : 12 },
  { 'name' : '陈杰' ,
   'type' : 'PER' ,
   'score' : 1.0 ,
   'beginOffset' : 16 ,
   'endOffset' : 18 }]}

In the output limit, each entity retains only one entity, and the entity with the highest score is obtained.

 > >> ner_app . restrict_analyze ( '同济大学位于上海市杨浦区，校长为陈杰' )
{ 'text' : '同济大学位于上海市杨浦区，校长为陈杰' ,
 'entities' : [
  { 'name' : '同济大学' ,
   'type' : 'ORG' ,
   'score' : 1.0 ,
   'beginOffset' : 0 ,
   'endOffset' : 4 },
  { 'name' : '杨浦区' ,
   'type' : 'LOC' ,
   'score' : 1.0 ,
   'beginOffset' : 9 ,
   'endOffset' : 12 },
  { 'name' : '陈杰' ,
   'type' : 'PER' ,
   'score' : 1.0 ,
   'beginOffset' : 16 ,
   'endOffset' : 18 }]}

View specific sequence labeling results

 >>> ner_app.predict('同济大学位于上海市杨浦区，校长为陈杰')
['B-ORG',
 'I-ORG',
 'I-ORG',
 'I-ORG',
 'O',
 'O',
 'B-LOC',
 'I-LOC',
 'I-LOC',
 'B-LOC',
 'I-LOC',
 'I-LOC',
 'O',
 'O',
 'O',
 'O',
 'B-PER',
 'I-PER']

Guidelines for the use of text classification

Fancy-NLP is loaded by default with the text classification model trained in the currently published Chinese news title classification dataset, which can predict the news category to which it belongs to for the news title text.

Initialize text classification application

 > >> from fancy_nlp . applications import TextClassification
> >> text_classification_app = TextClassification ()

When you run the above program for the first time, the pre-trained model will be downloaded from the cloud.

Directly predict text categories

 > >> text_classification_app . predict ( '苹果iOS占移动互联网流量份额逾65% 位居第一' )
'科技'

Output text classification to obtain target categories and scores

 > >> text_classification_app . analyze ( '苹果iOS占移动互联网流量份额逾65% 位居第一' )
( '科技' , 0.9996544 )

Guidelines for text similarity matching

Fancy-NLP is loaded by default in the text similarity matching model trained in the currently published WeBank customer service question matching dataset, which can predict whether it expresses the same intention for the provided text pairs.

Initialize text similarity matching application

 > >> from fancy_nlp . applications import SPM
> >> spm_app = SPM ()

When you run the above program for the first time, the pre-trained text similarity matching model will be downloaded from the cloud.

Predict whether the text pair expresses the same meaning

 > >> spm_app . predict (( '未满足微众银行审批是什么意思' , '为什么我未满足微众银行审批' ))
'1'

In the prediction result, 1 represents the same intention or similar text, 0 represents the different intention or different text.

Predict whether the text pair expresses the same intention and its score on different labels

 > >> spm_app . analyze (( '未满足微众银行审批是什么意思' , '为什么我未满足微众银行审批' ))
( '1' , array ([ 1.6599501e-09 , 1.0000000e+00 ], dtype = float32 ))

Detailed tutorial

In the detailed tutorial , you can learn how to use Fancy-NLP to build custom models that meet custom scenarios using your own datasets, and have a more comprehensive understanding of Fancy-NLP's interface.

Data download

To fully experience the following tutorial, you need to download the dataset and BERT model we use:

Dataset download link: https://share.weiyun.com/5h1IPNj Password: 86geqk
BERT model download link: https://share.weiyun.com/5neKbM0 Password: 7e6rb7

You can move the downloaded data to the same level as the examples directory, and the final directory structure is as follows:

 .
├── datasets
│   ├── ner
│   │   └── msra
│   │       ├── test_data
│   │       └── train_data
│   ├── spm
│   │   └── webank
│   │       ├── BQ_dev.txt
│   │       ├── BQ_test.txt
│   │       └── BQ_train.txt
│   └── text_classification
│       └── toutiao
│           ├── toutiao_cat_data.txt
│           └── toutiao_label_dict.txt
├── examples
│   ├── bert_combination.py
│   ├── bert_fine_tuning.py
│   ├── bert_single.py
│   ├── ner_example.py
│   ├── spm_example.py
│   └── text_classification_example.py
└── pretrained_embeddings
    └── chinese_L-12_H-768_A-12
        ├── bert_config.json
        ├── bert_model.ckpt.data-00000-of-00001
        ├── bert_model.ckpt.index
        ├── bert_model.ckpt.meta
        └── vocab.txt

So you can run the sample program directly. For example, python examples/ner_example.py .

Entity identification task

We still use the MSRA NER subset data mentioned above as an example to introduce how to use existing data sets to train our own entity recognition model. For a complete version of all the following code snippets, please refer to examples/ner_example.py .

Dataset preparation

In Fancy-NLP, entity recognition applications support the use of standard NER dataset formats, each character to be identified and its corresponding tag is separated by t , and sentences are separated by blank lines. The format of the tag can be a common standard format such as BIO and BIOES .

Loading training sets and validation sets

Using the interface provided by Fancy-NLP, we can directly load the dataset and process it into the format required by the model.

 from fancy_nlp . applications import NER
ner_app = NER ( use_pretrained = False )

from fancy_nlp . utils import load_ner_data_and_labels
train_data , train_labels = load_ner_data_and_labels ( 'datasets/ner/msra/train_data' )
valid_data , valid_labels = load_ner_data_and_labels ( 'datasets/ner/msra/test_data' )

load_ner_data_and_labels implements effective loading of NER data sets. You can directly use the file path of the data (training set, validation set or test set) to be loaded as parameters. The test set is used as the validation set. In actual tasks, you should have their own independent validation sets and test sets to obtain valuable test evaluation results.

Training the model

After obtaining valid data, the NER application can start training the model directly.

 checkpoint_dir = 'pretrained_models'
model_name = 'msra_ner_bilstm_cnn_crf'
ner_app . fit ( train_data , train_labels , valid_data , valid_labels ,
            ner_model_type = 'bilstm_cnn' ,
            char_embed_trainable = True ,
            callback_list = [ 'modelcheckpoint' , 'earlystopping' , 'swa' ],
            checkpoint_dir = checkpoint_dir ,
            model_name = model_name ,
            load_swa_model = True )

For the fit interface of the NER application, you need to pass in the training set and validation set samples that have been processed before. The meanings of the remaining parameters are as follows:

ner_model_type : Indicates the model name to be used. In this example, the bilstm_cnn model is used;
char_embed_trainable : Whether the word vector layer can be fine tuning. In this example, it is set to True , indicating that fine-tuning can be performed;
callback_list : The name of the callback function that needs to be used. The callback functions used in this example are:
- modelcheckpoint : Use the model checkpoint function. After each iteration, save the trained model;
- earlystopping : Use the early stop function. If the performance of the model does not improve after n round iterations (default n=5), the training will be ended;
- swa : SWA means Stochastic Weight Averaging, which is a common model integration strategy that can effectively improve the performance of the model. For more details, please refer to the introduction in the original paper;
checkpoint_dir : the directory path to save the model file;
model_name : the file name of the model file;
load_swa_model : Whether to load the SWA model weight after the model is trained. Set to True here, indicating the use of the SWA model;

Evaluate model effectiveness using test set

 test_data , test_labels = load_ner_data_and_labels ( 'datasets/ner/msra/test_data' )
ner_app . score ( test_data , test_labels )

Here, load_ner_data_and_labels is still used to process test set data. After obtaining a valid data format, use the score interface of the NER application to obtain the model's score in the test set.

Save the training model

After training the model, all model-related files required by the task need to be saved to facilitate the use of Fancy-NLP-trained models in other external applications.

 import os
ner_app . save (
    preprocessor_file = os . path . join ( checkpoint_dir , f' { model_name } _preprocessor.pkl' ),
    json_file = os . path . join ( checkpoint_dir , f' { model_name } .json' ))

The save interface of the NER application can be used to persist the model's structure file (json), weight file (hdf5) and preprocessing related results (pickle):

preprocessor_file: the saved preprocessor file;
json_file: saved model structure file;
weight_file: The saved weight file name. Usually there is no need to display the specification, because in previous model training, the weight file has been saved through modelcheckpoint function;

Load the previously trained model for prediction

 ner_app . load (
    preprocessor_file = os . path . join ( checkpoint_dir , f' { model_name } _preprocessor.pkl' ),
    json_file = os . path . join ( checkpoint_dir , f' { model_name } .json' ),
    weights_file = os . path . join ( checkpoint_dir , f' { model_name } _swa.hdf5' ))

At this time, ner_app already has the ability to predict samples, and you can complete the relevant prediction functions mentioned in the introductory guide . For example, analyze , restrict_analyze .

Text classification tasks

We still use the Chinese news title classification dataset mentioned above as an example to introduce how to use the existing dataset to train our own text classification model. For a complete version of all the following code snippets, please refer to examples/text_classification_example.py .

Dataset preparation

In Fancy-NLP, text classification applications support dataset formats separated by fixed delimiters using original text. They can have redundant columns that are independent of text classification tasks. They only need to ensure that both the label column and the input text column are in a unified fixed position.

In addition, for classification tags, a mapping file of tags and tag IDs is also required, which consists of two columns: the first column is the original name of the tag in the data set, usually some encoded IDs; the second column is the readable name corresponding to the original name of the tag. The correspondence of this file will be used to directly output readable label names when the model is predicted.

Loading training sets and validation sets

Using the interface provided by Fancy-NLP, we can directly load the dataset and process it into the format required by the model.

 from fancy_nlp . applications import TextClassification
text_classification_app = TextClassification ( use_pretrained = False )

data_file = 'datasets/text_classification/toutiao/toutiao_cat_data.txt'

from fancy_nlp . utils import load_text_classification_data_and_labels
train_data , train_labels , valid_data , valid_labels , test_data , test_labels =
    load_text_classification_data_and_labels ( data_file ,
                                             label_index = 1 ,
                                             text_index = 3 ,
                                             delimiter = '_!_' ,
                                             split_mode = 2 ,
                                             split_size = 0.3 )

load_ner_data_and_labels implements effective loading of text classification data sets. You can directly use the file path of the data (training set, validation set or test set) to be loaded as parameters. The complete data is used here to divide the training set, validation set and test set. In addition to the data file, the specific meaning of the above remaining parameters is:

label_index : The position of the classification tag in the data file (position number starts from 0);
text_index : The location of the text to be classified in the data file;
delimiter : the separator between columns of the data file;
split_mode : indicates how to divide the original data by changing the parameters. in:
- 1: It means dividing the original data into a training set and a validation set, which means you have previously owned an independent test set;
- 2: It means that the original data is divided into training connection, verification set and test set;
split_size : data division ratio. When split_mode=1 , it means that the data that divides split_size ratio from the original data will be used as the verification set; when split_mode=2 , it means that the data that divides split_size ratio from the original data will be used as the sum of the verification set and the test set, and the proportions of the verification set and the test set each account for half.

Training the model

After obtaining valid data, the text classification application can start training the model directly.

 dict_file = 'datasets/text_classification/toutiao/toutiao_label_dict.txt'
model_name = 'toutiao_text_classification_cnn'
checkpoint_dir = 'pretrained_models'
text_classification_app . fit (
    train_data , train_labels , valid_data , valid_labels ,
    text_classification_model_type = 'cnn' ,
    char_embed_trainable = True ,
    callback_list = [ 'modelcheckpoint' , 'earlystopping' , 'swa' ],
    checkpoint_dir = checkpoint_dir ,
    model_name = model_name ,
    label_dict_file = dict_file ,
    max_len = 60 ,
    load_swa_model = True )

For the fit interface of a text classification application, you need to pass in the training set and validation set samples that have been processed before. The meanings of the remaining parameters are as follows:

text_classification_model_type : Indicates the model name to be used. In this example, the cnn model is used;
char_embed_trainable : Whether the word vector layer can be fine tuning. In this example, it is set to True , indicating that fine-tuning can be performed;
callback_list : The name of the callback function that needs to be used. The callback functions used in this example are:
- modelcheckpoint : Use the model checkpoint function. After each iteration, save the trained model;
- earlystopping : Use the early stop function. If the performance of the model does not improve after n round iterations (default n=5), the training will be ended;
- swa : SWA means Stochastic Weight Averaging, which is a common model integration strategy that can effectively improve the performance of the model. For more details, please refer to the introduction in the original paper;
checkpoint_dir : the directory path to save the model file;
model_name : the file name of the model file;
label_dict_file : a label dictionary file, which consists of two columns: the first column is the original name of the label in the data set, usually some encoded IDs; the second column is the readable name corresponding to the original name of the label;
max_len : For the maximum length retained by the input text, text beyond that length will be truncated;
load_swa_model : Whether to load the SWA model weight after the model is trained. Set to True here, indicating the use of the SWA model;

Evaluate model effectiveness using test set

 text_classification_app . score ( test_data , test_labels )

Here you can directly use the score interface of the text classification application to obtain the model's score in the test set.

Save the training model

After training the model, all model-related files required by the task need to be saved to facilitate the use of Fancy-NLP-trained models in other external applications.

 import os
text_classification_app . save (
    preprocessor_file = os . path . join ( checkpoint_dir , f' { model_name } _preprocessor.pkl' ),
    json_file = os . path . join ( checkpoint_dir , f' { model_name } .json' ))

The save interface of text classification applications can be used to persist the model's structure file (json), weight file (hdf5) and preprocessing related results (pickle):

preprocessor_file: the saved preprocessor file;
json_file: saved model structure file;
weight_file: The saved weight file name. Usually there is no need to display the specification, because in previous model training, the weight file has been saved through modelcheckpoint function.

Load the previously trained model for prediction

 text_classification_app . load (
    preprocessor_file = os . path . join ( checkpoint_dir , f' { model_name } _preprocessor.pkl' ),
    json_file = os . path . join ( checkpoint_dir , f' { model_name } .json' ),
    weights_file = os . path . join ( checkpoint_dir , f' { model_name } _swa.hdf5' ))

At this time, text_classification_app already has the ability to predict samples, and you can complete the relevant prediction functions mentioned in the introductory guide . For example, predict , analyze .

Text similarity matching task

We still use the WeBank customer service question matching dataset mentioned above as an example to introduce how to use the existing dataset to train our own text similarity matching model. For a complete version of all the following code snippets, please refer to examples/spm_example.py .

Dataset preparation

In Fancy-NLP, the text similarity matching task application supports the use of original text in a dataset format separated by t , which consists of three columns: the first column and the second column are a set of text pairs respectively; the third column is a sample label, 1 means text semantics are similar, and 0 means dissimilar.

Loading training sets and validation sets

Using the interface provided by Fancy-NLP, we can directly load the dataset and process it into the format required by the model.

 from fancy_nlp . applications import SPM
spm_app = applications . SPM ( use_pretrained = False )

train_file = 'datasets/spm/webank/BQ_train.txt'
valid_file = 'datasets/spm/webank/BQ_dev.txt'

from fancy_nlp . utils import load_spm_data_and_labels
train_data , train_labels = load_spm_data_and_labels ( train_file )
valid_data , valid_labels = load_spm_data_and_labels ( valid_file )

load_spm_data_and_labels implements effective loading of text similarity matching data sets. You can directly use the file path of the data (training set, validation set or test set) to be loaded as parameters.

Training the model

After obtaining valid data, the text similarity matching application can start training the model directly.

 model_name = 'spm_siamese_cnn'
checkpoint_dir = 'pretrained_models'
spm_app . fit ( train_data , train_labels , valid_data , valid_labels ,
            spm_model_type = 'siamese_cnn' ,
            word_embed_trainable = True ,
            callback_list = [ 'modelcheckpoint' , 'earlystopping' , 'swa' ],
            checkpoint_dir = checkpoint_dir ,
            model_name = model_name ,
            max_len = 60 ,
            load_swa_model = True )

For the text similarity matching application's fit interface, you need to pass in the training and validation set samples that have been processed before. The meaning of the remaining parameters is as follows:

spm_model_type : Indicates the model name to be used. In this example, the siamese_cnn model is used;
word_embed_trainable : Whether the word vector layer can be fine tuning. In this example, it is set to True , indicating that fine-tuning can be performed;
callback_list : The name of the callback function that needs to be used. The callback functions used in this example are:
- modelcheckpoint : Use the model checkpoint function. After each iteration, save the trained model;
- earlystopping : Use the early stop function. If the performance of the model does not improve after n round iterations (default n=5), the training will be ended;
- swa : SWA means Stochastic Weight Averaging, which is a common model integration strategy that can effectively improve the performance of the model. For more details, please refer to the introduction in the original paper;
checkpoint_dir : the directory path to save the model file;
model_name : the file name of the model file;
max_len : For the maximum length retained by the input text, text beyond that length will be truncated;
load_swa_model : Whether to load the SWA model weight after the model is trained. Set to True here, indicating the use of the SWA model;

Evaluate model effectiveness using test set

 test_file = 'datasets/spm/webank/BQ_test.txt'
test_data , test_labels = load_spm_data_and_labels ( test_file )
spm_app . score ( test_data , test_labels )

Here you can directly use the text similarity matching application score interface to obtain the model's score in the test set.

Save the training model

After training the model, all model-related files required by the task need to be saved to facilitate the use of Fancy-NLP-trained models in other external applications.

 import os
spm_app . save (
    preprocessor_file = os . path . join ( checkpoint_dir , f' { model_name } _preprocessor.pkl' ),
    json_file = os . path . join ( checkpoint_dir , f' { model_name } .json' ))

The save interface of the text similarity matching application can be used to persist the model's structure file (json), weight file (hdf5) and preprocessing related results (pickle):

preprocessor_file: the saved preprocessor file;
json_file: saved model structure file;
weight_file: The saved weight file name. Usually there is no need to display the specification, because in previous model training, the weight file has been saved through modelcheckpoint function.

Load the previously trained model for prediction

 spm_app . load (
    preprocessor_file = os . path . join ( checkpoint_dir , f' { model_name } _preprocessor.pkl' ),
    json_file = os . path . join ( checkpoint_dir , f' { model_name } .json' ),
    weights_file = os . path . join ( checkpoint_dir , f' { model_name } _swa.hdf5' ))

At this time, spm_app already has the ability to predict samples, and you can continue to complete the relevant prediction functions mentioned in the introductory guide . For example, predict , analyze .

Use of BERT Model

Facny-NLP provides various methods for using the BERT model:

Directly fine-tune the BERT model to complete the NLP task;
Use the vector output from the BERT model as feature input for the downstream task model;
Combining the vectors output by the BERT model and other eigenvectors as feature inputs for the downstream task model.

To use BERT in Fancy-NLP, you only need to download the pre-trained BERT model (such as the Chinese BERT model provided by Google, the ERNIE model provided by Baidu (extraction code: iq74), and the BERT-wwm model provided by Harbin Institute of Technology). After that, you can pass the path of the BERT model's vocabulary file, configuration file, and model file into fit method of the relevant application. The following is an example of the entity recognition application. For the complete example code, please refer to examples/bert_fine_tuning.py , examples/bert_single.py , and examples/bert_combination.py .

Note that the BERT model can only be used with character vectors, not with word vectors.

Fine-tuning the BERT model

 import tensorflow as tf
from fancy_nlp . applications import NER
ner_app = NER ( use_pretrained = False )

from fancy_nlp . utils import load_ner_data_and_labels
train_data , train_labels = load_ner_data_and_labels ( 'datasets/ner/msra/train_data' )
valid_data , valid_labels = load_ner_data_and_labels ( 'datasets/ner/msra/test_data' )
ner_app . fit ( train_data , train_labels , valid_data , valid_labels ,
            ner_model_type = 'bert' ,
            use_char = False ,       
            use_word = False ,
            use_bert = True ,
            bert_vocab_file = 'pretrained_embeddings/chinese_L-12_H-768_A-12/vocab.txt' ,
            bert_config_file = 'pretrained_embeddings/chinese_L-12_H-768_A-12/bert_config.json' ,
            bert_checkpoint_file = 'pretrained_embeddings/chinese_L-12_H-768_A-12/bert_model.ckpt' ,
            bert_trainable = True ,
            optimizer = tf . keras . optimizers . Adam ( 1e-5 ),
            callback_list = [ 'modelcheckpoint' , 'earlystopping' , 'swa' ],
            checkpoint_dir = 'pretrained_models' ,
            model_name = 'msra_ner_bert_crf' ,
            load_swa_model = True )

In the above code snippet, it is important to note:

ner_model_type : Set the model type to bert ;
use_char : set not to use character-level vectors;
use_word : Set not to use word-level vectors as auxiliary input;
use_bert : When fine-tuning the bert model, set to use only BERT input,;
bert_vocab_file , bert_config_file , bert_checkpoint_file : The path to the BERT model-related files
bert_trainable : Set the BERT model parameters to trainable state, that is, fine-tuning;
optimizer : Set the optimizer for the BERT model. In fine-tuning the BERT model, the learning rate of the optimizer needs to be adjusted to a smaller level.

Use the output vector of the BERT model as feature input for the downstream task model

 from fancy_nlp . applications import NER
ner_app = NER ( use_pretrained = False )

from fancy_nlp . utils import load_ner_data_and_labels
train_data , train_labels = load_ner_data_and_labels ( 'datasets/ner/msra/train_data' )
valid_data , valid_labels = load_ner_data_and_labels ( 'datasets/ner/msra/test_data' )
ner_app . fit ( train_data , train_labels , valid_data , valid_labels ,
            ner_model_type = 'bilstm_cnn' ,
            use_char = False ,
            use_word = False ,       
            use_bert = True ,
            bert_vocab_file = 'pretrained_embeddings/chinese_L-12_H-768_A-12/vocab.txt' ,
            bert_config_file = 'pretrained_embeddings/chinese_L-12_H-768_A-12/bert_config.json' ,
            bert_checkpoint_file = 'pretrained_embeddings/chinese_L-12_H-768_A-12/bert_model.ckpt' ,
            bert_trainable = False ,
            optimizer = 'adam' ,
            callback_list = [ 'modelcheckpoint' , 'earlystopping' , 'swa' ],
            checkpoint_dir = 'pretrained_models' ,
            model_name = 'msra_ner_bilstm_cnn_bert_crf' ,
            load_swa_model = True )

In the above code snippet, it is important to note:

ner_model_type : Set the model type to bilstm_cnn , and a non-BERT model must be used here;
use_char : set not to use character-level vectors;
use_word : Set not to use word-level vectors as auxiliary input;
use_bert : Set to use only BERT vectors as feature input;
bert_vocab_file , bert_config_file , bert_checkpoint_file : The path to the BERT model-related files
bert_trainable : Set the BERT model parameter to an untrainable state, and it is also possible to set it to True here;
optimizer : Set the optimizer. If the bet model is trainable, it is recommended to adjust the learning rate of the optimizer to a smaller one.

Combined with BERT output vectors and other eigenvectors

 import tensorflow as tf
from fancy_nlp . applications import NER
ner_app = NER ( use_pretrained = False )

from fancy_nlp . utils import load_ner_data_and_labels
train_data , train_labels = load_ner_data_and_labels ( 'datasets/ner/msra/train_data' )
valid_data , valid_labels = load_ner_data_and_labels ( 'datasets/ner/msra/test_data' )
ner_app . fit ( train_data , train_labels , valid_data , valid_labels ,
            ner_model_type = 'bilstm_cnn' ,
            use_char = True ,
            use_word = False ,
            use_bert = True ,
            bert_vocab_file = 'pretrained_embeddings/chinese_L-12_H-768_A-12/vocab.txt' ,
            bert_config_file = 'pretrained_embeddings/chinese_L-12_H-768_A-12/bert_config.json' ,
            bert_checkpoint_file = 'pretrained_embeddings/chinese_L-12_H-768_A-12/bert_model.ckpt' ,
            bert_trainable = True ,
            optimizer = tf . keras . optimizers . Adam ( 1e-5 ),
            callback_list = [ 'modelcheckpoint' , 'earlystopping' , 'swa' ],
            checkpoint_dir = 'pretrained_models' ,
            model_name = 'msra_ner_bilstm_cnn_char_bert_crf' ,
            load_swa_model = True )

In the above code snippet, it is important to note:

ner_model_type : Set the model type to bilstm_cnn , and a non-BERT model must be used here;
use_char : set the use character-level vector;
use_word : Set not to use word-level vectors as auxiliary input;
use_bert : Set the use of BERT vector, which combines word vectors and BERT vectors as feature inputs;
bert_vocab_file , bert_config_file , bert_checkpoint_file : The path to the BERT model-related files
bert_trainable : Set the BERT model parameters to trainable state, and it is also possible to set it to False here;
optimizer : Set the optimizer. If the bet model is trainable, it is recommended to adjust the learning rate of the optimizer to a smaller one.

Honors and Awards

This project originated from the 2019 Tencent Advertising Rhino Bird Special Research Plan - a research on knowledge graph construction technology for product recommendations . The core contributors are composed of Tencent Advertising Team and Tongji University's Big Data Processing and Intelligent Analysis Laboratory;
Fancy-NLP won the third place in the preliminary round and fifth place in the semi-finals competition in CCKS 2019 - The physical link index of short Chinese text evaluation competition, and won the only technological innovation award in the evaluation competition. The process guidance can be reproduced, please refer to the original repo;
Fancy-NLP version 0.0.1 won the third issue of the Tencent AI Code Culture Award in 2019.

How to contribute code

Developers who are interested in improving Fancy NLP code should follow the following specifications to submit Pull requests:

The code specifications of the project must comply with the PEP8 standard;
For all code submissions, please follow the convention submission specification;
To add new core code to the project, please write the corresponding unit test module.

Quote

If you use Fancy NLP during the related research process, you can add the following to the citation list

@misc{tencent2019fancynlp,
  title={Fancy-NLP},
  author={Li Yang and Shiyao Xu and Shijia E},
  howpublished={ url {https://github.com/boat-group/fancy-nlp}},
  year={2019}
}

Acknowledgements

^{▴ Back to top}

This project is inspired by many excellent open source projects, especially Keras. As Keras's slogan: Deep learning for human said, we hope Fancy NLP is NLP for human , especially in the Chinese field.

Keras: https://github.com/keras-team/keras
anoGo: https://github.com/Hironsan/anago
Kashgari: https://github.com/BrikerMan/Kashgari
bert-as-service: https://github.com/hanxiao/bert-as-service

Expand