
xmnlp: an open source Chinese natural language processing toolkit out of the box
XMNLP: An out-of-the-box Chinese Natural Language Processing Toolkit
Install the latest version of xmnlp
pip install -U xmnlp
Domestic users can add index-url
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -U xmnlpAfter installing the package, you also need to download the model weights to be used normally.
Please download the corresponding version of xmnlp model. If you are not clear about the version of xmnlp, you can execute python -c 'import xmnlp; print(xmnlp.__version__)' to view the version
| Model name | Applicable version | Download address |
|---|---|---|
| xmnlp-onnx-models-v5.zip | v0.5.0, v0.5.1, v0.5.2, v0.5.3 | Feishu[IGHI] | Baidu Netdisk[l9id] |
| xmnlp-onnx-models-v4.zip | v0.4.0 | Feishu[DKLa] | Baidu Netdisk[j1qi] |
| xmnlp-onnx-models-v3.zip | v0.3.2, v0.3.3 | Feishu[o4bA] | Baidu Netdisk[9g7e] |
After downloading the model, you need to set the model path xmnlp to run normally. Two configuration methods are provided
Method 1: Configure environment variables (recommended)
After the downloaded model is decompressed, you can set the environment variable to specify the model address. Taking Linux system as an example, the settings are as follows
export XMNLP_MODEL=/path/to/xmnlp-modelsMethod 2: Setting through functions
Set the model address before calling xmnlp, as follows
import xmnlp
xmnlp . set_model ( '/path/to/xmnlp-models' )
* The above /path/to/ is only for placeholder. Please replace it with the real directory address of the model when configuring.
Chinese word segmentation (default), based on inverse maximum matching, RoBERTa + CRF is used for new word recognition.
parameter:
The result returns:
Example:
> >> import xmnlp
> >> text = """xmnlp 是一款开箱即用的轻量级中文自然语言处理工具?。"""
> >> print ( xmnlp . seg ( text ))
[ 'xmnlp' , '是' , '一款' , '开箱' , '即用' , '的' , '轻量级' , '中文' , '自然语言' , '处理' , '工具' , '?' , '。' ]
Word segmentation based on reverse maximum matching does not include new word recognition, and is faster.
parameter:
The result returns:
Example:
> >> import xmnlp
> >> text = """xmnlp 是一款开箱即用的轻量级中文自然语言处理工具?。"""
> >> print ( xmnlp . seg ( text ))
[ 'xmnlp' , '是' , '一款' , '开箱' , '即' , '用' , '的' , '轻量级' , '中文' , '自然语言' , '处理' , '工具' , '?' , '。' ]
Based on the RoBERTa + CRF model, the speed is slower. Currently, deep interface only supports simplified Chinese, not traditional Chinese.
parameter:
The result returns:
Example:
> >> import xmnlp
> >> text = """xmnlp 是一款开箱即用的轻量级中文自然语言处理工具?。"""
> >> print ( xmnlp . deep_seg ( text ))
[ 'xmnlp' , '是' , '一款' , '开箱' , '即用' , '的' , '轻' , '量级' , '中文' , '自然' , '语言' , '处理' , '工具' , '?' , '。' ]
Part of speech annotation.
parameter:
The result returns:
Example:
> >> import xmnlp
> >> text = """xmnlp 是一款开箱即用的轻量级中文自然语言处理工具?。"""
> >> print ( xmnlp . tag ( text ))
[( 'xmnlp' , 'eng' ), ( '是' , 'v' ), ( '一款' , 'm' ), ( '开箱' , 'n' ), ( '即用' , 'v' ), ( '的' , 'u' ), ( '轻量级' , 'b' ), ( '中文' , 'nz' ), ( '自然语言' , 'l' ), ( '处理' , 'v' ), ( '工具' , 'n' ), ( '?' , 'x' ), ( '。' , 'x' )]
Based on reverse maximum matching, it does not include new word recognition, and is faster.
parameter:
The result returns:
Example:
> >> import xmnlp
> >> text = """xmnlp 是一款开箱即用的轻量级中文自然语言处理工具?。"""
> >> print ( xmnlp . fast_tag ( text ))
[( 'xmnlp' , 'eng' ), ( '是' , 'v' ), ( '一款' , 'm' ), ( '开箱' , 'n' ), ( '即' , 'v' ), ( '用' , 'p' ), ( '的' , 'uj' ), ( '轻量级' , 'b' ), ( '中文' , 'nz' ), ( '自然语言' , 'l' ), ( '处理' , 'v' ), ( '工具' , 'n' ), ( '?' , 'x' ), ( '。' , 'x' )]
Based on the RoBERTa + CRF model, the speed is slower. Currently, deep interface only supports simplified Chinese, not traditional Chinese.
parameter:
The result returns:
Example:
> >> import xmnlp
> >> text = """xmnlp 是一款开箱即用的轻量级中文自然语言处理工具?。"""
> >> print ( xmnlp . deep_tag ( text ))
[( 'xmnlp' , 'x' ), ( '是' , 'v' ), ( '一款' , 'm' ), ( '开箱' , 'v' ), ( '即用' , 'v' ), ( '的' , 'u' ), ( '轻' , 'nz' ), ( '量级' , 'b' ), ( '中文' , 'nz' ), ( '自然' , 'n' ), ( '语言' , 'n' ), ( '处理' , 'v' ), ( '工具' , 'n' ), ( '?' , 'w' ), ( '。' , 'w' )]Support user-defined dictionary, dictionary format is
词1 词性1
词2 词性2
Also compatible with the dictionary format of jieba participle
词1 词频1 词性1
词2 词频2 词性2
Note: The spacer in the above line is space
Example of usage:
from xmnlp . lexical . tokenization import Tokenization
# 定义 tokenizer
# detect_new_word 定义是否识别新词,默认 True, 设为 False 时速度会更快
tokenizer = Tokenization ( user_dict_path , detect_new_word = True )
# 分词
tokenizer . seg ( texts )
# 词性标注
tokenizer . tag ( texts )
Named body recognition, the entity types that support recognition are:
parameter:
The result returns:
Example:
> >> import xmnlp
> >> text = "现任美国总统是拜登。"
> >> print ( xmnlp . ner ( text ))
[( '美国' , 'LOCATION' , 2 , 4 ), ( '总统' , 'JOB' , 4 , 6 ), ( '拜登' , 'PERSON' , 7 , 9 )]
Extract keywords from text, based on Texttrank algorithm.
parameter:
The result returns:
Example:
> >> import xmnlp
> >> text = """自然语言处理: 是人工智能和语言学领域的分支学科。
...: 在这此领域中探讨如何处理及运用自然语言;自然语言认知则是指让电脑“懂”人类的
...: 语言。
...: 自然语言生成系统把计算机数据转化为自然语言。自然语言理解系统把自然语言转化
...: 为计算机程序更易于处理的形式。"""
> >> print ( xmnlp . keyword ( text ))
[( '自然语言' , 2.3000579596585897 ), ( '语言' , 1.4734141257937314 ), ( '计算机' , 1.3747500999598312 ), ( '转化' , 1.2687686226652466 ), ( '系统' , 1.1171384775870152 ), ( '领域' , 1.0970728069617324 ), ( '人类' , 1.0192131829490039 ), ( '生成' , 1.0075197087342542 ), ( '认知' , 0.9327188339671753 ), ( '指' , 0.9218423928455112 )]
Extract key sentences from text, based on Texttrank algorithm.
parameter:
The result returns:
Example:
> >> import xmnlp
> >> text = """自然语言处理: 是人工智能和语言学领域的分支学科。
...: 在这此领域中探讨如何处理及运用自然语言;自然语言认知则是指让电脑“懂”人类的
...: 语言。
...: 自然语言生成系统把计算机数据转化为自然语言。自然语言理解系统把自然语言转化
...: 为计算机程序更易于处理的形式。"""
> >> print ( xmnlp . keyphrase ( text , k = 2 ))
[ '自然语言理解系统把自然语言转化为计算机程序更易于处理的形式' , '自然语言生成系统把计算机数据转化为自然语言' ]
Emotional recognition is based on e-commerce review corpus training, and is suitable for emotional recognition in e-commerce scenarios.
parameter:
The result returns:
Example:
> >> import xmnlp
> >> text = "这本书真不错,下次还要买"
> >> print ( xmnlp . sentiment ( text ))
( 0.02727833203971386 , 0.9727216958999634 )
Text to pinyin
parameter:
The result returns:
Example:
> >> import xmnlp
> >> text = "自然语言处理"
> >> print ( xmnlp . pinyin ( text ))
[ 'Zi' , 'ran' , 'yu' , 'yan' , 'chu' , 'li' ]
Extract text radicals
parameter:
The result returns:
Example:
> >> import xmnlp
> >> text = "自然语言处理"
> >> print ( xmnlp . radical ( text ))
[ '自' , '灬' , '讠' , '言' , '夂' , '王' ]
Text error correction
parameter:
The result returns:
Example:
> >> import xmnlp
> >> text = "不能适应体育专业选拔人材的要求"
> >> print ( xmnlp . checker ( text ))
{( 11 , '材' ): [( '才' , 1.58528071641922 ), ( '材' , 1.0009655653266236 ), ( '裁' , 1.0000178480604518 ), ( '员' , 0.35814568400382996 ), ( '士' , 0.011077565141022205 )]}SentenceVector Initialization Function
The following are the three member functions of SentenceVector
Example of usage
import numpy as np
from xmnlp . sv import SentenceVector
query = '我想买手机'
docs = [
'我想买苹果手机' ,
'我喜欢吃苹果'
]
sv = SentenceVector ( genre = '通用' )
for doc in docs :
print ( 'doc:' , doc )
print ( 'similarity:' , sv . similarity ( query , doc ))
print ( 'most similar doc:' , sv . most_similar ( query , docs ))
print ( 'query representation shape:' , sv . transform ( query ). shape )Output
doc: 我想买苹果手机
similarity: 0.68668646
doc: 我喜欢吃苹果
similarity: 0.3020076
most similar doc: [('我想买苹果手机', 16.255546509314417)]
query representation shape: (312,)
The new version no longer provides the corresponding parallel processing interface, and requires the use of xmnlp.utils.parallel_handler to define the parallel processing interface.
The interface is as follows:
xmnlp . utils . parallel_handler ( callback : Callable , texts : List [ str ], n_jobs : int = 2 , ** kwargs ) - > Generator [ List [ Any ], None , None ]Example of usage:
from functools import partial
import xmnlp
from xmnlp . utils import parallel_handler
seg_parallel = partial ( parallel_handler , xmnlp . seg )
print ( seg_parallel ( texts ))
Looking forward to more friends' contributions to create a simple and easy-to-use Chinese NLP tool
@ misc {
xmnlp ,
title = { XMNLP : A Lightweight Chinese Natural Language Processing Toolkit },
author = { Xianming Li },
year = { 2018 },
publisher = { GitHub },
howpublished = { url { https : // github . com / SeanLee97 / xmnlp }},
}
I am committed to NLP research and implementation, and my directions include: information extraction, emotional classification, etc.
For other NLP implementation needs, please contact [email protected] (This is a paid service, and the bugs related to xmnlp can be directly reported)
Search the official account xmnlp-ai to follow, select "Communication Group" in the menu to join the group.
The data used in this project are mainly:
Apache 2.0
Most models are built based on LangML