xmnlp Download - xmnlp Source code download

xmnlp: an open source Chinese natural language processing toolkit out of the box

XMNLP: An out-of-the-box Chinese Natural Language Processing Toolkit

Feature Overview

Chinese lexical analysis (RoBERTa + CRF finetune)
- Participle
- Part of speech annotation
- Named body recognition
- Supports custom dictionaries
Chinese spell check (Detector + Corrector SpellCheck)
Text Summary & Keyword Extraction (Textrank)
Sentiment Analysis (RoBERTa finetune)
Text to Pinyin (Trie)
Chinese characters radicals (HashMap)
Sentence representation and similarity calculation

Outline

1. Installation
- Model download
- Configuration model
2. Use the document
- Default participle: seg
  - Fast participle: fast_seg
  - Depth participle: deep_seg
- Part of speech annotation: tag
  - Fast part of speech annotation: fast_tag
  - Deep part-of-speech annotation: deep_tag
- Word Partials & Partials of Word Annotation Custom Dictionary
- Named body recognition: ner
- Keyword extraction: keyword
- Key statement extraction: keyphrase
- Emotional recognition: sentiment
- Pinyin Extraction: pinyin
- Radical Extraction: radical
- Text error correction: checker
- Sentence representation and similarity calculation: sentence_vector
- Parallel processing
3. More
- Contributors
- Academic Citations
- Need customization
- Communication group
Refrence
License

1. Installation

Install the latest version of xmnlp

pip install -U xmnlp

Domestic users can add index-url

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -U xmnlp

After installing the package, you also need to download the model weights to be used normally.

Model download

Please download the corresponding version of xmnlp model. If you are not clear about the version of xmnlp, you can execute python -c 'import xmnlp; print(xmnlp.__version__)' to view the version

Model name	Applicable version	Download address
xmnlp-onnx-models-v5.zip	v0.5.0, v0.5.1, v0.5.2, v0.5.3	Feishu[IGHI] \| Baidu Netdisk[l9id]
xmnlp-onnx-models-v4.zip	v0.4.0	Feishu[DKLa] \| Baidu Netdisk[j1qi]
xmnlp-onnx-models-v3.zip	v0.3.2, v0.3.3	Feishu[o4bA] \| Baidu Netdisk[9g7e]

Configuration model

After downloading the model, you need to set the model path xmnlp to run normally. Two configuration methods are provided

Method 1: Configure environment variables (recommended)

After the downloaded model is decompressed, you can set the environment variable to specify the model address. Taking Linux system as an example, the settings are as follows

 export XMNLP_MODEL=/path/to/xmnlp-models

Method 2: Setting through functions

Set the model address before calling xmnlp, as follows

 import xmnlp

xmnlp . set_model ( '/path/to/xmnlp-models' )

* The above /path/to/ is only for placeholder. Please replace it with the real directory address of the model when configuring.

2. Use the document

xmnlp.seg(text: str) -> List[str]

Chinese word segmentation (default), based on inverse maximum matching, RoBERTa + CRF is used for new word recognition.

parameter:

text: Enter text

The result returns:

List, results after word segmentation

Example:

 > >> import xmnlp
> >> text = """xmnlp 是一款开箱即用的轻量级中文自然语言处理工具?。"""
> >> print ( xmnlp . seg ( text ))
[ 'xmnlp' , '是' , '一款' , '开箱' , '即用' , '的' , '轻量级' , '中文' , '自然语言' , '处理' , '工具' , '?' , '。' ]

xmnlp.fast_seg(text: str) -> List[str]

Word segmentation based on reverse maximum matching does not include new word recognition, and is faster.

parameter:

text: Enter text

The result returns:

List, results after word segmentation

Example:

 > >> import xmnlp
> >> text = """xmnlp 是一款开箱即用的轻量级中文自然语言处理工具?。"""
> >> print ( xmnlp . seg ( text ))
[ 'xmnlp' , '是' , '一款' , '开箱' , '即' , '用' , '的' , '轻量级' , '中文' , '自然语言' , '处理' , '工具' , '?' , '。' ]

xmnlp.deep_seg(text: str) -> List[str]

Based on the RoBERTa + CRF model, the speed is slower. Currently, deep interface only supports simplified Chinese, not traditional Chinese.

parameter:

text: Enter text

The result returns:

List, results after word segmentation

Example:

 > >> import xmnlp
> >> text = """xmnlp 是一款开箱即用的轻量级中文自然语言处理工具?。"""
> >> print ( xmnlp . deep_seg ( text ))
[ 'xmnlp' , '是' , '一款' , '开箱' , '即用' , '的' , '轻' , '量级' , '中文' , '自然' , '语言' , '处理' , '工具' , '?' , '。' ]

xmnlp.tag(text: str) -> List[Tuple(str, str)]

Part of speech annotation.

parameter:

text: Enter text

The result returns:

List of words and part-of-speech tuples

Example:

 > >> import xmnlp
> >> text = """xmnlp 是一款开箱即用的轻量级中文自然语言处理工具?。"""
> >> print ( xmnlp . tag ( text ))
[( 'xmnlp' , 'eng' ), ( '是' , 'v' ), ( '一款' , 'm' ), ( '开箱' , 'n' ), ( '即用' , 'v' ), ( '的' , 'u' ), ( '轻量级' , 'b' ), ( '中文' , 'nz' ), ( '自然语言' , 'l' ), ( '处理' , 'v' ), ( '工具' , 'n' ), ( '?' , 'x' ), ( '。' , 'x' )]

xmnlp.fast_tag(text: str) -> List[Tuple(str, str)]

Based on reverse maximum matching, it does not include new word recognition, and is faster.

parameter:

text: Enter text

The result returns:

List of words and part-of-speech tuples

Example:

 > >> import xmnlp
> >> text = """xmnlp 是一款开箱即用的轻量级中文自然语言处理工具?。"""
> >> print ( xmnlp . fast_tag ( text ))
[( 'xmnlp' , 'eng' ), ( '是' , 'v' ), ( '一款' , 'm' ), ( '开箱' , 'n' ), ( '即' , 'v' ), ( '用' , 'p' ), ( '的' , 'uj' ), ( '轻量级' , 'b' ), ( '中文' , 'nz' ), ( '自然语言' , 'l' ), ( '处理' , 'v' ), ( '工具' , 'n' ), ( '?' , 'x' ), ( '。' , 'x' )]

xmnlp.deep_tag(text: str) -> List[Tuple(str, str)]

Based on the RoBERTa + CRF model, the speed is slower. Currently, deep interface only supports simplified Chinese, not traditional Chinese.

parameter:

text: Enter text

The result returns:

List of words and part-of-speech tuples

Example:

 > >> import xmnlp
> >> text = """xmnlp 是一款开箱即用的轻量级中文自然语言处理工具?。"""
> >> print ( xmnlp . deep_tag ( text ))
[( 'xmnlp' , 'x' ), ( '是' , 'v' ), ( '一款' , 'm' ), ( '开箱' , 'v' ), ( '即用' , 'v' ), ( '的' , 'u' ), ( '轻' , 'nz' ), ( '量级' , 'b' ), ( '中文' , 'nz' ), ( '自然' , 'n' ), ( '语言' , 'n' ), ( '处理' , 'v' ), ( '工具' , 'n' ), ( '?' , 'w' ), ( '。' , 'w' )]

Word Partials & Partials of Word Annotation Custom Dictionary

Support user-defined dictionary, dictionary format is

词1 词性1
词2 词性2

Also compatible with the dictionary format of jieba participle

词1 词频1 词性1
词2 词频2 词性2

Note: The spacer in the above line is space

Example of usage:

 from xmnlp . lexical . tokenization import Tokenization

# 定义 tokenizer
# detect_new_word 定义是否识别新词，默认 True， 设为 False 时速度会更快
tokenizer = Tokenization ( user_dict_path , detect_new_word = True )

# 分词
tokenizer . seg ( texts )
# 词性标注
tokenizer . tag ( texts )

xmnlp.ner(text: str) -> List[Tuple(str, str, int, int)]

Named body recognition, the entity types that support recognition are:

TIME: Time
LOCATION: Location
PERSON: Characters
JOB: Career
ORGANIZAIRION: Organization

parameter:

text: Enter text

The result returns:

List of entities, entity types, entity start positions and entity end positions

Example:

 > >> import xmnlp
> >> text = "现任美国总统是拜登。"
> >> print ( xmnlp . ner ( text ))
[( '美国' , 'LOCATION' , 2 , 4 ), ( '总统' , 'JOB' , 4 , 6 ), ( '拜登' , 'PERSON' , 7 , 9 )]

xmnlp.keyword(text: str, k: int = 10, stopword: bool = True, allowPOS: Optional[List[str]] = None) -> List[Tuple[str, float]]

Extract keywords from text, based on Texttrank algorithm.

parameter:

text: text input
k: Return the number of keywords
stopword: Whether to remove stopword
allowPOS: configure allowed word quality

The result returns:

List of keywords and weights

Example:

 > >> import xmnlp
> >> text = """自然语言处理: 是人工智能和语言学领域的分支学科。
    ...: 在这此领域中探讨如何处理及运用自然语言；自然语言认知则是指让电脑“懂”人类的
    ...: 语言。
    ...: 自然语言生成系统把计算机数据转化为自然语言。自然语言理解系统把自然语言转化
    ...: 为计算机程序更易于处理的形式。"""
> >> print ( xmnlp . keyword ( text ))
[( '自然语言' , 2.3000579596585897 ), ( '语言' , 1.4734141257937314 ), ( '计算机' , 1.3747500999598312 ), ( '转化' , 1.2687686226652466 ), ( '系统' , 1.1171384775870152 ), ( '领域' , 1.0970728069617324 ), ( '人类' , 1.0192131829490039 ), ( '生成' , 1.0075197087342542 ), ( '认知' , 0.9327188339671753 ), ( '指' , 0.9218423928455112 )]

xmnlp.keyphrase(text: str, k: int = 10, stopword: bool = False) -> List[str]

Extract key sentences from text, based on Texttrank algorithm.

parameter:

text: text input
k: Return the number of keywords
stopword: Whether to remove stopword

The result returns:

List of keywords and weights

Example:

 > >> import xmnlp
> >> text = """自然语言处理: 是人工智能和语言学领域的分支学科。
    ...: 在这此领域中探讨如何处理及运用自然语言；自然语言认知则是指让电脑“懂”人类的
    ...: 语言。
    ...: 自然语言生成系统把计算机数据转化为自然语言。自然语言理解系统把自然语言转化
    ...: 为计算机程序更易于处理的形式。"""
> >> print ( xmnlp . keyphrase ( text , k = 2 ))
[ '自然语言理解系统把自然语言转化为计算机程序更易于处理的形式' , '自然语言生成系统把计算机数据转化为自然语言' ]

xmnlp.sentiment(text: str) -> Tuple[float, float]

Emotional recognition is based on e-commerce review corpus training, and is suitable for emotional recognition in e-commerce scenarios.

parameter:

text: Enter text

The result returns:

Tuple, format: [negative emotion probability, positive emotion probability]

Example:

 > >> import xmnlp
> >> text = "这本书真不错，下次还要买"
> >> print ( xmnlp . sentiment ( text ))
( 0.02727833203971386 , 0.9727216958999634 )

xmnlp.pinyin(text: str) -> List[str]

Text to pinyin

parameter:

text: Enter text

The result returns:

List of pinyin

Example:

 > >> import xmnlp
> >> text = "自然语言处理"
> >> print ( xmnlp . pinyin ( text ))
[ 'Zi' , 'ran' , 'yu' , 'yan' , 'chu' , 'li' ]

xmnlp.radiical(text: str) -> List[str]

Extract text radicals

parameter:

text: Enter text

The result returns:

List of radicals

Example:

 > >> import xmnlp
> >> text = "自然语言处理"
> >> print ( xmnlp . radical ( text ))
[ '自' , '灬' , '讠' , '言' , '夂' , '王' ]

xmnlp.checker(text: str, suggest: bool = True, k: int = 5, max_k: int = 200) -> Union[ List[Tuple[int, str]], Dict[Tuple[int, str], List[Tuple[str, float]]]]:

Text error correction

parameter:

text: Enter text
suggest: whether to return the suggested word
k: Return the number of suggested words
max_k: maximum number of pinyin searches (it is recommended to keep the default value)

The result returns:

When suggest is False, it returns a list of (wrong word subscript, wrong word); when suggest is True, it returns a dictionary, the dictionary key is (wrong word subscript, wrong word) list, and the values are the suggested words and weight list.

Example:

 > >> import xmnlp
> >> text = "不能适应体育专业选拔人材的要求"
> >> print ( xmnlp . checker ( text ))
{( 11 , '材' ): [( '才' , 1.58528071641922 ), ( '材' , 1.0009655653266236 ), ( '裁' , 1.0000178480604518 ), ( '员' , 0.35814568400382996 ), ( '士' , 0.011077565141022205 )]}

xmnlp.sv.SentenceVector(model_dir: Optional[str] = None, genre: str = 'generic', max_length: int = 512)

SentenceVector Initialization Function

model_dir: The model save address, and the model weight provided by xmnlp is loaded by default
genre: Content type, currently supports three types: ['generic', 'financial', 'international']
max_length: The maximum length of the input text, default 512

The following are the three member functions of SentenceVector

xmnlp.sv.SentenceVector.transform(self, text: str) -> np.ndarray

xmnlp.sv.SentenceVector.similarity(self, x: Union[str, np.ndarray], y: Union[str, np.ndarray]) -> float

xmnlp.sv.SentenceVector.most_similar(self, query: str, docs: List[str], k: int = 1, **kwargs) -> List[Tuple[str, float]]

query: query content
docs: Document list
k: Return topk similar text
kwargs: KDTree parameters, see sklearn.neighbors.KDTree

Example of usage

 import numpy as np
from xmnlp . sv import SentenceVector


query = '我想买手机'
docs = [
    '我想买苹果手机' ,
    '我喜欢吃苹果'
]

sv = SentenceVector ( genre = '通用' )
for doc in docs :
    print ( 'doc:' , doc )
    print ( 'similarity:' , sv . similarity ( query , doc ))
print ( 'most similar doc:' , sv . most_similar ( query , docs ))
print ( 'query representation shape:' , sv . transform ( query ). shape )

Output

 doc: 我想买苹果手机
similarity: 0.68668646
doc: 我喜欢吃苹果
similarity: 0.3020076
most similar doc: [('我想买苹果手机', 16.255546509314417)]
query representation shape: (312,)

Parallel processing

The new version no longer provides the corresponding parallel processing interface, and requires the use of xmnlp.utils.parallel_handler to define the parallel processing interface.

The interface is as follows:

 xmnlp . utils . parallel_handler ( callback : Callable , texts : List [ str ], n_jobs : int = 2 , ** kwargs ) - > Generator [ List [ Any ], None , None ]

Example of usage:

 from functools import partial

import xmnlp
from xmnlp . utils import parallel_handler


seg_parallel = partial ( parallel_handler , xmnlp . seg )
print ( seg_parallel ( texts ))

3. More

About Contributors

Looking forward to more friends' contributions to create a simple and easy-to-use Chinese NLP tool

Academic Citation Citation

@ misc {
  xmnlp ,
  title = { XMNLP : A Lightweight Chinese Natural Language Processing Toolkit },
  author = { Xianming Li },
  year = { 2018 },
  publisher = { GitHub },
  howpublished = { url { https : // github . com / SeanLee97 / xmnlp }},
}

Need customization

I am committed to NLP research and implementation, and my directions include: information extraction, emotional classification, etc.

For other NLP implementation needs, please contact [email protected] (This is a paid service, and the bugs related to xmnlp can be directly reported)

Communication group

Search the official account xmnlp-ai to follow, select "Communication Group" in the menu to join the group.

Reference

The data used in this project are mainly:

Lexical analysis, text error correction: People's Daily quotation
Emotional recognition: ChineseNlpCorpus

License

Apache 2.0

Most models are built based on LangML

Expand