JioNLP Download - JioNLP Source code download

JioNLP

Other source code

1.0.0

Download

JioNLP: Chinese NLP Preprocessing and Parsing Toolkit A Python Lib for Chinese NLP Preprocessing & Parsing

Install: `pip install jionlp`

JioNLP is a toolkit for NLP developers , providing NLP task preprocessing and parsing functions, with accurate, efficient and zero usage threshold. Please drop down this page, check the specific function information, and press Ctrl+F to search. JioNLP Online Edition can quickly try out some features. Follow the WeChat official account of the same name, JioNLP , and get the latest AI information and data resources.
- AI development direction—from pipeline to end2end
- Why don't you believe in LLM Model Review: In-depth Review of LLM Interfaces
- AI seems to be running in a strange direction
- Will ChatGPT be so strong affect the employment environment of NLPer?
- Understand the principles of the ChatGPT model in one article
- After three weeks, I updated another version of the open source software ffio => FFIO link

2023-12-12 Add MELLM

MELLM , short for Mutual Evaluation of Large Language Models , is an automatic evaluation algorithm of LLMs without human supervision. MELLM has been tested effectively on several LLMs and datasets test results and analysis. You can use the example code below to take a try.
before running this code, you should download norm_score.json and max_score.json from test data with password jmbo .
If you encounter any error, read the test_mellm.py to download *.json file.

 $ git clone https://github.com/dongrixinyu/JioNLP
$ cd JioNLP/test/
$ python test_mellm.py

2023-06-22 Add a large language model LLM evaluation dataset

JioNLP provides a set of LLM test data sets and automatically evaluates using the MELLM algorithm.
For evaluation results, please follow the official account JioNLP and check the specific review screenshots of each company.

 >>> import jionlp as jio
>>> llm_test = jio.llm_test_dataset_loader(version='1.1')
>>> print(llm_test[15])
>>> llm_test = jio.llm_test_dataset_loader(field='math')
>>> print(llm_test[5])

Install Installation

python>=3.6 github version is slightly ahead of pip

 $ git clone https://github.com/dongrixinyu/JioNLP
$ cd ./JioNLP
$ pip install .

pip installation

 $ pip install jionlp

Using Features

Import the toolkit and view the main functions and function comments of the toolkit

 >>> import jionlp as jio
>>> print(jio.__version__)  # 查看 jionlp 的版本
>>> dir(jio)
>>> print(jio.extract_parentheses.__doc__)

Star rating represents high-quality special features

1. Gadgets

Function	function	describe
Find Help	help	If you don't know what JioNLP functions, you can type several keywords according to the command line prompts to search
License plate number analysis	parse_motor_vehicle_licence_plate	Given a license plate number, analyze it
Time semantic analysis	parse_time	Given a time text, parse its time semantics (time stamp, duration), etc.
Key phrase extraction	extract_keyphrase	Given a text, extract its corresponding key phrases
Extracted text summary	extract_summary	Given a text, extract its corresponding dictionary
Stop word filtering	remove_stopwords	Given a word list after a text is participled, remove the stop words from it
Sentence	split_sentence	Punctuate text
Address resolution	parse_location	Given a string containing domestic address, identifying information such as province, city, county, township, street, village, etc.
Phone number place , Operator analysis	phone_location cell_phone_location landline_phone_location	Given a telephone number (mobile phone number, landline number) string, identify the province, city, and operator.
News place name recognition	recognize_location	Given a news text, identify domestic provinces, cities, counties, foreign countries, cities and other information.
Gregorian calendar dates	lunar2solar solar2lunar	Given a certain calendar date, convert it to a regional calendar
Identity card number analysis	parse_id_card	Given an ID number, identify the corresponding province, city, county, date of birth, Gender, verification code and other information
Idiom Solid	idiom_solitaire	The idiom is the same as the last character of the previous idiom and the first character of the next idiom (pronunciation)
Pornographic data filtering	-	-
Reactionary data filtering	-	-
Traditional Chinese to Simplified Chinese	tra2sim	Traditional Chinese to Simplified Chinese, supporting two modes of verbatim and maximum matching
Simplified Chinese to Traditional Chinese	sim2tra	Simplified Chinese to traditional Chinese, supporting two modes of verbatim and maximum matching
Chinese characters to pinyin	pinyin	Find out the Chinese pinyin corresponding to the Chinese text, and return the initials , finals , and tone
Chinese characters to radicals and characters	char_radiical	Find out the Chinese character structure information corresponding to Chinese text, Including radicals ("he" bulb), font structure ("he" left and right structure), Four corner code ("he" 31120), Chinese character disassembly ("he" water can), Wubi Code ("River" ISKG)
Amount number to Chinese characters	money_num2char	Given a numeric amount, return the result of its Chinese character capitalization
New word discovery	new_word_discovery	Given a corpus text file, the high probability of being a word

2. Data enhancement

Description of various methods for text data enhancement

Function	function	describe
Reply to translation	BackTranslation	Given a text, use the machine translation interface of cloud platforms of major manufacturers. Implement data enhancement
Nearly Chinese characters transposition	swap_char_position	Randomly exchange the positions of similar characters to achieve data enhancement
Homophone replacement	homophone_substitution	Same pronunciation vocabulary replacement to achieve data enhancement
Random character addition and deletion	random_add_delete	Randomly add or delete a character in the text, which has no effect on semantics
NER entity replacement	replace_entity	According to entity dictionary, random replacement of an entity in text will not affect semantics, and it is also widely used in sequence annotation and text classification

3. Regular extraction and analysis

Function	function	describe
Clean text	clean_text	Remove exception characters, redundant characters, HTML tags, bracket information in text, URL, E-mail, phone number, full-width alphanumeric conversion into half-width
Extract E-mail	extract_email	Extract the E-mail in the text, return the location and domain name
Analysis of currency amount	extract_money	parsing currency amount string
Extract WeChat signals	extract_wechat_id	Draw WeChat ID and return to location
Draw a phone number	extract_phone_number	Extract the phone number (including mobile phone number and landline number ), and return the domain name , type and location
Extract the Chinese ID card ID	extract_id_card	Extract the ID ID and cooperate with jio.parse_id_card to return the detailed information of the ID card ( province, city , date of birth , gender , verification code )
Draw QQ number	extract_qq	Draw QQ numbers, divided into strict rules and loose rules
Extract URL	extract_url	Extract URL hyperlink
Extract IP address	extract_ip_address	Extract IP address
Extract the contents in brackets	extract_parenteses	Extract the content of brackets, including {}"[][]()() <>"
Draw license plate number	extract_motor_vehicle_licence_plate	Extract mainland license plate number information
Delete E-mail	remove_email	Delete the E-mail message in the text
Delete URL	remove_url	Delete URL information in text
Delete phone number	remove_phone_number	Delete the phone number in the text
Delete IP address	remove_ip_address	Delete the IP address in the text
Delete ID number	remove_id_card	Delete the ID card information in the text
Delete QQ	remove_qq	Delete the qq number in the text
Delete HTML tags	remove_html_tag	Delete the remaining HTML tags in the text
Delete the content in brackets	remove_parenteses	Delete the content of brackets, including {}"[][]()() <>"
Delete exception characters	remove_exception_char	Delete exception characters in text, mainly retaining Chinese characters and commonly used punctuation. Unit calculation symbols, alphanumerics, etc.
Delete redundant characters	remove_redundant_char	Delete redundant duplicate characters in text
Normalized E-mail	replace_email	The E-mail message in the normalized text is <email>
Normalized URL	replace_url	The URL information in the normalized text is <url>
Normalized phone number	replace_phone_number	The phone number in the normalized text is <tel>
Normalized IP address	replace_ip_address	The IP address in the normalized text is <ip>
Normalized ID number	replace_id_card	The ID card information in the normalized text is <id>
Normalized QQ	replace_qq	The qq number in the normalized text is <qq>
Determine whether the text contains Chinese characters	check_any_chinese_char	Check whether the text contains Chinese characters. If at least one is included, it will return True.
Determine whether the text is all Chinese characters	check_all_chinese_char	Check whether all Chinese characters are in the text. If all are, return True
Determine whether the text contains Arabic numerals	check_any_arabic_num	Check whether the text contains Arabic numerals. If at least one is included, it returns True
Determine whether all texts are Arabic numerals	check_all_arabic_num	Check whether all Arabic numerals in the text are. If all are, return True

4. File reading and writing tools

Function	function	describe
Read files by line	read_file_by_iter	It is easy to read files by line in the form of an iterator, saving memory. Supports specified number of rows , skip empty rows
Read files by line	read_file_by_line	Read files by line, support specified number of lines , skip empty lines
Write elements in list to file by line	write_file_by_line	Write elements in list to file by line
Timing Tool	TimeIt	Calculate the time spent in a certain code segment
Logging Tools	set_logger	Adjust the toolkit log output form

5. Dictionary loading and use

Function	function	describe
Large Language Model LLM Evaluation Dataset	jio.llm_test_dataset_loader	LLM Evaluation Dataset
Byte-level BPE	jio.bpe.byte_level_bpe	Byte-level-BPE algorithm
Stop Word Dictionary	jio.stopwords_loader()	Comprehensive stop word dictionary of Baidu, jieba, iFlytek, etc.
Idiom dictionary	chinese_idiom_loader	Loading idiom dictionary
Dictionary of idioms	xiehouyu_loader	Loading idiom dictionary
Chinese Dictionary of Place Nouns	china_location_loader	Load the three-level dictionary of China's provincial, municipal and county
Chinese Dictionary of Division Adjustment	china_location_change_loader	Loading records of renaming and renaming of county-level and above zoning in China since 2018
World Place Noun Dictionary	world_location_loader	Load the world continent, country, city dictionary
Xinhua Dictionary	chinese_char_dictionary_loader	Loading Xinhua Dictionary
Xinhua Dictionary	chinese_word_dictionary_loader	Loading Xinhua Dictionary

6. Entity recognition (NER) algorithm auxiliary tool set

Toolkit NER data specification description

Function	function	describe
Extract currency amount entity	extract_money	Extract the currency amount from the text
Extract time entity	extract_time	Extracting time entities from text
Based on dictionary NER	LexiconNER	Forward maximum matching entity based on the specified entity dictionary
entity to tag	entity2tag	Convert json format entity to a tag sequence processed by the model
tag to entity	tag2entity	Convert the tag sequence processed by the model to a json format entity
Word token Transpose token	char2word	Convert character level token to vocabulary level token
Word token Transform word token	word2char	Convert vocabulary level token to character level token
Comparison of entity differences between labels and model predictions	entity_compare	Compare differentially with the entity results predicted by the model for manual annotation.
NER model prediction acceleration	TokenSplitSentence TokenBreakLongSentence TokenBatchBucket	Methods for predicting parallel acceleration for NER models
Split dataset	analyze_dataset	The NER annotation corpus is divided into training set, verification set, and test set, and the entity type distribution statistics of each subset are given.
Entity Collection	collect_dataset_entities	Collect the entities in the annotated corpus to form a dictionary

7. Text classification

Function	function	describe	Star rating
Naive Bayesian Analysis Category Vocabulary	analyze_freq_words	For the annotated corpus of text classification, perform naive Bayesian word frequency analysis, and return high-condition probabilistic vocabulary for various texts
Split dataset	analyze_dataset	The annotation corpus for text classification is divided into training set, verification set, and test set. And give the classification distribution statistics of each subset

8. Sentiment Analysis

Function	function	describe	Star rating
Dictionary-based sentiment analysis	LexiconSentiment	Based on the artificially constructed emotional dictionary, the emotional value of the text is calculated, ranging from 0 to 1

9. Participle

Function	function	describe
Word to tag	cws.word2tag	Convert json format word segmentation sequence to model-processed tag sequence
Tag to word	cws.tag2word	Convert the tag sequence processed by the model to json format word segmentation
Statistics F1 value	cws.f1	Comparison of the F1 value of the label of the word participle label on the model prediction label
Word participle data correction-standard dictionary	cws.CWSDCWithStandardWords	Correct and repair word-participle annotation data using standard dictionary

Literature Citations

If the paper needs to be cited, the following citations can be copied:

Chengyu Cui, JioNLP, (2020), GitHub repository, https://github.com/dongrixinyu/JioNLP

Original intention

NLP preprocessing and parsing are critical and time-consuming. This lib can quickly assist in completing various trivial preprocessing and analysis operations, accelerate development progress, and devote limited energy to thinking rather than code.
If there are any functional suggestions or bugs, you can submit them according to the template through issue.
NLP developers and researchers are welcome to work together to improve this toolkit and add new features .

If this tool is helpful to you, please click the star in the upper right corner

Or scan the code to ask the author to have a cup of coffee (●'◡'●), the open source project is completely powered by Ai, thank you! Recommended priority use [Alipay] ~~

Thank you to the sponsors on the list of thank you. Your rewards have made me more motivated

It is not easy to do NLP. Welcome to join the Natural Language Processing WeChat Communication Group

Please scan the following code, or search for the official account JioNLP by wx, follow and reply [Enter the group]

Expand

Additional Information

Version 1.0.0
Type Other source code
Update Time 2025-04-15
size 17.57MB
From Github

Related Applications

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All

JioNLP

JioNLP: Chinese NLP Preprocessing and Parsing Toolkit A Python Lib for Chinese NLP Preprocessing & Parsing

Install: pip install jionlp

2023-12-12 Add MELLM

2023-06-22 Add a large language model LLM evaluation dataset

Install Installation

Using Features

1. Gadgets

2. Data enhancement

3. Regular extraction and analysis

4. File reading and writing tools

5. Dictionary loading and use

6. Entity recognition (NER) algorithm auxiliary tool set

7. Text classification

8. Sentiment Analysis

9. Participle

Literature Citations

Original intention

If this tool is helpful to you, please click the star in the upper right corner

Or scan the code to ask the author to have a cup of coffee (●'◡'●), the open source project is completely powered by Ai, thank you! Recommended priority use [Alipay] ~~

It is not easy to do NLP. Welcome to join the Natural Language Processing WeChat Communication Group

Please scan the following code, or search for the official account JioNLP by wx, follow and reply [Enter the group]

Install: `pip install jionlp`