JioNLP
1.0.0

pip install jionlpJioNLP is a toolkit for NLP developers , providing NLP task preprocessing and parsing functions, with accurate, efficient and zero usage threshold. Please drop down this page, check the specific function information, and press Ctrl+F to search. JioNLP Online Edition can quickly try out some features. Follow the WeChat official account of the same name, JioNLP , and get the latest AI information and data resources.
norm_score.json and max_score.json from test data with password jmbo .*.json file. $ git clone https://github.com/dongrixinyu/JioNLP
$ cd JioNLP/test/
$ python test_mellm.py
>>> import jionlp as jio
>>> llm_test = jio.llm_test_dataset_loader(version='1.1')
>>> print(llm_test[15])
>>> llm_test = jio.llm_test_dataset_loader(field='math')
>>> print(llm_test[5])
$ git clone https://github.com/dongrixinyu/JioNLP
$ cd ./JioNLP
$ pip install .
$ pip install jionlp
>>> import jionlp as jio
>>> print(jio.__version__) # 查看 jionlp 的版本
>>> dir(jio)
>>> print(jio.extract_parentheses.__doc__)
| Function | function | describe | Star rating |
|---|---|---|---|
| Find Help | help | If you don't know what JioNLP functions, you can type several keywords according to the command line prompts to search | |
| License plate number analysis | parse_motor_vehicle_licence_plate | Given a license plate number, analyze it | |
| Time semantic analysis | parse_time | Given a time text, parse its time semantics (time stamp, duration), etc. | |
| Key phrase extraction | extract_keyphrase | Given a text, extract its corresponding key phrases | |
| Extracted text summary | extract_summary | Given a text, extract its corresponding dictionary | |
| Stop word filtering | remove_stopwords | Given a word list after a text is participled, remove the stop words from it | |
| Sentence | split_sentence | Punctuate text | |
| Address resolution | parse_location | Given a string containing domestic address, identifying information such as province, city, county, township, street, village, etc. | |
| Phone number place , Operator analysis | phone_location cell_phone_location landline_phone_location | Given a telephone number (mobile phone number, landline number) string, identify the province, city, and operator. | |
| News place name recognition | recognize_location | Given a news text, identify domestic provinces, cities, counties, foreign countries, cities and other information. | |
| Gregorian calendar dates | lunar2solar solar2lunar | Given a certain calendar date, convert it to a regional calendar | |
| Identity card number analysis | parse_id_card | Given an ID number, identify the corresponding province, city, county, date of birth, Gender, verification code and other information | |
| Idiom Solid | idiom_solitaire | The idiom is the same as the last character of the previous idiom and the first character of the next idiom (pronunciation) | |
| Pornographic data filtering | - | - | |
| Reactionary data filtering | - | - | |
| Traditional Chinese to Simplified Chinese | tra2sim | Traditional Chinese to Simplified Chinese, supporting two modes of verbatim and maximum matching | |
| Simplified Chinese to Traditional Chinese | sim2tra | Simplified Chinese to traditional Chinese, supporting two modes of verbatim and maximum matching | |
| Chinese characters to pinyin | pinyin | Find out the Chinese pinyin corresponding to the Chinese text, and return the initials , finals , and tone | |
| Chinese characters to radicals and characters | char_radiical | Find out the Chinese character structure information corresponding to Chinese text, Including radicals ("he" bulb), font structure ("he" left and right structure), Four corner code ("he" 31120), Chinese character disassembly ("he" water can), Wubi Code ("River" ISKG) | |
| Amount number to Chinese characters | money_num2char | Given a numeric amount, return the result of its Chinese character capitalization | |
| New word discovery | new_word_discovery | Given a corpus text file, the high probability of being a word |
| Function | function | describe | Star rating |
|---|---|---|---|
| Reply to translation | BackTranslation | Given a text, use the machine translation interface of cloud platforms of major manufacturers. Implement data enhancement | |
| Nearly Chinese characters transposition | swap_char_position | Randomly exchange the positions of similar characters to achieve data enhancement | |
| Homophone replacement | homophone_substitution | Same pronunciation vocabulary replacement to achieve data enhancement | |
| Random character addition and deletion | random_add_delete | Randomly add or delete a character in the text, which has no effect on semantics | |
| NER entity replacement | replace_entity | According to entity dictionary, random replacement of an entity in text will not affect semantics, and it is also widely used in sequence annotation and text classification |
| Function | function | describe | Star rating |
|---|---|---|---|
| Clean text | clean_text | Remove exception characters, redundant characters, HTML tags, bracket information in text, URL, E-mail, phone number, full-width alphanumeric conversion into half-width | |
| Extract E-mail | extract_email | Extract the E-mail in the text, return the location and domain name | |
| Analysis of currency amount | extract_money | parsing currency amount string | |
| Extract WeChat signals | extract_wechat_id | Draw WeChat ID and return to location | |
| Draw a phone number | extract_phone_number | Extract the phone number (including mobile phone number and landline number ), and return the domain name , type and location | |
| Extract the Chinese ID card ID | extract_id_card | Extract the ID ID and cooperate with jio.parse_id_card to return the detailed information of the ID card ( province, city , date of birth , gender , verification code ) | |
| Draw QQ number | extract_qq | Draw QQ numbers, divided into strict rules and loose rules | |
| Extract URL | extract_url | Extract URL hyperlink | |
| Extract IP address | extract_ip_address | Extract IP address | |
| Extract the contents in brackets | extract_parenteses | Extract the content of brackets, including {}"[][]()() <>" | |
| Draw license plate number | extract_motor_vehicle_licence_plate | Extract mainland license plate number information | |
| Delete E-mail | remove_email | Delete the E-mail message in the text | |
| Delete URL | remove_url | Delete URL information in text | |
| Delete phone number | remove_phone_number | Delete the phone number in the text | |
| Delete IP address | remove_ip_address | Delete the IP address in the text | |
| Delete ID number | remove_id_card | Delete the ID card information in the text | |
| Delete QQ | remove_qq | Delete the qq number in the text | |
| Delete HTML tags | remove_html_tag | Delete the remaining HTML tags in the text | |
| Delete the content in brackets | remove_parenteses | Delete the content of brackets, including {}"[][]()() <>" | |
| Delete exception characters | remove_exception_char | Delete exception characters in text, mainly retaining Chinese characters and commonly used punctuation. Unit calculation symbols, alphanumerics, etc. | |
| Delete redundant characters | remove_redundant_char | Delete redundant duplicate characters in text | |
| Normalized E-mail | replace_email | The E-mail message in the normalized text is <email> | |
| Normalized URL | replace_url | The URL information in the normalized text is <url> | |
| Normalized phone number | replace_phone_number | The phone number in the normalized text is <tel> | |
| Normalized IP address | replace_ip_address | The IP address in the normalized text is <ip> | |
| Normalized ID number | replace_id_card | The ID card information in the normalized text is <id> | |
| Normalized QQ | replace_qq | The qq number in the normalized text is <qq> | |
| Determine whether the text contains Chinese characters | check_any_chinese_char | Check whether the text contains Chinese characters. If at least one is included, it will return True. | |
| Determine whether the text is all Chinese characters | check_all_chinese_char | Check whether all Chinese characters are in the text. If all are, return True | |
| Determine whether the text contains Arabic numerals | check_any_arabic_num | Check whether the text contains Arabic numerals. If at least one is included, it returns True | |
| Determine whether all texts are Arabic numerals | check_all_arabic_num | Check whether all Arabic numerals in the text are. If all are, return True |
| Function | function | describe | Star rating |
|---|---|---|---|
| Read files by line | read_file_by_iter | It is easy to read files by line in the form of an iterator, saving memory. Supports specified number of rows , skip empty rows | |
| Read files by line | read_file_by_line | Read files by line, support specified number of lines , skip empty lines | |
| Write elements in list to file by line | write_file_by_line | Write elements in list to file by line | |
| Timing Tool | TimeIt | Calculate the time spent in a certain code segment | |
| Logging Tools | set_logger | Adjust the toolkit log output form |
| Function | function | describe | Star rating |
|---|---|---|---|
| Large Language Model LLM Evaluation Dataset | jio.llm_test_dataset_loader | LLM Evaluation Dataset | |
| Byte-level BPE | jio.bpe.byte_level_bpe | Byte-level-BPE algorithm | |
| Stop Word Dictionary | jio.stopwords_loader() | Comprehensive stop word dictionary of Baidu, jieba, iFlytek, etc. | |
| Idiom dictionary | chinese_idiom_loader | Loading idiom dictionary | |
| Dictionary of idioms | xiehouyu_loader | Loading idiom dictionary | |
| Chinese Dictionary of Place Nouns | china_location_loader | Load the three-level dictionary of China's provincial, municipal and county | |
| Chinese Dictionary of Division Adjustment | china_location_change_loader | Loading records of renaming and renaming of county-level and above zoning in China since 2018 | |
| World Place Noun Dictionary | world_location_loader | Load the world continent, country, city dictionary | |
| Xinhua Dictionary | chinese_char_dictionary_loader | Loading Xinhua Dictionary | |
| Xinhua Dictionary | chinese_word_dictionary_loader | Loading Xinhua Dictionary |
| Function | function | describe | Star rating |
|---|---|---|---|
| Extract currency amount entity | extract_money | Extract the currency amount from the text | |
| Extract time entity | extract_time | Extracting time entities from text | |
| Based on dictionary NER | LexiconNER | Forward maximum matching entity based on the specified entity dictionary | |
| entity to tag | entity2tag | Convert json format entity to a tag sequence processed by the model | |
| tag to entity | tag2entity | Convert the tag sequence processed by the model to a json format entity | |
| Word token Transpose token | char2word | Convert character level token to vocabulary level token | |
| Word token Transform word token | word2char | Convert vocabulary level token to character level token | |
| Comparison of entity differences between labels and model predictions | entity_compare | Compare differentially with the entity results predicted by the model for manual annotation. | |
| NER model prediction acceleration | TokenSplitSentence TokenBreakLongSentence TokenBatchBucket | Methods for predicting parallel acceleration for NER models | |
| Split dataset | analyze_dataset | The NER annotation corpus is divided into training set, verification set, and test set, and the entity type distribution statistics of each subset are given. | |
| Entity Collection | collect_dataset_entities | Collect the entities in the annotated corpus to form a dictionary |
| Function | function | describe | Star rating |
|---|---|---|---|
| Naive Bayesian Analysis Category Vocabulary | analyze_freq_words | For the annotated corpus of text classification, perform naive Bayesian word frequency analysis, and return high-condition probabilistic vocabulary for various texts | |
| Split dataset | analyze_dataset | The annotation corpus for text classification is divided into training set, verification set, and test set. And give the classification distribution statistics of each subset |
| Function | function | describe | Star rating |
|---|---|---|---|
| Dictionary-based sentiment analysis | LexiconSentiment | Based on the artificially constructed emotional dictionary, the emotional value of the text is calculated, ranging from 0 to 1 |
| Function | function | describe | Star rating |
|---|---|---|---|
| Word to tag | cws.word2tag | Convert json format word segmentation sequence to model-processed tag sequence | |
| Tag to word | cws.tag2word | Convert the tag sequence processed by the model to json format word segmentation | |
| Statistics F1 value | cws.f1 | Comparison of the F1 value of the label of the word participle label on the model prediction label | |
| Word participle data correction-standard dictionary | cws.CWSDCWithStandardWords | Correct and repair word-participle annotation data using standard dictionary |
Chengyu Cui, JioNLP, (2020), GitHub repository, https://github.com/dongrixinyu/JioNLP

