THUOCL
Table of contents
- Introduction to the dictionary
- Thesaurus format and word frequency statistics corpus
- List of the dictionary
- Open Source Protocol
- author
Introduction to the dictionary
THUOCL (THU Open Chinese Lexicon) is a high-quality Chinese vocabulary compiled and launched by the Natural Language Processing and Social Humanities Computing Laboratory of Tsinghua University. The vocabulary list comes from social tags, search hot words, input method vocabulary, etc. of mainstream websites. THUOCL has the following characteristics:
Includes DF value (Document Frequency) for user personalized choices.
The vocabulary has undergone multiple rounds of manual screening to ensure the accuracy of the vocabulary inclusion.
Open updates will continue to update existing vocabulary lists and launch more category vocabulary lists. Professionals are welcome to join and collaborate on building an open thesaurus. Interested people are welcome to write to [email protected].
This thesaurus can be used for automatic word segmentation in Chinese to improve the effect of word segmentation in Chinese. It is recommended to use it with the THULAC toolkit developed by this group to improve the effectiveness of Chinese word segmentation in specific fields.
Thesaurus format and word frequency statistics corpus
Each line of the lexicon consists of two parts, namely the word and the DF value (the number of documents where this word exists), and is separated by a Tab.
Word frequency statistics corpus:
- CSDN Blog Time: 2014.07-2016.07 Number of documents: 3785976
- Sina News Time: 2008.01-2016.11 Number of documents: 8421097
- Sogou Corpus Document Number: 729008561
List of the dictionary
IT
- Introduction to the Vocabulary: This Vocabulary contains a large number of IT vocabulary.
- Entry examples: file backup, virtual address, C++ programming, transaction scheduling, strong connection deflation points.
- Number of entries: 16,000
- Word frequency statistics corpus: CSDN blog
- Updated: 2016-12-24
- Contributors: Ma Yunshan, Han Shiyi, Zhang Yuhui
- Download link: Click here to download
Finance
- Introduction to the vocabulary: This vocabulary contains a large number of financial vocabulary.
- Entry examples: year, adjustment plan, comprehensive acquisition, price difference, shrinkage.
- Number of entries: 3830
- Word frequency statistics corpus: Sina News
- Updated: 2016-12-24
- Contributors: Han Shiyi, Zhang Yuhui, Ma Yunshan
- Download link: Click here to download
idiom
- Introduction to the Vocabulary: This Vocabulary contains a large number of idioms and vocabulary.
- Examples of entry: Pretending to be profound, reasonable and well-founded, inexhaustible, people's subtle words, adapt to local conditions, and seek talents as eagerly.
- Number of entries: 8519
- Word frequency statistics corpus: Sina News
- Updated: 2016-12-24
- Contributors: Han Shiyi, Zhang Yuhui, Ma Yunshan
- Download link: Click here to download
Place name
- Introduction to the vocabulary: This vocabulary contains a large number of place nouns.
- Entry examples: Zhejiang, Shanghai, Australia, Mount Everest, Xiangtan County, Dajia Town.
- Number of entries: 44,805
- Word frequency statistics corpus: Sogou ingredient
- Updated: 2017-06-01
- Contributors: Han Shiyi, Zhang Yuhui, Ma Yunshan
- Download link: Click here to download
Historical celebrities
- Introduction to the vocabulary: This vocabulary contains a large number of historical human vocabulary.
- Entry examples: Lu You, Xun Yu, Zhuge Liang, Sun Quan, Chamberlain.
- Number of entries: 13658
- Word frequency statistics corpus: Sina News
- Updated: 2016-12-24
- Contributors: Han Shiyi, Zhang Yuhui, Ma Yunshan
- Download link: Click here to download
Poetry
- Introduction to the vocabulary list: This vocabulary list contains a large number of famous poems and sentences.
- Example of entry: Going to the next level, you still have a pipa covering your face, the road is long and arduous, no matter how winds east, west, south and north.
- Number of entries: 13703
- Word frequency statistics corpus: Sina News
- Updated: 2017-01-20
- Contributors: Zhang Yuhui, Han Shiyi, Ma Yunshan
- Download link: Click here to download
medicine
- Introduction to the vocabulary: This vocabulary contains a large number of medical vocabulary.
- Entry examples: patient, congestion, rash, Cordyceps sinensis.
- Number of entries: 18749
- Word frequency statistics corpus: Sina News
- Updated: 2017-01-20
- Contributors: Zhang Yuhui, Han Shiyi, Ma Yunshan
- Download link: Click here to download
diet
- Introduction to the dictionary: This dictionary contains most dietary vocabulary.
- Entry examples: potatoes, hot pot, pasta, fruit, monkey head mushroom.
- Number of entries: 8974
- Word frequency statistics corpus: Sogou ingredient
- Updated: 2017-04-20
- Contributors: Wang Mengyuan, Wu Jiaoyu, Huang Weijie, Lin Yongtian
- Download link: Click here to download
law
- Introduction to the dictionary: This dictionary contains most legal vocabulary.
- Entry examples: copyright, relevant departments, limited liability companies, judges of the Land Tribunal, Japanese manor system.
- Number of entries: 9896
- Word frequency statistics corpus: Sogou ingredient
- Updated: 2017-04-28
- Contributors: Wang Mengyuan, Wu Jiaoyu, Huang Weijie, Lin Yongtian
- Download link: Click here to download
car
- Introduction to the dictionary: This dictionary contains most automotive vocabulary.
- Entry examples: sedan, auto show, Dongfeng Honda, front windshield, Sichuan Toyota.
- Number of entries: 1752
- Word frequency statistics corpus: Sogou ingredient
- Updated: 2017-05-15
- Contributors: Wang Mengyuan, Wu Jiaoyu, Huang Weijie, Lin Yongtian
- Download link: Click here to download
animal
- Introduction to the dictionary: This dictionary contains most animal vocabulary.
- Examples of entry: carrier pigeons, sika deer, street pigeons, square vines, spotted forest pigeons.
- Number of entries: 17287
- Word frequency statistics corpus: Sogou ingredient
- Updated: 2017-06-01
- Contributors: Wang Mengyuan, Wu Jiaoyu, Huang Weijie, Lin Yongtian
- Download link: Click here to download
Open Source Protocol
- THUOCL is free to universities, research institutes, enterprises, institutions and individuals at home and abroad, and can be used for research and business.
- Any valuable comments and suggestions are welcome to provide for this toolkit. Please send an email to [email protected].
- If you publish a paper or obtain scientific research results based on THUOCL, please declare that "the open Chinese dictionary of Tsinghua University" is used when publishing the paper and applying for the results, and quote it in the following format:
中文: 韩世依, 张钰晖, 马云山, 涂存超, 郭志芃, 刘知远, 孙茂松. THUOCL:清华大学开放中文词库. 2016.
英文: Shiyi Han, Yuhui Zhang, Yunshan Ma, Cunchao Tu, Zhipeng Guo, Zhiyuan Liu, Maosong Sun. THUOCL: Tsinghua Open Chinese Lexicon. 2016.
author
Contributors: Shiyi Han (Han Shiyi, undergraduate student at Beijing University of Aeronautics and Astronautics), Yuhui Zhang (Zhang Yuhui, undergraduate student at Tsinghua University), Yunshan Ma (Ma Yunshan), Cunchao Tu (Tu Cunchao, doctoral student at Tsinghua University), Zhipeng Guo (Guo Zhipeng, undergraduate student at Tsinghua University).
Instructors: Zhiyuan Liu (Liu Zhiyuan, assistant professor at Tsinghua University), Maosong Sun (Sun, professor at Tsinghua University).