Bulletin release: RIME Chinese grammar model and thesaurus construction based on 32GB ultra-large-scale corpus
——Vision Grammar Model,Vision Atomic Dictionary
Project Introduction
- Based on the huge and diverse Chinese corpus, we have built a Chinese grammar model with excellent performance and wide coverage and an efficient vocabulary. The grammar model and thesaurus released this time integrate content from community Q&A, blog interaction, official accounts, encyclopedia entries, news reports, lyrics, poetry literature, idioms, tongue twisters, hotel takeaway reviews, legal documents, regional descriptions, literary works, and poetry. The overall corpus is 32G scale, which is more balanced and more meticulous in cleaning. The project vision is committed to providing the strongest basic base of RIME, making the most accurate pronunciation annotation, making the most accurate word frequency statistics, the most appropriate word segmentation database, and creating a high hit rate and accurate input model based on existing conditions ;
- At the same time, the single-word pinyin dictionary maintained in the project covers the cjk basic area to the extended G area and the Kangxi radical area. It manually maintains more pronunciations based on the Han dictionary, which may be more comprehensive in the single-text lexicon;
- All rime lexicons in the project use AI-assisted screening and manual proofreading to select high-quality phrases. The vocabulary library is full of spelling with tone, and all word frequency is based on phrases and double-key pinyin. The difference is such as: "Where is there" for single-word frequency in similar scenarios, rather than being all incorporated into the pinyin of na. The single word frequency is a combination of single word and its corresponding pinyin in the phrase sentence. Therefore, single word frequency is also distinguished from polyphonic characters. Due to the huge scale of the corpus, many single words have reached the level of 1 billion. The word frequency has been logarithmic normalized, which shortens the word frequency and is easy to maintain and the file stores fewer bytes. How to migrate to your plan? Click to migrate the vocabulary
Model download | Model configuration instructions | details of usage and construction tutorials
- Model file version description: v is the version number, n is the model level, m is the size of 100 megabytes
| File size | Level 2 model | Level 3 model |
|---|
| 100M | v1n2m1 | v1n3m1 |
| 200M | v1n2m2 | v1n3m2 |
| 300M | v1n2m3 | v1n3m3 |
- Corresponding instructions for the database file:
Sample project:
Vientiane Pinyin Enhanced Version - Combination of multi-dimensional direct auxiliary code and any pinyin scheme | Vientiane Pinyin Basic Version - Full Pinyin Double Pinyin Indirect Auxiliary Code Version
| Thesaurus type | File name | describe |
|---|
| Large-font table | large.dict | Contains all pronunciations in the basic area of the CJK font library, regardless of multi-sounding 43324 words |
| Basic thesaurus | base.dict | Contains 2-3 word phrases |
| Extended thesaurus | ext.dict | Contains commonly used phrases |
| Full word table | full.dict | Includes all characters with CJK, complete Chinese characters |
Just put this section of content in the scheme file, download the model to the user directory of rime, and change the language: amz-v2n3m1-zh-hans to the file name you downloaded (not including the suffix), and re-deploy it to use!
__include: octagram #启用语法模型
#语法模型
octagram:
__patch:
grammar:
language: amz-v2n3m1-zh-hans
collocation_max_length: 5
collocation_min_length: 2
translator/contextual_suggestions: true
translator/max_homophones: 7
translator/max_homographs: 7