It parses Japanese sentences and generates full context labels.
This is a rewrite of the preprocessing part of OpenJTalk (other than HTS Engine) using Rust.
It is the main interface. It is a wrapper for Lindera, jpreprocess-njd, jpreprocess-jpcommon, and more. The words in the analysis result are kept in the data structure defined by jpreprocess-core.
example:
use jpreprocess :: * ;
let config = JPreprocessConfig {
dictionary : SystemDictionaryConfig :: File ( path ) ,
user_dictionary : None ,
} ;
let jpreprocess = JPreprocess :: from_config ( config ) ? ;
let jpcommon_label = jpreprocess
. extract_fullcontext ( "日本語文を解析し、音声合成エンジンに渡せる形式に変換します." ) ? ;
assert_eq ! (
jpcommon_label [ 2 ] . to_string ( ) ,
concat! (
"sil^n-i+h=o" ,
"/A:-3+1+7" ,
"/B:xx-xx_xx" ,
"/C:02_xx+xx" ,
"/D:02+xx_xx" ,
"/E:xx_xx!xx_xx-xx" ,
"/F:7_4#0_xx@1_3|1_12" ,
"/G:4_4%0_xx_1" ,
"/H:xx_xx" ,
"/I:3-12@1+2&1-8|1+41" ,
"/J:5_29" ,
"/K:2+8-41"
)
) ; It includes data structures such as pronunciation, words, parts of speech, JPCommon, and other related functions and structures that represent errors. pos is an acronym for Part Of Speech and represents "part of speech."
Loads the word dictionary generated by jpreprocess-dictionary-builder into memory, allowing words to be searched.
At this time, the dictionary format will be automatically determined.
The original dictionary is in the same CSV format as Mecab, but you need to generate a dedicated dictionary in advance so that it can be analyzed at high speed with Lindera.
It is created based on Lindera's lindera-ipadic-builder, but jpreprocess-dictionary-builder also parses strings in advance, and can generate a dictionary (JPreprocess dictionary) that can be processed directly with JPreprocess.
Generate a dictionary for JPreprocess using the dictionary that was shipped with OpenJTalk. Used for naist-jdic feature of jpreprocess crate.
Note that if you enable naist-jdic feature and include this crate, it will take several minutes to build.
It defines the structure of NJDNode and NJD in OpenJTalk, and performs conversion processing for NJD.
Specifically, it converts the reading of numbers (for example, "10,120" to "Ichiman Hyakuniju") and estimates the accent position.
It defines the structure of JPCommonLabel in OpenJTalk, and converts it from NJD to JPCommon and then JPCommon to full context labels.
Implement a mutable window used in the jpreprocess-njd conversion process.
This software includes source code from:
Although this repository has CODEOWNERS file, it does not necessarily mean that the developers listed in codeowners file have the copyright for all files in this repository. Copyrights are listed in NOTICE or LICENSE files, and CODEOWNERS file is just for code reviewing.
BSD-3-Clause