jpreprocess Download - jpreprocess Source code download

jpreprocess

AI Source Code

v0.10.0

Download

jpreprocess

It parses Japanese sentences and generates full context labels.

This is a rewrite of the preprocessing part of OpenJTalk (other than HTS Engine) using Rust.

Goals and policies

Instead of just transferring the OpenJTalk structure, it is designed to be as easy to read and write as possible.
While reducing the size of dictionary files with its own dictionary format, it also uses traditional "all information as strings" dictionary.
- Neither is compatible with the Mecab dictionary itself, but you can generate a dictionary using the same CSV file as you would use to build a Mecab dictionary.
Except for some features that appear to be bugs, you can get exactly the same output (full context label) as OpenJTalk
- For example, the way to read "special auxiliary verbs" and confusing 2, 2, 3 digits separated numbers is different from OpenJTalk.
- Although it does not eliminate the addition of new features, we would like to ensure that there is still a way to obtain the same output as OpenJTalk using options, versions, features, etc.
This repository does not handle HTS Engine
- It supports the creation of full context labels, but beyond that, it is outside the scope of this repository.
- A project to rewrite HTS Engine with Rust can be found at jpreprocess/jbonsai.

Crates

jpreprocess

It is the main interface. It is a wrapper for Lindera, jpreprocess-njd, jpreprocess-jpcommon, and more. The words in the analysis result are kept in the data structure defined by jpreprocess-core.

example:

 use jpreprocess :: * ;

let config = JPreprocessConfig {
     dictionary : SystemDictionaryConfig :: File ( path ) ,
     user_dictionary : None ,
 } ;
let jpreprocess = JPreprocess :: from_config ( config ) ? ;

let jpcommon_label = jpreprocess
    . extract_fullcontext ( "日本語文を解析し、音声合成エンジンに渡せる形式に変換します．" ) ? ;
assert_eq ! (
  jpcommon_label [ 2 ] . to_string ( ) ,
  concat! (
      "sil^n-i+h=o" ,
      "/A:-3+1+7" ,
      "/B:xx-xx_xx" ,
      "/C:02_xx+xx" ,
      "/D:02+xx_xx" ,
      "/E:xx_xx!xx_xx-xx" ,
      "/F:7_4#0_xx@1_3|1_12" ,
      "/G:4_4%0_xx_1" ,
      "/H:xx_xx" ,
      "/I:3-12@1+2&1-8|1+41" ,
      "/J:5_29" ,
      "/K:2+8-41"
  )
) ;

jpreprocess-core

It includes data structures such as pronunciation, words, parts of speech, JPCommon, and other related functions and structures that represent errors. pos is an acronym for Part Of Speech and represents "part of speech."

jpreprocess-dictionary

Loads the word dictionary generated by jpreprocess-dictionary-builder into memory, allowing words to be searched.

At this time, the dictionary format will be automatically determined.

jpreprocess-dictionary-builder

The original dictionary is in the same CSV format as Mecab, but you need to generate a dedicated dictionary in advance so that it can be analyzed at high speed with Lindera.

It is created based on Lindera's lindera-ipadic-builder, but jpreprocess-dictionary-builder also parses strings in advance, and can generate a dictionary (JPreprocess dictionary) that can be processed directly with JPreprocess.

jpreprocess-naist-jdic

Generate a dictionary for JPreprocess using the dictionary that was shipped with OpenJTalk. Used for naist-jdic feature of jpreprocess crate.

Note that if you enable naist-jdic feature and include this crate, it will take several minutes to build.

jpreprocess-njd

It defines the structure of NJDNode and NJD in OpenJTalk, and performs conversion processing for NJD.

Specifically, it converts the reading of numbers (for example, "10,120" to "Ichiman Hyakuniju") and estimates the accent position.

jpreprocess-jpcommon

It defines the structure of JPCommonLabel in OpenJTalk, and converts it from NJD to JPCommon and then JPCommon to full context labels.

jpreprocess-window

Implement a mutable window used in the jpreprocess-njd conversion process.

Copyrights

This software includes source code from:

OpenJTalk. Copyright (c) 2008-2016 Nagoya Institute of Technology Department of Computer Science
Yada: Yet Another Double-Array.

Although this repository has CODEOWNERS file, it does not necessarily mean that the developers listed in codeowners file have the copyright for all files in this repository. Copyrights are listed in NOTICE or LICENSE files, and CODEOWNERS file is just for code reviewing.