hanlp lucene pluginのダウンロード-HANLP hanlp lucene pluginインソースコードダウンロード

hanlp lucene plugin

その他のソースコード

v1.1.6 常规维护

ダウンロード

hanlp-lucene-plugin

hanlp中国語ワード分詞ルーネプラグイン

HANLPに基づいて、solr（7.x）を含むルーセン（7.x）ベースのシステムがサポートされています。

メイベン

    < dependency >
      < groupId >com.hankcs.nlp</ groupId >
      < artifactId >hanlp-lucene-plugin</ artifactId >
      < version >1.1.7</ version >
    </ dependency >

solrはすぐに始めます

hanlp-portable.jarとhanlp-lucene-plugin.jar ${webapp}/WEB-INF/libに入れます。（またはmvn packageを使用してソースコードをパッケージ化し、 target/hanlp-lucene-plugin-xxxjarを${webapp}/WEB-INF/libにコピーします）
構成ファイル${core}/conf/schema.xml of solr coreを変更する：

  < fieldType name = " text_cn " class = " solr.TextField " >
      < analyzer type = " index " >
          < tokenizer class = " com.hankcs.lucene.HanLPTokenizerFactory " enableIndexMode = " true " />
      </ analyzer >
      < analyzer type = " query " >
          <!-- 切记不要在query中开启index模式 -->
          < tokenizer class = " com.hankcs.lucene.HanLPTokenizerFactory " enableIndexMode = " false " />
      </ analyzer >
  </ fieldType >
  <!-- 业务系统中需要分词的字段都需要指定type为text_cn -->
  < field name = " my_field1 " type = " text_cn " indexed = " true " stored = " true " />
  < field name = " my_field2 " type = " text_cn " indexed = " true " stored = " true " />

場所、概要など、ビジネスシステムに他のフィールドがある場合は、Type = "text_cn"を1つずつ指定する必要もあります。そうしないと、これらのフィールドはまだsolrデフォルトの単語セグメルターになります。
また、クエリでインデックスモードを有効にしないことを忘れないでください。そうしないと、PhaseQueryに影響します。インデックスモードは、インデックスで1回有効にする必要があります。

高度な構成

現在、このプラグインは、 schema.xmlに基づいて次の構成をサポートしています。

構成アイテム名	関数	デフォルト値
アルゴリズム	単語分詞アルゴリズム	viterbi
enabableIndexMode	インデックスモードに設定します（クエリでオンにしないでください）	真実
EnableCustomdictionary	ユーザー辞書を有効にするかどうか	真実
CustomDictionaryPath	ユーザー辞書パス（プログラムで読み取ることができる絶対パスまたは相対パス、スペースごとに複数の辞書で区切られます）	ヌル
EnableCustomictionaryForcing	ユーザー辞書の優先度が高い	間違い
stopword -dictionarypath	ワード辞書パスを停止します	ヌル
EnableNumberQuantifierを認識します	数値および定量的な単語認識を有効にするかどうか	真実
EnableNamerEcognize	個人名の認識をオンにします	真実
EnableTranslatedNamerEcognize	音訳された人名の認識を有効にするかどうか	間違い
Japanesenamerecognizeを有効にします	日本の名前の認識を有効にするかどうか	間違い
EnableOrganizationは認識します	組織名の認識をオンにします	間違い
enablePlacerEcognize	地名の認識をオンにします	間違い
有効化	文字の正規化を実行するかどうか（従来の - >単純化された、全幅 - >半幅、上>低ケース）	間違い
EnableTraditionAlchinesEmode	正確な伝統的な中国語の単語セグメンテーションをオンにします	間違い
EnableDeBug	デバッグモードをオンにします	間違い

より高度な構成は、主にクラスパスの下でhanlp.propertiesを介して構成されています。次のような、より関連する構成については、HANLP Natural Language Processing Packageドキュメントをお読みください。

ユーザー辞書
音声注釈の一部
単純化された伝統的な中国の転換
...

単語と同義語を止めます

LuceneまたはSolr独自のフィルターの実装を使用することをお勧めします。このプラグインは干渉しません。構成の例は次のとおりです。

    <!-- text_cn字段类型: 指定使用HanLP分词器，同时开启索引模式。通过solr自带的停用词过滤器，使用"stopwords.txt"（默认空白）过滤。
	 在搜索的时候，还支持solr自带的同义词词典。 -->
    < fieldType name = " text_cn " class = " solr.TextField " positionIncrementGap = " 100 " >
      < analyzer type = " index " >
        < tokenizer class = " com.hankcs.lucene.HanLPTokenizerFactory " enableIndexMode = " true " />
        < filter class = " solr.StopFilterFactory " ignoreCase = " true " words = " stopwords.txt " />
        <!-- 取消注释可以启用索引期间的同义词词典
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        < filter class = " solr.LowerCaseFilterFactory " />
      </ analyzer >
      < analyzer type = " query " >
        < tokenizer class = " com.hankcs.lucene.HanLPTokenizerFactory " enableIndexMode = " false " />
        < filter class = " solr.StopFilterFactory " ignoreCase = " true " words = " stopwords.txt " />
        < filter class = " solr.SynonymFilterFactory " synonyms = " synonyms.txt " ignoreCase = " true " expand = " true " />
        < filter class = " solr.LowerCaseFilterFactory " />
      </ analyzer >
    </ fieldType >
    <!-- 业务系统中需要分词的字段都需要指定type为text_cn -->
    < field name = " my_field1 " type = " text_cn " indexed = " true " stored = " true " />
    < field name = " my_field2 " type = " text_cn " indexed = " true " stored = " true " />

呼び出し方法

クエリを書き換えるときは、スピーチの部分やhanlpanalyzer分詞の結果で他の属性を使用できます。

 String text = "中华人民共和国很辽阔" ;
for ( int i = 0 ; i < text . length (); ++ i )
{
    System . out . print ( text . charAt ( i ) + "" + i + " " );
}
System . out . println ();
Analyzer analyzer = new HanLPAnalyzer ();
TokenStream tokenStream = analyzer . tokenStream ( "field" , text );
tokenStream . reset ();
while ( tokenStream . incrementToken ())
{
    CharTermAttribute attribute = tokenStream . getAttribute ( CharTermAttribute . class );
    // 偏移量
    OffsetAttribute offsetAtt = tokenStream . getAttribute ( OffsetAttribute . class );
    // 距离
    PositionIncrementAttribute positionAttr = tokenStream . getAttribute ( PositionIncrementAttribute . class );
    // 词性
    TypeAttribute typeAttr = tokenStream . getAttribute ( TypeAttribute . class );
    System . out . printf ( "[%d:%d %d] %s/%s n " , offsetAtt . startOffset (), offsetAtt . endOffset (), positionAttr . getPositionIncrement (), attribute , typeAttr . type ());
}

他のシナリオでは、hanlptokenizerは、カスタムワードセグメンテーション（名前付きエンティティ認識、従来の中国語単語セグメンテーション、CRFワードセグメンテーションなどを可能にする単語セグメンテーションなど）によってサポートされています。

 tokenizer = new HanLPTokenizer ( HanLP . newSegment ()
                                    . enableJapaneseNameRecognize ( true )
                                    . enableIndexMode ( true ), null , false );
tokenizer . setReader ( new StringReader ( "林志玲亮相网友:确定不是波多野结衣？" ));