hanlp lucene plugin下載-HANLP hanlp lucene plugin源代碼下載

hanlp lucene plugin

其他源碼

v1.1.6 常规维护

下載

hanlp-lucene-plugin

HanLP中文分詞Lucene插件

基於HanLP，支持包括Solr（7.x）在內的任何基於Lucene（7.x）的系統。

Maven

    < dependency >
      < groupId >com.hankcs.nlp</ groupId >
      < artifactId >hanlp-lucene-plugin</ artifactId >
      < version >1.1.7</ version >
    </ dependency >

Solr快速上手

將hanlp-portable.jar和hanlp-lucene-plugin.jar共兩個jar放入${webapp}/WEB-INF/lib下。（或者使用mvn package對源碼打包，拷貝target/hanlp-lucene-plugin-xxxjar到${webapp}/WEB-INF/lib下）
修改solr core的配置文件${core}/conf/schema.xml ：

  < fieldType name = " text_cn " class = " solr.TextField " >
      < analyzer type = " index " >
          < tokenizer class = " com.hankcs.lucene.HanLPTokenizerFactory " enableIndexMode = " true " />
      </ analyzer >
      < analyzer type = " query " >
          <!-- 切记不要在query中开启index模式 -->
          < tokenizer class = " com.hankcs.lucene.HanLPTokenizerFactory " enableIndexMode = " false " />
      </ analyzer >
  </ fieldType >
  <!-- 业务系统中需要分词的字段都需要指定type为text_cn -->
  < field name = " my_field1 " type = " text_cn " indexed = " true " stored = " true " />
  < field name = " my_field2 " type = " text_cn " indexed = " true " stored = " true " />

如果你的業務系統中有其他字段，比如location，summary之類，也需要一一指定其type="text_cn"。切記，否則這些字段仍舊是solr默認分詞器。
另外，切記不要在query中開啟indexMode，否則會影響PhaseQuery。 indexMode只需在index中開啟一遍即可。

高級配置

目前本插件支持如下基於schema.xml的配置:

配置項名稱	功能	預設值
algorithm	分詞算法	viterbi
enableIndexMode	設為索引模式（切勿在query中開啟）	true
enableCustomDictionary	是否啟用用戶詞典	true
customDictionaryPath	用戶詞典路徑(絕對路徑或程序可以讀取的相對路徑,多個詞典用空格隔開)	null
enableCustomDictionaryForcing	用戶詞典高優先級	false
stopWordDictionaryPath	停用詞詞典路徑	null
enableNumberQuantifierRecognize	是否啟用數詞和數量詞識別	true
enableNameRecognize	開啟人名識別	true
enableTranslatedNameRecognize	是否啟用音譯人名識別	false
enableJapaneseNameRecognize	是否啟用日本人名識別	false
enableOrganizationRecognize	開啟機構名識別	false
enablePlaceRecognize	開啟地名識別	false
enableNormalization	是否執行字符正規化（繁體->簡體，全角->半角，大寫->小寫）	false
enableTraditionalChineseMode	開啟精準繁體中文分詞	false
enableDebug	開啟調試模式	false

更高級的配置主要通過class path下的hanlp.properties進行配置，請閱讀HanLP自然語言處理包文檔以了解更多相關配置，如：

用戶詞典
詞性標註
簡繁轉換
……

停用詞與同義詞

推薦利用Lucene或Solr自帶的filter實現，本插件不會越俎代庖。一個示例配置如下：

    <!-- text_cn字段类型: 指定使用HanLP分词器，同时开启索引模式。通过solr自带的停用词过滤器，使用"stopwords.txt"（默认空白）过滤。
	 在搜索的时候，还支持solr自带的同义词词典。 -->
    < fieldType name = " text_cn " class = " solr.TextField " positionIncrementGap = " 100 " >
      < analyzer type = " index " >
        < tokenizer class = " com.hankcs.lucene.HanLPTokenizerFactory " enableIndexMode = " true " />
        < filter class = " solr.StopFilterFactory " ignoreCase = " true " words = " stopwords.txt " />
        <!-- 取消注释可以启用索引期间的同义词词典
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        < filter class = " solr.LowerCaseFilterFactory " />
      </ analyzer >
      < analyzer type = " query " >
        < tokenizer class = " com.hankcs.lucene.HanLPTokenizerFactory " enableIndexMode = " false " />
        < filter class = " solr.StopFilterFactory " ignoreCase = " true " words = " stopwords.txt " />
        < filter class = " solr.SynonymFilterFactory " synonyms = " synonyms.txt " ignoreCase = " true " expand = " true " />
        < filter class = " solr.LowerCaseFilterFactory " />
      </ analyzer >
    </ fieldType >
    <!-- 业务系统中需要分词的字段都需要指定type为text_cn -->
    < field name = " my_field1 " type = " text_cn " indexed = " true " stored = " true " />
    < field name = " my_field2 " type = " text_cn " indexed = " true " stored = " true " />

調用方法

在Query改寫的時候，可以利用HanLPAnalyzer分詞結果中的詞性等屬性，如

 String text = "中华人民共和国很辽阔" ;
for ( int i = 0 ; i < text . length (); ++ i )
{
    System . out . print ( text . charAt ( i ) + "" + i + " " );
}
System . out . println ();
Analyzer analyzer = new HanLPAnalyzer ();
TokenStream tokenStream = analyzer . tokenStream ( "field" , text );
tokenStream . reset ();
while ( tokenStream . incrementToken ())
{
    CharTermAttribute attribute = tokenStream . getAttribute ( CharTermAttribute . class );
    // 偏移量
    OffsetAttribute offsetAtt = tokenStream . getAttribute ( OffsetAttribute . class );
    // 距离
    PositionIncrementAttribute positionAttr = tokenStream . getAttribute ( PositionIncrementAttribute . class );
    // 词性
    TypeAttribute typeAttr = tokenStream . getAttribute ( TypeAttribute . class );
    System . out . printf ( "[%d:%d %d] %s/%s n " , offsetAtt . startOffset (), offsetAtt . endOffset (), positionAttr . getPositionIncrement (), attribute , typeAttr . type ());
}

在另一些場景，支持以自定義的分詞器（比如開啟了命名實體識別的分詞器、繁體中文分詞器、CRF分詞器等）構造HanLPTokenizer，比如：

 tokenizer = new HanLPTokenizer ( HanLP . newSegment ()
                                    . enableJapaneseNameRecognize ( true )
                                    . enableIndexMode ( true ), null , false );
tokenizer . setReader ( new StringReader ( "林志玲亮相网友:确定不是波多野结衣？" ));

版權

Apache License Version 2.0

展開

附加信息

版本 v1.1.6 常规维护
類型其他源碼
更新時間 2025-04-19
大小 32.23KB
來自於 Github

相關應用

intellij platform gradle plugin

2024-11-09
scite zotero plugin

2024-11-08
BaseElements Plugin

2024-11-07
index cli plugin

2024-11-06
napari plugin manager

2024-11-04
超級圖片插件

2009-04-18

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3

相關資訊全部