hanlp lucene plugin
v1.1.6 常规维护
基於HanLP,支持包括Solr(7.x)在內的任何基於Lucene(7.x)的系統。
< dependency >
< groupId >com.hankcs.nlp</ groupId >
< artifactId >hanlp-lucene-plugin</ artifactId >
< version >1.1.7</ version >
</ dependency >${webapp}/WEB-INF/lib下。 (或者使用mvn package對源碼打包,拷貝target/hanlp-lucene-plugin-xxxjar到${webapp}/WEB-INF/lib下)${core}/conf/schema.xml : < fieldType name = " text_cn " class = " solr.TextField " >
< analyzer type = " index " >
< tokenizer class = " com.hankcs.lucene.HanLPTokenizerFactory " enableIndexMode = " true " />
</ analyzer >
< analyzer type = " query " >
<!-- 切记不要在query中开启index模式 -->
< tokenizer class = " com.hankcs.lucene.HanLPTokenizerFactory " enableIndexMode = " false " />
</ analyzer >
</ fieldType >
<!-- 业务系统中需要分词的字段都需要指定type为text_cn -->
< field name = " my_field1 " type = " text_cn " indexed = " true " stored = " true " />
< field name = " my_field2 " type = " text_cn " indexed = " true " stored = " true " />目前本插件支持如下基於schema.xml的配置:
| 配置項名稱 | 功能 | 預設值 |
|---|---|---|
| algorithm | 分詞算法 | viterbi |
| enableIndexMode | 設為索引模式(切勿在query中開啟) | true |
| enableCustomDictionary | 是否啟用用戶詞典 | true |
| customDictionaryPath | 用戶詞典路徑(絕對路徑或程序可以讀取的相對路徑,多個詞典用空格隔開) | null |
| enableCustomDictionaryForcing | 用戶詞典高優先級 | false |
| stopWordDictionaryPath | 停用詞詞典路徑 | null |
| enableNumberQuantifierRecognize | 是否啟用數詞和數量詞識別 | true |
| enableNameRecognize | 開啟人名識別 | true |
| enableTranslatedNameRecognize | 是否啟用音譯人名識別 | false |
| enableJapaneseNameRecognize | 是否啟用日本人名識別 | false |
| enableOrganizationRecognize | 開啟機構名識別 | false |
| enablePlaceRecognize | 開啟地名識別 | false |
| enableNormalization | 是否執行字符正規化(繁體->簡體,全角->半角,大寫->小寫) | false |
| enableTraditionalChineseMode | 開啟精準繁體中文分詞 | false |
| enableDebug | 開啟調試模式 | false |
更高級的配置主要通過class path下的hanlp.properties進行配置,請閱讀HanLP自然語言處理包文檔以了解更多相關配置,如:
推薦利用Lucene或Solr自帶的filter實現,本插件不會越俎代庖。 一個示例配置如下:
<!-- text_cn字段类型: 指定使用HanLP分词器,同时开启索引模式。通过solr自带的停用词过滤器,使用"stopwords.txt"(默认空白)过滤。
在搜索的时候,还支持solr自带的同义词词典。 -->
< fieldType name = " text_cn " class = " solr.TextField " positionIncrementGap = " 100 " >
< analyzer type = " index " >
< tokenizer class = " com.hankcs.lucene.HanLPTokenizerFactory " enableIndexMode = " true " />
< filter class = " solr.StopFilterFactory " ignoreCase = " true " words = " stopwords.txt " />
<!-- 取消注释可以启用索引期间的同义词词典
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
< filter class = " solr.LowerCaseFilterFactory " />
</ analyzer >
< analyzer type = " query " >
< tokenizer class = " com.hankcs.lucene.HanLPTokenizerFactory " enableIndexMode = " false " />
< filter class = " solr.StopFilterFactory " ignoreCase = " true " words = " stopwords.txt " />
< filter class = " solr.SynonymFilterFactory " synonyms = " synonyms.txt " ignoreCase = " true " expand = " true " />
< filter class = " solr.LowerCaseFilterFactory " />
</ analyzer >
</ fieldType >
<!-- 业务系统中需要分词的字段都需要指定type为text_cn -->
< field name = " my_field1 " type = " text_cn " indexed = " true " stored = " true " />
< field name = " my_field2 " type = " text_cn " indexed = " true " stored = " true " />在Query改寫的時候,可以利用HanLPAnalyzer分詞結果中的詞性等屬性,如
String text = "中华人民共和国很辽阔" ;
for ( int i = 0 ; i < text . length (); ++ i )
{
System . out . print ( text . charAt ( i ) + "" + i + " " );
}
System . out . println ();
Analyzer analyzer = new HanLPAnalyzer ();
TokenStream tokenStream = analyzer . tokenStream ( "field" , text );
tokenStream . reset ();
while ( tokenStream . incrementToken ())
{
CharTermAttribute attribute = tokenStream . getAttribute ( CharTermAttribute . class );
// 偏移量
OffsetAttribute offsetAtt = tokenStream . getAttribute ( OffsetAttribute . class );
// 距离
PositionIncrementAttribute positionAttr = tokenStream . getAttribute ( PositionIncrementAttribute . class );
// 词性
TypeAttribute typeAttr = tokenStream . getAttribute ( TypeAttribute . class );
System . out . printf ( "[%d:%d %d] %s/%s n " , offsetAtt . startOffset (), offsetAtt . endOffset (), positionAttr . getPositionIncrement (), attribute , typeAttr . type ());
}在另一些場景,支持以自定義的分詞器(比如開啟了命名實體識別的分詞器、繁體中文分詞器、CRF分詞器等)構造HanLPTokenizer,比如:
tokenizer = new HanLPTokenizer ( HanLP . newSegment ()
. enableJapaneseNameRecognize ( true )
. enableIndexMode ( true ), null , false );
tokenizer . setReader ( new StringReader ( "林志玲亮相网友:确定不是波多野结衣?" ));Apache License Version 2.0