ดาวน์โหลด hanlp lucene plugin - hanlp lucene plugin ซอร์สโค้ดดาวน์โหลดแหล่งข้อมูลดาวน์โหลด

hanlp lucene plugin

ซอร์สโค้ดอื่น ๆ

v1.1.6 常规维护

ดาวน์โหลด

Hanlp-Lucene-Plugin

hanlp คำว่าคำนาม lucene plug-in

จาก HANLP ระบบพื้นฐานของ Lucene (7.x) รวมถึง SOLR (7.x) ได้รับการสนับสนุน

หนอง

    < dependency >
      < groupId >com.hankcs.nlp</ groupId >
      < artifactId >hanlp-lucene-plugin</ artifactId >
      < version >1.1.7</ version >
    </ dependency >

Solr เริ่มต้นอย่างรวดเร็ว

ใส่ hanlp-portable.jar และ hanlp-lucene-plugin.jar ลงใน ${webapp}/WEB-INF/lib (หรือใช้ mvn package เพื่อจัดทำซอร์สโค้ดและคัดลอก target/hanlp-lucene-plugin-xxxjar เป็น ${webapp}/WEB-INF/lib )
แก้ไขไฟล์กำหนดค่า ${core}/conf/schema.xml ของ Solr Core:

  < fieldType name = " text_cn " class = " solr.TextField " >
      < analyzer type = " index " >
          < tokenizer class = " com.hankcs.lucene.HanLPTokenizerFactory " enableIndexMode = " true " />
      </ analyzer >
      < analyzer type = " query " >
          <!-- 切记不要在query中开启index模式 -->
          < tokenizer class = " com.hankcs.lucene.HanLPTokenizerFactory " enableIndexMode = " false " />
      </ analyzer >
  </ fieldType >
  <!-- 业务系统中需要分词的字段都需要指定type为text_cn -->
  < field name = " my_field1 " type = " text_cn " indexed = " true " stored = " true " />
  < field name = " my_field2 " type = " text_cn " indexed = " true " stored = " true " />

หากคุณมีฟิลด์อื่น ๆ ในระบบธุรกิจของคุณเช่นที่ตั้งสรุป ฯลฯ คุณต้องระบุประเภท = "text_cn" ทีละตัว โปรดจำไว้ว่ามิฉะนั้นฟิลด์เหล่านี้จะยังคงเป็น segmenter คำเริ่มต้นของ SOLR
นอกจากนี้โปรดจำไว้ว่าอย่าเปิดใช้งาน INDEXMODE ในแบบสอบถามมิฉะนั้นจะส่งผลกระทบต่อเฟสเวอรี่ INDEXMODE จะต้องเปิดใช้งานในดัชนีเพียงครั้งเดียว

การกำหนดค่าขั้นสูง

ปัจจุบันปลั๊กอินนี้รองรับการกำหนดค่าต่อไปนี้ตาม schema.xml :

ชื่อรายการการกำหนดค่า	การทำงาน	ค่าเริ่มต้น
อัลกอริทึม	อัลกอริทึมคำนาม	Viterbi
เปิดใช้งาน INEDEXMODE	ตั้งค่าเป็นโหมดดัชนี (อย่าเปิดในแบบสอบถาม)	จริง
enableCustomDictionary	ไม่ว่าจะเปิดใช้งานพจนานุกรมผู้ใช้	จริง
CustomDictionaryPath	เส้นทางพจนานุกรมผู้ใช้ (เส้นทางสัมบูรณ์หรือเส้นทางสัมพัทธ์ที่สามารถอ่านได้โดยโปรแกรมคั่นด้วยพจนานุกรมหลายตัวโดยช่องว่าง)	โมฆะ
enableCustomDictionaryforcing	พจนานุกรมผู้ใช้ลำดับความสำคัญสูง	เท็จ
stopworddictionarypath	หยุดเส้นทางพจนานุกรม	โมฆะ
enablenumberquantifierrecognize	ไม่ว่าจะเปิดใช้งานการจดจำคำเชิงตัวเลขและเชิงปริมาณ	จริง
enableNamerecognize	เปิดการจดจำชื่อบุคคล	จริง
enabletranslatednamerecognize	ไม่ว่าจะเปิดใช้งานการจดจำชื่อบุคคลที่มีการป้องกัน	เท็จ
เปิดใช้งานญี่ปุ่น	ไม่ว่าจะเปิดใช้งานการจดจำชื่อญี่ปุ่น	เท็จ
เปิดใช้งาน organization recognize	เปิดการจดจำชื่อองค์กร	เท็จ
เปิดใช้งาน	เปิดการจดจำชื่อสถานที่	เท็จ
การทำให้เป็นระเบียบ	ไม่ว่าจะทำการทำให้เป็นมาตรฐานของอักขระ (แบบดั้งเดิม-> ง่าย, ความกว้างเต็ม-> ครึ่งความกว้าง, บน-> กรณีล่าง)	เท็จ
enableTraditionalchinesemode	เปิดการแบ่งส่วนคำภาษาจีนดั้งเดิมที่แม่นยำ	เท็จ
เปิดใช้งาน	เปิดโหมดดีบัก	เท็จ

การกำหนดค่าขั้นสูงเพิ่มเติมส่วนใหญ่จะถูกกำหนดค่าผ่าน hanlp.properties ภายใต้เส้นทางคลาส โปรดอ่านเอกสารประกอบแพ็คเกจการประมวลผลภาษาธรรมชาติของ HANLP สำหรับการกำหนดค่าที่เกี่ยวข้องเพิ่มเติมเช่น:

พจนานุกรมผู้ใช้
ส่วนหนึ่งของคำอธิบายประกอบการพูด
การแปลงภาษาจีนง่ายและดั้งเดิม
-

หยุดคำพูดและคำพ้องความหมาย

ขอแนะนำให้ใช้การใช้งานตัวกรองของ Lucene หรือ Solr ปลั๊กอินนี้จะไม่รบกวน ตัวอย่างการกำหนดค่ามีดังนี้:

    <!-- text_cn字段类型: 指定使用HanLP分词器，同时开启索引模式。通过solr自带的停用词过滤器，使用"stopwords.txt"（默认空白）过滤。
	 在搜索的时候，还支持solr自带的同义词词典。 -->
    < fieldType name = " text_cn " class = " solr.TextField " positionIncrementGap = " 100 " >
      < analyzer type = " index " >
        < tokenizer class = " com.hankcs.lucene.HanLPTokenizerFactory " enableIndexMode = " true " />
        < filter class = " solr.StopFilterFactory " ignoreCase = " true " words = " stopwords.txt " />
        <!-- 取消注释可以启用索引期间的同义词词典
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        < filter class = " solr.LowerCaseFilterFactory " />
      </ analyzer >
      < analyzer type = " query " >
        < tokenizer class = " com.hankcs.lucene.HanLPTokenizerFactory " enableIndexMode = " false " />
        < filter class = " solr.StopFilterFactory " ignoreCase = " true " words = " stopwords.txt " />
        < filter class = " solr.SynonymFilterFactory " synonyms = " synonyms.txt " ignoreCase = " true " expand = " true " />
        < filter class = " solr.LowerCaseFilterFactory " />
      </ analyzer >
    </ fieldType >
    <!-- 业务系统中需要分词的字段都需要指定type为text_cn -->
    < field name = " my_field1 " type = " text_cn " indexed = " true " stored = " true " />
    < field name = " my_field2 " type = " text_cn " indexed = " true " stored = " true " />

วิธีการโทร

เมื่อเขียนแบบสอบถามใหม่คุณสามารถใช้ส่วนของคำพูดและคุณลักษณะอื่น ๆ ในผลการอนุภาคของ Hanlpanalyzer เช่น

 String text = "中华人民共和国很辽阔" ;
for ( int i = 0 ; i < text . length (); ++ i )
{
    System . out . print ( text . charAt ( i ) + "" + i + " " );
}
System . out . println ();
Analyzer analyzer = new HanLPAnalyzer ();
TokenStream tokenStream = analyzer . tokenStream ( "field" , text );
tokenStream . reset ();
while ( tokenStream . incrementToken ())
{
    CharTermAttribute attribute = tokenStream . getAttribute ( CharTermAttribute . class );
    // 偏移量
    OffsetAttribute offsetAtt = tokenStream . getAttribute ( OffsetAttribute . class );
    // 距离
    PositionIncrementAttribute positionAttr = tokenStream . getAttribute ( PositionIncrementAttribute . class );
    // 词性
    TypeAttribute typeAttr = tokenStream . getAttribute ( TypeAttribute . class );
    System . out . printf ( "[%d:%d %d] %s/%s n " , offsetAtt . startOffset (), offsetAtt . endOffset (), positionAttr . getPositionIncrement (), attribute , typeAttr . type ());
}

ในสถานการณ์อื่น ๆ Hanlpkekenizer ได้รับการสนับสนุนโดยการแบ่งส่วนคำที่กำหนดเอง (เช่นการแบ่งส่วนคำที่เปิดใช้งานการจดจำเอนทิตีที่มีชื่อการแบ่งส่วนคำภาษาจีนดั้งเดิมการแบ่งส่วนคำ CRF ฯลฯ ) เช่น::

 tokenizer = new HanLPTokenizer ( HanLP . newSegment ()
                                    . enableJapaneseNameRecognize ( true )
                                    . enableIndexMode ( true ), null , false );
tokenizer . setReader ( new StringReader ( "林志玲亮相网友:确定不是波多野结衣？" ));

ลิขสิทธิ์

Apache License Version 2.0

ขยาย

ข้อมูลเพิ่มเติม

เวอร์ชัน v1.1.6 常规维护
ประเภท ซอร์สโค้ดอื่น ๆ
เวลาอัปเดต 2025-04-19
ขนาด 32.23KB
มาจาก Github

แอปที่เกี่ยวข้อง

intellij platform gradle plugin

2024-11-09
scite zotero plugin

2024-11-08
BaseElements Plugin

2024-11-07
index cli plugin

2024-11-06
napari plugin manager

2024-11-04
ปลั๊กอินซุปเปอร์อิมเมจ

2009-04-18

แนะนำสำหรับคุณ

chat.petals.dev

ซอร์สโค้ดอื่น ๆ

1.0.0
GPT Prompt Templates

ซอร์สโค้ดอื่น ๆ

1.0.0
GPTyped

ซอร์สโค้ดอื่น ๆ

GPTyped 1.0.5
Google Dorks

ซอร์สโค้ดอื่น ๆ

1.0
shepherd

ซอร์สโค้ดอื่น ๆ

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

ซอร์สโค้ดอื่น ๆ

v1.1.0-rc-3
Google Dorks

ซอร์สโค้ดอื่น ๆ

1.0
shepherd

ซอร์สโค้ดอื่น ๆ

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

ซอร์สโค้ดอื่น ๆ

v1.1.0-rc-3

ข้อมูลที่เกี่ยวข้อง ทั้งหมด