wink bm25 text search下載wink bm25 text search源代碼下載

wink bm25 text search

其他源碼

1.0.0

下載

wink-bm25-Text-search

基於BM25的快速全文搜索

基於BM25的wink-bm25-text-search search - 用於文檔檢索的P Robabilistic R高架算法是一個全文搜索包，用於在Node.js或瀏覽器環境中開發應用程序。它從輸入JSON文檔中構建了內存搜索索引，該文檔已針對尺寸和速度進行了優化。

探索Wink BM25文本搜索示例以深入研究：

它的代碼可在Showcase-BM25-Text-Search Repo以及詳細的博客文章中獲得。

通過以下方式為搜索添加語義風味很容易：

為字段分配不同的數值權重。每當發生與該字段的匹配時，負字段重量都會降低文檔的分數。
使用Wink-NLP的豐富文本處理功能，例如否定檢測，莖，誘餌，停止單詞檢測和命名實體檢測以執行智能搜索。
為字段和查詢文本分別定義不同的文本準備任務。

安裝

使用NPM安裝：

npm install wink-bm25-text-search --save

例子

 // Load wink-bm25-text-search
var bm25 = require ( 'wink-bm25-text-search' ) ;
// Create search engine's instance
var engine = bm25 ( ) ;
// Load sample data (load any other JSON data instead of sample)
var docs = require ( 'wink-bm25-text-search/sample-data/demo-data-for-wink-bm25.json' ) ;
// Load wink nlp and its model
const winkNLP = require ( 'wink-nlp' ) ;
// Use web model
const model = require ( 'wink-eng-lite-web-model' ) ;
const nlp = winkNLP ( model ) ;
const its = nlp . its ;

const prepTask = function ( text ) {
  const tokens = [ ] ;
  nlp . readDoc ( text )
      . tokens ( )
      // Use only words ignoring punctuations etc and from them remove stop words
      . filter ( ( t ) => ( t . out ( its . type ) === 'word' && ! t . out ( its . stopWordFlag ) ) )
      // Handle negation and extract stem of the word
      . each ( ( t ) => tokens . push ( ( t . out ( its . negationFlag ) ) ? '!' + t . out ( its . stem ) : t . out ( its . stem ) ) ) ;

  return tokens ;
} ;

// Contains search query.
var query ;

// Step I: Define config
// Only field weights are required in this example.
engine . defineConfig ( { fldWeights : { title : 1 , body : 2 } } ) ;
// Step II: Define PrepTasks pipe.
// Set up 'default' preparatory tasks i.e. for everything else
engine . definePrepTasks ( [ prepTask ] ) ;

// Step III: Add Docs
// Add documents now...
docs . forEach ( function ( doc , i ) {
  // Note, 'i' becomes the unique id for 'doc'
  engine . addDoc ( doc , i ) ;
} ) ;

// Step IV: Consolidate
// Consolidate before searching
engine . consolidate ( ) ;

// All set, start searching!
query = 'not studied law' ;
// `results` is an array of [ doc-id, score ], sorted by score
var results = engine . search ( query ) ;
// Print number of results.
console . log ( '%d entries found.' , results . length ) ;
// -> 1 entries found.
// results[ 0 ][ 0 ] i.e. the top result is:
console . log ( docs [ results [ 0 ] [ 0 ] ] . body ) ;
// -> George Walker Bush (born July 6, 1946) is an...
// -> ... He never studied Law...

// Whereas if you search for `law` then multiple entries will be
// found except the above entry!

筆記：
winknlp需要Node.js 16或18版。
Wink-NLP-Utils仍可用於支持舊版代碼。有關wink-nlp-util示例，請參閱Wink-BM25-Text-Search版本3.0.1。

API

`defineConfig( config )`

從config對象定義配置。該對象定義以下3個屬性：

fldWeights （強制性）是一個對象，其中每個密鑰是文檔的字段名稱，值是數值權重，即該字段的重要性。
bm25Params （可選）也是一個最多定義3個鍵的對象。 k1 ， b和k 。它們的默認值分別0.75 1.2和1 。注意： k1控制TF飽和度； b控制歸一度的程度， k管理IDF。
ovFldNames （可選）是一個包含字段名稱的數組，必須保留其原始值。這對於使用search() API調用過濾器降低搜索空間很有用。

`definePrepTasks( tasks [, field ] )`

定義文本準備tasks ，以將原始的傳入文本轉換為addDoc()和search()操作期間所需的一系列令牌。它返回tasks計數。

tasks應該是一系列功能。此數組中的第一個功能必須接受字符串作為輸入；最後一個功能必須返回一系列令牌作為JavaScript字符串。每個函數必須接受一個輸入參數並返回一個值。

第二個參數 - field是可選的。它定義了將要定義tasks的文檔field ；在沒有這個論點的情況下， tasks成為其他所有內容的默認值。該配置必須在此調用之前通過defineConfig()定義。

`addDoc( doc, uniqueId )`

將doc與BM25模型的Doc添加到uniqueId 。在添加文檔之前，必須調用defineConfig()和definePrepTasks() 。它接受結構化的JSON文檔作為創建模型的輸入。以下是此軟件包中包含的示例數據的示例文檔結構：

 {
  title: 'Barack Obama',
  body: 'Barack Hussein Obama II born August 4, 1961 is an American politician...'
  tags: 'democratic nobel peace prize columbia michelle...'
}

樣本數據是使用Wikipedia文章（例如Barack Obama上的一篇文章）創建的。

它具有一個別名learn( doc, uniqueId ) ，可維持各種眨眼軟件包（例如wink-nive-nive-bayes-text-clalerifier）的API水平均勻性。

`consolidate( fp )`

合併所有添加文檔的BM25模型。 fp定義了存儲術語頻率值的精度。默認值為4，在大多數情況下都足夠好。它是search()的先決條件，並且在合併後無法添加文檔。

`search( text [, limit, filter, params ] )`

搜索text並返回結果的limit數量。 filter應是必須基於params返回true或false的函數。將其視為JavaScript數組的濾波器功能。它收到兩個參數。（a）一個包含字段名稱/值對通過ovFldNames定義的defineConfig()和（b） params 。

最後三個參數limit ， filter和params是可選的。 limit的默認值為10 。

結果是在relevanceScore上排序的一系列[ uniqueId, relevanceScore ] 。

與addDoc()一樣，它也具有一個別名predict( doc, uniqueId )可在各種眨眼軟件包中保持API水平均勻性，例如眨眼的bayes-bayes-text-clalerifier。

`exportJSON()`

可以將BM25模型導出為可以保存在文件中的JSON文本。在合併之前導出JSON是一個好主意，並在需要添加更多文檔時使用相同的主意；而在合併後導出的JSON僅適用於搜索操作。

`importJSON( json )`

現有的JSON BM25模型可以導入搜索。在嘗試搜索之前，必須調用definePrepTasks() 。

`reset()`

它通過重新定位所有變量來完全重置BM25模型，除了準備任務。

getDocs()返回每個文檔的術語頻率和長度。
getTokens()返回token: index映射。
getIDF()為每個令牌返回IDF。通過其數值索引引用令牌，該索引通過getTokens()訪問。
getConfig()返回由defineConfig()設置的BM25F配置。
getTotalCorpusLength()返回所有添加的文檔中的令牌總數。
getTotalDocs()返回添加了總文檔。

注意：這些配件暴露了一些內部數據結構，並且必須避免修改它。它是專門用於僅閱讀目的的。任何有意或無意的修改都可能導致包裝的嚴重故障。

需要幫助嗎？

如果您發現一個錯誤尚未報告，請提出新問題或考慮修復它並發送拉動請求。

關於winkjs

Winkjs是一個用於自然語言處理，統計分析和機器學習的開源軟件包家族。該代碼已徹底記錄在人類的理解中，並具有約100％的測試覆蓋範圍，以建立生產等級解決方案。

版權和許可證

Wink-BM25-Text-Search是版權2017-22 Graype Systems Private Limited。

它是根據MIT許可證的條款獲得許可的。

展開

附加信息

版本 1.0.0
類型其他源碼
更新時間 2025-05-26
大小 125.14KB
來自於 Github

相關應用

詞搜尋 800

2024-11-08
azure search python samples

2024-11-05
Text With Jesus漢化

2023-08-23
與耶穌發簡訊

2023-08-17
Text With Jesus中文版

2023-08-17
發短信或死亡

2023-07-03

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3

相關資訊全部

wink bm25 text search