RelevancyFeedback下載 - RelevancyFeedback源代碼下載

RelevancyFeedback

其他源碼

1.0.0

下載

骰子相關反饋

Dice.com的SOLR插件，用於執行個性化搜索，並建議（通過相關反饋插件）和概念 /語義搜索（通過無監督的反饋插件）。

鏈接

談話的幻燈片
演講的視頻

構建插件

可以在./target文件夾中找到預製的jar文件。該項目包含一個maven pom.xml文件，該文件也可用於從源構建它。

支持的Solr版本

Solr 5.4（請參閱分支）
Solr 6.3（請參閱分支）也掌握
Solr 7.0（請參閱分支） - 還可以在7.1中工作

如果您需要的Solr有特定版本，請創建一個GitHub問題，我會看到我可以做什麼。要為特定版本手動編譯它，請使用Maven使用pom.xml文件來編譯插件，並在該文件中更新Solr和Lucene庫的版本，然後使用Maven來吸引這些依賴關係。然後修復任何彙編錯誤。

進口到Solr

請參閱官方的Solr指南，以在Solr註冊插件。這基本上涉及將JAR文件放入Solr檢查Core Reload上的類和JAR文件的文件夾之一。

SOLR插件
在Solr Cloud中添加自定義插件

相關反饋插件

下面顯示了SolrConfig.xml的示例請求處理程序配置，其中註釋概述了主要參數：

 <requestHandler name="/rf" class="org.dice.solrenhancements.relevancyfeedback.RelevancyFeedbackHandler">
        <lst name="defaults">
            <str name="omitHeader">true</str>
            <str name="wt">json</str>
            <str name="indent">true</str>			
	
            <!-- Regular query configuration - query parser used when parsing the rf query-->
            <str  name="defType">lucene</str>
            
            <!-- fields returned -->
            <str  name="fl">jobTitle,skill,company</str>
            <!-- fields to match on-->
            <str  name="rf.fl">skillFromSkill,extractTitles</str>
            
            <!-- field weights. Note that term weights are normalized so that each field is weighted exactly in this ratio
            as different fields can get different numbers of matching terms-->
            <str  name="rf.qf">skillFromSkill^3 extractTitles^4.5</str>
            
            <int  name="rows">10</int>
            
            <!-- How many terms to extract per field (max) -->
            <int  name="rf.maxflqt">10</int>
            <bool name="rf.boost">true</bool>
            
            <!-- normalize the weights for terms in each field (custom to dice rf, not present in solr MLT) -->
            <bool name="rf.normflboosts">true</bool>
            
            <!-- Take the raw term frequencies (false) or log of the term frequenies (true) -->
            <bool name="rf.logtf">true</bool>
            
            <!-- Minimum should match settings for the rf query - determines what proportion of the terms have to match -->
            <!-- See Solr edismax mm parameter for specifics --> 
            <bool name="rf.mm">25%</bool>            
            
            <!-- Returns the top k terms (see regular solr MLT handler) -->
            <str  name="rf.interestingTerms">details</str>
            
            <!-- Turns the rf query into a boost query using a multiplicative boost, allowing for boosting -->
            <str  name="rf.boostfn"></str>
            
            <!-- q parameter -  If you want to execute one query, and use the rf query to boost the results (e.g. for personalizing search), 
            	 pass the user's query into this parameter. Can take regular query syntax e.g.rf.q={!edismax df=title qf=.... v=$qq}&qq=Java
            	 The regular q parameter is reserved for the rf query (see abpve)
            -->
            <str name="q"></str>
            <!-- Query parser to use for the q query, if passed -->
            <str name="defType"></str>

            <!-- rf.q parameter - Note that the regular q parameter is used only for personalized search scenarios, where you have a main query
            	 and you want to use the rf query generated to boost the main queries documents. Typically the rf.q query is a query that identifies
            	 one or more documents, e.g. rf.q=(id:686867 id:98928980 id:999923). But it cam be any query. Note that it will process ALL documents
            	 matched by the q query (as it's intended mainly for looking up docs by id), so use cautiously.
            -->
            <str name="rf.q"></str>
            <!-- query parser to use for the rf.q query -->
            <str name="rf.defType"></str>
            
            <!-- Settings for personalized search - use the regular parameter names for the query parser defined by defType parameter -->
            <str name="df">title</str>
            <str name="qf"> company_text^0.01 title^12 skill^4 description^0.3</str>
            <str name="pf2">company_text^0.01 title^12 skill^4 description^0.6</str> 
            
            <!-- Content based recommendations settings (post a document to the endpoint in a POST request). The stream.body and stream.head are form parameters
                 You can send them in a GET request, but a POST handles larger data. If you have really large documents, you will need to change the buffer settings
                 so that the request doesn't blow the buffer limits in Solr or your web server.
             -->
            
            <!-- Fields used for processing documents posted to the stream.body and stream.head parameters in a POST call -->
            <str  name="stream.head.fl">title,title_syn</str>
            <str  name="stream.body.fl">extractSkills,extractTitles</str>
            
            <!-- pass a url in this parameter for Solr to download the webpage, and process the Html using the fields configured in the stream.qf parameters -->
            <str  name="stream.url"></str>
            
            <!-- Note that we have two different content stream fields to pass over in the POST request. This allows different analyzers to be appkied to each. 
            For instance, we pass the job title into the stream.head field and parse out job titles, while we pass the job description to the stream.head parameter 
            to parse out skills -->
            <!-- Pass the document body in this parameter as a form parameter. Analysed using the stream.body.fl fields-->
            <str  name="stream.body"></str>
            <!-- Pass the second document field in this parameter. Analysed using the stream.head.fl fields-->
            <str  name="stream.head"></str>
            
            <!-- Specifies a separate set of field weights to apply when procesing a document posted to the request handler via the 
                 stream.body and stream.head parameters -->
            <str  name="stream.qf">extractSkills^4.5 extractTitles^2.25 title^3.0 title_syn^3.0</str>           
        </lst>
</requestHandler>

示例請求

http：// localhost：8983/solr/jobs/rf？q = ID：11F407D319D6CC707437FAD874A097C0+ID：A2FD2F2E 34667D61FADCDCABFD359CF4＆行= 10＆df = title＆fl = title，技能，地理，城市，州和wt = JSON

示例響應

 {
  "match":{
      "numFound":2,
      "start":0,
      "docs":[
          {
            "id":"a2fd2f2e34667d61fadcdcabfd359cf4",        
            "title":"Console AAA Sports Video Game Programmer.",
            "skills":["Sports Game Experience a plus.",
              "2-10 years plus Console AAA Video Game Programming Experience"],
            "geocode":"38.124447,-122.55051",
            "city":"Novato",
            "state":"CA"
          },
          {
            "id":"11f407d319d6cc707437fad874a097c0",
            "title":"Game Engineer - Creative and Flexible Work Environment!",
            "skills":["3D Math",
              "Unity3d",
              "C#",
              "3D Math - game programming",
              "game programming",
              "C++",
              "Java"],
            "geocode":"33.97331,-118.243614",
            "city":"Los Angeles",
            "state":"CA"
          }
      ]
  },
  "response":{
      "numFound":5333,
      "start":0,
      "docs":[
          {
            "title":"Software Design Engineer 3 (Game Developer)",
            "skills":["C#",
              "C++",
              "Unity"],
            "geocode":"47.683647,-122.12183",
            "city":"Redmond",
            "state":"WA"
          },
          {          
            "title":"Game Server Engineer - MMO Mobile Gaming Start-Up!",
            "skills":["AWS",
              "Node.JS",
              "pubnub",
              "Websockets",
              "pubnub - Node.JS",
              "Vagrant",
              "Linux",
              "Git",
              "MongoDB",
              "Jenkins",
              "Docker"],
            "geocode":"37.777115,-122.41733",
            "city":"San Francisco",
            "state":"CA"
          },...
      ]
   }
}

無監督的反饋（盲反饋）插件

下面顯示了SolrConfig.xml的示例請求處理程序配置，其中註釋概述了主要參數：

 <requestHandler name="/ufselect" class="org.dice.solrenhancements.unsupervisedfeedback.UnsupervisedFeedbackHandler">
        <lst name="defaults">
            <str name="omitHeader">true</str>
            <str name="wt">json</str>
            <str name="indent">true</str>

            <!-- Regular query configuration -->
            <str  name="defType">edismax</str>
            <str  name="df">title</str>
            <str  name="qf">title^1.5   skills^1.25 description^1.1</str>
            <str  name="pf2">title^3.0  skills^2.5  description^1.5</str>
            <str  name="mm">1</str>
            <str  name="q.op">OR</str>

            <str  name="fl">jobTitle,skills,company</str>
            <int  name="rows">30</int>
                        
            <!-- Unsupervised Feedback (Blind Feedback) query configuration-->
            <str  name="uf.fl">skillsFromskills,titleFromJobTitle</str>
            <!-- How many docs to extract the top terms from -->
            <str  name="uf.maxdocs">50</str>
            <!-- How many terms to extract per field (max) -->
            <int  name="uf.maxflqt">10</int>
            <bool name="uf.boost">true</bool>
            <!-- Relative per-field boosts on the extracted terms (similar to edismax qf parameter -->
            <!-- NOTE: with  uf.normflboosts=true, all terms are normalized so that the total importance of each
            	field on the query is the same, then these relative boosts are applied per field-->
            
            <str  name="uf.qf">skillsFromskills^4.5 titleFromJobTitle^6.0</str>
            
            <!-- Returns the top k terms (see regular solr MLT handler) -->
            <str  name="uf.interestingTerms">details</str>
			
            <!-- unit-length norm all term boosts within a field (recommended) - see talk for details -->
            <bool name="uf.normflboosts">true</bool>
            <!-- use raw term clounts or log term counts? -->
            <bool name="uf.logtf">false</bool>
        </lst>
</requestHandler>

示例請求

http：// localhost：8983/solr/dicejobscp/ufselect？q =機器+學習+工程師＆start＆start = 0＆row = 10＆uf.logtf = false＆ fl =標題，技能，地理編碼，城市，州和fq = {！geofilt+sfield = jobendecageocode+d = 48+pt = 39.6955，-105.0841}＆wt = json

示例響應

 {
  "match":
  {
    "numFound":7729,
    "start":0,
    "docs":[
      {
        "title":"NLP/Machine Learning Engineer",
        "skills":["Linux",
          "NLP (Natural Language Processing)",
          "SQL",
          "Bash",
          "Python",
          "ML (Machine Learning)",
          "JavaScript",
          "Java"],
        "geocode":"42.35819,-71.050674",
        "city":"Boston",
        "state":"MA"
      },
      {
        "title":"Machine Learning Engineer",
        "skills":["machine learning",
          "java",
          "scala"],
        "geocode":"47.60473,-122.32594",
        "city":"Seattle",
        "state":"WA"
      },
      {
        "title":"Machine Learning Engineer - REMOTE!",
        "skills":["Neo4j",
          "Hadoop",
          "gensim",
          "gensim - C++",
          "Java",
          "R",
          "MongoDB",
          "elastic search",
          "sci-kit learn",
          "Python",
          "C++"],
        "geocode":"37.777115,-122.41733",
        "city":"San Francisco",
        "state":"CA"
        },...
    ]
}

這只是MLT處理程序嗎？

儘管它是基於Solr MLT處理程序代碼和算法（僅是Rocchio算法）的寬鬆，但算法設計中存在一些關鍵差異。構造MLT查詢時，MLT處理程序在所有配置的字段中佔據頂部K項。如果您的字段比其他字段更廣泛，則該術語的平均文檔頻率將低於其他詞彙較小的字段。這意味著這些術語將具有較高的相對IDF分數，並且傾向於主導Solr MLT處理程序選擇的頂級術語。我們的請求處理程序每個字段佔據了最高的K條款。它還可以確保在每個字段中的每個字段匹配多少個字段（直至配置的限制），在應用RF.QF參數中指定的字段特定權重之前，該字段在結果查詢中具有相同的權重。這是我們解決的Solr MLT處理程序的第二個問題。我們還提供了很多額外的功能。我們允許傳遞內容流，與多個文檔相匹配（更像是“這些”，而不是更像“ this”），將Boost查詢解析器應用於所得的MLT查詢以允許應用任何任意的Solr增強功能（乘法）。我們支持MM參數，因此我們可以強迫文檔返回僅與最高術語的設置百分比相匹配。

用於個性化搜索時的重要考慮因素

If you wish to use this to perform search personalization, as demonstrated in my Lucene Revolution 2017 talk, you need to pass in the user's current search query using the regular q parameter, and the information used to generate the rocchio query is passed via the rf.q parameter (when using documents to generate the Rocchio query) or via the content stream parameters (rf.stream.head and rf.stream.body, which take strings of content).但是請注意，由於算法適用的歸一化過程，適用於Rocchio查詢中的術語的提升與用戶查詢中的術語沒有比較的權重。因此，您需要嘗試不同的rf.qf值，直到根據搜索配置找到正確的查詢影響級別為止。另外，鑑於每個用戶生成的Rocchio查詢可能在用戶的搜索過程中相同（當然取決於您的用例），因此，一種更有效的方式來使用此操作來實現個性化的方式，是為了使用RF處理程序來為您生成Rochio Query曾經為您生成一次用戶登錄時，請在此搜索中搜索使用，然後將其作為定期搜索（以您的個人化為單位）。處理程序在響應中返回RF.Query參數中的Rocchio查詢。如果您只想使用處理程序來獲取查詢（而不是執行搜索），則可以將行參數設置為0。您還可以迭代算法返回的“有趣的術語”的集合，以及其權重，如果您設置了rf.isterestingterms =詳細信息，並使用此詳細信息，並使用此功能來構建boost Querst Query Query Query Query。

潛在的增強

除了確保與更多版本的Solr一起使用（請留下有關您想要哪些版本的反饋），還有許多可能的增強功能：

相關反饋處理程序允許從負面示例中學習負術語（如果提供 - 需要單獨的查詢參數），然後使用負面提升來實現。另一個增強功能是允許每個字段的最大項（rf.maxflqt）以每個字段指定，以便您可以改變字段提取的最大術語數。
無監督的反饋（盲目反饋）使用本文詳細介紹的位置相關模型：http：//dl.acm.org/citation.cfm?id=1835546。這僅使用文檔中查詢條款附近發現的術語，因為這些術語通常比使用整個文檔更相關。熒光筆組件大概可以用作參考，以確定如何從發布列表中獲取此信息，甚至直接用於獲取此信息。

聯繫方式

如果您有功能請求，請將其提交到問題列表中。如果您有疑問，那也是發布它們的好地方，但是如果您不回到這裡，也可以通過[email protected]與我聯繫。

展開

附加信息

版本 1.0.0
類型其他源碼
更新時間 2025-03-12
大小 108.25KB
來自於 Github

相關應用

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3

相關資訊全部