RelevancyFeedback下载 - RelevancyFeedback源代码下载

RelevancyFeedback

其他源码

1.0.0

下载

骰子相关反馈

Dice.com的SOLR插件，用于执行个性化搜索，并建议（通过相关反馈插件）和概念 /语义搜索（通过无监督的反馈插件）。

链接

谈话的幻灯片
演讲的视频

构建插件

可以在./target文件夹中找到预制的jar文件。该项目包含一个maven pom.xml文件，该文件也可用于从源构建它。

支持的Solr版本

Solr 5.4（请参阅分支）
Solr 6.3（请参阅分支）也掌握
Solr 7.0（请参阅分支） - 还可以在7.1中工作

如果您需要的Solr有特定版本，请创建一个GitHub问题，我会看到我可以做什么。要为特定版本手动编译它，请使用Maven使用pom.xml文件来编译插件，并在该文件中更新Solr和Lucene库的版本，然后使用Maven来吸引这些依赖关系。然后修复任何汇编错误。

进口到Solr

请参阅官方的Solr指南，以在Solr注册插件。这基本上涉及将JAR文件放入Solr检查Core Reload上的类和JAR文件的文件夹之一。

SOLR插件
在Solr Cloud中添加自定义插件

相关反馈插件

下面显示了SolrConfig.xml的示例请求处理程序配置，其中注释概述了主要参数：

 <requestHandler name="/rf" class="org.dice.solrenhancements.relevancyfeedback.RelevancyFeedbackHandler">
        <lst name="defaults">
            <str name="omitHeader">true</str>
            <str name="wt">json</str>
            <str name="indent">true</str>			
	
            <!-- Regular query configuration - query parser used when parsing the rf query-->
            <str  name="defType">lucene</str>
            
            <!-- fields returned -->
            <str  name="fl">jobTitle,skill,company</str>
            <!-- fields to match on-->
            <str  name="rf.fl">skillFromSkill,extractTitles</str>
            
            <!-- field weights. Note that term weights are normalized so that each field is weighted exactly in this ratio
            as different fields can get different numbers of matching terms-->
            <str  name="rf.qf">skillFromSkill^3 extractTitles^4.5</str>
            
            <int  name="rows">10</int>
            
            <!-- How many terms to extract per field (max) -->
            <int  name="rf.maxflqt">10</int>
            <bool name="rf.boost">true</bool>
            
            <!-- normalize the weights for terms in each field (custom to dice rf, not present in solr MLT) -->
            <bool name="rf.normflboosts">true</bool>
            
            <!-- Take the raw term frequencies (false) or log of the term frequenies (true) -->
            <bool name="rf.logtf">true</bool>
            
            <!-- Minimum should match settings for the rf query - determines what proportion of the terms have to match -->
            <!-- See Solr edismax mm parameter for specifics --> 
            <bool name="rf.mm">25%</bool>            
            
            <!-- Returns the top k terms (see regular solr MLT handler) -->
            <str  name="rf.interestingTerms">details</str>
            
            <!-- Turns the rf query into a boost query using a multiplicative boost, allowing for boosting -->
            <str  name="rf.boostfn"></str>
            
            <!-- q parameter -  If you want to execute one query, and use the rf query to boost the results (e.g. for personalizing search), 
            	 pass the user's query into this parameter. Can take regular query syntax e.g.rf.q={!edismax df=title qf=.... v=$qq}&qq=Java
            	 The regular q parameter is reserved for the rf query (see abpve)
            -->
            <str name="q"></str>
            <!-- Query parser to use for the q query, if passed -->
            <str name="defType"></str>

            <!-- rf.q parameter - Note that the regular q parameter is used only for personalized search scenarios, where you have a main query
            	 and you want to use the rf query generated to boost the main queries documents. Typically the rf.q query is a query that identifies
            	 one or more documents, e.g. rf.q=(id:686867 id:98928980 id:999923). But it cam be any query. Note that it will process ALL documents
            	 matched by the q query (as it's intended mainly for looking up docs by id), so use cautiously.
            -->
            <str name="rf.q"></str>
            <!-- query parser to use for the rf.q query -->
            <str name="rf.defType"></str>
            
            <!-- Settings for personalized search - use the regular parameter names for the query parser defined by defType parameter -->
            <str name="df">title</str>
            <str name="qf"> company_text^0.01 title^12 skill^4 description^0.3</str>
            <str name="pf2">company_text^0.01 title^12 skill^4 description^0.6</str> 
            
            <!-- Content based recommendations settings (post a document to the endpoint in a POST request). The stream.body and stream.head are form parameters
                 You can send them in a GET request, but a POST handles larger data. If you have really large documents, you will need to change the buffer settings
                 so that the request doesn't blow the buffer limits in Solr or your web server.
             -->
            
            <!-- Fields used for processing documents posted to the stream.body and stream.head parameters in a POST call -->
            <str  name="stream.head.fl">title,title_syn</str>
            <str  name="stream.body.fl">extractSkills,extractTitles</str>
            
            <!-- pass a url in this parameter for Solr to download the webpage, and process the Html using the fields configured in the stream.qf parameters -->
            <str  name="stream.url"></str>
            
            <!-- Note that we have two different content stream fields to pass over in the POST request. This allows different analyzers to be appkied to each. 
            For instance, we pass the job title into the stream.head field and parse out job titles, while we pass the job description to the stream.head parameter 
            to parse out skills -->
            <!-- Pass the document body in this parameter as a form parameter. Analysed using the stream.body.fl fields-->
            <str  name="stream.body"></str>
            <!-- Pass the second document field in this parameter. Analysed using the stream.head.fl fields-->
            <str  name="stream.head"></str>
            
            <!-- Specifies a separate set of field weights to apply when procesing a document posted to the request handler via the 
                 stream.body and stream.head parameters -->
            <str  name="stream.qf">extractSkills^4.5 extractTitles^2.25 title^3.0 title_syn^3.0</str>           
        </lst>
</requestHandler>

示例请求

http：// localhost：8983/solr/jobs/rf？q = ID：11F407D319D6CC707437FAD874A097C0+ID：A2FD2F2E 34667D61FADCDCABFD359CF4＆行= 10＆df = title＆fl = title，技能，地理，城市，州和wt = JSON

示例响应

 {
  "match":{
      "numFound":2,
      "start":0,
      "docs":[
          {
            "id":"a2fd2f2e34667d61fadcdcabfd359cf4",        
            "title":"Console AAA Sports Video Game Programmer.",
            "skills":["Sports Game Experience a plus.",
              "2-10 years plus Console AAA Video Game Programming Experience"],
            "geocode":"38.124447,-122.55051",
            "city":"Novato",
            "state":"CA"
          },
          {
            "id":"11f407d319d6cc707437fad874a097c0",
            "title":"Game Engineer - Creative and Flexible Work Environment!",
            "skills":["3D Math",
              "Unity3d",
              "C#",
              "3D Math - game programming",
              "game programming",
              "C++",
              "Java"],
            "geocode":"33.97331,-118.243614",
            "city":"Los Angeles",
            "state":"CA"
          }
      ]
  },
  "response":{
      "numFound":5333,
      "start":0,
      "docs":[
          {
            "title":"Software Design Engineer 3 (Game Developer)",
            "skills":["C#",
              "C++",
              "Unity"],
            "geocode":"47.683647,-122.12183",
            "city":"Redmond",
            "state":"WA"
          },
          {          
            "title":"Game Server Engineer - MMO Mobile Gaming Start-Up!",
            "skills":["AWS",
              "Node.JS",
              "pubnub",
              "Websockets",
              "pubnub - Node.JS",
              "Vagrant",
              "Linux",
              "Git",
              "MongoDB",
              "Jenkins",
              "Docker"],
            "geocode":"37.777115,-122.41733",
            "city":"San Francisco",
            "state":"CA"
          },...
      ]
   }
}

无监督的反馈（盲反馈）插件

下面显示了SolrConfig.xml的示例请求处理程序配置，其中注释概述了主要参数：

 <requestHandler name="/ufselect" class="org.dice.solrenhancements.unsupervisedfeedback.UnsupervisedFeedbackHandler">
        <lst name="defaults">
            <str name="omitHeader">true</str>
            <str name="wt">json</str>
            <str name="indent">true</str>

            <!-- Regular query configuration -->
            <str  name="defType">edismax</str>
            <str  name="df">title</str>
            <str  name="qf">title^1.5   skills^1.25 description^1.1</str>
            <str  name="pf2">title^3.0  skills^2.5  description^1.5</str>
            <str  name="mm">1</str>
            <str  name="q.op">OR</str>

            <str  name="fl">jobTitle,skills,company</str>
            <int  name="rows">30</int>
                        
            <!-- Unsupervised Feedback (Blind Feedback) query configuration-->
            <str  name="uf.fl">skillsFromskills,titleFromJobTitle</str>
            <!-- How many docs to extract the top terms from -->
            <str  name="uf.maxdocs">50</str>
            <!-- How many terms to extract per field (max) -->
            <int  name="uf.maxflqt">10</int>
            <bool name="uf.boost">true</bool>
            <!-- Relative per-field boosts on the extracted terms (similar to edismax qf parameter -->
            <!-- NOTE: with  uf.normflboosts=true, all terms are normalized so that the total importance of each
            	field on the query is the same, then these relative boosts are applied per field-->
            
            <str  name="uf.qf">skillsFromskills^4.5 titleFromJobTitle^6.0</str>
            
            <!-- Returns the top k terms (see regular solr MLT handler) -->
            <str  name="uf.interestingTerms">details</str>
			
            <!-- unit-length norm all term boosts within a field (recommended) - see talk for details -->
            <bool name="uf.normflboosts">true</bool>
            <!-- use raw term clounts or log term counts? -->
            <bool name="uf.logtf">false</bool>
        </lst>
</requestHandler>

示例请求

http：// localhost：8983/solr/dicejobscp/ufselect？q =机器+学习+工程师＆start＆start = 0＆row = 10＆uf.logtf = false＆ fl =标题，技能，地理编码，城市，州和fq = {！geofilt+sfield = jobendecageocode+d = 48+pt = 39.6955，-105.0841}＆wt = json

示例响应

 {
  "match":
  {
    "numFound":7729,
    "start":0,
    "docs":[
      {
        "title":"NLP/Machine Learning Engineer",
        "skills":["Linux",
          "NLP (Natural Language Processing)",
          "SQL",
          "Bash",
          "Python",
          "ML (Machine Learning)",
          "JavaScript",
          "Java"],
        "geocode":"42.35819,-71.050674",
        "city":"Boston",
        "state":"MA"
      },
      {
        "title":"Machine Learning Engineer",
        "skills":["machine learning",
          "java",
          "scala"],
        "geocode":"47.60473,-122.32594",
        "city":"Seattle",
        "state":"WA"
      },
      {
        "title":"Machine Learning Engineer - REMOTE!",
        "skills":["Neo4j",
          "Hadoop",
          "gensim",
          "gensim - C++",
          "Java",
          "R",
          "MongoDB",
          "elastic search",
          "sci-kit learn",
          "Python",
          "C++"],
        "geocode":"37.777115,-122.41733",
        "city":"San Francisco",
        "state":"CA"
        },...
    ]
}

这只是MLT处理程序吗？

尽管它是基于Solr MLT处理程序代码和算法（仅是Rocchio算法）的宽松，但算法设计中存在一些关键差异。构造MLT查询时，MLT处理程序在所有配置的字段中占据顶部K项。如果您的字段比其他字段更广泛，则该术语的平均文档频率将低于其他词汇较小的字段。这意味着这些术语将具有较高的相对IDF分数，并且倾向于主导Solr MLT处理程序选择的顶级术语。我们的请求处理程序每个字段占据了最高的K条款。它还可以确保在每个字段中的每个字段匹配多少个字段（直至配置的限制），在应用RF.QF参数中指定的字段特定权重之前，该字段在结果查询中具有相同的权重。这是我们解决的Solr MLT处理程序的第二个问题。我们还提供了很多额外的功能。我们允许传递内容流，与多个文档相匹配（更像是“这些”，而不是更像“ this”），将Boost查询解析器应用于所得的MLT查询以允许应用任何任意的Solr增强功能（乘法）。我们支持MM参数，因此我们可以强迫文档返回仅与最高术语的设置百分比相匹配。

用于个性化搜索时的重要考虑因素

If you wish to use this to perform search personalization, as demonstrated in my Lucene Revolution 2017 talk, you need to pass in the user's current search query using the regular q parameter, and the information used to generate the rocchio query is passed via the rf.q parameter (when using documents to generate the Rocchio query) or via the content stream parameters (rf.stream.head and rf.stream.body, which take strings of content).但是请注意，由于算法适用的归一化过程，适用于Rocchio查询中的术语的提升与用户查询中的术语没有比较的权重。因此，您需要尝试不同的rf.qf值，直到根据搜索配置找到正确的查询影响级别为止。另外，鉴于每个用户生成的Rocchio查询可能在用户的搜索过程中相同（当然取决于您的用例），因此，一种更有效的方式来使用此操作来实现个性化的方式，是为了使用RF处理程序来为您生成Rochio Query曾经为您生成一次用户登录时，请在此搜索中搜索使用，然后将其作为定期搜索（以您的个人化为单位）。处理程序在响应中返回RF.Query参数中的Rocchio查询。如果您只想使用处理程序来获取查询（而不是执行搜索），则可以将行参数设置为0。您还可以迭代算法返回的“有趣的术语”的集合，以及其权重，如果您设置了rf.isterestingterms =详细信息，并使用此详细信息，并使用此功能来构建boost Querst Query Query Query Query。

潜在的增强

除了确保与更多版本的Solr一起使用（请留下有关您想要哪些版本的反馈），还有许多可能的增强功能：

相关反馈处理程序允许从负面示例中学习负术语（如果提供 - 需要单独的查询参数），然后使用负面提升来实现。另一个增强功能是允许每个字段的最大项（rf.maxflqt）以每个字段指定，以便您可以改变字段提取的最大术语数。
无监督的反馈（盲目反馈）使用本文详细介绍的位置相关模型：http：//dl.acm.org/citation.cfm?id=1835546。这仅使用文档中查询条款附近发现的术语，因为这些术语通常比使用整个文档更相关。荧光笔组件大概可以用作参考，以确定如何从发布列表中获取此信息，甚至直接用于获取此信息。

联系方式

如果您有功能请求，请将其提交到问题列表中。如果您有疑问，那也是发布它们的好地方，但是如果您不回到这里，也可以通过[email protected]与我联系。

展开

附加信息

版本 1.0.0
类型其他源码
更新时间 2025-03-12
大小 108.25KB
来自于 Github

RelevancyFeedback

骰子相关反馈

链接

构建插件

支持的Solr版本

进口到Solr

相关反馈插件

示例请求

示例响应

无监督的反馈（盲反馈）插件

示例请求

示例响应

这只是MLT处理程序吗？

用于个性化搜索时的重要考虑因素

潜在的增强

联系方式

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express