Dice.com的SOLR插件,用于执行个性化搜索,并建议(通过相关反馈插件)和概念 /语义搜索(通过无监督的反馈插件)。
可以在./target文件夹中找到预制的jar文件。该项目包含一个maven pom.xml文件,该文件也可用于从源构建它。
如果您需要的Solr有特定版本,请创建一个GitHub问题,我会看到我可以做什么。要为特定版本手动编译它,请使用Maven使用pom.xml文件来编译插件,并在该文件中更新Solr和Lucene库的版本,然后使用Maven来吸引这些依赖关系。然后修复任何汇编错误。
请参阅官方的Solr指南,以在Solr注册插件。这基本上涉及将JAR文件放入Solr检查Core Reload上的类和JAR文件的文件夹之一。
下面显示了SolrConfig.xml的示例请求处理程序配置,其中注释概述了主要参数:
<requestHandler name="/rf" class="org.dice.solrenhancements.relevancyfeedback.RelevancyFeedbackHandler">
<lst name="defaults">
<str name="omitHeader">true</str>
<str name="wt">json</str>
<str name="indent">true</str>
<!-- Regular query configuration - query parser used when parsing the rf query-->
<str name="defType">lucene</str>
<!-- fields returned -->
<str name="fl">jobTitle,skill,company</str>
<!-- fields to match on-->
<str name="rf.fl">skillFromSkill,extractTitles</str>
<!-- field weights. Note that term weights are normalized so that each field is weighted exactly in this ratio
as different fields can get different numbers of matching terms-->
<str name="rf.qf">skillFromSkill^3 extractTitles^4.5</str>
<int name="rows">10</int>
<!-- How many terms to extract per field (max) -->
<int name="rf.maxflqt">10</int>
<bool name="rf.boost">true</bool>
<!-- normalize the weights for terms in each field (custom to dice rf, not present in solr MLT) -->
<bool name="rf.normflboosts">true</bool>
<!-- Take the raw term frequencies (false) or log of the term frequenies (true) -->
<bool name="rf.logtf">true</bool>
<!-- Minimum should match settings for the rf query - determines what proportion of the terms have to match -->
<!-- See Solr edismax mm parameter for specifics -->
<bool name="rf.mm">25%</bool>
<!-- Returns the top k terms (see regular solr MLT handler) -->
<str name="rf.interestingTerms">details</str>
<!-- Turns the rf query into a boost query using a multiplicative boost, allowing for boosting -->
<str name="rf.boostfn"></str>
<!-- q parameter - If you want to execute one query, and use the rf query to boost the results (e.g. for personalizing search),
pass the user's query into this parameter. Can take regular query syntax e.g.rf.q={!edismax df=title qf=.... v=$qq}&qq=Java
The regular q parameter is reserved for the rf query (see abpve)
-->
<str name="q"></str>
<!-- Query parser to use for the q query, if passed -->
<str name="defType"></str>
<!-- rf.q parameter - Note that the regular q parameter is used only for personalized search scenarios, where you have a main query
and you want to use the rf query generated to boost the main queries documents. Typically the rf.q query is a query that identifies
one or more documents, e.g. rf.q=(id:686867 id:98928980 id:999923). But it cam be any query. Note that it will process ALL documents
matched by the q query (as it's intended mainly for looking up docs by id), so use cautiously.
-->
<str name="rf.q"></str>
<!-- query parser to use for the rf.q query -->
<str name="rf.defType"></str>
<!-- Settings for personalized search - use the regular parameter names for the query parser defined by defType parameter -->
<str name="df">title</str>
<str name="qf"> company_text^0.01 title^12 skill^4 description^0.3</str>
<str name="pf2">company_text^0.01 title^12 skill^4 description^0.6</str>
<!-- Content based recommendations settings (post a document to the endpoint in a POST request). The stream.body and stream.head are form parameters
You can send them in a GET request, but a POST handles larger data. If you have really large documents, you will need to change the buffer settings
so that the request doesn't blow the buffer limits in Solr or your web server.
-->
<!-- Fields used for processing documents posted to the stream.body and stream.head parameters in a POST call -->
<str name="stream.head.fl">title,title_syn</str>
<str name="stream.body.fl">extractSkills,extractTitles</str>
<!-- pass a url in this parameter for Solr to download the webpage, and process the Html using the fields configured in the stream.qf parameters -->
<str name="stream.url"></str>
<!-- Note that we have two different content stream fields to pass over in the POST request. This allows different analyzers to be appkied to each.
For instance, we pass the job title into the stream.head field and parse out job titles, while we pass the job description to the stream.head parameter
to parse out skills -->
<!-- Pass the document body in this parameter as a form parameter. Analysed using the stream.body.fl fields-->
<str name="stream.body"></str>
<!-- Pass the second document field in this parameter. Analysed using the stream.head.fl fields-->
<str name="stream.head"></str>
<!-- Specifies a separate set of field weights to apply when procesing a document posted to the request handler via the
stream.body and stream.head parameters -->
<str name="stream.qf">extractSkills^4.5 extractTitles^2.25 title^3.0 title_syn^3.0</str>
</lst>
</requestHandler>
http:// localhost:8983/solr/jobs/rf?q = ID:11F407D319D6CC707437FAD874A097C0+ID:A2FD2F2E 34667D61FADCDCABFD359CF4&行= 10&df = title&fl = title,技能,地理,城市,州和wt = JSON
{
"match":{
"numFound":2,
"start":0,
"docs":[
{
"id":"a2fd2f2e34667d61fadcdcabfd359cf4",
"title":"Console AAA Sports Video Game Programmer.",
"skills":["Sports Game Experience a plus.",
"2-10 years plus Console AAA Video Game Programming Experience"],
"geocode":"38.124447,-122.55051",
"city":"Novato",
"state":"CA"
},
{
"id":"11f407d319d6cc707437fad874a097c0",
"title":"Game Engineer - Creative and Flexible Work Environment!",
"skills":["3D Math",
"Unity3d",
"C#",
"3D Math - game programming",
"game programming",
"C++",
"Java"],
"geocode":"33.97331,-118.243614",
"city":"Los Angeles",
"state":"CA"
}
]
},
"response":{
"numFound":5333,
"start":0,
"docs":[
{
"title":"Software Design Engineer 3 (Game Developer)",
"skills":["C#",
"C++",
"Unity"],
"geocode":"47.683647,-122.12183",
"city":"Redmond",
"state":"WA"
},
{
"title":"Game Server Engineer - MMO Mobile Gaming Start-Up!",
"skills":["AWS",
"Node.JS",
"pubnub",
"Websockets",
"pubnub - Node.JS",
"Vagrant",
"Linux",
"Git",
"MongoDB",
"Jenkins",
"Docker"],
"geocode":"37.777115,-122.41733",
"city":"San Francisco",
"state":"CA"
},...
]
}
}
下面显示了SolrConfig.xml的示例请求处理程序配置,其中注释概述了主要参数:
<requestHandler name="/ufselect" class="org.dice.solrenhancements.unsupervisedfeedback.UnsupervisedFeedbackHandler">
<lst name="defaults">
<str name="omitHeader">true</str>
<str name="wt">json</str>
<str name="indent">true</str>
<!-- Regular query configuration -->
<str name="defType">edismax</str>
<str name="df">title</str>
<str name="qf">title^1.5 skills^1.25 description^1.1</str>
<str name="pf2">title^3.0 skills^2.5 description^1.5</str>
<str name="mm">1</str>
<str name="q.op">OR</str>
<str name="fl">jobTitle,skills,company</str>
<int name="rows">30</int>
<!-- Unsupervised Feedback (Blind Feedback) query configuration-->
<str name="uf.fl">skillsFromskills,titleFromJobTitle</str>
<!-- How many docs to extract the top terms from -->
<str name="uf.maxdocs">50</str>
<!-- How many terms to extract per field (max) -->
<int name="uf.maxflqt">10</int>
<bool name="uf.boost">true</bool>
<!-- Relative per-field boosts on the extracted terms (similar to edismax qf parameter -->
<!-- NOTE: with uf.normflboosts=true, all terms are normalized so that the total importance of each
field on the query is the same, then these relative boosts are applied per field-->
<str name="uf.qf">skillsFromskills^4.5 titleFromJobTitle^6.0</str>
<!-- Returns the top k terms (see regular solr MLT handler) -->
<str name="uf.interestingTerms">details</str>
<!-- unit-length norm all term boosts within a field (recommended) - see talk for details -->
<bool name="uf.normflboosts">true</bool>
<!-- use raw term clounts or log term counts? -->
<bool name="uf.logtf">false</bool>
</lst>
</requestHandler>
http:// localhost:8983/solr/dicejobscp/ufselect?q =机器+学习+工程师&start&start = 0&row = 10&uf.logtf = false& fl =标题,技能,地理编码,城市,州和fq = {!geofilt+sfield = jobendecageocode+d = 48+pt = 39.6955,-105.0841}&wt = json
{
"match":
{
"numFound":7729,
"start":0,
"docs":[
{
"title":"NLP/Machine Learning Engineer",
"skills":["Linux",
"NLP (Natural Language Processing)",
"SQL",
"Bash",
"Python",
"ML (Machine Learning)",
"JavaScript",
"Java"],
"geocode":"42.35819,-71.050674",
"city":"Boston",
"state":"MA"
},
{
"title":"Machine Learning Engineer",
"skills":["machine learning",
"java",
"scala"],
"geocode":"47.60473,-122.32594",
"city":"Seattle",
"state":"WA"
},
{
"title":"Machine Learning Engineer - REMOTE!",
"skills":["Neo4j",
"Hadoop",
"gensim",
"gensim - C++",
"Java",
"R",
"MongoDB",
"elastic search",
"sci-kit learn",
"Python",
"C++"],
"geocode":"37.777115,-122.41733",
"city":"San Francisco",
"state":"CA"
},...
]
}
尽管它是基于Solr MLT处理程序代码和算法(仅是Rocchio算法)的宽松,但算法设计中存在一些关键差异。构造MLT查询时,MLT处理程序在所有配置的字段中占据顶部K项。如果您的字段比其他字段更广泛,则该术语的平均文档频率将低于其他词汇较小的字段。这意味着这些术语将具有较高的相对IDF分数,并且倾向于主导Solr MLT处理程序选择的顶级术语。我们的请求处理程序每个字段占据了最高的K条款。它还可以确保在每个字段中的每个字段匹配多少个字段(直至配置的限制),在应用RF.QF参数中指定的字段特定权重之前,该字段在结果查询中具有相同的权重。这是我们解决的Solr MLT处理程序的第二个问题。我们还提供了很多额外的功能。我们允许传递内容流,与多个文档相匹配(更像是“这些”,而不是更像“ this”),将Boost查询解析器应用于所得的MLT查询以允许应用任何任意的Solr增强功能(乘法)。我们支持MM参数,因此我们可以强迫文档返回仅与最高术语的设置百分比相匹配。
If you wish to use this to perform search personalization, as demonstrated in my Lucene Revolution 2017 talk, you need to pass in the user's current search query using the regular q parameter, and the information used to generate the rocchio query is passed via the rf.q parameter (when using documents to generate the Rocchio query) or via the content stream parameters (rf.stream.head and rf.stream.body, which take strings of content).但是请注意,由于算法适用的归一化过程,适用于Rocchio查询中的术语的提升与用户查询中的术语没有比较的权重。因此,您需要尝试不同的rf.qf值,直到根据搜索配置找到正确的查询影响级别为止。另外,鉴于每个用户生成的Rocchio查询可能在用户的搜索过程中相同(当然取决于您的用例),因此,一种更有效的方式来使用此操作来实现个性化的方式,是为了使用RF处理程序来为您生成Rochio Query曾经为您生成一次用户登录时,请在此搜索中搜索使用,然后将其作为定期搜索(以您的个人化为单位)。处理程序在响应中返回RF.Query参数中的Rocchio查询。如果您只想使用处理程序来获取查询(而不是执行搜索),则可以将行参数设置为0。您还可以迭代算法返回的“有趣的术语”的集合,以及其权重,如果您设置了rf.isterestingterms =详细信息,并使用此详细信息,并使用此功能来构建boost Querst Query Query Query Query。
除了确保与更多版本的Solr一起使用(请留下有关您想要哪些版本的反馈),还有许多可能的增强功能:
如果您有功能请求,请将其提交到问题列表中。如果您有疑问,那也是发布它们的好地方,但是如果您不回到这里,也可以通过[email protected]与我联系。