Dice.com的SOLR插件,用於執行個性化搜索,並建議(通過相關反饋插件)和概念 /語義搜索(通過無監督的反饋插件)。
可以在./target文件夾中找到預製的jar文件。該項目包含一個maven pom.xml文件,該文件也可用於從源構建它。
如果您需要的Solr有特定版本,請創建一個GitHub問題,我會看到我可以做什麼。要為特定版本手動編譯它,請使用Maven使用pom.xml文件來編譯插件,並在該文件中更新Solr和Lucene庫的版本,然後使用Maven來吸引這些依賴關係。然後修復任何彙編錯誤。
請參閱官方的Solr指南,以在Solr註冊插件。這基本上涉及將JAR文件放入Solr檢查Core Reload上的類和JAR文件的文件夾之一。
下面顯示了SolrConfig.xml的示例請求處理程序配置,其中註釋概述了主要參數:
<requestHandler name="/rf" class="org.dice.solrenhancements.relevancyfeedback.RelevancyFeedbackHandler">
<lst name="defaults">
<str name="omitHeader">true</str>
<str name="wt">json</str>
<str name="indent">true</str>
<!-- Regular query configuration - query parser used when parsing the rf query-->
<str name="defType">lucene</str>
<!-- fields returned -->
<str name="fl">jobTitle,skill,company</str>
<!-- fields to match on-->
<str name="rf.fl">skillFromSkill,extractTitles</str>
<!-- field weights. Note that term weights are normalized so that each field is weighted exactly in this ratio
as different fields can get different numbers of matching terms-->
<str name="rf.qf">skillFromSkill^3 extractTitles^4.5</str>
<int name="rows">10</int>
<!-- How many terms to extract per field (max) -->
<int name="rf.maxflqt">10</int>
<bool name="rf.boost">true</bool>
<!-- normalize the weights for terms in each field (custom to dice rf, not present in solr MLT) -->
<bool name="rf.normflboosts">true</bool>
<!-- Take the raw term frequencies (false) or log of the term frequenies (true) -->
<bool name="rf.logtf">true</bool>
<!-- Minimum should match settings for the rf query - determines what proportion of the terms have to match -->
<!-- See Solr edismax mm parameter for specifics -->
<bool name="rf.mm">25%</bool>
<!-- Returns the top k terms (see regular solr MLT handler) -->
<str name="rf.interestingTerms">details</str>
<!-- Turns the rf query into a boost query using a multiplicative boost, allowing for boosting -->
<str name="rf.boostfn"></str>
<!-- q parameter - If you want to execute one query, and use the rf query to boost the results (e.g. for personalizing search),
pass the user's query into this parameter. Can take regular query syntax e.g.rf.q={!edismax df=title qf=.... v=$qq}&qq=Java
The regular q parameter is reserved for the rf query (see abpve)
-->
<str name="q"></str>
<!-- Query parser to use for the q query, if passed -->
<str name="defType"></str>
<!-- rf.q parameter - Note that the regular q parameter is used only for personalized search scenarios, where you have a main query
and you want to use the rf query generated to boost the main queries documents. Typically the rf.q query is a query that identifies
one or more documents, e.g. rf.q=(id:686867 id:98928980 id:999923). But it cam be any query. Note that it will process ALL documents
matched by the q query (as it's intended mainly for looking up docs by id), so use cautiously.
-->
<str name="rf.q"></str>
<!-- query parser to use for the rf.q query -->
<str name="rf.defType"></str>
<!-- Settings for personalized search - use the regular parameter names for the query parser defined by defType parameter -->
<str name="df">title</str>
<str name="qf"> company_text^0.01 title^12 skill^4 description^0.3</str>
<str name="pf2">company_text^0.01 title^12 skill^4 description^0.6</str>
<!-- Content based recommendations settings (post a document to the endpoint in a POST request). The stream.body and stream.head are form parameters
You can send them in a GET request, but a POST handles larger data. If you have really large documents, you will need to change the buffer settings
so that the request doesn't blow the buffer limits in Solr or your web server.
-->
<!-- Fields used for processing documents posted to the stream.body and stream.head parameters in a POST call -->
<str name="stream.head.fl">title,title_syn</str>
<str name="stream.body.fl">extractSkills,extractTitles</str>
<!-- pass a url in this parameter for Solr to download the webpage, and process the Html using the fields configured in the stream.qf parameters -->
<str name="stream.url"></str>
<!-- Note that we have two different content stream fields to pass over in the POST request. This allows different analyzers to be appkied to each.
For instance, we pass the job title into the stream.head field and parse out job titles, while we pass the job description to the stream.head parameter
to parse out skills -->
<!-- Pass the document body in this parameter as a form parameter. Analysed using the stream.body.fl fields-->
<str name="stream.body"></str>
<!-- Pass the second document field in this parameter. Analysed using the stream.head.fl fields-->
<str name="stream.head"></str>
<!-- Specifies a separate set of field weights to apply when procesing a document posted to the request handler via the
stream.body and stream.head parameters -->
<str name="stream.qf">extractSkills^4.5 extractTitles^2.25 title^3.0 title_syn^3.0</str>
</lst>
</requestHandler>
http:// localhost:8983/solr/jobs/rf?q = ID:11F407D319D6CC707437FAD874A097C0+ID:A2FD2F2E 34667D61FADCDCABFD359CF4&行= 10&df = title&fl = title,技能,地理,城市,州和wt = JSON
{
"match":{
"numFound":2,
"start":0,
"docs":[
{
"id":"a2fd2f2e34667d61fadcdcabfd359cf4",
"title":"Console AAA Sports Video Game Programmer.",
"skills":["Sports Game Experience a plus.",
"2-10 years plus Console AAA Video Game Programming Experience"],
"geocode":"38.124447,-122.55051",
"city":"Novato",
"state":"CA"
},
{
"id":"11f407d319d6cc707437fad874a097c0",
"title":"Game Engineer - Creative and Flexible Work Environment!",
"skills":["3D Math",
"Unity3d",
"C#",
"3D Math - game programming",
"game programming",
"C++",
"Java"],
"geocode":"33.97331,-118.243614",
"city":"Los Angeles",
"state":"CA"
}
]
},
"response":{
"numFound":5333,
"start":0,
"docs":[
{
"title":"Software Design Engineer 3 (Game Developer)",
"skills":["C#",
"C++",
"Unity"],
"geocode":"47.683647,-122.12183",
"city":"Redmond",
"state":"WA"
},
{
"title":"Game Server Engineer - MMO Mobile Gaming Start-Up!",
"skills":["AWS",
"Node.JS",
"pubnub",
"Websockets",
"pubnub - Node.JS",
"Vagrant",
"Linux",
"Git",
"MongoDB",
"Jenkins",
"Docker"],
"geocode":"37.777115,-122.41733",
"city":"San Francisco",
"state":"CA"
},...
]
}
}
下面顯示了SolrConfig.xml的示例請求處理程序配置,其中註釋概述了主要參數:
<requestHandler name="/ufselect" class="org.dice.solrenhancements.unsupervisedfeedback.UnsupervisedFeedbackHandler">
<lst name="defaults">
<str name="omitHeader">true</str>
<str name="wt">json</str>
<str name="indent">true</str>
<!-- Regular query configuration -->
<str name="defType">edismax</str>
<str name="df">title</str>
<str name="qf">title^1.5 skills^1.25 description^1.1</str>
<str name="pf2">title^3.0 skills^2.5 description^1.5</str>
<str name="mm">1</str>
<str name="q.op">OR</str>
<str name="fl">jobTitle,skills,company</str>
<int name="rows">30</int>
<!-- Unsupervised Feedback (Blind Feedback) query configuration-->
<str name="uf.fl">skillsFromskills,titleFromJobTitle</str>
<!-- How many docs to extract the top terms from -->
<str name="uf.maxdocs">50</str>
<!-- How many terms to extract per field (max) -->
<int name="uf.maxflqt">10</int>
<bool name="uf.boost">true</bool>
<!-- Relative per-field boosts on the extracted terms (similar to edismax qf parameter -->
<!-- NOTE: with uf.normflboosts=true, all terms are normalized so that the total importance of each
field on the query is the same, then these relative boosts are applied per field-->
<str name="uf.qf">skillsFromskills^4.5 titleFromJobTitle^6.0</str>
<!-- Returns the top k terms (see regular solr MLT handler) -->
<str name="uf.interestingTerms">details</str>
<!-- unit-length norm all term boosts within a field (recommended) - see talk for details -->
<bool name="uf.normflboosts">true</bool>
<!-- use raw term clounts or log term counts? -->
<bool name="uf.logtf">false</bool>
</lst>
</requestHandler>
http:// localhost:8983/solr/dicejobscp/ufselect?q =機器+學習+工程師&start&start = 0&row = 10&uf.logtf = false& fl =標題,技能,地理編碼,城市,州和fq = {!geofilt+sfield = jobendecageocode+d = 48+pt = 39.6955,-105.0841}&wt = json
{
"match":
{
"numFound":7729,
"start":0,
"docs":[
{
"title":"NLP/Machine Learning Engineer",
"skills":["Linux",
"NLP (Natural Language Processing)",
"SQL",
"Bash",
"Python",
"ML (Machine Learning)",
"JavaScript",
"Java"],
"geocode":"42.35819,-71.050674",
"city":"Boston",
"state":"MA"
},
{
"title":"Machine Learning Engineer",
"skills":["machine learning",
"java",
"scala"],
"geocode":"47.60473,-122.32594",
"city":"Seattle",
"state":"WA"
},
{
"title":"Machine Learning Engineer - REMOTE!",
"skills":["Neo4j",
"Hadoop",
"gensim",
"gensim - C++",
"Java",
"R",
"MongoDB",
"elastic search",
"sci-kit learn",
"Python",
"C++"],
"geocode":"37.777115,-122.41733",
"city":"San Francisco",
"state":"CA"
},...
]
}
儘管它是基於Solr MLT處理程序代碼和算法(僅是Rocchio算法)的寬鬆,但算法設計中存在一些關鍵差異。構造MLT查詢時,MLT處理程序在所有配置的字段中佔據頂部K項。如果您的字段比其他字段更廣泛,則該術語的平均文檔頻率將低於其他詞彙較小的字段。這意味著這些術語將具有較高的相對IDF分數,並且傾向於主導Solr MLT處理程序選擇的頂級術語。我們的請求處理程序每個字段佔據了最高的K條款。它還可以確保在每個字段中的每個字段匹配多少個字段(直至配置的限制),在應用RF.QF參數中指定的字段特定權重之前,該字段在結果查詢中具有相同的權重。這是我們解決的Solr MLT處理程序的第二個問題。我們還提供了很多額外的功能。我們允許傳遞內容流,與多個文檔相匹配(更像是“這些”,而不是更像“ this”),將Boost查詢解析器應用於所得的MLT查詢以允許應用任何任意的Solr增強功能(乘法)。我們支持MM參數,因此我們可以強迫文檔返回僅與最高術語的設置百分比相匹配。
If you wish to use this to perform search personalization, as demonstrated in my Lucene Revolution 2017 talk, you need to pass in the user's current search query using the regular q parameter, and the information used to generate the rocchio query is passed via the rf.q parameter (when using documents to generate the Rocchio query) or via the content stream parameters (rf.stream.head and rf.stream.body, which take strings of content).但是請注意,由於算法適用的歸一化過程,適用於Rocchio查詢中的術語的提升與用戶查詢中的術語沒有比較的權重。因此,您需要嘗試不同的rf.qf值,直到根據搜索配置找到正確的查詢影響級別為止。另外,鑑於每個用戶生成的Rocchio查詢可能在用戶的搜索過程中相同(當然取決於您的用例),因此,一種更有效的方式來使用此操作來實現個性化的方式,是為了使用RF處理程序來為您生成Rochio Query曾經為您生成一次用戶登錄時,請在此搜索中搜索使用,然後將其作為定期搜索(以您的個人化為單位)。處理程序在響應中返回RF.Query參數中的Rocchio查詢。如果您只想使用處理程序來獲取查詢(而不是執行搜索),則可以將行參數設置為0。您還可以迭代算法返回的“有趣的術語”的集合,以及其權重,如果您設置了rf.isterestingterms =詳細信息,並使用此詳細信息,並使用此功能來構建boost Querst Query Query Query Query。
除了確保與更多版本的Solr一起使用(請留下有關您想要哪些版本的反饋),還有許多可能的增強功能:
如果您有功能請求,請將其提交到問題列表中。如果您有疑問,那也是發布它們的好地方,但是如果您不回到這裡,也可以通過[email protected]與我聯繫。