I remember that the java teacher once said an interview question on Baidu, which roughly means "There are 10,000 disordered records, how to quickly find the records you want from them." This is equivalent to a simple search engine. Recently, I have already realized this in the work I have sorted out this year. Today I will share with you further abstract it.
First write specific implementation code, and write specific implementation ideas and logic after the code.
Beans used for sorting during search
/** *@Description: */ package cn.lulei.search.engine.model; public class SortBean { private String id; private int times; public String getId() { return id; } public void setId(String id) { this.id = id; } public int getTimes() { return times; } public void setTimes(int times) { this.times = times; } } Constructed search data structure and simple search algorithm
/** *@Description: */ package cn.lulei.search.engine; import java.util.ArrayList; import java.util.Collections; import java.util.Comparator; import java.util.HashMap; import java.util.HashSet; import java.util.List; import cn.lulei.search.engine.model.SortBean; public class SerachBase { //details Store detailed information of search objects, where key is the unique identifier to distinguish Object private HashMap<String, Object> details = new HashMap<String, Object>(); //For keywords participating in the search, the sparse array storage used here can also be stored using HashMap. The definition format is as follows//private static HashMap<Integer, HashSet<String>> keySearch = new HashMap<Integer, HashSet<String>>(); //The key value of the HashMap medium is equivalent to the subscript in the sparse array, and the value is equivalent to the value of the sparse array at this position private final static int maxLength = Character.MAX_VALUE; @SuppressWarnings("unchecked") private HashSet<String>[] keySearch = new HashSet[maxLength]; /** *@Description: Implement singleton mode, using Initialization on Demand Holder to load*@Version:1.1.0 */ private static class lazyLoadSerachBase { private static final SerachBase serachBase = new SerachBase(); } /** * The singleton mode is set to private function here*/ private SerachBase() { } /** * @return * @Description: Get singleton*/ public static SerachBase getSerachBase() { return lazyLoadSerachBase.serachBase; } /** * @param id * @return * @Description: Get details based on id */ public Object getObject(String id) { return details.get(id); } /** * @param ids * @return * @Description: Get details based on ids, separate the ids with ","*/ public List<Object> getObjects(String ids) { if (ids == null || "".equals(ids)) { return null; } List<Object> objs = new ArrayList<Object>(); String[] idArray = ids.split(","); for (String id : idArray) { objs.add(getObject(id)); } return objs; } /** * @param key * @return * @Description: Find the corresponding id according to the search terms, and use "," to split between ids*/ public String getIds(String key) { if (key == null || "".equals(key)) { return null; } //Find//idTimes stores whether each character in the search term appears in the id in the id. idTimes = new HashMap<String, Integer> idTimes = new HashMap<String, Integer>(); //ids stores the id of the characters in the search term. HashSet<String> ids = new HashSet<String>(); //Finding for (int i = 0; i < key.length(); i++) { int at = key.charAt(i); //There is no corresponding character in the search term. Then match the next character if (keySearch[at] == null) { continue; } for (Object obj: keySearch[at].toArray()) { String id = (String) obj; int times = 1; if (ids.contains(id)) { times += idTimes.get(id); idTimes.put(id, times); } else { ids.add(id); idTimes.put(id, times); } } } //Sort with array List<SortBean> sortBeans = new ArrayList<SortBean>(); for (String id : ids) { SortBean sortBean = new SortBean(); sortBeans.add(sortBean); sortBean.setId(id); sortBean.setTimes(idTimes.get(id)); } Collections.sort(sortBeans, new Comparator<SortBean>(){ public int compare(SortBean o1, SortBean o2){ return o2.getTimes() - o1.getTimes(); } }); //Construct return string StringBuffer sb = new StringBuffer(); for (SortBean sortBean : sortBeans) { sb.append(sortBean.getId()); sb.append(","); } //Release the resource idTimes.clear(); idTimes = null; ids.clear(); ids = null; sortBeans.clear(); sortBeans = null; //Return return sb.toString(); } /** * @param id * @param searchKey * @param obj * @Description: Add search history*/ public void add(String id, String searchKey, Object obj) { //The parameters are partially empty and do not load if (id == null || searchKey == null || obj == null) { return; } //Save object details.put(id, obj); //Save search term addSearchKey(id, searchKey); } /** * @param id * @param searchKey * @Description: Add search terms to the search domain*/ private void addSearchKey(String id, String searchKey) { //Separate parameters are empty and not loaded//This is a private method, and the following judgment can be made, but for the sake of design specifications, if (id == null || searchKey == null) { return; } //The following is character participle. Here you can also use other mature word participles for (int i = 0; i < searchKey.length(); i++) { //The value of at is equivalent to the subscript of the array, and the HashSet composed of id is equivalent to the value of the array int at = searchKey.charAt(i); if (keySearch[at] == null) { HashSet<String> value = new HashSet<String>(); keySearch[at] = value; } keySearch[at].add(id); } } } Test cases:
/** *@Description: */ package cn.lulei.search.engine.test; import java.util.List; import cn.lulei.search.engine.SerachBase; public class Test { public static void main(String[] args) { // TODO Auto-generated method stub SerachBase serachBase = SerachBase.getSerachBase(); serachBase.add("1", "Hello!", "Hello!"); serachBase.add("2", "Hello! I am Zhang San.", "Hello! I'm Zhang San."); serachBase.add("3", "The weather is pretty good today."); serachBase.add("4", "Who are you?", "Who are you?"); serachBase.add("5", "The subject of advanced mathematics is difficult", "High mathematics is indeed difficult."); serachBase.add("6", "Test", "The above is just test"); String ids = serachBase.getIds("Your High mathematics"); System.out.println(ids); List<Object> objs = serachBase.getObjects(ids); if (objs != null) { for (Object obj : objs) { System.out.println((String) obj); } } } } The test output results are as follows:
5, 3, 2, 1, 4, High numbers are indeed difficult. The weather today is pretty good. Hello! I am Zhang San. Hello! Who are you?
Such a simple search engine is considered to be completed.
Question 1: The word participle here uses character participle, which is quite good at handling Chinese, but it is very weak in handling English.
Improvement method: Use the current mature word segmentation method, such as IKAnalyzer, StandardAnalyzer, etc. In this way, the data structure of keySearch needs to be modified, which can be modified to private HashMap<String, String>[] keySearch = new HashMap[maxLength]; where the key stores the word element of the part, and the value stores the unique identifier id.
Question 2: The search engine implemented in this article does not set weights for word elements like lucene, but simply determines whether word elements appear in the object.
Improvement method: None yet. Adding weight processing makes the data structure more complex, so it has not been processed for the time being, and weight processing will be implemented in future articles.
Let’s briefly introduce the implementation ideas of search engines .
Set the two properties of details and keySearch in the SerachBase class. details are used to store the detailed information of the Object, and keySearch is used to index the search domain. The details data format is HashMap, and the data format of keySearch is a sparse array (can also be HashMap, the HashMap medium key value is equivalent to the subscript in the sparse array, and the value is equivalent to the value of the sparse array at this position).
I won't give too much introduction to details.
The calculation method of array subscripts in keySearch (such as HashMap is key) is to obtain the first character int value of the word element (because the word participle in this article uses character participle, so a character is a word element). The int value is the subscript of the array, and the corresponding array value is the unique identifier of the Object. In this way, the data structure of keySearch is as follows
Therefore, when you want to add a new record, you only need to call the add method.
The implementation logic for search is similar to the keySearch above. For id search, just use HashMap's get method. For a search for search terms, the overall process also uses word segmentation first, query second, and sorted last. Of course, the word participle here should be consistent with the word participle used to create (that is, character participle is used when creating, and character participle is also used when searching).
In the getIds method, the HashMap<String, Integer> idTimes = new HashMap<String, Integer>();idTimes variable is used to store how many word elements in the search term appear in keySearch. The key value is the unique identifier id and the value is the number of word elements that appear. HashSet<String> ids = new HashSet<String>(); ids variable is used to store the ids of the occurrence of the word. The complexity of searching in this way is the number of word elements n of the search term. Obtain the ids containing word elements, construct the SortBean array, and sort it. The sorting rule is to descending order of the number of word elements. Finally, the ids string is returned, and each id is divided by ",". If you want to get detailed information, use the getObjects method.
The above is just a simple search engine and has not designed too many calculation methods. I hope it will inspire everyone's learning.