In scenarios like forums and chat rooms, in order to ensure user experience, we often need to block a lot of bad words. For single keyword search, it is naturally more efficient in indexOf and regular methods. However, if there are many keywords, if you repeatedly call indexOf or regular words to match the full text, the performance consumption is very high. Since the target string is usually large in size, it is necessary to ensure that the result is obtained in one traversal. Based on such needs, it is easy to think of ways to match each character in the whole text in sequence. For example, for this text: "Mike Jordan had said "Just do IT", so Mark has been a coder." If our keyword is "Mike" and "Mark", then you can traverse the whole sentence. When you find "M", then see if you can match "i" or "a". If you can match until the end, you will successfully find a keyword, otherwise you will continue to traverse. Then the structure of the keywords should be like this:
var keywords = { M: { i: { k: { e: { end: true} } } }, a: { r: { k: {end: true} } } }}From the above, we can see that this data is a tree structure, and it is still time-consuming to create a tree structure based on the keyword group, while the keywords are already given, so you can create such a data structure in advance before matching. The code is as follows:
function buildTree(keywords) { var tblCur = {}, key, str_key, Length, j, i; var tblRoot = tblCur; for(j = keywords.length - 1; j >= 0; j -= 1) { str_key = keywords[j]; Length = str_key.length; for(i = 0; i < Length; i += 1) { key = str_key.charAt(i); if(tblCur.hasOwnProperty(key)) { tblCur = tblCur[key]; } else { tblCur = tblCur[key] = {}; } } tblCur.end = true; //The last keyword tblCur = tblRoot; } return tblRoot;}This code uses a continuous statement: tblCur = tblCur[key] = {}. The execution order here is the execution order of the statements. Since the operation level of [] is higher than =, the first thing is to create a key attribute in the tblCur object. Combined with tblRoot = tblCur = {}, the execution order is:
var tblRoot = tblCur = {};tblRoot = tblCur;tblCur['key'] = undefined; // now tblRoot = {key: undefined}tblCur['key'] = {};tblCur = tblCur['key'];Through the above code, the required query data is constructed. Let’s take a look at the writing method of the query interface.
For each word of the target string, we start from the top of this keywords. First, keywords[a] . If there is, look at keyword[a][b]. If the last keyword[a][b]…[x]=true, it means that the match is successful. If keyword[a][b]…[x]=undefined, then the matching will be restarted from the next position.
function search(content) { var tblCur, p_star = 0, n = content.length, p_end, match, //Whether to find a match_key, match_str, arrMatch = [], //Storage the result arrLength = 0; //The length index of arrMatch while(p_star < n) { tblCur = tblRoot; //Tracking back to the root p_end = p_star; match_str = ""; match = false; do { match_key = content.charAt(p_end); if(!(tblCur = tblCur[match_key])) { //The end of this match p_star += 1; break; } else { match_str += match_key; } p_end += 1; if(tblCur.end) //Whether to match to the tail { match = true; } } while (true); if(match) { //Maximum match arrMatch[arrLength] = { key: match_str, begin: p_star - 1, end: p_end }; arrLength += 1; p_star = p_end; } } return arrMatch;}The above is the core of the entire keyword matching system. Here we use the language features of js very well, and the efficiency is very high. I used a 500,000-word "Soushen Ji" to test it and found the given 300 idioms. The matching effect is about 1 second. Importantly, since the target text is traversed at once, the length of the target text has little effect on query time. The number of keywords that have a greater impact on query time is the number of keywords. Each word in the target text traverses the keywords, so it has a certain impact on the query.
Simple analysis
I guess you are wondering when you read the above article. You can traverse all keywords for each word. Even if some keywords are partially the same, it is quite time-consuming to traverse them completely. However, the properties of objects in js are constructed using a hash table. The data of this structure is very different from simple array traversal, and the efficiency is much higher than that of order-based array traversal. Some students may not be familiar with data structures, so I will briefly talk about the relevant content of the hash table.
First, let’s take a look at the storage of data.
The storage of data in memory consists of two parts, one is the value and the other is the address. Think of memory as a Xinhua dictionary, the explanation of the word is the value, and the directory is the address. The dictionary is sorted by pinyin, for example, "ni" with the same pronunciation is arranged in the same piece, that is, the array is neatly arranged in a memory area. This structure is an array, you can specify "ni" number 1 and 10 to access it. The structure diagram is as follows:
The advantage of arrays is that they are simple to traverse, and they can directly access the corresponding data through subscripts. But it is very difficult to add or delete a certain item. For example, if you want to delete item 6, then the data after item 5 must be moved forward by one position. If you want to delete the first bit, the entire array needs to be moved, which consumes a lot.
In order to solve the problem of array addition and deletion, linked lists appear. If we divide the value into two parts, part is used to store the original value, and the other part is used to store an address, which points to another same structure, and so on, form a linked list. The structure is as follows:
It can be clearly seen from the above figure that adding and deleting the linked list is very simple. Just rewrite the target item and the next item of the previous item and it will be completed. However, it is very difficult to query the value of an item. You must traverse it in turn to access the target location.
In order to integrate the advantages of these two structures, you must have thought of the following structure.
This data structure is a hash table structure. The header address of the linked list is stored in the array, and a two-dimensional data table can be formed. As for how data is distributed, this is a hashing algorithm, and a regular translation should be a hashing algorithm. Although there are many types of algorithms, in principle, they solve the key through a function, and then place the data according to the results obtained from the solution. In other words, a mapping is formed between the key and the actual address, so at this time we no longer access the array by subscripting the array or simply traversing it, but instead locate the data by the inverse function of the hash function. The object in js is a hash structure. For example, we define an obj. Obj.name through hashing, and its position in memory may be 90 in the figure above. When we want to operate obj.name, the underlying layer will automatically help us locate the position of 90 through the hash algorithm, which means that we directly start to look up the linked list from the 12 items of the array, rather than traversing the entire memory block from 0.
Define an object obj{key: value} in js. The key is converted into a string and then hashed to obtain a memory address, and then put the value into it. This allows us to understand why we can add and delete attributes at will, and why we can also assign attributes to arrays in js, and there is no so-called cross-border array.
In situations where the data volume is large, the hash table has a very obvious advantage because it reduces a lot of unnecessary calculations through the hash algorithm. The so-called performance optimization is actually to make computers less computing; the biggest optimization is to not calculate!
Optimization of algorithms
Now that you understand the underlying implementation of the algorithm, you can consider optimizing the algorithm in retrospect. However, before optimizing, you should emphasize one thing: Don’t blindly pursue performance! For example, in this case, we can match up to 5,000 words at most, so the existing algorithm is enough, and all optimizations are unnecessary. The reason why I still talk about optimization is to improve my understanding of the algorithm and the program, rather than really doing the 1ms optimization.
We found that none of our keywords have a single word, so it would be a waste for us to traverse keywords according to the unit of one word. The optimization here is to pre-statistic the maximum and minimum length of the keyword, and search in units of the minimum length each time. For example, the keyword in my test case is an idiom, and the shortest is 4 characters, so every time I match, I match 4 characters together. If I hit, continue to search in depth to find the maximum length. In other words, when we first construct the tree, it is first built with the minimum length, and then it is added verbatim.
A simple calculation is made. According to our test case, for 300 idioms, we only need to compare one word once, and for single-word query, we need to compare 4 times, and each comparison we have to access our tree structure, which is the avoidable performance consumption. More importantly, the comparison here is not a string comparison. Our keywords here all exist as keys, and the effect is the same as key in obj, which is hashed the key and then access the corresponding address! So don’t worry about the difference between comparing one word and comparing four words. We didn’t compare strings!
That’s all about matching multiple keywords. I won’t post the optimized version of the code because it is generally not available.