Solr is an independent enterprise-level search application server that provides a web-service-like API interface. Users can submit XML files of a certain format to the search engine server through http requests and generate indexes; they can also submit search requests through Http Get operations and get the return result in XML/Json format. Developed using Java5, based on Lucene.
Lucene is a subproject of the 4 jakarta project team of the Apache Software Foundation. It is an open source full-text search engine toolkit, that is, it is not a complete full-text search engine, but a full-text search engine architecture, providing a complete query engine and index engine, and a partial text analysis engine (English and German Western languages).
Among them, the basic principles of Lucene's full-text search are consistent with the technology in the web search course taught by Guo Jundaniu. It uses word segmentation, semantic grammar analysis, vector space model and other technologies to achieve it. The following is a more detailed blog post memo: http://www.cnblogs.com/guochunguang/articles/3641008.html
1. General discussion
Definition according to http://lucene.apache.org/java/docs/index.html:
Lucene is an efficient, Java-based full-text search library.
So before you understand Lucene, you have to take some time to learn about the full text search.
So what is full-text search? This starts with the data in our lives.
The data in our lives are generally divided into two types: structured data and unstructured data.
•Structured data: refers to data with fixed format or finite length, such as databases, metadata, etc.
•Unstructured data: refers to data with uncertain length or no fixed format, such as emails, word documents, etc.
Of course, some places will also mention the third type, semi-structured data, such as XML, HTML, etc., when it can be processed according to structured data, or plain text can be extracted to process as unstructured data.
Unstructured data is also called full-text data.
According to the classification of data, searches are also divided into two types:
•Search for structured data: For example, searching for databases, use SQL statements. For example, searching metadata, such as using Windows search to search for file names, types, modification time, etc.
• Search for unstructured data: For example, searching for Windows can also search for file content, grep command under Linux, and for example, using Google and Baidu can search for a large amount of content data.
There are two main methods for searching for unstructured data, that is, full-text data:
One is Serial Scanning: the so-called sequential scanning, for example, looking for a file with a certain string, is to look at each document at a time. For each document, see from the beginning to the end. If this document contains this string, then this document is the file we are looking for, and then look at the next file until all the files are scanned. If you use Windows to search, you can also search for file content, but it is quite slow. If you have an 80G hard drive, if you want to find a file with a certain string on it, it is afraid you won't be able to do it without spending a few hours. This is also the same way for grep commands under Linux. You may think this method is relatively primitive, but for files with small data volumes, this method is still the most direct and convenient. But for a large number of files, this method is very slow.
Some people may say that sequential scanning of unstructured data is very slow, but searching for structured data is relatively fast (because structured data has a certain structure, a certain search algorithm can be used to speed up the speed), so isn't it enough to find a way to make our unstructured data have a certain structure?
This idea is natural, but it forms the basic idea of full-text search, which is to extract part of the information in the unstructured data and reorganize it to make it have a certain structure, and then search for the data with a certain structure, so as to achieve the purpose of searching relatively quickly.
This part of the information extracted from the unstructured data and then reorganized is called the index.
This statement is relatively abstract, and it is easy to understand by giving a few examples. For example, the dictionary, the pinyin table and radical check character table of the dictionary are equivalent to the index of the dictionary. The explanation of each character is unstructured. If the dictionary does not have a syllable table and radical check character table, you can only scan it in the vast sea of words. However, some information of the word can be extracted for structured processing. For example, pronunciation is relatively structured, divided into initials and finals, and there are only a few types that can be listed one by one. Therefore, the pronunciation is taken out and arranged in a certain order, and each pronunciation points to the number of pages of the detailed explanation of this word. When we search, we search for the pronunciation according to the structured pinyin, and then we can find our unstructured data - that is, the explanation of the words.
This process of first creating an index and then searching the index is called Full-text search.
The picture below is from "Lucene in action", but it not only describes Lucene's search process, but also describes the general process of full-text search.
Full-text search is generally divided into two processes: index creation (Indexing) and search index (Search).
• Index creation: The process of extracting information from all structured and unstructured data in the real world and creating an index.
•Search index: It is the process of obtaining the user's query request, searching for the created index, and then returning the result.
Therefore, there are three important issues in full text search:
1. What exactly exists in the index? (Index)
2. How to create an index? (Indexing)
3. How to search for indexes? (Search)
Below we study each problem in sequence.
2. What exactly exists in the index
What exactly needs to be stored in the index?
First, let’s see why sequential scanning speed is slow:
In fact, it is caused by the inconsistency between the information we want to search and the information stored in the unstructured data.
The information stored in unstructured data is what strings each file contains, that is, known files, and it is relatively easy to seek strings, that is, the mapping from file to string. The information we want to search for is which files contain this string, that is, a known string, and the desired file, that is, the mapping from string to file. The two are exactly the opposite. Therefore, if the index can always save the mapping from string to file, the search speed will be greatly improved.
Since the mapping from a string to a file is a reverse process of the file to a string mapping, the index that stores such information is called a reverse index .
The saved information of the reverse index is generally as follows:
Suppose there are 100 documents in my document collection. For the sake of convenience, we number the document from 1 to 100 and get the following structure.
On the left is a series of strings called dictionaries .
Each string points to a document linked list containing this string, which is called a Posting List .
With indexing, the saved information is consistent with the information to be searched, which can greatly speed up the search.
For example, if we want to find a document that contains both the string "lucene" and the string "solr", we only need the following steps:
1. Remove the document link list containing the string "lucene".
2. Remove the document link list containing the string "solr".
3. By combining the linked list, find files that contain both "lucene" and "solr".
Seeing this place, some people may say that full-text search does speed up the search, but with the process of extra indexing, the two may not be much faster than sequential scanning. Indeed, with the indexing process, full-text retrieval is not necessarily faster than sequential scanning, especially when the data volume is small. Creating an index on a large amount of data is also a very slow process.
However, there is still a difference between the two. Sequential scanning is a scan every time, and the process of creating an index only needs to be once and for all. Each search, the process of creating an index does not have to go through, just searching for the created index.
This is also one of the advantages of full-text search over sequential scanning: indexing once, using multiple times.
3. How to create an index
The index creation process of full-text search generally has the following steps:
Step 1: Some original documents to be indexed (Documents).
To facilitate the explanation of the index creation process, here we use two files as examples:
File one: Students should be allowed to go out with their friends, but not allowed to drink beer.
File 2: My friend Jerry went to school to see his students but found them drunk which is not allowed.
Step 2: Pass the original document to the Tokenizer.
The word participle component (Tokenizer) will do the following things (this process is called Tokenize):
1. Divide the document into separate words.
2. Remove punctuation marks.
3. Remove the Stop word.
The so-called Stop word is some of the most common words in a language. Since it has no special meaning, it cannot be a search keyword in most cases. Therefore, when creating an index, this word will be removed and the size of the index will be reduced.
English Stop word such as: "the", "a", "this", etc.
For each language's Tokenizer, there is a set of stop words.
The result obtained after a word participle (Tokenizer) is called a word element.
In our example, we get the following word element (Token):
"Students", "allowed", "go", "their", "friends", "allowed", "drink", "beer", "My", "friend", "Jerry", "went", "school", "see", "his", "students", "found", "them", "drunk", "allowed".
Step 3: Pass the obtained token to the language processing component (Linguistic Processor).
The language processing component (linguistic processor) mainly deals with the resulting word elements (tokens).
For English, the language processing component (Linguistic Processor) generally does the following:
1. Change to lowercase (Lowercase).
2. Reduce the word into root form, such as "cars" to "car", etc. This operation is called: stemming.
3. Convert words into root forms, such as "drove" to "drive", etc. This operation is called lemmatization.
Similarities and differences between Stemming and lemmatization:
•Symmetric: Stemming and lemmatization both make vocabulary a root form.
•The two methods are different:
◦Stemming adopts the "reduction" method: "cars" to "car", "driving" to "drive".
◦Lemmatization adopts the "transformation" method: "drove" to "drove", "driving" to "drive".
•The algorithms of the two are different:
◦Stemming mainly adopts some fixed algorithm to do this reduction, such as removing "s", removing "ing" and adding "e", changing "ational" to "ate", and changing "tional" to "tion".
◦Lemmatization mainly uses the method of saving a certain dictionary to make this transformation. For example, there are mappings from "driving" to "drive", "drove" to "drive", "am, is, are" to "be" in the dictionary. When making a transformation, just look up the dictionary.
•Stemming and lemmatization are not mutually exclusive relationships, but have intersections. Some words can achieve the same conversion using both methods.
The result of a linguistic processor is called a Term.
In our example, after language processing, the word (Term) obtained is as follows:
"student", "allow", "go", "their", "friend", "allow", "drink", "beer", "my", "friend", "jerry", "go", "school", "see", "his", "student", "find", "them", "drink", "allow".
It is precisely because of the language processing steps that the search for drive can be searched, and drive can also be searched.
Step 4: Pass the obtained word (Term) to the index component (Indexer).
Indexer mainly does the following:
1. Create a dictionary using the resulting word (Term).
In our example the dictionary is as follows:
Term | Document ID |
Student | 1 |
allow | 1 |
go | 1 |
Their | 1 |
friend | 1 |
allow | 1 |
Drink | 1 |
beer | 1 |
My | 2 |
friend | 2 |
jerry | 2 |
go | 2 |
school | 2 |
see | 2 |
his | 2 |
Student | 2 |
Find | 2 |
They | 2 |
Drink | 2 |
allow | 2 |
2. Sort the dictionary alphabetically.
Term | Document ID |
allow | 1 |
allow | 1 |
allow | 2 |
beer | 1 |
Drink | 1 |
Drink | 2 |
Find | 2 |
friend | 1 |
friend | 2 |
go | 1 |
go | 2 |
his | 2 |
jerry | 2 |
My | 2 |
school | 2 |
see | 2 |
Student | 1 |
Student | 2 |
Their | 1 |
They | 2 |
•Document Frequency means the frequency of the document, indicating how many files contain this word (Term).
•Frequency means the word frequency, which means that this file contains several of the words (Term).
Therefore, for the word "allow", there are two documents containing this word (Term), so there are two documents following the document list after the word (Term). The first item represents the first document containing "allow", that is, document 1. In this document, "allow" appears twice, and the second item represents the second document containing "allow" is document 2. In this document, "allow" appears once.
So far, the index has been created, and we can quickly find the document we want through it.
And in the process, we were surprised to find that searches for "drive", "driving", "drove", and "driven" can also be found. Because in our index, "driving", "drove", and "driven" will all be processed through language and become "drive". When searching, if you enter "driving", the entered query statement will also go through one to three steps here, and then become a query "drive", so that you can search the desired document.
3. How to search for indexes?
It seems that we can announce "we have found the documentation we want".
However, the matter was not over, and it was found that it was only one aspect of the full text search. Isn't it? If only one or ten documents contain the strings we query, we did find them. But what if there are a thousand, or even thousands? Which one is the file you want most?
Open Google, for example, if you want to find a job at Microsoft, you enter "Microsoft job" and you find a total of 22600,000 results returned. What a big number, it was a problem that suddenly I found that it was not found, and too many were also a problem. With so many results, how do you put the most relevant ones first?
Of course Google does a good job, you'll find jobs at Microsoft in one go. Imagine how terrible it would be if the first few were all "Microsoft does a good job at software industry..."
How to find the most relevant query statements among thousands of search results like Google?
How to determine the correlation between the searched document and the query statement?
This is back to our third question: How to search for indexes?
Search is mainly divided into the following steps:
Step 1: The user enters the query statement.
Query statements have certain syntax, just like our ordinary language.
Different query statements have different syntax, such as SQL statements have certain syntax.
The syntax of the query statement varies according to the implementation of the full-text retrieval system. The most basic ones include: AND, OR, NOT, etc.
For example, the user input statement: lucene AND learned NOT hadoop.
Explain that the user wants to find a document that contains lucene and learned but does not include hadoop.
Step 2: Perform lexical analysis, grammatical analysis, and language processing of the query statement.
Since the query statement has syntax, it is also necessary to perform grammatical analysis, grammatical analysis and language processing.
1. Lexical analysis is mainly used to identify words and keywords.
As in the above example, after lexical analysis, the words include lucene, learned, hadoop, and keywords include AND, NOT.
If an illegal keyword is found in the lexical analysis, an error will occur. For example, lucene AMD learned, where AMD is involved in querying as an ordinary word due to misspelling of AND.
2. Syntax analysis mainly forms a syntax tree based on the grammatical rules of the query statement.
If you find that the query statement does not meet the syntax rules, an error will be reported. If lucene NOT AND learned, an error will occur.
As in the above example, the syntax tree formed by lucene AND learned NOT hadoop is as follows:
3. Language processing is almost the same as language processing during indexing.
For example, learnt becomes learnt, etc.
After the second step, we get a language-processed syntax tree.
Step 3: Search for the index and get documents that match the syntax tree.
This step is divided into several small steps:
1. First, in the reverse index table, find the document link list containing lucene, learn, and hadoop.
2. Secondly, merge the linked lists containing lucene and learn to obtain a document linked list that contains both lucene and learn.
3. Then, perform a difference between this linked list and the document linked list of hadoop, and remove the document containing hadoop, so as to obtain a document linked list that contains both lucene and learn and does not contain hadoop.
4. This document link list is the document we are looking for.
Step 4: Sort the results based on the correlation between the obtained document and the query statement.
Although in the previous step, we got the desired document, the query results should be sorted by their correlation with the query statement, and the more relevant the ones are, the higher the ones are.
How to calculate the correlation between documents and query statements?
It is better to regard the query statement as a short document and score the relevance between documents. If the correlation with high scores is good, it should be ranked first.
So how do you rate the relationship between documents?
This is not an easy task. First, let’s take a look at judging the relationship between people.
First of all, when looking at a person, there are often many elements , such as personality, beliefs, hobbies, clothing, height, fatness, and thinness.
Secondly , for the relationship between people, different elements have different importance . Character, belief, and hobbies may be more important. Clothing, height, and fatness may not be so important. Therefore, people with the same or similar personality, belief, and hobbies are more likely to become good friends, but people with different clothing, height, and fatness, and thinness can also become good friends.
Therefore, when judging the relationship between people, we must first find out which elements are most important to the relationship between people , such as personality, beliefs, and hobbies. Secondly, we need to judge the relationship between these elements of two people , such as one person has a cheerful personality, the other person has an extroverted personality, one believes in Buddhism, the other believes in God, one loves to play basketball, and the other loves to play football. We found that both of them are very positive in terms of personality, kind in terms of faith, and sports in terms of hobbies, so the relationship between them should be very good.
Let’s take a look at the relationship between companies.
First, look at a company, which consists of many people, such as general manager, manager, chief technical officer, ordinary employees, security guards, doormen, etc.
Secondly, different people have different importance for the relationship between companies . General managers, managers, and chief technical officers may be more important, and ordinary employees, security guards, and doormen may be less important. So if the relationship between the general managers, managers and chief technology officers of two companies is relatively good, the two companies are prone to have a better relationship. However, even if an ordinary employee has a deep hatred with an ordinary employee of another company, it is unlikely that it will affect the relationship between the two companies.
Therefore, to judge the relationship between a company, we must first find out who is most important to the relationship between the company , such as the general manager, manager, and chief technology officer. Secondly, we need to judge the relationship between these people , which is not as good as the general managers of the two companies, the managers are fellow villagers, and the chief technology officers are once entrepreneurial partners. We found that the relationship between the two companies is good, whether the general manager, manager, or chief technology officer, is good, so the relationship between the two companies should be good.
After analyzing the two relationships, let’s take a look at how to judge the relationship between documents .
First of all, a document consists of many words (Term) , such as search, lucene, full-text, this, a, what, etc.
Secondly, different Term has different importance for the relationship between documents . For example, for this document, search, Lucene, full-text is relatively important, this, a, what may be relatively unimportant. So if both documents contain search, Lucene, and fulltext, the correlation between these two documents is better. However, even if one document contains this, a, what, and the other document does not contain this, a, what, it cannot affect the correlation between the two documents.
Therefore, to judge the relationship between documents, first find out which words (Term) are the most important for the relationship between documents, such as search, Lucene, fulltext. Then judge the relationship between these words (Term).
The process of finding out the importance of a word (Term) to a document is called the process of calculating the weight of a word.
There are two parameters to calculate the term weight. The first is the word (Term) and the second is the document (Document).
The weight of a word indicates the importance of this word in this document. The more important the word (Term) has the greater the weight (Term weight), and therefore it will play a greater role in calculating the correlation between documents.
The process of judging the relationship between words (Term) and thus obtaining document correlations is used to use a vector space model algorithm (Vector Space Model).
Let’s analyze these two processes carefully:
1. The process of calculating the weight (Term weight).
There are two main factors that influence the importance of a word (Term) in a document:
•Term Frequency (tf): That is how many times does this Term appear in this document. The bigger the tf, the more important it means.
•Document Frequency (df): that is, how many documents contain secondary Term. The larger the df, the less important it means.
Is it easy to understand? The more times the word (Term) appears in the document, the more important it is to the document. For example, the word "search" appears many times in this document, which means that this document mainly talks about this aspect. However, in an English document, if this appears more often, does it mean that it is more important? No, this is adjusted by the second factor. The second factor shows that the more documents contain this word (Term), it means that the word (Term) is too ordinary and not enough to distinguish these documents, so the less important it is.
This is also the technology we programmers learn. For programmers themselves, the deeper they master this technology, the better (the deeper they master means that the more time they spend, the bigger the TF), and the more competitive they are when looking for a job. However, for all programmers, the fewer people who understand this technology, the better (the fewer people who understand it), and the more competitive they are in finding a job. This is the reason why human value lies in irreplaceability.
Once the truth is understood, let's take a look at the formula:
This is just a simple typical implementation of the term weight calculation formula. People who implement the full-text retrieval system will have their own implementation, and Lucene is slightly different from this.
2. The process of judging the relationship between Term and obtaining document correlation, that is, the algorithm (VSM) of vector space model.
We regard documents as a series of words (Term), each word (Term) has a weight (Term weight), and different words (Term) affect the scoring calculation of document relevance based on their own weight in the document.
So we regard the weight of the term in this document as a vector.
Document = {term1, term2, … ,term N}
Document Vector = {weight1, weight2, ... ,weight N}
Similarly, we regard the query statement as a simple document and also express it in vectors.
Query = {term1, term 2, … , term N}
Query Vector = {weight1, weight2, ... , weight N}
We put all the searched document vectors and query vectors into an N-dimensional space, and each word (term) is one-dimensional.
As shown in the picture:
We believe that the smaller the angle between two vectors, the greater the correlation.
Therefore, we calculate the cosine value of the included angle as a score of correlation. The smaller the angle, the larger the cosine value, the higher the score, and the greater the correlation.
Some people may ask that query statements are generally very short and contain very few words (Term), so the dimensions of the query vector are very small, while the document is very long, contains many words (Term), and the dimensions of the document vector are very large. Why are the dimensions of both in your graph N?
Here, since you want to put it in the same vector space, the natural dimensions are the same. Different at the same time, take the union of the two. If there is no word (Term), the weight (Term Weight) is 0.
The correlation scoring formula is as follows:
For example, there are 11 Term in the query statement, and three documents are searched. The respective weights (Term weight) are as follows.
t1 | t2 | t3 | t4 | t5 | t6 | t7 | t8 | t9 | t10 | t11 | |
D1 | 0 | 0 | .477 | 0 | .477 | .176 | 0 | 0 | 0 | .176 | 0 |
D2 | 0 | .176 | 0 | .477 | 0 | 0 | 0 | 0 | .954 | 0 | .176 |
D3 | 0 | .176 | 0 | 0 | 0 | .176 | 0 | 0 | 0 | .176 | .176 |
Q | 0 | 0 | 0 | 0 | 0 | .176 | 0 | 0 | .477 | 0 | .176 |
Therefore, the correlation scores between the three documents and the query statement are calculated as:
Therefore, Document 2 has the highest correlation, first returns, followed by Document 1, and finally Document 3.
So far, we can find the documentation we want the most.
Having said so much, I have not entered Lucene yet, but it is just the basic theory in information retrieval. However, after we look at Lucene, we will find that Lucene is a basic practice of this basic theory. Therefore, in the articles analyzing Lucene, we will often see the application of the above theory in Lucene.
Before entering Lucene, a summary of the above index creation and search process is as shown in the figure:
This figure refers to the article "Lucene, the full text search engine of open source code" in http://www.lucene.com.cn/about.htm
1. Indexing process:
1) There are a series of indexed files
2) The indexed file is syntaxly analyzed and language processing to form a series of words (Term).
3) Create a dictionary and reverse index table through indexing.
4) Write the index to the hard disk through index storage.
2. Search process:
a) User input query statement.
b) A series of words (Term) are obtained through grammatical analysis and language analysis of the query statement.
c) Obtain a query tree through syntax analysis.
d) Read the index into memory through index storage.
e) Use the query tree to search the index, so as to obtain the document link list for each word (Term), submit the document link list, and obtain the result document.
f) Sort the searched result document to the relevance of the query.
g) Return the query result to the user.
2. Sort the dictionary alphabetically.
Term | Document ID |
allow | 1 |
allow | 1 |
allow | 2 |
beer | 1 |
Drink | 1 |
Drink | 2 |
Find | 2 |
friend | 1 |
friend | 2 |
go | 1 |
go | 2 |
his | 2 |
jerry | 2 |
My | 2 |
school | 2 |
see | 2 |
Student | 1 |
Student | 2 |
Their | 1 |
They | 2 |
3. Merge the same words (Term) into the document reverse (Posting List) link list.
In this table, there are several definitions: