0 Introduction
With the development of the World Wide Web and the advent of the big data era, a large amount of digital information is produced, stored, transmitted and converted every day. How to find information that meets your needs in a certain way from a large amount of information, so that it can be ordered and utilized has become a major problem. Full-text search technology is the most common information query application today. In life, search engines are used to search information in blog forums. The core principle of these searches is the full-text search technology to be implemented in this article. With the realization of digitalization of document information, effectively storing and timely and accurately extracting information is the basis for every company, enterprise and unit to lay a good foundation. There are already many mature theories and methods for full-text search in English. The open source full-text search engine Lucene is a sub-project of the Jakarta project team of the Apache Software Foundation. Its purpose is to provide software developers with a simple and easy-to-use toolkit to facilitate the implementation of full-text search functions in the target system. Lucene does not support Chinese, but there are currently many open source Chinese word segmenters that can index Chinese content. Based on the study of Lucene's core principles, this paper realizes crawling and retrieval of Chinese and English web pages respectively.
1 Introduction to Lucene
1.1 Introduction to lucene
Lucene is a full-text search engine toolkit written in Java, which implements two core functions: indexing and search, and the two are independent of each other, which allows developers to easily expand. Lucene provides rich APIs that can easily interact with information stored in the index. It should be noted that it is not a complete full-text search application, but provides indexing and search functions for the application. That is, if Lucene wants to truly work, it is necessary to do some necessary secondary development based on it.
Lucene's structural design is similar to that of the database, but Lucene's index is very different from that of the database. Databases and Lucene indexing are all for the convenience of searching, but the database is only established for some fields, and the data needs to be converted into formatted information and saved. Full-text search is to index all information in a certain way. The differences and similarities of the two searches are shown in Table 1-1.
Table 1-1: Comparison of database search and Lucene search
Comparison | Lucene Search | Database Retrieval |
Data Retrieval | Check out from Lucene's index file | Retrieve records from database index |
Index structure | Document | Record |
Query results | Hit: Document composition that satisfies relationships | Query result set: records containing keywords |
Full text search | support | Not supported |
Fuzzy query | support | Not supported |
Result sorting | Set weights and sort correlation | Cannot be sorted |
1.2 Lucene overall structure
The release form of the Lucene software package is a JAR file, with fast version updates and large version gaps. This article uses version 5.3.1, and the main subpackages used are shown in Table 1-2.
Table 1-2: Subpackages and functions
Package name | Function |
Org.apache.lucene.analysis | Participle |
Org .apache.lucene .document | Documentation for index management |
Org .apache.lucene .index | Indexing operations, including addition, deletion, etc. |
Org.apache.lucene.queryParser | Queryer, construct search expressions |
Org .apache.lucene .search | Search management |
Org .apache.lucene .store | Data storage management |
Org .apache.lucene .util | Public category |
1.3 lucene architecture design
Lucene has very powerful functions, but fundamentally, it mainly includes two parts: one is to index the index into the library after the text content is segmented; the other is to return the results according to the query conditions, that is, to establish an index and query.
As shown in Figure 1-1, this article throws out external interfaces and information sources, focusing on indexing and querying the text content crawled by web pages.
Figure 1-1: Lucene's architecture design
2 JDK installation and environment variable configuration
1.jdk download:
Download the compressed package that matches the system version on the official website of oracle, and the URL is as follows. Click Install and install according to the prompts. During the installation process, you will prompt whether to install jre, click Yes.
http://www.oracle.com/technetwork/java/javase/downloads/index.html
2. Set environment variables:
(1) Right-click computer =》Properties =》Advanced system settings =》Environment variable =》System variable =》New=》JAVA_HOME: Installation path
(2) Newly added to Path =》%JAVA_HOME%/bin
3. Whether the test is successful:
Start =》Run =》CMD Enter in the pop-up DOS window
Enter: java -version will display version information.
Enter: javac usage information appears in javac
The appearance is as shown in Figure 2-1 as success.
Figure 2-1: cmd command box test java configuration
3. Write Java code to obtain web content
Because Lucene needs to use different word segmenters for different languages, standard word segmenters are used in English, and smartcn word segmenters are used in Chinese. When obtaining a web page, first obtain the web page as an html file. In the html, the tag interference will affect the retrieval effect. Therefore, the html tag needs to be eliminated and the text content is converted into a txt file for saving. Except for word participlers, the others in Chinese and English are basically the same, so any subsequent code and experimental results demonstration will be selected. This article selects fifty web pages of Chinese and English stories as examples.
The specific code design is as follows: Url2Html.java converts the web page input URL into an html file, and the Html2Txt.java file realizes the removal of the html document tag and converts it into a txt document. The specific codes are shown in Figures 3-1 and 3-2.
public void way(String filePath,String url) throws Exception{ File dest = new File(filePath);//Create file InputStream is;//Receive byte input stream FileOutputStream fos = new FileOutputStream(dest);//Byte output stream URL wangzhi = new URL(url);//Set URL URL is = wangzhi.openStream(); BufferedInputStream bis = new BufferedInputStream(is);//Buffered BufferedOutputStream for byte input stream bos = new BufferedOutputStream(fos);//Buffer the byte output stream/* * Read the byte*/ int length; byte[] bytes = new byte[1024*20]; while((length = bis.read(bytes, 0, bytes.length)) != -1){ fos.write(bytes, 0, length); } /* * Close buffered stream and input and output stream*/ bos.close(); fos.close(); bis.close(); is.close(); } public String getBody(String val){ String zyf = val.replaceAll("</?[^>]+>", ""); //Check out the <html> tag return zyf;}public void writeTxt(String Str,String writePath) { File writename = new File(writePath); try { writename.createNewFile(); BufferedWriter out = new BufferedWriter(new FileWriter(writename)); out.write(Str); out.flush(); out.close(); } catch (IOException e) { e.printStackTrace(); }}Taking the web page of the fairy tale "The Stupid Wolf Going to School" as an example, the document path is set to "E:/work /lucene /test /data /html" and "E:/work/lucene/test/data/txt". The two parameters that need to be set when reading the web page are named filename and obtain the destination URL URL. Create a new main function to implement calls to two methods. The specific implementation is shown in Figure 3-3:
public static void main(String[] args) { String filename = "jingdizhi";//File name String url = "http://www.51test.net/show/8072125.html";//Web page that needs to be crawled String filePath = "E://work//lucene//test//data//html//"+filename+".html";//Write the file path of html + filename String writePath = "E://work//lucene//test//data//txt//"+filename+".txt";//Write the file path of txt + file name Url2Html url2html = new Url2Html(); try { url2html.way(filePath,url); } catch (Exception e) { e.printStackTrace(); } Html2Txt html2txt = new Html2Txt(); String read=html2txt.readfile(filePath);//Read the html file String txt = html2txt.getBody(read);//Remove the html tag System.out.println(txt); try { html2txt.writeTxt(txt,writePath); } catch (Exception e) { e.printStackTrace(); } }After executing the program, create "Silly Wolf School.html" and "Silly Wolf School.txt" in two folders respectively.
4. Create an index
The basic principles of indexing and query are as follows:
Indexing: Search engine indexing is actually to implement the specific data structure of the "word-document matrix". It is also the first step in full text retrieval. lucene provides IndexWriter class for index management, mainly including add(), delete(), and update(). There is also the setting of weights. Through the setting of different index weights, you can return according to the correlation size during search.
Search: The original direct search was to search for documents in sequence. After establishing an index, you can find the location where the index word appears in the document by searching the index, and then return the position and word in the document to which the index item is correct. Lucene provides the IndexSearcher class to search documents. The search forms are mainly divided into two categories. The first category is Term, which searches for single term items; the second category is Parser, which can customize the search expressions and has more search forms. The specific methods will be demonstrated later.
4.1 Experimental environment
This PC uses Windows 10x64 system, 8G memory, and 256G SSD. The development environment is Myeclipse 10, and the jdk version is 1.8. During the experiment, due to some syntax changes, several Classes were implemented using version 1.6.
4.2 Creating an index
Creating an index library is to add index records to the index library. Lucene provides an interface to add index records and add indexes.
It mainly uses three categories: "write indexer", "document" and "domain". To create an index, you must first construct a Document document object and determine the various fields of the Document. This is similar to the establishment of a table structure in a relational database. Document is equivalent to a record row in a table, and the field is equivalent to a column in a row. In Lucene, for the properties and data output requirements of different domains, different index/store field rules can be selected for the domain. In this experiment, the file name fileName, file path fullPath and text content are used as the fields of the Document.
IndexWriter is responsible for receiving newly added documents and writing them to the index library. When creating the Write IndexWriter IndexWriter, you need to specify the language analyzer you are using. Index creation is divided into two categories: the first: unweighted index; the second: weighted index.
public Indexer(String indexDir)throws Exception{ Directory dir=FSDirectory.open(Paths.get(indexDir)); Analyzer analyzer=new StandardAnalyzer(); // Standard word divider//SmartChineseAnalyzer analyzer = new SmartChineseAnalyzer(); IndexWriterConfig iwc=new IndexWriterConfig(analyzer); writer=new IndexWriter(dir, iwc); }Set the index field, and the Store indicates whether the index content is stored: fileName and fullPath take up less memory and can be stored to facilitate query return.
private Document getDocument(File f)throws Exception { Document doc=new Document(); doc.add(new TextField("contents", new FileReader(f))); doc.add(new TextField("fileName", f.getName(),Store.YES)); doc.add(new TextField("fullPath",f.getCanonicalPath(),Store.YES));//Path index return doc; }The result after executing the main code is shown in the figure: When indexing a file, the design returns the file "index file: + file path" and calculates the time it takes to output the index of all files.
4.3 Delete and modify the index
Generally, operations on databases include CRUD (add, delete, change, query). Adding means selecting and establishing index items. Query, as a more core function, will be discussed later. Here we mainly record the methods used when deleting and updating indexes.
Deletion is divided into two types, including ordinary deletion and complete deletion, because the deletion of index affects the entire database. Moreover, for large systems, deleting indexes means changing the underlying layer of the system, which is time-consuming and labor-intensive and cannot be returned. When the index was first indexed, several small files were generated after the index was established. When searching, each file will be merged and then searched. Ordinary deletion is just a simple marking of the previously established index, which makes it impossible to search and return. Complete deletion means destroying the index and cannot be revoked. Take the example of deleting an index with the index item "id" of 1:
Normal deletion (delete before merge):
writer.deleteDocuments(new Term("id","1"));writer.commit();Completely delete (delete after merge):
writer.deleteDocuments(new Term("id","1"));writer.forceMergeDeletes(); // Force delete writer.commit();The principle of modifying the index is relatively simple, which is to implement coverage on the basis of the original index. The implementation code is the same as the addition of the index in the above text. I will not explain it here.
4.4 Weighting of indexes
Lucene is sorted by relevance by default. Lucene provides Field with a Boosting parameter that can be set. This parameter is used to indicate the importance of records. When the search conditions are met, records with high importance will be given priority and the return result will be topped. If there are many records, records with low weight will be ranked behind the homepage. Therefore, the weighting operation on the index is an important factor affecting the satisfaction of the return result. When actually designing the information system, there should be strict weight calculation formulas to facilitate the change of the Field weight and better meet the needs of users.
For example, search engines will have a high click rate, and the web pages that are linked to and exited will be ranked first page when they return. The implementation code is shown in Figure 4-1, and the unweighted sum weighted result pairs are shown in Figure 4-2.
TextField field = new TextField("fullPath", f.getCanonicalPath(), Store.YES); if("A GREAT GRIEF.txt".equals(f.getName())){ field.setBoost(2.0f);//Weight the fullPath path with the file name second story.txt; } //The default weight is 1.0, and change to 1.2 to increase the weight. doc.add(field);Figure 4-1: Index weighting
Figure 4-2: Before Weighting
Figure 4-2: After weighting
As can be seen from the results of Figure 4-2, when weighted, it is returned in dictionary order. Therefore, before the second, the first is weighted and the order of the file named second is changed when the return is returned, realizing the test of the weight.
5 Conduct a query
Lucene's search interface is mainly composed of three classes: QueryParser, IndexSearcher, and Hits. QueryParser is a query parser, responsible for parsing query keywords submitted by users. When creating a new parser, you need to specify the domain to be parsed and what language analyzer to be used. The language analyzer used here must be the same as the parser used when the index library is established, otherwise the query result will be incorrect. IndexSearcher is an index searcher. When instantiating IndexSearcher, you need to specify the directory where the index library is located. IndexSearcher has a search method to perform index search. This method accepts Query as a parameter and returns Hits. Hists is a collection of a series of sorted query results. The element of the collection is Document. Through the get method of Document, you can obtain information about the file corresponding to this document, such as: file name, file path, file content, etc.
5.1 Basic Query
As shown in the figure, there are two main ways to query, but it is recommended to use the first type of constructing QueryParser expression, which can have flexible combinations, including Boolean logical expression, fuzzy matching, etc., but the second type of Term can only be used for vocabulary query.
1. Construct the QueryParser query form:
QueryParser parser=new QueryParser("fullPath", analyzer);Query query=parser.parse(q);2. Query for specific items:
Term t = new Term("fileName", q);Query query = new TermQuery(t);The query result is shown in Figure 5-1: Take the query file name fileName containing "big" as an example.
Figure 5-1: "Big" query results
5.2 Fuzzy query
When constructing QueryParser, precise matching and fuzzy matching can be achieved by modifying the term q. Fuzzy matching is modified by adding "~" after "q". As shown in Figure 5-2:
Figure 5-2: Fuzzy Matching
5.3 Qualified condition query
Boolean logical query and fuzzy query only need to change the query word q, while limited conditional query needs to set the query expression, which is mainly divided into the following categories:
It is a specified item range search, a specified number range, a specified string beginning and a multi-condition query, which lists the applied queries respectively. The true parameter refers to whether the upper and lower limits are included.
Specify the item range:
TermRangeQuery query=new TermRangeQuery("desc", new BytesRef("b".getBytes()), new BytesRef("c".getBytes()), true, true);Specify the number range:
NumericRangeQuery<Integer> query=NumericRangeQuery.newIntRange("id", 1, 2, true, true);Specify the beginning of the string:
PrefixQuery query=new PrefixQuery(new Term("city","a"));Multi-condition query:
NumericRangeQuery<Integer>query1=NumericRangeQuery.newIntRange("id", 1, 2, true, true);PrefixQuery query2=new PrefixQuery(new Term("city","a"));BooleanQuery.Builder booleanQuery=new BooleanQuery.Builder();booleanQuery.add(query1,BooleanClause.Occur.MUST);booleanQuery.add(query2,BooleanClause.Occur.MUST);5.4 Highlight query
In search engines such as Baidu and Google, when querying, the returned web page will be displayed in red when it contains query keywords and will be displayed in summary, that is, some content containing the keywords are intercepted and returned. Highlight query is to realize the style changes to keywords. This experiment is performed in myeclipse. There will be no style changes when returning the result. It will only add html tags to the keywords that return the content. If displayed on the web page, the style changes will occur.
The highlighted setting code is shown in Figure 5-3, and the results are shown in Figure 5-4. The Nanjing matching words will be added and labeled, which will be bold and red when displayed on the web page.
QueryScorer scorer=new QueryScorer(query);Fragmenter fragmenter=new SimpleSpanFragmenter(scorer);SimpleHTMLFormatter simpleHTMLFormatter=new SimpleHTMLFormatter("<b><font color='red'>","</font></b>");Highlighter highlighter=new Highlighter(simpleHTMLFormatter, scorer);highlighter.setTextFragmenter(fragmenter);Figure 5-3: Highlight settings
Figure 5-4: Highlight the results
6 Problems and shortcomings encountered during the experiment
The Lucene version is updated quickly, and a good connection is needed between the jdk version, eclipse version and the lucene version, otherwise it will cause a lot of incompatibility. There are many difficulties in the debug version and the selection of jdk1.6 and jdk1.8. For example, the append method in web crawling has been deleted in version 1.8 and cannot be used. However, the reading of document path force FSDirectory.open() requires jdk1.8 to support it.
The shortcomings of this experiment are mainly reflected in:
The code is less flexible. When crawling web pages, it needs to be done manually, and it needs to be done separately in Chinese and English. The code should be improved so that there is a judgment on the language of the web page, and then it will automatically select and execute different word segmenters.
The code has low reusability and no more reasonable classification and method construction. For simplicity, the effect is basically achieved by annotating and marking in several core codes, which needs to be improved.
The code is low in portability, and the crawling of web pages uses the jdk1.6 version, and the implementation of Lucene uses the jdk1.8 version. When exporting to other machines, the environment needs to be slightly modified and configured, and one-click operation cannot be achieved.
7 Summary
Based on the principles of Lucene, this article understands the ideas and methods of full-text search, and conducts experiments and tests on commonly used functions. During the experiment, I learned about the principles of search engines and had a better practical experience based on the content of the information retrieval course. Lucene is an excellent open source full text search technology framework. Through in-depth research on it, we are more familiar with its implementation mechanism. In the process of studying it, we have learned a lot of object-oriented programming methods and ideas. Its good system framework and scalability are worth learning from.