Simple version of the site search
Based on campus news search engine
- Implementation idea: crawl all the news from the campus network, store it in the MySQL database, then divide the titles in the database, and then turn the word segmentation results into an index table. Enter a query content, segment the query content, match the word segmentation list in the database, map the corresponding URL, and then return the result.
Development Environment
Dependency library
- pymysql: Interface between python and MySQL
- jieba: python library for word participle
Overall architecture
Some crawlers use scrapy framework to crawl the news network of Liaoning University of Engineering and Technology. Description of the main parts of the scrapy framework:
- IntuSpider.py file: The main extraction process of web page information, using nested loop calls, using depth-first algorithms to make recursive calls, parsing all html news pages of Liaoning University of Technology, extracting the required information (title, url), and saving it to the item object. The framework will further call the pipeline.py file to process the stored objects. The parsing method used by crawlers is xpath
- items.py defines the object to be crawled.
- pipeline.py stores the saved objects into the mysql database through the mysql interface. The database fields are title and url respectively. The other files are some configuration files, and there are almost no changes, and the location of the changes has been commented. The crawler part ends here.
The overall idea of search engines: participle the titles stored in the database and establish a keyword index. Secondly, based on the frequency of keywords, an index table of keywords and occurrences is established. Main document description:
- Intu.py: Database table building, take the data crawled by the crawler, perform word segmentation, and store it in the forward and backward tables respectively.
- forward.py: forward table, define the class component
forwardIndexTableItem , specify the array content in the table, and in the forward table class forwardIndexTable , the title participle is performed and stored in the database table. - lexicon: word segmentation, define operation: obtain its ID through words, obtain words through ID, establish word segmentation list, and load word segmentation list;
- backwardList: Backward table, processing data in the forward table. The main function is to store the id of the word and the docID of the news title, as well as its collection in the database through the content in the forward table.
- linesEngine: Search engine class, run this file directly, you can return the corresponding title and url by querying the words you entered. The core is to segment the input content, and then sort the corresponding title according to the keyword, and then print the first 10 lines according to the number of hit keywords.
Notes and shortcomings:
- First of all, the crawler is static. After crawling once, it is stored in the database and cannot be updated in real time according to changes in the web page. If there are duplicate titles in the database, inserting the data will fail. The table needs to be cleared and re-crawled.
- The database content is fixed. When searching, if the keyword is not indexed in the database, it will have no search results.
- The hit rate of the search is related to the accuracy of the stutter participle. The crawler's efficiency is not very high. It took nearly 5 hours to climb 3W data. When crawling, the layout of the web page is very clear, and the news I crawled is not duplicated. All special url deduplication algorithms are just to prevent the occurrence of duplicate data, and python lists are used for deduplication.
- In the future, you may add a web page, search through the web site, and create a web interface.
- Crawlers and searches are independent, and other news networks can also be crawled, just need the crawler part.
How to use
- Dependence environment preparation: python3 environment, scrapy framework, pymysql installation, jieba word library installation, mysql installation, mysql establishment mytable database, and intu data table.
- First git clone to the specified directory
- Open the console under Windows, enter the corresponding folder, and enter
scrapy crawl Intu - Wait for the crawling result, and the crawling is over.
- Run the seachEngine.py file and enter the text content you query