When implementing many anti-collection methods, you need to consider whether they will affect the search engine's crawling of the website, so let's first analyze the difference between general collectors and search engine crawler collection.
Similarities:
a. Both need to directly capture the source code of the web page to work effectively.
b. Both of them will crawl a large number of visited website contents multiple times per unit time;
c. From a macro perspective, both IPs will change;
d. Both of them are too impatient to crack some of the encryption (verification) of your web pages. For example, the web content is encrypted through js files. For example, you need to enter a verification code to browse the content. For example, you need to log in to access the content.
Differences:
The search engine crawler first ignores the entire web page source code script and style and html tag code, and then performs a series of complex processing on the remaining text parts such as word segmentation, grammatical and syntactic analysis. The collector generally captures the required data through the characteristics of HTML tags. When making collection rules, it is necessary to fill in the start mark and end mark of the target content, so as to locate the required content; or use specific regular rules for specific web pages. Expression to filter out the required content. Whether you use start and end tags or regular expressions, html tags (web page structure analysis) will be involved.
Then we will propose some anti-collection methods.
1. Limit the number of visits per unit time of an IP address
Analysis: No ordinary person can visit the same website 5 times in one second, unless it is accessed by a program, and those who have this preference are left with search engine crawlers and annoying scrapers.
Disadvantages: One size fits all, which will also prevent search engines from including the website.
Applicable websites: Websites that do not rely much on search engines
What the collector will do: Reduce the number of accesses per unit time and reduce collection efficiency
2. Block IP
Analysis: Use background counters to record visitor IP addresses and access frequency, manually analyze visit records, and block suspicious IP addresses.
Disadvantages: There seems to be no disadvantages, but the webmaster is a bit busy.
Applicable websites: All websites, and the webmaster can know which are Google or Baidu robots
What the collector will do: Fight guerrilla warfare! Use IP proxy to collect data every time, but it will reduce the efficiency of the collector and the network speed (use a proxy).
3. Use js to encrypt web content
Note: I have never come across this method, I just saw it from elsewhere.
Analysis: No need to analyze, search engine crawlers and collectors can kill each other
Applicable websites: Websites that extremely hate search engines and collectors
The collector will do this: If you are so awesome and risk everything, he will not come to collect you.
4. The website copyright or some random junk text is hidden in the web page. These text styles are written in the css file.
Analysis: Although it cannot prevent collection, it will make the collected content full of your website's copyright statement or some junk text, because generally the collector will not collect your CSS files at the same time, and those texts will be displayed without style.
Applicable websites: all websites
What the collector will do: For copyrighted text, it is easy to handle and replace it. There is nothing you can do about random spam text, just be diligent.
5. Users must log in to access website content
Analysis: Search engine crawlers will not design login procedures for every type of website. I heard that the collector can simulate user login and form submission behavior for a certain website design.
Applicable websites: Websites that hate search engines and want to block most collectors
What the collector will do: Make a module that simulates user login and submit form behavior
6. Use scripting language to do paging (hide paging)
Analysis: Again, search engine crawlers will not analyze the hidden paginations of various websites, which affects their inclusion by search engines. However, when the collector writes the collection rules, he must analyze the code of the target web page. Those who know some scripting knowledge will know the real link address of the paging.
Applicable websites: Websites that are not highly dependent on search engines. Also, the person collecting you does not have scripting knowledge.
What the collector will do: It should be said what the collector will do. He will analyze your web page code anyway, and analyze your paging script by the way. It doesn't take much extra time.