robots.txt is a plain text file in which the website administrator can declare that the part of the website that does not want to be accessed by robots, or specify that the search engine only includes specified content. Basic introduction to robots.txt
robots.txt is a plain text file in which the website administrator can declare that the part of the website that does not want to be accessed by robots, or specify that the search engine only includes specified content.
When a search robot (some called search spider) visits a site, it will first check whether robots.txt exists in the root directory of the site. If it exists, the search robot will determine the scope of access according to the content in the file; if the file does not exist, the search robot will crawl along the link.
In addition, robots.txt must be placed in the root directory of a site, and the file name must be all lowercase.
robots.txt writing grammarFirst, let’s take a look at a robots.txt example: http://www.csswebs.org/robots.txt
By visiting the above specific address, we can see the specific content of robots.txt as follows:
# Robots.txt file from http://www.csswebs.org
# All robots will spider the domain
User-agent: *
Disallow:
The above text means that all search robots are allowed to access all files under the www.csswebs.org site.
Specific syntax analysis: The text after # is the explanation information; User-agent: The name of the search robot is followed, and if it is *, it generally refers to all search robots; Disallow: The file directory behind is not allowed to be accessed.
Below, I will list some specific usages of robots.txt:
Allow all robot accessUser-agent: *
Disallow:
Or you can create an empty file/robots.txt file
All search engines are prohibited from accessing any part of the websiteUser-agent: *
Disallow: /
All search engines are prohibited from accessing several parts of the website (directories 01, 02, 03 in the following example)User-agent: *
Disallow: /01/
Disallow: /02/
Disallow: /03/
Disable access to a search engine (BadBot in the following example)User-agent: BadBot
Disallow: /
Only access to a certain search engine (Crawler in the following example)User-agent: Crawler
Disallow:
User-agent: *
Disallow: /
In addition, I think it is necessary to provide an extension explanation and introduce some robots meta:
Robots META tags are mainly aimed at specific pages. Like other META tags (such as the language used, the page description, keywords, etc.), the Robots META tag is also placed in the page's <head></head>, and is specifically used to tell search engines how to crawl the content of the page.
How to write Robots META tags:
There is no difference between upper and lower case in the Robots META tag. name=Robots means all search engines and can be written as name=BaiduSpider for a specific search engine. The content part has four instruction options: index, noindex, follow, and nofollow, separated by instructions.
The INDEX command tells the search robot to grab the page;
The FOLLOW command indicates that the search robot can continue to crawl along the link on the page;
The default values of the Robots Meta tag are INDEX and FOLLOW, except inktomi. For it, the default values are INDEX and NOFOLLOW.
In this way, there are four combinations:
<META NAME=ROBOTS CONTENT=INDEX,FOLLOW>
<META NAME=ROBOTS CONTENT=NOINDEX,FOLLOW>
<META NAME=ROBOTS CONTENT=INDEX,NOFOLLOW>
<META NAME=ROBOTS CONTENT=NOINDEX,NOFOLLOW>
in
<META NAME=ROBOTS CONTENT=INDEX,FOLLOW> can be written as <META NAME=ROBOTS CONTENT=ALL>;
<META NAME=ROBOTS CONTENT=NOINDEX,NOFOLLOW> can be written as <META NAME=ROBOTS CONTENT=NONE>
At present, it seems that most search engine robots abide by the rules of robots.txt, and for the Robots META tag, there are not many support currently, but they are gradually increasing. For example, the famous search engine GOOGLE fully supports it, and GOOGLE has also added a command archive to limit whether GOOGLE retains web page snapshots. For example:
<META NAME=googlebot CONTENT=index,follow,noarchive>