Mastering the Use of Robots.txt file: A Detailed Guide
A robots.txt file is a simple text file that website owners can use to tell search engine bots(crawlers) which pages or sections of their site should not be crawled. It is important to note that the robots.txt file is not a guarantee that a website will not be crawled or indexed by search engines, but it is a strong indication that certain pages should not be accessed.
How robots.txt works
The way robots.txt works is simple: web crawlers or “bots” that belong to search engines such as Google, Bing, and Yahoo, will look for a file named “robots.txt” at the root of a website, when they visit it. If the file is present, the bots will read the instructions provided in the file and will follow them while crawling the website. If the file is not present, the bots will assume that they are free to crawl the entire website.
Lets say your domain name is mywebsite.com, then the robots.txt URL should be:
http://yourdomain.com/robots.txt
The instructions provided in the robots.txt file are in the form of a set of “User-agent” and “Disallow” or “Allow” directives.
User-agent directive in robots.txt file
User-agent is a useful feature when it comes to managing access to your website by various web crawlers. With hundreds of crawlers potentially trying to access your site, it can be beneficial to set specific boundaries for each of them based on their intentions. User-agent allows you to do this by providing a way to identify the specific crawler and apply different instructions for each.
You can also use User-agent
to target specific web crawlers and give them different instructions, it will depend on the crawler capabilities and behavior. Also some crawlers have their own name and format, for example Googlebot-Image, Googlebot-News, Googlebot-Video etc. You can find the user-agent name of each crawler on their official website.
Also, you can use the wildcard *
to apply the instructions to all web crawlers that visit your site.
Disallow: directive
The “Disallow” directive is used to specify which pages or sections of a website should not be accessed by web crawlers.
User-agent: *
Disallow: /admin
The above example will block all URLs whose path starts with “/admin”:
http://yourdomain.com/admin
http://yourdomain.com/admin?test=0
http://yourdomain.com/admin/somethings
http://yourdomain.com/admin-example-page-keep-them-out-of-search-results
Allow: directive
The Allow
directive is used to specify which parts of a website should be accessible to web crawlers or search engine bots.
User-agent: *
Allow: /some-directory/important-page
Disallow: /some-directory/
Above example will block the following URLs:
http://yourdomain.com/some-directory/
http://yourdomain.com/some-directory/everything-blocked-but
But it will not block any of the following:
http://yourdomain.com/some-directory/important-page
http://yourdomain.com/some-directory/important-page-its-someting
http://yourdomain.com/some-directory/important-page/anypage-here
Sitemap directive
The Sitemap
directive can be used to specify the location of a sitemap for a website. A sitemap is an XML file that lists all of the URLs on a website and provides information about each URL, such as when it was last updated. This can be useful for search engines when they are crawling a website, as it allows them to find all of the URLs on the site more easily.
Example:
User-agent: *
Sitemap: http://yourdomain.com/sitemap.xml
Crawl-delay: directive
The Crawl-delay
directive can be used to specify the number of seconds that a web crawler should wait between requests to a website. This can be useful for preventing a website from being overwhelmed by too many requests from a single crawler.
Note: This directive is not a standard and not all web crawlers support it. In particular, Google does not support it, instead, it uses different methods to control the crawling rate, such as setting the crawl rate in the Google Search Console.
Example:
User-agent: *
Crawl-delay: 2
Wildcards “*” asterisk in robots.txt
Wildcards can be used to specify a pattern of URLs that should be blocked or allowed for web crawlers. Wildcards can be used in both the Disallow
and Allow
directives. For Example
Disallow: /names/*/details
Below are the URLs that will be blocked from above directive
http://yourdomain.com/names/ravi/details
http://yourdomain.com/names/rohit/account/details
http://yourdomain.com/names/aarush/details-about-something
http://yourdomain.com/names/varma/search?q=/details
End-of-string operator “$” (Dollar sign)
The dollar sign $
can be used to indicate the end of a URL. This can be useful in cases where you want to block a specific file type or extension.
User-agent: *
Disallow: /junk-page$
Disallow: /*.pdf$
From the above example any URLs ending with pdf and junk-page will be blocked
But it will not block any of the following:
http://yourdomain.com/junk-page-and-how-to-avoid-creating-them
http://yourdomain.com/junk-page/
http://yourdomain.com/junk-page?a=b
What if you want to block all URLs that contain a dollar sign?
http://yourdomain.com/store?price=$50
The following will not work:
Disallow: /*$
This directive will actually block everything on your website. To avoid this mistake, an extra asterisk should be placed after the dollar sign.
Disallow: /*$*
Common Robots.txt Configuration Mistakes
Not placing the robots.txt file in the correct location
Placing the robots.txt file anywhere other than the site root will result in it being ignored by search engines. If you do not have access to the site root, you can block pages using alternative methods such as robots meta tags or by using the X-Robots-Tag header in the .htaccess file (or equivalent)
Blocking subdomains in robots.txt
Trying to target specific subdomains using robots.txt is a common mistake. Robots.txt is only applicable to the specific domain it is placed on, and will not affect any subdomains. To block specific subdomains, you will need to create a separate robots.txt file for each subdomain and place it in the root directory of that subdomain.
Case Consistency in Robots.txt
The robots.txt standard is case-sensitive, meaning that “User-agent” and “user-agent” would be interpreted as different commands. Similarly, “Disallow” and “disallow” would also be treated as different commands.
Forgetting the user-agent line
The “User-agent” line tells search engines which crawlers the rules in the file apply to. Without this line, search engines will not know which rules to follow, and may ignore the entire file.
How to test robots.txt file?
Testing the robots.txt file can be done with the use of web-based tools like Google Search Console and Bing Webmaster Tools, which allows you to enter the URL you want to verify and see if it’s allowed or disallowed. If you have technical knowledge, you can also use Google’s open-source robots.txt library to test the file locally on your computer.
Meta Robots Tag
Robots.txt is not the sole method of communicating with web crawlers. Alternative methods include using the Meta Robots Tag and X-Robots-Tag headers to specify crawling instructions for specific pages or sections of a website.