robots

Mastering the Use of Robots.txt file: A Detailed Guide

robots.txt file is a simple text file that website owners can use to tell search engine bots(crawlers) which pages or sections of their site should not be crawled. It is important to note that the robots.txt file is not a guarantee that a website will not be crawled or indexed by search engines, but it is a strong indication that certain pages should not be accessed.

How robots.txt works

The way robots.txt works is simple: web crawlers or “bots” that belong to search engines such as Google, Bing, and Yahoo, will look for a file named “robots.txt” at the root of a website, when they visit it. If the file is present, the bots will read the instructions provided in the file and will follow them while crawling the website. If the file is not present, the bots will assume that they are free to crawl the entire website.

Lets say your domain name is mywebsite.com, then the robots.txt URL should be:

http://yourdomain.com/robots.txt

The instructions provided in the robots.txt file are in the form of a set of “User-agent” and “Disallow” or “Allow” directives.

User-agent directive in robots.txt file

User-agent is a useful feature when it comes to managing access to your website by various web crawlers. With hundreds of crawlers potentially trying to access your site, it can be beneficial to set specific boundaries for each of them based on their intentions. User-agent allows you to do this by providing a way to identify the specific crawler and apply different instructions for each.

You can also use User-agent to target specific web crawlers and give them different instructions, it will depend on the crawler capabilities and behavior. Also some crawlers have their own name and format, for example Googlebot-Image, Googlebot-News, Googlebot-Video etc. You can find the user-agent name of each crawler on their official website.

Also, you can use the wildcard * to apply the instructions to all web crawlers that visit your site.

Disallow: directive

The “Disallow” directive is used to specify which pages or sections of a website should not be accessed by web crawlers.

User-agent: *
Disallow: /admin

The above example will block all URLs whose path starts with “/admin”:

http://yourdomain.com/admin
http://yourdomain.com/admin?test=0
http://yourdomain.com/admin/somethings
http://yourdomain.com/admin-example-page-keep-them-out-of-search-results

Allow: directive

The Allow directive is used to specify which parts of a website should be accessible to web crawlers or search engine bots.

User-agent: *
Allow: /some-directory/important-page
Disallow: /some-directory/

Above example will block the following URLs:

http://yourdomain.com/some-directory/
http://yourdomain.com/some-directory/everything-blocked-but

But it will not block any of the following:

http://yourdomain.com/some-directory/important-page
http://yourdomain.com/some-directory/important-page-its-someting
http://yourdomain.com/some-directory/important-page/anypage-here

Sitemap directive

The Sitemap directive can be used to specify the location of a sitemap for a website. A sitemap is an XML file that lists all of the URLs on a website and provides information about each URL, such as when it was last updated. This can be useful for search engines when they are crawling a website, as it allows them to find all of the URLs on the site more easily.

Example:

User-agent: *
Sitemap: http://yourdomain.com/sitemap.xml

Crawl-delay: directive

The Crawl-delay directive can be used to specify the number of seconds that a web crawler should wait between requests to a website. This can be useful for preventing a website from being overwhelmed by too many requests from a single crawler.

Note: This directive is not a standard and not all web crawlers support it. In particular, Google does not support it, instead, it uses different methods to control the crawling rate, such as setting the crawl rate in the Google Search Console.

Example:

User-agent: *
Crawl-delay: 2

Wildcards “*” asterisk in robots.txt

Wildcards can be used to specify a pattern of URLs that should be blocked or allowed for web crawlers. Wildcards can be used in both the Disallow and Allow directives. For Example

Disallow: /names/*/details

Below are the URLs that will be blocked from above directive

http://yourdomain.com/names/ravi/details
http://yourdomain.com/names/rohit/account/details
http://yourdomain.com/names/aarush/details-about-something
http://yourdomain.com/names/varma/search?q=/details

End-of-string operator “$” (Dollar sign)

The dollar sign $ can be used to indicate the end of a URL. This can be useful in cases where you want to block a specific file type or extension.

User-agent: *
Disallow: /junk-page$
Disallow: /*.pdf$

From the above example any URLs ending with pdf and junk-page will be blocked

But it will not block any of the following:

http://yourdomain.com/junk-page-and-how-to-avoid-creating-them
http://yourdomain.com/junk-page/
http://yourdomain.com/junk-page?a=b

What if you want to block all URLs that contain a dollar sign?

http://yourdomain.com/store?price=$50

The following will not work:

Disallow: /*$

This directive will actually block everything on your website. To avoid this mistake, an extra asterisk should be placed after the dollar sign.

Disallow: /*$*

Common Robots.txt Configuration Mistakes

Not placing the robots.txt file in the correct location

Placing the robots.txt file anywhere other than the site root will result in it being ignored by search engines. If you do not have access to the site root, you can block pages using alternative methods such as robots meta tags or by using the X-Robots-Tag header in the .htaccess file (or equivalent)

Blocking subdomains in robots.txt

Trying to target specific subdomains using robots.txt is a common mistake. Robots.txt is only applicable to the specific domain it is placed on, and will not affect any subdomains. To block specific subdomains, you will need to create a separate robots.txt file for each subdomain and place it in the root directory of that subdomain.

Case Consistency in Robots.txt

The robots.txt standard is case-sensitive, meaning that “User-agent” and “user-agent” would be interpreted as different commands. Similarly, “Disallow” and “disallow” would also be treated as different commands.

Forgetting the user-agent line

The “User-agent” line tells search engines which crawlers the rules in the file apply to. Without this line, search engines will not know which rules to follow, and may ignore the entire file.

How to test robots.txt file?

Testing the robots.txt file can be done with the use of web-based tools like Google Search Console and Bing Webmaster Tools, which allows you to enter the URL you want to verify and see if it’s allowed or disallowed. If you have technical knowledge, you can also use Google’s open-source robots.txt library to test the file locally on your computer.

Meta Robots Tag

Robots.txt is not the sole method of communicating with web crawlers. Alternative methods include using the Meta Robots Tag and X-Robots-Tag headers to specify crawling instructions for specific pages or sections of a website.

Leave a Comment