Robots.txt: A Beginners Guide
A simple file that contains components used to specify the pages on a website that must not be crawled (or in some cases must be crawled) by search engine bots. This file should be placed in the root directory of your site. The standard for this file was developed in 1994 and is known as the Robots Exclusion Standard or Robots Exclusion Protocol.
Some common misconceptions about robots.txt:
- It stops content from being indexed and shown in search results.
If you list a certain page or file under a robots.txt file but the URL to the page is found in external resources, search engine bots may still crawl and index this external URL and show the page in search results. Also, not all robots follow the instructions given in robots.txt files, so some bots may crawl and index pages mentioned under a robots.txt file anyway. If you want an extra indexing block, a robots Meta tag with a ‘noindex’ value in the content attribute will serve as such when used on these specific web pages, as shown below:
<meta name=“robots” content=“noindex”>
Read more about this here.
- It protects private content.
If you have private or confidential content on a site that you would like to block from the bots, please do not only depend on robots.txt. It is advisable to use password protection for such files, or not to publish them online at all.
- It guarantees no duplicate content indexing.
As robots.txt does not guarantee that a page will not be indexed, it is unsafe to use it to block duplicate content on your site. If you do use robots.txt to block duplicate content make sure you also adopt other foolproof methods, such as a rel=canonical tag.
- It guarantees the blocking of all robots.
Unlike Google bots, not all bots are legitimate and thus may not follow the robots.txt file instructions to block a particular file from being indexed. The only way to block these unwanted or malicious bots is by blocking their access to your web server through server configuration or with a network firewall, assuming the bot operates from a single IP address.
Uses for Robots.txt:
In some cases the use of robots.txt may seem ineffective, as pointed out in the above section. This file is there for a reason, however, and that is its importance for on-page SEO.
The following are some of the practical ways to use robots.txt:
- To discourage crawlers from visiting private folders.
- To keep the robots from crawling less noteworthy content on a website. This gives them more time to crawl the important content that is intended to be shown in search results.
- To allow only specific bots access to crawl your site. This saves bandwidth. Search bots request robots.txt files by default. If they do not find one they will report a 404 error, which you will find in the log files. To avoid this you must at least use a default robots.txt, i.e. a blank robots.txt file.
To provide bots with the location of your Sitemap. To do this, enter a directive in your robots.txt that includes the location of your Sitemap:
You can add this anywhere in the robots.txt file because the directive is independent of the user-agent line. All you have to do is specify the location of your Sitemap in the sitemap-location.xml part of the URL. If you have multiple Sitemaps you can also specify the location of your Sitemap index file. Learn more about sitemaps in our blog on XML Sitemaps.
Check out our Robots.txt SEO Guide for a detailed look at how robots.txt files work and how you can use them for SEO.
Examples of Robots.txt Files:
There are two major elements in a robots.txt file: User-agent and Disallow.
User-agent: The user-agent is most often represented with a wildcard (*) which is an asterisk sign that signifies that the blocking instructions are for all bots. If you want certain bots to be blocked or allowed on certain pages, you can specify the bot name under the user-agent directive.
Disallow: When disallow has nothing specified it means that the bots can crawl all the pages on a site. To block a certain page you must use only one URL prefix per disallow. You cannot include multiple folders or URL prefixes under the disallow element in robots.txt.
The following are some common uses of robots.txt files.
To allow all bots to access the whole site (the default robots.txt) the following is used:
To block the entire server from the bots, this robots.txt is used:
User-agent:* Disallow: /
To allow a single robot and disallow other robots:
User-agent: Googlebot Disallow: User-agent: * Disallow: /
To block the site from a single robot:
User-agent: XYZbot Disallow: /
To block some parts of the site:
User-agent: * Disallow: /tmp/ Disallow: /junk/
Use this robots.txt to block all content of a specific file type. In this example we are excluding all files that are Powerpoint files. (NOTE: The dollar ($) sign indicates the end of the line):
User-agent: * Disallow: *.ppt$
To block bots from a specific file:
User-agent: * Disallow: /directory/file.html
To crawl certain HTML documents in a directory that is blocked from bots you can use an Allow directive. Some major crawlers support the Allow directive in robots.txt. An example is shown below:
User-agent: * Disallow: /folder/ Allow: /folder1/myfile.html
To block URLs containing specific query strings that may result in duplicate content, the robots.txt below is used. In this case, any URL containing a question mark (?) is blocked:
User-agent: * Disallow: /*?
Sometimes a page will get indexed even if you include in the robots.txt file due to reasons such as being linked externally. In order to completely block that page from being shown in search results, you can include robots noindex Meta tags on those pages individually. You can also include a nofollow tag and instruct the bots not to follow the outbound links by inserting the following codes:
For the page not to be indexed:
<meta name=“robots” content=“noindex”>
For the page not to be indexed and links not to be followed:
<meta name=“robots” content=“noindex,nofollow”>
NOTE: If you add these pages to the robots.txt and also add the above Meta tag to the page, it will not be crawled but the pages may appear in the URL-only listings of search results, as the bots were blocked specifically from reading the Meta tags within the page.
Another important thing to note is that you must not include any URL that is blocked in your robots.txt file in your XML sitemap. This can happen, especially when you use separate tools to generate the robots.txt file and XML sitemap. In such cases, you might have to manually check to see if these blocked URLs are included in the sitemap. You can test this in your Google Webmaster Tools account if you have your site submitted and verified on the tool and have submitted your sitemap.
Go to Webmaster Tools > Optimization > Sitemaps and if the tool shows any crawl error on the sitemap(s) submitted, you can double check to see whether it is a page included in robots.txt.
Alternatively, there is a robots.txt testing tool within GWT. It is found under Webmaster Tools > Health > Blocked URLs.
This tool is a great way to learn how to use your robots.txt file. You can see how Googlebots will treat URLs after you enter the URL you want to test.
Lastly there are some important points to remember when it comes to robots.txt:
When you use a forward slash after a directory or a folder, it means that robots.txt will block the directory or folder and everything in it, as shown below:
- Verify your syntax with the Google Webmaster Tool or get it done by someone who is well versed in robots.txt, otherwise you risk blocking important content on your site.
If you have two user-agent sections, one for all the bots and one for a specific bot, let’s say Googlebots, then you must keep in mind that the Googlebot crawler will only follow the instructions within the user-agent for Googlebot and not for the general one with the wildcard (*). In this case, you may have to repeat the disallow statements included in the general user-agent section in the section specific to Googlebots as well. Take a look at the text below:
User-agent: * Disallow: /folder1/ Disallow: /folder2/ Disallow: /folder3/ User-agent: googlebot Crawl-delay: 2 Disallow: /folder1/ Disallow: /folder2/ Disallow: /folder3/ Disallow: /folder4/ Disallow: /folder5/