Robots.txt is a very important file for modern websites. Developers place it inside the root folder (the same place where the index file is). The reason behind it is that the file tells the search engine robots what pages to crawl and what pages to not crawl.
Why Do You Need Robots.txt?
Google cannot visit all websites and all pages. The current estimation puts the number of active websites at 1.7 billion. And there is an unlimited number of possible pages on each site. Google prioritizes the most important pages and crawls them more often.
Also Read: Main Places To Put Keyword Phrases On A Site
Also, as the search engine crawler visits the site, it can overwhelm it. With the use of robots.txt, you give clear indications in regards to what should be crawled. You tell Google and other search engines how often and much more.
Robots.txt serves several important purposes, including:
- Preventing the inclusion of duplicate content – In some cases, a website needs several copies of one piece of content. When this applies, you can instruct Google which one to include and which ones to ignore.
- Hiding pages – For instance, when you rework your site, you can prevent the indexing of unfinished pages.
You can learn more about Robots.txt specifications and rules here.
The setup for the protocol is very straightforward. It starts with 2 main parts:
- User-agent – Defines the crawlers that will respect the indications you offer.
- Disallow – Defines the content you will block.
Besides the 2 main parts, the “allow” label is available. Use this when referring to a blocked site directory. As an example, when you want to block most of the directory and allow a subdirectory in it, you can use:
This string will instruct crawlers to only look at files inside “subdirectory” while avoiding everything else inside “directory”. You can leave disallow blank when you want the crawlers to look at everything on the site.
The common content types blocked are:
- Login pages.
- Duplicate content you need, like webpage printable versions and PDF documents.
- Thank you pages.
- New pages you create that are not fully developed yet.
Making Sure Search Engines Properly Interpret Robots.txt
At first glance, Robots.txt looks very simple. In reality, it is quite complex and you have to follow several rules. The most common ones are:
- Always name “robots.txt” exactly like this, with lower-case letters.
- Always include the protocols in the server’s top-level directory.
- Only use one “disallow” for every URL because this is the maximum.
- Use different protocol files for subdomains sharing the same root domain.
After the setup of the Robots.txt protocol, test the site. Do so through the Google Webmasters account. Go to “Crawl” and look for the protocol tester option. When Google determines text is allowed, you wrote it correctly.
Use Robots.txt for better SEO results as you control when and how search engine spiders crawl your site. Every single site needs such a file.