Articles on: Getting Started
This article is also available in:

How to use the robots.txt file?

How does a Robots.txt file work?



The robots.txt files tell search engines which URLs they can browse and, more importantly, which ones they cannot.

Search engines have two main jobs:


Explore the web to discover content
Index content so it can be presented to Internet users looking for information

As they crawl, search engines find and follow the links. This process takes them from site A, to site B, and then to site C, through billions of links and websites.

When arriving at any site, the first thing a bot will do is look for a potential robots.txt file. If he finds one, he’ll check the file before he does anything else.

This allows you to dictate rules for bots. The syntax of the file is very simple and straightforward. These indicate the user-agent (search engine) to which they are addressed, followed by instructions to follow.

You can also use the asterisk (*) to assign instructions to all search engines at once. This means that the rule applies to all bots, rather than a specific bot.

Note: A robots.txt file provides instructions, but it cannot impose them. It is simply a recommended code of conduct. Well-intentioned bots (such as search engine browsing bots) will follow the dictated rules. However, malicious bots (such as those generating spam) will completely ignore them.

The syntax of the robots.txt file



A robots.txt file consists of:

One or more blocks of instructions;
Each block designates a user agent (search robot);
Each also includes a allow (allow) or disallow directive related to the robot in question.

A block generally looks like this:

User-agent: Googlebot
Disallow: /not-for-google
User-agent: BingBot
Disallow: /not-for-Bing
Sitemap: https://www.votresiteweb.com/sitemap.xml


User-Agent Directive



The first line of each block of instructions is the user-agent segment, which identifies the scanning robot (crawler) to which it is addressed.

For example, if you want to tell Googlebot not to browse your WordPress admin page, your directive should start with:

User-agent: Googlebot
Disallow: /wp-admin/


Keep in mind that most search engines have several bots. They use different robots for their basic index, images, videos, etc.

Search engines always obey the most specific block of directives they can find.

Let’s take an example where you have three sets of directives: one for *, one for Googlebot, and one for Googlebot-Image.

When the Googlebot-News user agent explores your site, it will follow the Googlebot guidelines.

On the other hand, the Googlebot-Image user agent will follow the more specific directives assigned to it.

The Disallow Directive



The second line in any instruction block is the line to allow or forbid scanning.

You can have several prohibition directives at a time. Each one specifies which parts of your site the bot cannot access.

An empty Disallow line means you don’t prohibit anything, so a robot can access all sections of your site.

For example, if you wanted to allow all search engines to browse your entire site, your block would look like this:

User-agent: *
Allow: /

On the other hand, if you wanted to prevent all search engines from exploring your site, your block would look more like this:

User-agent: *
Disallow: /


Directives such as Allow and Disallow do not take capital letters into account. For example, the /photo/ directory is exactly the same as /Photo/ for bots.

However, you will often find the directives in uppercase letters, as this makes it easier for human users to read the file.

Allow directive



The Allow directive allows search engines to browse a specific directory or page. Even in a directory that is otherwise forbidden.

For example, if you wanted to prevent Googlebot from accessing every post on your blog except one, your directive might look like this:

User-agent: Googlebot
Disallow: /blog
Allow: /blog/sample-article


Note: Not all search engines recognize this command, but Google and Bing follow this directive.

The Sitemap Directive



The Sitemap directive tells search engines where to find your XML sitemap. Plans usually include pages that search engines need to browse and index.

You can find this directive at the top or bottom of a robots.txt file. You can (and should) still submit your sitemap to each search engine using their webmaster tools. The directive is a simple line, similar to this:

Sitemap: https://www.yourwebsite.com/sitemap.xml


Search engines will browse your site by themselves, but submitting a site map speeds up the search process.

If you don’t want to submit your plan to all engines, adding a Sitemap directive to your robots.txt file is a good alternative.

Crawl-Delay Directive



The crawl-delay directive specifies a scan delay in seconds. It is intended to prevent bots from overloading a server and slowing down a website.

Unfortunately, Google no longer follows this directive from the robots.txt file. If you want to set your scan time for Googlebot, you’ll need to set it from the Google Search Console. Bing and Yandex, for their shares, obey the crawl-delay directive.

Here’s how the command works. If you want all bots to wait 15 seconds after each action, you need to set the delay to 15, like this:

User-agent: *
Crawl-delay: 15


Noindex Directive



The robots.txt file tells bots what they can and cannot explore. However, it cannot tell a search engine which URLs not to index and display in search results.

The page will always appear in the search results, but no information about its content will be associated with the URL.

Google has never officially supported this directive, but SEO professionals have long believed this to be the case.

However, in September 2019, Google ended the doubt and clearly indicated that this directive is not being followed.

If you really want to exclude a page or file from the search results, use a meta tag.

Updated on: 27/04/2023

Was this article helpful?

Share your feedback

Cancel

Thank you!