Robots.txt and Web crawlers

Web crawlers are programs that navigate and parse the contents of websites. Companies that run web-search engines need to crawl the web to obtain data to build their search index; this is what search engines like Google, Bing, or DuckDuckGo continuously do.

Even if you do not run a web-search engine, but you'd like to crawl a site, you better follow its rules, or you risk getting banned from it.

Robots exclusion standard or protocol, or **robots.txt** for short, is a standard that defines how websites communicate with web crawlers (also called spiders or bots) which resources may be crawled and which may not.

The main reason for setting such rules is that blindly crawling entire websites has undesirable consequences: it increases load on the servers and on the network; in some cases to the point that it hampers the experience of other legitimate users. So the bots may want to follow robots.txt to avoid getting banned.

Robots.txt and Web crawlers

The websites, on the other hand, may want to use robots.txt communicate how they allow to be crawled and to tell which resources are crawlable and which not.

Some resources on the site might be private or irrelevant to crawlers, and a site may want to exclude them from crawling. This can also be done with robots.txt.

However, one must be aware that such rules are merely a non-enforceable suggestion. So while a nice crawler may honor robots.txt, a bad-behaving one may likely ignore them or do exactly the opposite.

The syntax of robots.txt

The file robots.txt is a plain-text file that is hosted on the webroot of a website. For instace, the robots file for Bunny.net is hosted at a predictable location: https://bunny.net/robots.txt.

The contents of the said file are, at the time of writing, the following.

User-agent: *
Allow: /
Sitemap: https://bunny.net/sitemap.xml
Host: https://bunny.net

Every subdomain should have their own robots.txt. For instance, https://w3.org has one for the root domain https://www.w3.org/robots.txt, and another for the list domain https://lists.w3.org/robots.txt.

Next, we cover the directives of the robots exclusion standard.

User-Agent directive

The User-Agent directive is used to specify instructions for specific bots.

Generally, the term user-agent denotes a piece of software that acts on behalf of a user. In this case, the User-Agent denotes the name of the crawler that also denotes the name its owner. For instance, the following bot names belong to the most known search engines:

  • Googlebot
  • Bingbot
  • DuckDuckBot

If you want to address all bots, set the User-Agent to a wildcard value denoted by asterix: User-Agent: *.

Disallow directive

The Disallow directive specifies which resources are not to be crawled. It can be used in many ways.

  1. To disallow crawling a particular resource.

    Disallow: /a-particular-resource.html
    
  2. To disallow crawling a whole directory, including its subdirectories.

    Disallow: /directory-name/
    
  3. To disallow crawling entirely.

    Disallow: /
    
  4. To allow access to the entire site set the directive to an empty string.

    Disallow:
    

Comments

The last official Robots exclusion protocol directive is a single line comment. The comment is started by using the the pound sign #. For instance:

User-Agent: DuckDuckBot # when the crawler is from DuckDuckGo search engine
Disallow:               # allow access to the entire site

User-Agent: ABotNonGrata # tell a search bot from an engine we do not like
Disallow: /              # it is not allowed to crawl

Needless to say that nothing prevents the ABotNonGrata to actually crawl the site.

Additional unofficial directives

Directives User-Agent and Disallow and writing comments are the only official Robots exclusion standard directives.

However, there are a few other ones that are not officially recognized, but are still acknowledged by most bots.

  1. Allow directive

    Directive Allow specifies a resource that is allowed to crawled. This is commonly used together with a Disallow directive that typically disallows a larger set from crawling, but the Allow directive exempts the resources from being disallowed.

    In the following, crawling /folder/allowed_resource.html is allowed, but crawling anything else from /folder/ is not.

    Allow: /folder/allowed_resource.html
    Disallow: /folder/
    
  2. Crawl-delay directive

    Directive Crawl-delay: value is used to rate-limit the crawler.

    The interpretation of value varies between bots. Some regard it as the number of milliseconds (e.g. Yandex) the crawler should wait between sending subsequent requests. Others regard it as the number of seconds. Some, like GoogleBot, do not recognize it at all.

  3. Host directive

    Some crawlers support the Host: domain directive that allows websites with multiple mirrors to list their preferred domain for crawling.

  4. Sitemap directive

    Directive Sitemap: URL specifies a URL to a website's sitemap in XML.

    The sitemap contains all resources that are available for crawling together with their metadata. Here's again the robots.txt from Bunny.net.

    User-agent: *
    Allow: /
    Sitemap: https://bunny.net/sitemap.xml
    Host: https://bunny.net
    

    On https://bunny.net/sitemap.xml we find a set of URLs, each with some additional metadata, such as the modification date and priority. Here we list only a small excerpt.

    <?xml version="1.0" encoding="UTF-8"?>
    <urlset
        xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
                            http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
    
      <url>
        <loc>https://bunny.net/</loc>
        <lastmod>2022-08-02T20:28:29+00:00</lastmod>
        <priority>1.00</priority>
      </url>
    
      <url>
        <loc>https://bunny.net/stream/</loc>
        <lastmod>2022-08-02T20:29:38+00:00</lastmod>
        <priority>0.90</priority>
      </url>
    </urlset>
    

    A nice-behaving crawler can simply look at these URLs and process them directly without having to parse HTML pages and look for links. Consequently, such crawling inflicts the minimum amount of load on the website.

Conclusion

The Robots exclusion standard, or robots.txt, is a set of rules that should be followed when crawling a website with a bot or a spider.

While a website has no guarantee that a crawler will honor such rules, a robot that crawls the site in accordance with the robots.txt will inflict a tolerable load and is unlikely to get banned.

Glossary

HTTP

Hypertext Transfer Protocol. A protocol that connects web browsers to web servers when they request content.

Crawler

An automated application used to scrape (i.e. take) content from other sources.