Learn about robot.txt and web crawlers.
Web crawlers are programs that navigate and parse the contents of websites. Companies that run web-search engines need to crawl the web to obtain data to build their search index; this is what search engines like Google, Bing, or DuckDuckGo continuously do.
Even if you do not run a web-search engine, but you'd like to crawl a site, you better follow its rules, or you risk getting banned from it.
Robots exclusion standard or
protocol, or **robots.txt**
for short, is a standard that defines how websites communicate with web
crawlers (also called spiders or bots) which resources may be crawled
and which may not.
The main reason for setting such rules is that blindly crawling entire
websites has undesirable consequences: it increases load on the servers
and on the network; in some cases to the point that it hampers the
experience of other legitimate users. So the bots may want to follow
robots.txt
to avoid getting banned.
The websites, on the other hand, may want to use robots.txt
communicate how they allow to be crawled and to tell which resources are
crawlable and which not.
Some resources on the site might be private or irrelevant to crawlers,
and a site may want to exclude them from crawling. This can also be done
with robots.txt
.
However, one must be aware that such rules are merely a non-enforceable
suggestion. So while a nice crawler may honor robots.txt
, a
bad-behaving one may likely ignore them or do exactly the opposite.
robots.txt
The file robots.txt
is a plain-text file that is hosted on the webroot
of a website. For instace, the robots file for Bunny.net is hosted at a
predictable location: https://bunny.net/robots.txt.
The contents of the said file are, at the time of writing, the following.
User-agent: *
Allow: /
Sitemap: https://bunny.net/sitemap.xml
Host: https://bunny.net
Every subdomain should have their own robots.txt
. For instance,
https://w3.org has one for the root domain
https://www.w3.org/robots.txt, and another for the list domain
https://lists.w3.org/robots.txt.
Next, we cover the directives of the robots exclusion standard.
User-Agent
directiveThe User-Agent
directive is used to specify instructions for specific
bots.
Generally, the term user-agent denotes a piece of software that acts
on behalf of a user. In this case, the User-Agent
denotes the name of
the crawler that also denotes the name its owner. For instance, the
following bot names belong to the most known search engines:
Googlebot
Bingbot
DuckDuckBot
If you want to address all bots, set the User-Agent
to a wildcard
value denoted by asterix: User-Agent: *
.
Disallow
directiveThe Disallow
directive specifies which resources are not to be
crawled. It can be used in many ways.
To disallow crawling a particular resource.
Disallow: /a-particular-resource.html
To disallow crawling a whole directory, including its subdirectories.
Disallow: /directory-name/
To disallow crawling entirely.
Disallow: /
To allow access to the entire site set the directive to an empty string.
Disallow:
The last official Robots exclusion protocol directive is a single line
comment. The comment is started by using the the pound sign #
. For
instance:
User-Agent: DuckDuckBot # when the crawler is from DuckDuckGo search engine
Disallow: # allow access to the entire site
User-Agent: ABotNonGrata # tell a search bot from an engine we do not like
Disallow: / # it is not allowed to crawl
Needless to say that nothing prevents the ABotNonGrata
to actually
crawl the site.
Directives User-Agent
and Disallow
and writing comments are the only
official Robots exclusion standard directives.
However, there are a few other ones that are not officially recognized, but are still acknowledged by most bots.
Allow
directive
Directive Allow
specifies a resource that is allowed to crawled.
This is commonly used together with a Disallow
directive that
typically disallows a larger set from crawling, but the Allow
directive exempts the resources from being disallowed.
In the following, crawling /folder/allowed_resource.html
is
allowed, but crawling anything else from /folder/
is not.
Allow: /folder/allowed_resource.html
Disallow: /folder/
Crawl-delay
directive
Directive Crawl-delay: value
is used to rate-limit the crawler.
The interpretation of value
varies between bots. Some regard it as
the number of milliseconds (e.g. Yandex) the crawler should wait
between sending subsequent requests. Others regard it as the number
of seconds. Some, like GoogleBot, do not recognize it at all.
Host
directive
Some crawlers support the Host: domain
directive that allows
websites with multiple mirrors to list their preferred domain
for
crawling.
Sitemap
directive
Directive Sitemap: URL
specifies a URL
to a website's sitemap in
XML.
The sitemap contains all resources that are available for crawling
together with their metadata. Here's again the robots.txt
from
Bunny.net.
User-agent: *
Allow: /
Sitemap: https://bunny.net/sitemap.xml
Host: https://bunny.net
On https://bunny.net/sitemap.xml we find a set of URLs, each with some additional metadata, such as the modification date and priority. Here we list only a small excerpt.
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>https://bunny.net/</loc>
<lastmod>2022-08-02T20:28:29+00:00</lastmod>
<priority>1.00</priority>
</url>
<url>
<loc>https://bunny.net/stream/</loc>
<lastmod>2022-08-02T20:29:38+00:00</lastmod>
<priority>0.90</priority>
</url>
</urlset>
A nice-behaving crawler can simply look at these URLs and process them directly without having to parse HTML pages and look for links. Consequently, such crawling inflicts the minimum amount of load on the website.
The Robots exclusion standard, or robots.txt
, is a set of rules that
should be followed when crawling a website with a bot or a spider.
While a website has no guarantee that a crawler will honor such rules, a
robot that crawls the site in accordance with the robots.txt
will
inflict a tolerable load and is unlikely to get banned.
Hypertext Transfer Protocol. A protocol that connects web browsers to web servers when they request content.
An automated application used to scrape (i.e. take) content from other sources.