Managing Web Crawlers With robots.txt
Many years ago, getting your site picked up by search engines involved filling out forms, registering the site and manually editing keywords you want the site to be picked up for. This process changed with the advent of search engine web crawlers or spiders.
What are web crawlers?
Web crawlers are automated programs that crawl around the internet following links from one web page to another, indexing the content and adding it to their databases. This means that as long as your website has a link to it from another website the search engine already knows about, then it will find you over time. The more sites that link to yours, the faster that this will happen.
Unfortunately, these crawlers can be very intensive visitors to your site. This is because they load every page and file in order to catalog it for their databases. Crawlers can cause high load on your VPS and possibly cause issues for visitors. To help solve these load issues, there is a standardized method to control the behavior of these crawlers by placing a file called robots.txt in the root directory of your website. However, there is nothing forcing adherence to this file. So while most web search engine crawlers will obey it, some crawlers may not.
robots.txt format
The robots.txt file has a specific format. See an example below:
User-agent: googlebot
Disallow: /images
Allow: /images/metadata
crawl-delay: 2
sitemap: /sitemap.xml
Let’s look at each of the directive lines in order:
- We start with the User-agent line: a robot or web browser will identify itself with a user-agent and the various search engines crawlers will have their own user-agents. Any other directives that follow a user-agent directive will be seen as valid only for the given user-agent. A user-agent of an asterisk (*) will be taken as referring to all user-agents. In our example file, our directives relate to the googlebot crawler.
- The Disallow directive is used to tell a crawler about directories or files that you don’t want it to load. A thing to note is that while the crawler won’t load the files if it follows a link to them, it will still list them in search results. So it cannot be used to block pages from appearing in search results. The Disallow is probably the only directive that all crawlers will support. So in our example, we were disallowing crawling of the /images directory.
- The Allow directive can be used to specify files or directories in a disallowed directory that the crawler can load. While not all crawlers support this, most do. In our example, we allowed the crawler to load the files in the /images/metadata directory.
- The next directive is crawl-delay which gives a number in seconds that a crawler will wait before loading the next page. This is the best way to slow down crawlers, though you probably don’t want to make the number too high unless you have very few pages on your site as it will significantly limit the number of pages that a crawler can load each day.
- Finally, we have the sitemap directive which can be used to direct a crawler to your website’s XML sitemap file, something it can also use to aid its indexing of the site.
Take charge of web crawlers
You can fill the sections for as many or few user-agents in your robots.txt as you would like to control the way that crawlers visit your site. It makes sense to start with just one user-agent section for all the crawlers and then add separate sections for specific crawlers as you find them causing issues for your site. Once you have created your robots.txt, it’s well worth testing it to make sure it is valid. If there is a typo or mistake in the syntax it may result in a crawler ignoring the rules you are setting for it. Fortunately, there are a number of tools out there for testing it, as well as the main search engines such as Google providing testing tools.