How to crawl a site with a robots.txt crawl-delay greater than one second

When setting up a new crawl for your website, during the configuration phase, you have to define a number of parameters to make sure the crawl will run smoothly.

An important (yet hidden) parameter is the crawl-delay directive of the robots.txt file.

Since Oncrawl can be used to crawl any website, if identified on the target website, our bot will follow the crawl-delay directive expressed in the robots.txt file. Otherwise, we limit the crawl rate to the speed of one page per second in order not to overload the site or the web server.

When a website has a crawl-delay directive higher than one, you will receive a warning to inform you that the crawl will be slower than the requested speed.

If the crawl-delay directive is higher than 30, you will receive an error message. The Oncrawl platform does not allow crawls to be configured with such an high crawl delay.

The only way to setup a crawl under those circumstances, is to use a virtual robots.txt file.

To do so, you must first validate the project with your Google Analytics account, so as to confirm your level of ownership regarding the domain you wish to crawl.

Once completed, you can activate the virtual robots.txt feature within the crawl configuration settings.

Simply remove the crawl-delay directive from the robots.txt file and you'll be ready.

At that point, you can also configure a crawl speed of up to 10 URLs per second, so you'll get the crawl report even sooner. That is especially true if the target website has a lot of pages.

Crawl part of a website

How (and why) to use a virtual robot.txt file

Crawl settings

How does the Oncrawl bot find and crawl pages?

Oncrawl bot: what URLs are known, crawled and analyzed?