All Collections
General information
How to crawl a site with a robots.txt crawl delay greater than 1 second
How to crawl a site with a robots.txt crawl delay greater than 1 second
Our bot supports the Crawl Delay directive of robots.txt file. If you have validated the project you can overcome this, learn how now!
Tanguy avatar
Written by Tanguy
Updated over a week ago

While setting up the crawl of a given website, at the point of starting the crawl, a bunch of checks are performed to make sure the crawl will run smoothly.

An important (yet hidden) parameter is the Crawl-Delay directive of the robots.txt file.

Since OnCrawl app can be used to crawl any website, our bot will follow the Crawl-Delay directive expressed in the robots.txt file found on the target website, if any.
Otherwise, we limit the crawl rate at the speed of 1 page per second, so our bot is not too aggressive against the targeted website.

When a website has a Crawl-Delay directive higher than 1, our application expresses a warning to tell you that the crawl will be slower than the requested speed.
If the Crawl-Delay is higher than 30, we express an error. We will simply not allow to configure a crawl with such an high crawl delay.

The only way to setup a crawl under those circumstances, is to use a virtual robots.txt file.

To do so, you must first validate the project with your Google Analytics account, so we can make sure that you have some kind of ownership on the domain you want to crawl.

Once that is done, you can activate the virtual robots.txt feature within the crawl configuration.

Simply remove the crawl delay directive from the robots.txt file and you'll be ready.

At that point, you can also configure a crawl speed of up to 10 URL per second, so you'll get the crawl report even sooner. That is especially true if the target website is large.

Leave us a message if you need further guidance!

Did this answer your question?