Crawl part of a website

Use a virtual robots.txt in order to limit the crawl to a subset of pages on a website

Updated over a week ago

You might want to crawl only part of your website. You can limit the crawl to certain subdomains by listing them as Start URL and making sur that the Subdomains option is turned off.

But what if the part of the website you want to crawl is not a subdomain?

One way to do this is to use a virtual robots.txt file, which will only apply to the Oncrawl bot during the crawl, and will not affect the robots.txt file on your website.

Before you start: make sure your project is validated

In order to override a robots.txt file, you will need to validate the site you want to crawl. This lets us know you have permission to ignore the way the site is configured.

If you're not sure whether you've already validated your project or not, you can follow steps one and two below. If your project is already validated, we'll let you know on the project validation page.

  1. From your Projects home page (or any other page in the project), in the upper right-hand corner, click on the three dots to open the project menu.

  2. Select Verify ownership

  3. Follow the steps to provide the information we need to validate your project.

  4. Click on Set up a new crawl to go directly to the crawl settings page.

Set up a new crawl with a virtual robots.txt

Configure your crawl as you need.

To limit the crawl to only URLs under the /blog/ part of our site, we'll now configure a virtual robots.txt file:

  1. At the top of the page, make sure the Extra settings are on.

  2. Scroll down to the Extra settings section and click on Virtual robots.txt.

  3. Tick Enable virtual robots.txt and click the + to create a new virtual robots.txt.

By default, Oncrawl fills the input field with the contents of the original robots.txt file, preceded by commented lines (starting with #) that can be used to give access to the website to the Oncrawl bot:

We can edit this part to tell Oncrawl bot to only follow some URLs on the website, for example to follow only links starting with, as follows:

User-Agent: Oncrawl
Allow: /blog/
Disallow: /

Remember to remove the # that signals these lines as a comment.

You can now save the configuration. At this time, a check is performed to ensure that the Oncrawl bot will be able to crawl the website with the given settings.

For example, if the Start URL is disallowed by the virtual robots.txt file, you will not be able to save the crawl profile or to launch a crawl using it.

Make sure the Start URL you have chosen is not disallowed in your virtual robots.txt file!

You can now go ahead and click Save & launch crawl.

Check whether a crawl profile uses a virtual robots.txt

You can have a quick look at the active virtual robots.txt from any crawl analysis page by clicking on the i and then switching to the Crawl profile tab.

You can also see this same information from the project page, by hovering over the i next to the crawl profile name:

Did this answer your question?