Crawl part of a website

Use a virtual robots.txt in order to limit the crawl to a subset of pages on a website

Updated over a week ago

You might want to crawl only part of your website.

You can limit the crawl to certain subdomains by listing them as Start URL and making sur that the Subdomains option is turned off.

But what if the part of the website you want to crawl is not a subdomain?

One way to do this is to filter URLs based on patterns in the URL to only crawl the URLs you want. You can do this in the crawl settings.

Another way to do this is to use a virtual robots.txt file, which will only apply to the Oncrawl bot during the crawl, and will not affect the robots.txt file on your website.

Set up a new crawl with URL exclusion or inclusion rules for filtering

Filtering URLs allows you to choose which URLs the crawler will explore. Rules for including URLs mean that the crawler will keep these URLs; rules for excluding URLs mean that the crawler will skips these URLs.

You can use a mix of both types of rules to target exactly the URLs that you want.

To limit the crawl to only URLs under the /blog/ part of our site, we'll now configure URL inclusion rule:

  1. Scroll down in the crawl settings to URL filtering.

  2. Tick Enable rules.

  3. Click Edit rules and click the + under Include to add a new rule.

  4. Enter */blog/* and save the rule.

You can now save and launch a crawl. All URLs that do not match the rule requirements will be ignored in the crawl.

Make sure the Start URL you have chosen is included and not excluded by your rules!

Set up a new crawl with a virtual robots.txt

Before you start: make sure the domain you want to crawl is verified in your workspace settings. If you don't have access to workspace settings, contact your workspace administrator or manager.

In order to override a robots.txt file, you will need to validate the site you want to crawl. This lets us know you have permission to ignore the way the site is configured.

Configure your crawl as you need.


To limit the crawl to only URLs under the /blog/ part of our site, we'll now configure a virtual robots.txt file:

  1. At the top of the page, make sure the Extra settings are on.

  2. Scroll down to the Extra settings section and click on Virtual robots.txt.

  3. Tick Enable virtual robots.txt and click the + to create a new virtual robots.txt.

By default, Oncrawl fills the input field with the contents of the original robots.txt file, preceded by commented lines (starting with #) that can be used to give access to the website to the Oncrawl bot:

We can edit this part to tell Oncrawl bot to only follow some URLs on the website, for example to follow only links starting with https://www.oncrawl.com/blog/, as follows:

User-Agent: Oncrawl
Allow: /blog/
Disallow: /

Remember to remove the # that signals these lines as a comment.

You can now save the configuration. At this time, a check is performed to ensure that the Oncrawl bot will be able to crawl the website with the given settings.

For example, if the Start URL is disallowed by the virtual robots.txt file, you will not be able to save the crawl profile or to launch a crawl using it.

Make sure the Start URL you have chosen is not disallowed in your virtual robots.txt file!

You can now go ahead and click Save & launch crawl.

Check whether a crawl profile uses a virtual robots.txt

You can have a quick look at the active virtual robots.txt from any crawl analysis page by clicking on the i and then switching to the Crawl profile tab.

You can also see this same information from the project page, by hovering over the i next to the crawl profile name:

Did this answer your question?