All Collections
How to get started
Crawl setup
How (and why) to use a virtual robot.txt file
How (and why) to use a virtual robot.txt file

Use a virtual robot.txt file to tell the Oncrawl bot how to crawl your site

Updated over a week ago

A virtual robots.txt file exists only within Oncrawl for the purpose of the crawl it is associated with. It provides the same type information and functions the same way as your regular robots.txt file, but is tailored to the requirements of your your SEO crawl.

Why configure a virtual robots.txt file?

Sometimes you don't want your SEO audit bot to follow the same rules you show to search engine bots in your robots.txt file.

There are many reasons you might not want to use the default robot.txt file for your site:

  • crawl blocked pages.

  • crawl only a part of the site.

  • allow our bot to crawl faster than the speed set in the crawl delay.

Of course, you don't want to replace your normal robots.txt file, since it provides important instructions to search engines. This is where a virtual robots.txt file comes in.
โ€‹

Setting up a virtual robots.txt file

Before you start

In order to override a robots.txt file, you will need to validate ownership of the domain you plan to crawl. This lets Oncrawl know you have permission to ignore the way the site is configured.

If you're not sure whether you've already validated your project or not, you can follow the steps below.

  1. Use the workspace menu to navigate to the Workspace settings.

  2. Go to Verify domain ownership section.

  3. Click on Verify a domain. Follow the steps to provide the information we need to verify your domains.

  4. All of the domains shown in the list can now be used with the virtual robot.txt option.

Enable virtual robots.txt

On the crawl settings page, enable the use of a virtual robots.txt file:

  1. At the top of the page, make sure the extra settings are shown. If the toggle button is gray, click Extra settings to display them.

  2. Scroll down to the Extra settings section and click on Virtual robots.txt to expand the section.

  3. Tick Enable virtual robots.txt

Provide the content of the virtual robots.txt file

To make creating a virtual robots.txt file easier, Oncrawl uses an existing file as a template.

  1. In the dropdown host menu, choose the domain name (the URL of the website) for which you would like to create a virtual robots.txt file.

  2. If the host doesn't already have a virtual robots.txt file, you can create one: click the + to set one up. Then, provide the host (the URL of the website), and click Create virtual robots.txt.

  3. The Virtual robots.txt rules field displays the content of the virtual robots.txt file. Add, modify, or delete rules to create the robots.txt file that will be used only by the Oncrawl bot.

If you are crawling multiple domains or subdomains, repeat these steps (1-3) for each domain or subdomain that needs a virtual robots.txt.

When you are finished, scroll down to the bottom of the page and click Save or Save and launch crawl to save your virtual robots.txt file.

Common robots.txt modifications

Crawl everything

Allow the Oncrawl bot access to everything by adding:

User-Agent: Oncrawl
Allow: /

Crawl blocked pages

To crawl directories or pages that are currently disallowed, delete or comment out the disallow line:

# Disallow: /blog/

Crawl only a part of the site

To crawl only part of a site, delete or comment out rules applying to the entire site.

Then, disallow the directories you don't want to crawl. Allow the directories you want to crawl.

Disallow: /blog/
Allow: /products/

Crawl only some of the site's subdomains

Make sure that crawls on the subdomains you want to crawl are allowed.

For each subdomain that you do not want to crawl, create a virtual robots.txt and disallow the entire subdomain:

User-Agent: Oncrawl
Disallow: /

For example, to crawl please-crawl.mysite.com but not do-not-crawl-1.mysite.com or do-not-crawl-2.mysite.com:

  • Make sure the robots.txt for please-crawl.mysite.com allows the subdomain to be crawled.

  • Create a robots.txt for do-not-crawl-1.mysite.com and disallow the subdomain.

  • Create a robots.txt for do-not-crawl-2.mysite.com and disallow the subdomain.

For more information on this modification, see How to crawl some subdomains but not others.

Allow the Oncrawl bot to crawl faster than the speed set in the crawl delay

Delete or comment out the crawl delay parameter:

# Crawl-delay:2

Best practices

  • You can allow the Oncrawl bot access to everything by adding:

User-Agent: Oncrawl
Allow: /
  • If you are using Disallow rules, remember not to disallow the Start URLs!

  • A virtual robots.txt will only apply to the domain or the subdomain for which it was created. In the case of a crawl including several subdomains, create a robots.txt for each subdomain.

  • You can find more information on robots.txt files here: http://www.robotstxt.org/robotstxt.html

  • You can use Google Search Console to check how rules effect the crawl.

Did this answer your question?