A virtual robots.txt file exists only within OnCrawl for the purpose of the crawl it is associated with. It provides the same type information and functions the same way as your regular robots.txt file, but is tailored to the requirements of your your SEO crawl.

Why configure a virtual robots.txt file?

Sometimes you don't want your SEO audit bot to follow the same rules you show to search engine bots in your robots.txt file.

There are many reasons you might not want to use the default robot.txt file for your site:

  • crawl blocked pages.
  • crawl only a part of the site.
  • allow our bot to crawl faster than the speed set in the crawl delay.

Of course, you don't want to replace your normal robots.txt file, since it provides important instructions to search engines. This is where a virtual robots.txt file comes in.

Setting up a virtual robots.txt file

Before you start

In order to override a robots.txt file, you will need to validate the site you want to crawl. This lets us know you have permission to ignore the way the site is configured.

If you're not sure whether you've already validated your project or not, you can follow steps one and two below. If your project is already validated, we'll let you know on the project validation page.

  1. From your project home page (or any other page in the project), in the upper right-hand corner, click on the three dots to open the project menu.
  2. Select "Verify ownership"
  3. Follow the steps to provide the information we need to validate your project.
  4. Click on "Setup new Crawl" to go directly to the crawl settings page.

Enable virtual robots.txt

On the Crawl settings page, enable the use of a virtual robots.txt file:

  1. At the top of the page, make sure the extra settings are shown. If the toggle button is gray, click "Show extra settings" to display them.
  1. Scroll down to the "Extra settings" section and click on "Virtual robots.txt" to expand the section.
  2. Tick "Enable virtual robots.txt"

Provide the content of the virtual robots.txt file

To make creating a virtual robots.txt file easier, OnCrawl uses an existing file as a template.

  1. Provide the domain name (the URL of the website) for which you would like to create a virtual robots.txt file.If the domain's host does not exist or does not reply, but you want to use it anyway, you can click "no host found, create?" to create a blank robots.txt file.
  2. Click the "+" to add the robots.txt
  3. The "Virtual robots.txt rules" field displays the content of the virtual robots.txt file. Add, modify, or delete rules to create the robots.txt file that will be used by the OnCrawl bot.

If you are crawling multiple domains or subdomains, repeat these steps (1-3) for each domain or subdomain that needs a virtual robots.txt.

When you are finished, scroll down to the bottom of the page and click "Save" or "Save and launch crawl" to save your virtual robots.txt file.

Common robots.txt modifications

Crawl everything

Allow the OnCrawl bot access to everything by adding:

User-Agent: OnCrawl
Allow: /

Crawl blocked pages

To crawl directories or pages that are currently disallowed, delete or comment out the disallow line:

# Disallow: /blog/

Crawl only a part of the site

To crawl only part of a site, delete or comment out rules applying to the entire site.

Then, disallow the directories don't want to crawl. Allow the directories you want to crawl.

Disallow: /blog/
Allow: /products/

Crawl only some of the site's subdomains

Make sure that crawls on the subdomains you want to crawl are allowed.

For each subdomain that you do not want to crawl, create a virtual robots.txt and disallow the entire subdomain:

User-Agent: OnCrawl
Disallow: /

For example, to crawl please-crawl.mysite.com but not do-not-crawl-1.mysite.com or do-not-crawl-2.mysite.com:

  • Make sure the robots.txt for please-crawl.mysite.com allows the subdomain to be crawled.
  • Create a robots.txt for do-not-crawl-1.mysite.com and disallow the subdomain.
  • Create a robots.txt for do-not-crawl-1.mysite.com and disallow the subdomain.

For more information on this modification, see How to crawl some subdomains but not others.

Allow the OnCrawl bot to crawl faster than the speed set in the crawl delay

Delete or comment out the crawl delay parameter:

# Crawl-delay:2

For more information on this modification, see Your robots.txt has a crawl delay setup with a value greater than 1 second.

Best practices

  • You can allow the OnCrawl bot access to everything by adding:
User-Agent: OnCrawl
Allow: /
  • If you are using Disallow rules, remember not to disallow the start URLs!
  • A robots.txt will only apply to the domain or the subdomain for which it was created. In the case of a crawl including several subdomains, create a robots.txt for each subdomain.
  • You can find more information on robots.txt files here: http://www.robotstxt.org/robotstxt.html
  • You can use Google Search Console to check how rules effect the crawl.

Going further

If you still have questions about using a virtual robots.txt, feel free to drop us a line at @oncrawl_cs or click on the Intercom button at the bottom right of your screen to start a chat with us.

Happy crawling!

Did this answer your question?