A virtual robots.txt file exists only within Oncrawl for the purpose of the crawl it is associated with. It provides the same type information and functions the same way as your regular robots.txt file, but is tailored to the requirements of your your SEO crawl.
Why configure a virtual robots.txt file?
Sometimes you don't want your SEO audit bot to follow the same rules you show to search engine bots in your robots.txt file.
There are many reasons you might not want to use the default robot.txt file for your site:
crawl blocked pages.
crawl only a part of the site.
allow our bot to crawl faster than the speed set in the crawl delay.
Of course, you don't want to replace your normal robots.txt file, since it provides important instructions to search engines. This is where a virtual robots.txt file comes in.
โ
Setting up a virtual robots.txt file
Before you start
In order to override a robots.txt file, you will need to validate ownership of the domain you plan to crawl. This lets Oncrawl know you have permission to ignore the way the site is configured.
If you're not sure whether you've already validated your project or not, you can follow the steps below.
Use the workspace menu to navigate to the Workspace settings.
Go to Verify domain ownership section.
Click on Verify a domain. Follow the steps to provide the information we need to verify your domains.
All of the domains shown in the list can now be used with the virtual robot.txt option.
Enable virtual robots.txt
On the crawl settings page, enable the use of a virtual robots.txt file:
At the top of the page, make sure the extra settings are shown. If the toggle button is gray, click Extra settings to display them.
Scroll down to the Extra settings section and click on Virtual robots.txt to expand the section.
Tick Enable virtual robots.txt
Provide the content of the virtual robots.txt file
To make creating a virtual robots.txt file easier, Oncrawl uses an existing file as a template.
In the dropdown host menu, choose the domain name (the URL of the website) for which you would like to create a virtual robots.txt file.
If the host doesn't already have a virtual robots.txt file, you can create one: click the + to set one up. Then, provide the host (the URL of the website), and click Create virtual robots.txt.
The Virtual robots.txt rules field displays the content of the virtual robots.txt file. Add, modify, or delete rules to create the robots.txt file that will be used only by the Oncrawl bot.
If you are crawling multiple domains or subdomains, repeat these steps (1-3) for each domain or subdomain that needs a virtual robots.txt.
When you are finished, scroll down to the bottom of the page and click Save or Save and launch crawl to save your virtual robots.txt file.
Common robots.txt modifications
Crawl everything
Allow the Oncrawl bot access to everything by adding:
User-Agent: Oncrawl
Allow: /
Crawl blocked pages
To crawl directories or pages that are currently disallowed, delete or comment out the disallow line:
# Disallow: /blog/
Crawl only a part of the site
To crawl only part of a site, delete or comment out rules applying to the entire site.
Then, disallow the directories you don't want to crawl. Allow the directories you want to crawl.
Disallow: /blog/
Allow: /products/
Crawl only some of the site's subdomains
Make sure that crawls on the subdomains you want to crawl are allowed.
For each subdomain that you do not want to crawl, create a virtual robots.txt and disallow the entire subdomain:
User-Agent: Oncrawl
Disallow: /
For example, to crawl please-crawl.mysite.com
but not do-not-crawl-1.mysite.com
or do-not-crawl-2.mysite.com
:
Make sure the robots.txt for
please-crawl.mysite.com
allows the subdomain to be crawled.Create a robots.txt for
do-not-crawl-1.mysite.com
and disallow the subdomain.Create a robots.txt for
do-not-crawl-2.mysite.com
and disallow the subdomain.
For more information on this modification, see How to crawl some subdomains but not others.
Allow the Oncrawl bot to crawl faster than the speed set in the crawl delay
Delete or comment out the crawl delay parameter:
# Crawl-delay:2
For more information on this modification, see Your robots.txt has a crawl delay setup with a value greater than 1 second.
Best practices
You can allow the Oncrawl bot access to everything by adding:
User-Agent: Oncrawl
Allow: /
If you are using Disallow rules, remember not to disallow the Start URLs!
A virtual robots.txt will only apply to the domain or the subdomain for which it was created. In the case of a crawl including several subdomains, create a robots.txt for each subdomain.
You can find more information on robots.txt files here: http://www.robotstxt.org/robotstxt.html
You can use Google Search Console to check how rules effect the crawl.