Let's start by creating our project
Project is created, we have to validate it so that defining a virtual robots.txt is possible.
Hit the "add your GA account" button:
Fill in the form and press the "verify" button corresponding to the domain you want to crawl (be aware that the domain must match, including TLD)
Once validated, the project page looks like this
It is now time to hit the "set up a new crawl" button. Configure your crawl as you need.
To limit the crawl to only URLs under the /blog/ part of our site, we'll now configure a virtual robots.txt file. By default, we fill the input field with the content of the original robots.txt file, preceded with commented lines that can be used to give access to the website to our bot:
We can edit this part to tell OnCrawl bot to only follow some URLs on the website, for example to follow only links starting with http://www.oncrawl.com/blog/, proceed as follow:
We can now save the configuration. At this time, a check is performed to ensure that our bot will be able to crawl the website with the given settings. For example, if the start URL is not allowed by the robots.txt file, you will have an error. Make sure the start URL is allowed by the virtual robots.txt file!
Before launching the crawl, you can review all the settings:
You can have a quick look at the active virtual robots.txt definition by clicking on the link
We can now hit the "launch a new crawl" button to start the crawl