You might want to crawl only part of your website. You can limit the crawl to certain subdomains by listing them as Start URLs and making sur that the "crawl subdomains" option is turned off.
But what if the part of the website you want to crawl is not a subdomain?
One way to do this is to use a virtual robots.txt file, which will only apply to the OnCrawl bot during the crawl, and will not affect the robots.txt file on your website.
Before you start: make sure your project is validated
In order to override a robots.txt file, you will need to validate the site you want to crawl. This lets us know you have permission to ignore the way the site is configured.
If you're not sure whether you've already validated your project or not, you can follow steps one and two below. If your project is already validated, we'll let you know on the project validation page.
From your project home page (or any other page in the project), in the upper right-hand corner, click on the three dots to open the project menu.
Select "Verify ownership"
Follow the steps to provide the information we need to validate your project.
Click on "Setup new Crawl" to go directly to the crawl settings page.
Set up a new crawl with a virtual robots.txt
It is now time to hit the "set up a new crawl" button. Configure your crawl as you need.
To limit the crawl to only URLs under the /blog/ part of our site, we'll now configure a virtual robots.txt file :
At the top of the page, make sure the extra settings are shown. If the toggle button is gray, click "Show extra settings" to display them.
Scroll down to the "Extra settings" section and click on "Virtual robots.txt" to expand the section.
Tick "Enable virtual robots.txt" and click the "+" to add a new virtual robots.txt file.
By default, we fill the input field with the content of the original robots.txt file, preceded with commented lines that can be used to give access to the website to our bot:
We can edit this part to tell OnCrawl bot to only follow some URLs on the website, for example to follow only links starting with http://www.oncrawl.com/blog/, proceed as follows:
User-Agent: OnCrawl
Allow: /blog/
Disallow: /
We can now save the configuration. At this time, a check is performed to ensure that our bot will be able to crawl the website with the given settings.
For example, if the Start URL is not allowed by the robots.txt file, you will have an error. Make sure the Start URL is allowed by the virtual robots.txt file!
You can now go ahead and hit "Save and launch crawl"!
Check whether a crawl profile uses a virtual robots.txt
You can have a quick look at the active virtual robots.txt for any crawl by hovering over the "i" next to the crawl profile listed from the project homepage and clicking "show":
You can also find this article by searching for:
analizar sólo una parte del sitio, no rastrear todo el sitio web
analyser une part du site web, ne pas crawler tout le site web