Oncrawl allows you to crawl some subdomains but not others.
There are multiple ways to crawl some subdomains and not crawl others. Here are two of them.
Use multiple start URLs and do not crawl encountered subdomains if you have a lot of subdomains not to crawl and few links between subdomains. This is also a good choice if you cannot verify your project.
Use a virtual robots.txt and crawl encountered subdomains if you have few subdomains not to crawl and lots of links between subdomains.
1. Using multiple start URLs to crawl subdomains
This method works by launching the crawl simultaneously on the subdomains you want to crawl, while also telling the Oncrawl bot not to explore any other subdomains it encounters.
Use this method if you have a lot of subdomains you want to exclude, if there are few links between your subdomains, or if you can't validate your project.
🚨 Warning: Because you will tell the Oncrawl bot not to explore discovered subdomains, links from pages in one subdomain to pages in another subdomain will be treated as external links.
Modify your crawl profile or create a new one
From the project home page, click Set up a new crawl.
Check the crawl profile you are modifying. If necessary, use the Create crawl profile button at the top right to create a new one.
Set the Start URLs
Click on Start URL to expand the section.
Make sure your crawl is set to Spider mode.
In the Start URL field, enter the start URL for one of the subdomains you would like to crawl.
Enter the Additional start URLs for any additional subdomains that you want to crawl.
Enable crawling encountered subdomains
Click on "Subdomains" to expand the section
Make sure that the "Crawl encountered subdomains" box is not ticked.
When you launch a crawl, only the subdomains you listed as start URLs will be crawled.
2. Using a virtual robots.txt
This method works by ensuring that all subdomains will be crawled, and then telling the Oncrawl bot not to crawl certain subdomains.
Use this method if you only have a few subdomains that you don't want to crawl, if there are lots of links between your subdomains, or if your project is already verified.
You may want to take a look at the article on How to use a virtual robots.txt.
Before you start: verify ownership to validate your project
Because you'll need to use a virtual robots.txt override, you will need to validate the site you want to crawl. This lets us know you have permission to ignore the way the site is configured.
If you're not sure whether you've already validated your project or not, you can follow steps one and two below. If your project is already validated, we'll let you know on the project validation page.
From your project home page, in the upper right-hand corner, click on the three dots to open the project menu.
Select Verify ownership.
Follow the steps to provide the information we need to validate your project.
Click on Set up new Crawl to go directly to the crawl settings page.
Set up crawl parameters
From the project home page, click Set up a new crawl.
At the top of the page, make sure the extra settings are shown. If the toggle button is gray, click on it to display them.
Set the Start URLs
Click on Start URL to expand the section.
Make sure your crawl is set to Spider mode.
In the Start URL field, enter the start URL for one of the subdomains you would like to crawl.
Enter the Additional start URLs for any additional subdomains that you want to crawl. This ensures that the additional subdomains will be crawled, even if the first link to them is beyond the maximum depth or number of pages.
Enable crawling encountered subdomains
Click on Subdomains to expand the section
Tick the Crawl encountered subdomains box.
At this point, if you launch a crawl, all subdomains will be crawled.
The next steps will limit the subdomains that can be crawled.
Disallow unwanted subdomains
You can use a virtual robots.txt to tell the Oncrawl bot to avoid certain subdomains.
Scroll down to Extra settings and click on Virtual robots.txt to expand the section.
Tick the Enable virtual robots.txt to enable this option.
Enter the first subdomain that you want to avoid in the Virtual robots.txt field.
Click on the + to create a virtual robots.txt file.
Replace the contents of the robots.txt file with:
User-Agent: OnCrawl
Disallow: /
Repeat steps 3-5 for each subdomain that you do not want to crawl.
Make sure that the desired subdomains are allowed
In the Virtual robots.txt section, enter the first subdomain that you want to crawl in the "Virtual robots.txt" field.
Click on the "+" to create a virtual robots.txt file.
Replace the contents of the robots.txt file with:
User-Agent: OnCrawl
Allow: /
Repeat steps 1-3 for each subdomain that you want to crawl.