OnCrawl allows you to crawl some subdomains but not others.

There are multiple ways to crawl some subdomains and not crawl others. Here are two of them.

  • Use a virtual robots.txt and crawl encountered subdomains if you have few subdomains not to crawl and lots of links between subdomains.
  • Use multiple start URLs and do not crawl encountered subdomains if you have a lot of subdomains not to crawl and few links between subdomains. This is also a good choice if you cannot verify your project.

Using virtual robots.txt

This method works by ensuring that all subdomains will be crawled, and then telling the OnCrawl bot not to crawl certain subdomains.

Use this method if you only have a few subdomains that you don't want to crawl, if there are lots of links between your subdomains, or if your project is already verified.

You may want to take a look at the article on How to use a virtual robots.txt.

Before you start

Because you'll need to use a virtual robots.txt file, you will need to validate the site you want to crawl. This lets us know you have permission to ignore the way the site is configured.

If you're not sure whether you've already validated your project or not, you can follow steps one and two below. If your project is already validated, we'll let you know on the project validation page.

  1. From your project home page (or any other page in the project), in the upper right-hand corner, click on the three dots to open the project menu.
  2. Select "Verify ownership"
  3. Follow the steps to provide the information we need to validate your project.
  4. Click on "Setup new Crawl" to go directly to the crawl settings page.

Set up Crawl parameters

  1. From the project home page, click "Create new crawl".
  2. At the top of the page, make sure the extra settings are shown. If the toggle button is gray, click "Show extra settings" to display them. 

Now let's look at the different settings you'll need.

Set the Start URLs

  1. Click on "Start URL" to expand the section.
  2. Make sure your crawl is set to "Spider mode".
  3. In the "Start URL" field, enter the start URL for one of the subdomains you would like to crawl.
  4. In the field labelled "You can define additional start URLs to your crawl", enter the start URL for any additional subdomains that you want to crawl. This ensures that they will be crawled, even if the first link to them is beyond the maximum depth or number of pages.

Enable crawling encountered subdomains

  1. Click on "Subdomains" to expand the section
  2. Tick the "Crawl encountered subdomains" box.

At this point, if you launch a crawl, all subdomains will be crawled.

The next steps will limit the subdomains that can be crawled.

Disallow unwanted subdomains

You can use a virtual robots.txt to tell the OnCrawl bot to avoid certain subdomains.

  1. Scroll down to "Extra settings" and click on "Virtual robots.txt" to expand the section.
  2. Tick the "Enable virtual robots.txt" to enable this option.
  3. Enter the first subdomain that you want to avoid in the "Virtual robots.txt" field.
  4. Click on the "+" to create a virtual robots.txt file.
  5. Replace the contents of the robots.txt file with:
User-Agent: OnCrawl
Disallow: /

Repeat steps 3-5 for each subdomain that you do not want to crawl.

Make sure that the desired subdomains are allowed

For good measure, it's best to make sure that the subdomains you want to crawl allow robots.

If this is not the case, you can use a virtual robots.txt to tell the OnCrawl bot to crawl these subdomains anyway.

  1. Scroll down to "Extra settings" and click on "Virtual robots.txt" to expand the section.
  2. Enter the first subdomain that you want to crawl in the "Virtual robots.txt" field.
  3. Click on the "+" to create a virtual robots.txt file.
  4. Replace the contents of the robots.txt file with:
User-Agent: OnCrawl
Allow: /

Repeat steps 2-4 for each subdomain that you want to crawl.

Using multiple start URLs to crawl subdomains

This method works by launching the crawl simultaneously on the subdomains you want to crawl, while also telling the OnCrawl bot not to explore any other subdomains it encounters.

Use this method if you have a lot of subdomains you want to exclude, if there are few links between your subdomains, or if you can't validate your project.

Warning: Because you will tell the OnCrawl bot not to explore discovered subdomains, links from pages in one subdomain to pages in another subdomain will be treated as external links.

Set up Crawl parameters

  1. From the project home page, click "Create new crawl".

Set the Start URLs

  1. Click on "Start URL" to expand the section.
  2. Make sure your crawl is set to "Spider mode".
  3. In the "Start URL" field, enter the start URL for one of the subdomains you would like to crawl.
  4. In the field labelled "You can define additional start URLs to your crawl", enter the start URL for any additional subdomains that you want to crawl.

Enable crawling encountered subdomains

  1. Click on "Subdomains" to expand the section
  2. Make sure that the "Crawl encountered subdomains" box is not ticked.

When you launch a crawl, only the subdomains you listed as start URLs will be crawled.

Going further

If you still have questions about crawling subdomains, feel free to drop us a line at @oncrawl_cs or click on the Intercom button at the bottom right of your screen to start a chat with us.

Happy crawling!

Did this answer your question?