All Collections
General information
FAQ
How to crawl a staging/pre-prod website
How to crawl a staging/pre-prod website

It's good practice to protect staging websites and pre-production websites from bots. Here's how to crawl one before it goes live.

Updated over a week ago

During staging or pre-production periods, it’s best to prevent Google from indexing a site that’s still being modified or tested. This usually involves preventing access to the website through a variety of means, such as using a proxy server.

Note: The robots.txt file and the meta robots tags on your pre-prod site should be the same as the ones you intend to use when you go live. This best practice helps avoid costly errors. In almost all cases, an htpasswd is sufficient protection for a pre-prod website.

Here’s how to allow the Oncrawl bot access to your staging or pre-production website.

Your pre-prod site is password protected

To crawl a site that is protected by a password, you will need to set up a crawl that provides the correct password to the site.

When setting up your crawl, make sure to click Extra settings at the top of the crawl settings page. The toggle button should be green if the extra settings are visible.

  1. Scroll down and click Authentication to expand the section.

  2. Tick the Enable HTTP authentication box.

  3. Provide the username and the password used to access your site.

  4. Optionally, you can select the type of authentication scheme (Basic, Digest, or NTLM).

  5. To make it easier to recall what you're logging in, you can also provide an indication of the Realm that this login applies to (e.g. blog, CMS, admin…).

Your pre-prod site uses meta instructions aimed at bots (noindex, nofollow)

The Oncrawl bot respects instructions to bots.

Before you can run a crawl on your pre-production site, you'll need to ask your developers to set up the pre-production site exactly the way the live site will be set up, including meta robot instructions. This best practice also helps to avoid errors when going live.

For technical reasons, we cannot make exceptions on this point. Oncrawl will be unable to crawl your site if you do not remove meta instructions that prohibit bots (noindex, nofollow) on pages that should be indexed.

Your pre-prod site uses IP restrictions

To crawl a site that limits access to a few, whitelisted IP addresses, you will need to authorize the Oncrawl static IP addresses.

When setting up your crawl, make sure to click Extra settings at the top of the crawl settings page. The toggle button should be green if the extra settings are visible.

  1. Scroll down to Crawler IP addresses and click to expand the section.

  2. This will display a list of IP addresses that Oncrawl will use to crawl your site. Whitelist these addresses, or ask your web development team to whitelist them, before launching your crawl. 

Your pre-prod site is located behind a proxy server

If your pre-prod site is located behind a proxy server, you can configure Oncrawl to crawl your site by overriding the DNS. Note that this option cannot be used with a JS crawl.

When setting up your crawl, make sure to click Extra settings at the top of the crawl settings page. The toggle button should be green if the extra settings are visible.

  1. Scroll down and click DNS Override to expand the section.

  2. Enter the IP address of the server you want to use and the corresponding domain.

Your pre-prod site uses a robots.txt to prohibit bots

You can override the robots.txt for the Oncrawl bot only using a virtual robots.txt file.

When setting up your crawl, make sure to click Extra settings at the top of the crawl settings page. The toggle button should be green if the extra settings are visible.

  1. Scroll down and click Virtual robots.txt to expand the section.

  2. Tick the Enable virtual robots.txt box.

  3. Enter the domain for the pre-prod site and click the +.

  4. Modify the virtual robots.txt you have just created to allow the Oncrawl bot access to your site.

If you need help, you can take a look at our documentation on virtual robots.txt files.

Did this answer your question?