During staging or pre-production periods, it’s best to prevent Google from indexing a site that’s still being modified or tested. This usually involves preventing access to the website through a variety of means, such as using a proxy server.

Note: The robots.txt file and the meta robots tags on your pre-prod site should be the same as the ones you intend to use when you go live. This best practice helps avoid costly errors. In almost all cases, an htpassword is sufficient protection for a pre-prod website.

Here’s how to allow the OnCrawl bot access to your staging or pre-production website.

Your pre-prod site is password protected

To crawl a site that is protected by a password, you will need to set up a crawl that provides the correct password to the site.

When setting up your crawl, make sure to click "Show extra settings" at the top of the crawl settings page. The toggle button should be red if the extra settings are visible.

  1. Scroll down and click "Authentication" to expand the section.
  2. Tick the "Enable authentication" box.
  3. Provide the username and the password used to access your site.
  4. Optionally, indicate the type of authentication (Basic, Digest, or NTLM).
  5. To make it easier to recall what you're logging in to, you can also provide an indication of the "realm" that this login applies to: blog, CMS, admin…

Your pre-prod site uses meta instructions aimed at bots (noindex, nofollow)

The OnCrawl bot respects instructions to bots.

Before you can run a crawl on your pre-production site, you'll need to ask your developers to set up the pre-production site exactly the way the live site will be set up, including meta robot instructions. This best practice also helps to avoid errors when going live.

For technical reasons, we cannot make exceptions on this point. OnCrawl will be unable to crawl your site if you do not remove meta instructions that prohibit bots (noindex, nofollow) on pages that should be indexed.

Your pre-prod site uses IP restrictions

To crawl a site that limits access to a few, whitelisted IP addresses, you will need to authorize the OnCrawl static IP addresses.

When setting up your crawl, make sure to click "Show extra settings" at the top of the crawl settings page. The toggle button should be red if the extra settings are visible.

  1. Scroll down to "Crawler IP addresses" and click to expand the section.
  2. Tick the "Use static IP addresses" box. 

This will display a list of IP addresses that OnCrawl will use to crawl your site. Whitelist these addresses, or ask your web development team to whitelist them, before launching your crawl. 

Please note: the Static IP Addresses option is not included by default in standard plans. If the option is grayed out and can't be clicked, please contact the OnCrawl business team to discuss adding it to your plan.

If your pre-prod site is built in JavaScript and you are already using the OnCrawl JS crawl, we've likely already set static IP addresses for your website. Please contact us using the blue Intercom chat button at the bottom of the screen so that we can provide you with the list.

Your pre-prod site is located behind a proxy server

If your pre-prod site is located behind a proxy server, you can configure OnCrawl to crawl your site by overriding the DNS.

When setting up your crawl, make sure to click "Show extra settings" at the top of the crawl settings page. The toggle button should be red if the extra settings are visible.

  1. Scroll down and click "DNS Override" to expand the section.
  2. Enter the IP address of the server you want to use and the corresponding domain.

Your pre-prod site uses a robots.txt to prohibit bots

You can override the robots.txt for the OnCrawl bot only using a virtual robots.txt file.

When setting up your crawl, make sure to click "Show extra settings" at the top of the crawl settings page. The toggle button should be red if the extra settings are visible.

  1. Scroll down and click "Virtual robots.txt" to expand the section.
  2. Tick the "Enable virtual robots.txt" box.
  3. Enter the domain for the pre-prod site and click the "+".
  4. Modify the virtual robots.txt you have just created to allow the OnCrawl bot access to your site.

If you need help, you can take a look at our documentation on virtual robots.txt files.

Going further

If you still have questions, drop us a line at @oncrawl_cs or click on the Intercom button at the bottom right of your screen to start a chat with us.

Happy crawling!

You can also find this article by searching for:
servidor con contraseña, servidor proxy, sitio pre-producción
serveur avec mot de passe, serveur proxy, sandbox

Did this answer your question?