OnCrawl's crawler is designed to crawl real websites. We provide modifiable crawl parameters that will help adapt our crawler to your specific case.

Here's how to set up a crawl.

We'll cover the following points:

Where to find the crawl settings

From your project page, click on "+ Set up new crawl".

This will take you to the crawl settings page. Your default crawl settings are displayed. You can either modify them, or create a new crawl configuration.

Crawl settings configurations

A crawl settings configuration is a set of settings that has been saved with a name to be used later. 

You can modify or verify the settings in a crawl settings configuration by selecting the configuration in the drop-down menu next to the "Crawl settings" header.

If you save changes when viewing a crawl settings configuration, this will overwrite the old settings for the configuration.

To create a new crawl settings configuration and leave your current configurations as they are, click the blue "+ Create Crawl Config" button in the upper right.

Name: The name you give a crawl configuration will be used to identify it in the OnCrawl interface. We list the latest crawl configuration on the projects card in the list of projects, and on the project page we list all of the crawls in the project, along with the name of the crawl configuration used. A good name might, for example, state what is being crawled, or the type of additional data included. For example, you might have a "Rankings" crawl, or a "Blog without parameters" for a crawl of the subdomain https://blog.example.com that excludes query strings appended to the end of the URL.

Copy config from: The new configuration will always be based on an existing configuration, but you can change all of the parameters once the configuration has been created.

After creating the configuration and making the changes you want to the settings on this page, click "Save" to save the changes, or "Save and launch crawl" to launch a crawl right away.

Crawl settings

The "Crawl settings" section covers basic settings required to be able to launch a crawl. This defines what you crawl, with what crawler profile, and how fast.

Start URL

The Start URL parameter determines how the crawler will discover your URLs.

In Spider mode, the crawler will at a given point and will follow links on pages it encounters until it reaches a stop condition. The start URL sets the URL (or URLs) that the crawler will start from. This will determine the depth ranking of the pages in your site as OnCrawl maps out your site's architecture.

In URL list mode, the crawler will examine the URLs in a list you provide, but will not follow any links it finds on those pages.

Learn more about crawl modes and start URLs.

Crawl limits

The crawl limits tell the crawler when to stop crawling when in spider mode. (Consequently, this setting is not available when you've set the crawler to URL list mode.)

You can set:

  • The max URLs: the crawler will stop once it has discovered a given number of URLs, unless it has reached a different limit before then.
  • The max depth: each time the crawler follows a link, it advances one depth into the site architecture. The start URL or start URLs constitute depth 1, and all pages linked to from the start URL(s) have a depth of 2. If you set a depth of 3, the crawler will stop after discovering all of the pages at depth 3 (but not following their links), unless it has reached a different limit before then.

Crawl bot

You can set:

  • Whether to use a mobile or desktop bot when crawling. This can be useful if you provide different pages to a mobile visitor than to a desktop visitor. Differences might include a lightweight style, a different page address (such as https://m.example.com or https://www.myexample.com/mobile).
  • The bot name in the User-Agent. This can be used to test your site's behavior if you treat some bots differently, or if you need to identify the OnCrawl bot.

The full user-agent for your chosen bot configuration is provided on the right.

Learn more about the OnCrawl user-agent.

Max crawl speed

You can set:

  • The number of pages crawled per second.

To the right, you can enter a number of URLs to find the approximate time it will take to crawl your site at the chosen speed.

The ideal speed is the speed your server and site architecture can handle best. Unlike Google, which does not make 10 requests per second of your server until it has discovered all of your pages, OnCrawl will be making a lot of requests one after another. If OnCrawl's crawl speed is too high, your server might not be able to keep up.

To protect your site, we ask you to verify ownership of your site by linking a Google account before we let you increase the crawl speed beyond 10 URLs/second.

URL with parameters

You can set:

  • Whether or not to crawl URLs with query strings following the main URL (parameters).
  • Whether or not to filter the parameters you want to crawl. You can then set the parameters to exclude ("Keep all parameters except for...") or to include ("Remove all parameters except for..."), if you don't want to crawl all parameters. List the parameters in the space provided with spaces between them.

Crawling parameters can significantly increase the number of URLs you crawl. Note that, when you crawl parameters, OnCrawl (and Google) count each of the following as separate URLs:

Subdomains

By default, if you start a crawl for https://www.example.com, OnCrawl will consider that any subdomains it encounters, such as https://store.example.com, are part of a different site.

You can override this behavior by checking the "Crawl encountered subdomains" box.

Learn more about crawling subdomains.

Sitemaps

If you provide a sitemap, OnCrawl will compare the contents of your sitemap to the contents and structure of your site as discovered by the crawler.

You can:

  • Specify sitemap URLs: provide the URL to any sitemaps you would like to use. You can use more than one.
  • Allow soft mode: Oncrawl usually follows rules for bots with regards to sitemaps. If you want to ignore these rules, choose this option. A common use is to ignore the rule that makes a sitemap apply to the folder where it is stored. This allows sitemaps in https://www.example.com/sitemaps/ to apply to https://www.example.com and to https://www.example.com/news/

Learn more about using sitemaps to check your URLs.

Virtual robots.txt

If you don't want to follow the instructions in your robots.txt file, you can provide a virtual, temporary version for the OnCrawl bot only. This will not modify or suspend your actual robots.txt file.

You can:

  • Enable virtual robots.txt  and provide the contents of the virtual file that will apply only for the OnCrawl bot.

Note: you will need to have validated your project before you can override the robots.txt in place for this website.

Learn more about using a virtual robots.txt file.

Analysis

In the "Analysis" section, indicate which third-party data should be used to create cross-analyses and added to the information available for each URL.

Note: your project needs to be an advanced project in order to enable additional analyses. You will need to convert your project to an advanced project before proceeding.

SEO impact report

This report provides cross-analysis with analytics data. Enable it by connecting your Google Analytics or AT Internet account here.

Learn more about the SEO impact report with AT Internet. You can also enable this report with Google Analytics.

Ranking report

This report provides cross-analysis with data on SERP positions. Enable it by connecting your Google Search Console account here.

Learn more about the ranking report.

Backlink report

This report provides cross-analysis with data on external backlinks to your site. Enable it by connecting your Majestic account here.

Crawl over crawl

The crawl over crawl function compares two crawls with the same start URL(s) and the same subdomain settings. Enable the cross-analysis with a similar crawl here.

Alternatively, you can add this analysis later from the project page by scheduling a crawl over crawl between two compatible crawls.

Learn more about crawl over crawl.

Scraping

Data scraping uses rules to harvest information in the source code of your web pages as they are crawled. This data is saved in custom fields and can be used in cross-analysis for custom metrics.

In this section, you will define the scraping rules to find the data for each of the custom fields you want to create.

Note: if this option is not already part of your plan, you will need to subscribe to it before activating it. Use the Intercom button below to request help from your sales representative.

Learn more about data scraping and custom fields.

Data ingestion

Data ingestion absorbs information provided in a CSV file and adds it to the information for each URL as it is crawled. Data from any source, such as exports from SEMrush, your CRM, or any other tool, can be provided in this formal.

In this section, you will provide the file or files containing the additional information you want to include in your analysis.

Note: if this option is not already part of your plan, you will need to subscribe to it before activating it. Use the Intercom button below to request help from your sales representative.

Learn more about data ingestion.

Extra settings

The extra settings are hidden by default. To access them, click the "Show extra settings" toggle at the top of the page.

These settings provide technical solutions in cases where your site is not normally accessible to a standard crawler. 

Crawl JS

If your site is built with Javascript and requires Javascript to be crawled, OnCrawl's crawler can do it.

Note: JavaScript crawls are more expensive than normal crawls and will cost 10 URLs from your quota for each URL crawled.

Learn more about JavaScript crawls.

HTTP headers

HTTP headers are properties and their value that are provided as part of a request for a URL. Headers can be used, for example, to transmit cookies or session IDs stored by the browser.

If your site requires certain headers, you can add them to the HTTP headers provided by our crawler when it requests a URL.

Provide one header per line in the following format:

Header: value

DNS override

If you need to override the DNS, you can assign a different IP to a given host. This may be the case if you're crawling a site hosted on pre-production server but that still uses your usual domain name.

Authentication

If you need to provide a login and password to access your site, select "Enable HTTP authentication". This is often used to protect a pre-production site.

You can provide:

  • Username (required): the name used to access the site
  • Password (required): the password used to access the site
  • Scheme (optional): choose the type of authentication used. Basic, Digest, and NTLM are supported.
  • Realm (optional): indicate the structure you are accessing with this authentication.

Note: if this option is not already part of your plan, you will need to subscribe to it before activating it. Use the Intercom button below to request help from your sales representative.

Crawler IP Addresses

If you need to whitelist crawler IP addresses to allow them access to your site, you can check the "Use static IP addresses" box. This is useful when your site filters or blocks bots.

Checking "Use static IP addresses" will provide you with a list of four IP addresses that will be used to crawl your site. You will then need to whitelist these addresses.

Note: if this option is not already part of your plan, you will need to subscribe to it before activating it. Use the Intercom button below to request help from your sales representative.

Learn more about the IP addresses OnCrawl uses.

Cookies

If keeping cookies causes crawl issues on your site, you can disable them by unchecking the "Keep cookies returned by server" box here.

Going further

If you still have questions about setting up a crawl, drop us a line at @oncrawl_cs or click on the Intercom button at the bottom right of your screen to start a chat with us.

Happy crawling!

You can also find this article by searching for:
Paramètres de crawl, configurer un crawl
Configuración del crawleo, configurar el rastreador

Did this answer your question?