Crawl settings

Set crawl parameters, save crawl settings as a profile to reuse later, create cross-data reports, and launch your crawl.

Updated over a week ago

Oncrawl's crawler is designed to crawl real websites. We provide modifiable crawl parameters that will help adapt our crawler to your specific case.

Here's how to set up a crawl.

We'll cover the following points:

Where to find the crawl settings

From your project page, click on + Set up a new crawl.

This will take you to the crawl settings page. Your default crawl settings are displayed. You can either modify them, or create a new crawl configuration.

Crawl profiles

A crawl profile is a set of settings that has been saved with a name to be used later. 

You can modify or verify the settings in a crawl profiles by selecting the profile in the drop-down menu at the top of the page.

If you save changes when viewing a crawl settings configuration, this will overwrite the old settings for the configuration.

To create a new crawl settings configuration and leave your current configurations as they are, click the + Create crawl profile button in the upper right.

Name: The name you give a crawl profile will be used to identify it in the OnCrawl interface. We list the latest crawl profile on the projects card in the list of projects, and on the project page we list all of the crawls in the project, along with the name of the crawl profile used. A good name might, for example, state what is being crawled, or the type of additional data included.

For example, you might want to have different types of crawl for different scopes, data types, or frequencies:

  • A "Rankings" crawl that includes GSC data

  • A "Blog without parameters" for a crawl of the subdomain https://blog.example.com that excludes query strings appended to the end of the URL

  • A "Daily sanity check" that checks key pages every day

  • A "Monthly full crawl (JS)" that crawls the full site every month with JS enabled

Copy config from: The new profile will always be based on an existing profile, but you can change all of the parameters once the profile has been created.

After creating the configuration and making the changes you want to the settings on this page, click Save to save the changes, or Save and launch crawl to launch a crawl right away.

Crawl settings

The Crawl settings section covers basic settings required to be able to launch a crawl. This defines what you crawl, with what crawler description, and how fast.

Start URL

The Start URL parameter determines how the crawler will discover your URLs.

In Spider mode, the crawler will at a given point and will follow links on pages it encounters until it reaches a stop condition. The start URL sets the URL (or URLs) that the crawler will start from. This will determine the depth ranking of the pages in your site as Oncrawl maps out your site's architecture.

In URL list mode, the crawler will examine the URLs in a list you provide, but will not follow any links it finds on those pages.

In Sitemap mode, Oncrawl will use the URLs that appear in your site's sitemaps in order to create a list of URLs to crawl.

You can allow Oncrawl to discover sitemaps from a directory, subdomain, or URL; or you can provide the URLs of one or more sitemaps that you want to use.

Crawl limits

The crawl limits tell the crawler when to stop crawling when in spider mode. (Consequently, this setting is not available when you've set the crawler to URL list mode.)

You can set:

  • The max URLs: the crawler will stop once it has discovered a given number of URLs, unless it has reached a different limit before then.

  • The max depth: each time the crawler follows a link, it advances one depth into the site architecture. The start URL or start URLs constitute depth 1, and all pages linked to from the start URL(s) have a depth of 2. If you set a depth of 3, the crawler will stop after discovering all of the pages at depth 3 (but not following their links), unless it has reached a different limit before then.

You can also modify the maximum depth of a crawl while the crawl is running. To do so, go to the Crawl Monitoring and click on the Pause crawl drop-down button at the top right of the screen. Choose Change max depth.

Crawl bot (user-agent)

You need to verify the domain in the verify domain ownership section of the workspace settings if you want to use a full custom user-agent.

You can set:

  • Whether to use the default mobile or desktop bot, or whether to provide your own custom user-agent when crawling. This can be useful if you provide different pages to a mobile visitor than to a desktop visitor. Differences might include a lightweight style, a different page address.

  • When using the default bots, you can customize the user-agent by modifying the bot name in the user-agent. This can be used to test your site's behavior if you treat some bots differently from others, or if you need to identify the Oncrawl bot.

  • When using a full custom user-agent, which gives you full control over the bot identity declared by Oncrawl on your site. Using a custom user agent allows you to monitor how your website renders when a bot with that user-agent accesses the site, or even helps you protect your site by only allowing bots specific user-agents to access your site. In this case, you'll need to provide two elements:

    • The bot name

    • The full user-agent string.

Max crawl speed

You will need to verify the domain in the verify domain ownership section of the workspace settings before you crawl at a speed beyond 1 URL/second

You can set:

  • The number of pages crawled per second.

To the right, you can enter a number of URLs to find the approximate time it will take to crawl your site at the chosen speed.

The ideal speed is the speed your server and site architecture can handle best. Unlike Google, which does not make 10 requests per second of your server until it has discovered all of your pages, Oncrawl will be making a lot of requests one after another. If Oncrawl's crawl speed is too high, your server might not be able to keep up.

You also can modify the maximum crawl speed while the crawl is running. To do so, go to the Crawl Monitoring and click on the Pause crawl drop-down button at the top right of the screen. Choose Change max speed.

Crawler behavior

This controls which URLs discovered by the Oncrawl bot will be added to the list of URLs to fetch and explore.

You can choose:

  • Follow links (href="")

  • Follow HTTP redirects (3xx)

  • Follow alternates (<link rel="alternate">)

  • Follow canonicals (<link rel="canonical">)

  • Ignore nofollow tags

  • Ignore noindex tags

The crawler will follow href, 3xx, alternate and canonical links by default if you don't make any changes. It will respect nofollow and noindex tags.

URL pattern filtering

You can enable filters that use Regex rules to target only certain URLs (include) or exclude only certain URLs (exclude), or a combination of both types, in order to crawl only a specific section of a website.

The crawler will only fetch and explore URLs that pass the filtering rules you set.

URL with parameters

You can set:

  • Whether or not to crawl URLs with query strings following the main URL (parameters).

  • Whether or not to filter the parameters you want to crawl. You can then set the parameters to exclude ("Keep all parameters except for...") or to include ("Remove all parameters except for..."), if you don't want to crawl all parameters. List the parameters in the space provided with a comma and a space between them, like this:
    utm_source, utm_medium, utm_term

Crawling parameters can significantly increase the number of URLs you crawl. Note that, when you crawl parameters, Oncrawl (and Google) count each of the following as separate URLs:

Subdomains

By default, if you start a crawl for https://www.example.com, Oncrawl will consider that any subdomains it encounters, such as https://store.example.com, are part of a different site.

You can override this behavior by checking the Crawl encountered subdomains box.

Sitemaps

If you provide a sitemap, Oncrawl will compare the contents of your sitemap to the contents and structure of your site as discovered by the crawler.

You can:

  • Disable the sitemap analysis and remove this dashboard from your report.

  • Specify sitemap URLs: provide the URL to any sitemaps you would like to use. You can use more than one.

  • Allow soft mode: Oncrawl usually follows rules for bots with regards to sitemaps. If you want to ignore these rules, choose this option. A common use is to ignore the rule that makes a sitemap apply to the folder where it is stored. This allows sitemaps in https://www.example.com/sitemaps/ to apply to https://www.example.com and to https://www.example.com/news/

Crawl JS

If your site is built with Javascript and requires Javascript to be crawled, Oncrawl's crawler can do it.

Note: JavaScript crawls are more expensive than normal crawls and will cost 3 URLs from your quota for each URL crawled.

Notifications

The notification section consists of two tabs: Email and Webhook.

Any workspace member can receive an alert when a crawl ends and can also to be informed as to why the crawl ended.

Email notifications

Any individual workspace member, or all members, can receive an email report at the end of a crawl with this profile.

The email will state which crawl has ended, as well as the reason for the end of the crawl.

This list of addresses can be modified for any crawl profile at any time.

Webhook notifications

A webhook works like an HTTP callback link. In Oncrawl, it is possible to define the URL of a webhook in the crawl profile and to trigger specific actions at the end of a crawl.

When a crawl ends, Oncrawl will send an HTTP request (POST or GET type) via this URL, sending data which can then be used to trigger further actions. These may include alerts via the Oncrawl API or Big Data exports.

The data provided contains information related to Workspace_Id and Project_Id, as well as the reason for the end of the crawl:

Coded reason

UI label

Email label

quota_reached

Monthly URL quota reached.

The crawl ended because it reached the maximum number of URLs in the monthly quota.

max_url_reached

Configured maximum number of URLs reached.

The crawl ended because it reached the maximum number of URLs that was set in the crawl profile.

max_depth_reached

Configured maximum depth reached.

The crawl ended because it reached the maximum crawl depth that was set in the crawl profile.

user_cancelled

Crawl canceled by workspace member.

The crawl was cancelled by a member of the workspace.

user_requested

Crawl ended early by workspace member.

This crawl was ended early by a member of the workspace.

no_fetched_urls

No URLs fetched.

The crawler could not retrieve any URLs.

Oncrawl also provides reasonin the case of scheduled crawls that couldn’t start:

Coded reason

UI label

Email label

invalid_configuration

Cannot start a crawl (crawl profile contains an invalid element).

The scheduled crawl's profile contains an invalid element.

already_crawling

A crawl with the same profile was already running.

The scheduled crawl did not launch because another crawl with the same profile was already running.

concurrent_crawl_quota_reached

Too many crawls currently running.

The scheduled crawl did not launch because other crawls were running at the same time. The workspace has reached its quota for concurrent crawls.

missing_ga_feature

Cannot start a crawl with Google Analytics (feature is not available).

The scheduled crawl did not launch because it includes the Google Analytics feature that is not available in the workspace subscription.

missing_majestic_feature

Cannot start a crawl with Majestic (feature is not available).

The scheduled crawl did not launch because it includes the Majestic feature that is not available in the workspace subscription.

missing_custom_fields_feature

Cannot start a crawl with scraping (feature is not available).

The scheduled crawl did not launch because it includes the scraping feature that is not available in the workspace subscription.

You can test whether your webhook URL works using a provided API endpoint.

For enhanced security, a secret (security code) can be implemented as a third-party verification measure to ensure the authenticity of the received payload.

Analysis

In the Analysis section, indicate which third-party data should be used to create cross-analyses and added to the information available for each URL.

Note: your project needs to be an advanced project in order to enable additional analyses. You will need to convert your project to an advanced project before proceeding.

SEO impact report

This report provides cross-analysis with analytics data and log file data. Enable it by connecting your Google Analytics or Piano Analytics account here.

Learn more about the SEO impact report with Piano Analytics. You can also enable this report with Google Analytics.

If you use log monitoring and you don't need this report (for example, for highly targeted or quick, basic crawls), you can turn it off by unticking the Log Monitoring check box here. This will significantly speed up the analysis step when processing your crawl.

Ranking report

This report provides cross-analysis with data on SERP positions. Enable it by connecting your Google Search Console account here.

This report provides cross-analysis with data on external backlinks to your site. Enable it by connecting your Majestic account here.

Crawl over crawl

The crawl over crawl function compares two crawls of the same site or two sites with similar pages, such as a production and a staging site, or a desktop and a mobile site. Enable the cross-analysis with a similar crawl here.

Alternatively, you can add this analysis later from the project page by scheduling a crawl over crawl between two compatible crawls.

Scraping

Data scraping uses rules to harvest information in the source code of your web pages as they are crawled. This data is saved in custom fields and can be used in cross-analysis for custom metrics.

In this section, you will define the scraping rules to find the data for each of the custom fields you want to create.

Data ingestion

Data ingestion absorbs information provided in a CSV file and adds it to the information for each URL as it is crawled. Data from any source, such as exports from SEMrush, your CRM, or any other tool, can be provided in this formal.

In this section, you will provide the file or files containing the additional information you want to include in your analysis.

Export to

Oncrawl can automatically export crawl results to Google Looker Studio when a crawl finishes.

Extra settings

The extra settings are hidden by default. To access them, click the Extra settings toggle at the top of the page.

These settings provide technical solutions in cases where your site is not normally accessible to a standard crawler. 

Virtual robots.txt

You will need to verify the domain in the verify domain ownership section of the workspace settings before you can override a website's robots.txt.

If you don't want to follow the instructions in your robots.txt file, you can provide a virtual, temporary version for the Oncrawl bot only. This will not modify or suspend your actual robots.txt file.

You can:

  • Enable virtual robots.txt and provide the contents of the virtual file that will apply only for the Oncrawl bot.

HTTP headers

HTTP headers are properties and their value that are provided as part of a request for a URL. Headers can be used, for example, to transmit cookies or session IDs stored by the browser.

If your site requires certain headers, you can add them to the HTTP headers provided by our crawler when it requests a URL.

Provide one header per line in the following format:

Header: value

You can use this to pre-set cookies. Enter Cookie: CookieName=CookieValue in the text box. Replace CookieName with the name of your cookie and CookieValue with the value you want to use.

Note that you can only set one value per type of HTTP header. If you need to set multiple cookies, use the following format:

Cookie: Cookie1_Name=Cookie1_Value; Cookie2_Name=Cookie2_Value; etc.

DNS override

If you need to override the DNS, you can assign a different IP to a given host. This may be the case if you're crawling a site hosted on pre-production server but that still uses your usual domain name.

DNS overrides are not compatible with JS crawling.

Authentication

If you need to provide a login and password to access your site, select "Enable HTTP authentication". This is often used to protect a pre-production site.

You can provide:

  • Username (required): the name used to access the site

  • Password (required): the password used to access the site

  • Scheme (optional): choose the type of authentication used. Basic, Digest, and NTLM are supported.

  • Realm (optional): indicate the structure you are accessing with this authentication.

Cross-analysis range

By default, Oncrawl collects 45 days of data from sources in the Analysis section, if they are enabled. You can change that period here.

Changes made will apply to all data sources in this crawl profile, and may make it difficult to compare values from previous crawls with this profile to new ones.

Did this answer your question?