Oncrawl's crawler is designed to crawl real websites. We provide modifiable crawl parameters that will help adapt our crawler to your specific case.
Here's how to set up a crawl.
We'll cover the following points:
Where to find the crawl settings: where to go in the interface to set up a crawl
Crawl profiles: how to create and reuse crawl profiles
Where to find the crawl settings
From your project page, click on "+ Set up new crawl".
This will take you to the crawl settings page. Your default crawl settings are displayed. You can either modify them, or create a new crawl configuration.
A crawl profile is a set of settings that has been saved with a name to be used later.
You can modify or verify the settings in a crawl profiles by selecting the profile in the drop-down menu at the top of the page.
If you save changes when viewing a crawl settings configuration, this will overwrite the old settings for the configuration.
To create a new crawl settings configuration and leave your current configurations as they are, click the blue "+ Create Crawl Profile" button in the upper right.
Name: The name you give a crawl profile will be used to identify it in the OnCrawl interface. We list the latest crawl profile on the projects card in the list of projects, and on the project page we list all of the crawls in the project, along with the name of the crawl profile used. A good name might, for example, state what is being crawled, or the type of additional data included. For example, you might have a "Rankings" crawl, or a "Blog without parameters" for a crawl of the subdomain https://blog.example.com that excludes query strings appended to the end of the URL.
Copy config from: The new profile will always be based on an existing profile, but you can change all of the parameters once the profile has been created.
After creating the configuration and making the changes you want to the settings on this page, click "Save" to save the changes, or "Save and launch crawl" to launch a crawl right away.
The "Crawl settings" section covers basic settings required to be able to launch a crawl. This defines what you crawl, with what crawler description, and how fast.
The Start URL parameter determines how the crawler will discover your URLs.
In Spider mode, the crawler will at a given point and will follow links on pages it encounters until it reaches a stop condition. The start URL sets the URL (or URLs) that the crawler will start from. This will determine the depth ranking of the pages in your site as Oncrawl maps out your site's architecture.
In URL list mode, the crawler will examine the URLs in a list you provide, but will not follow any links it finds on those pages.
The crawl limits tell the crawler when to stop crawling when in spider mode. (Consequently, this setting is not available when you've set the crawler to URL list mode.)
You can set:
The max URLs: the crawler will stop once it has discovered a given number of URLs, unless it has reached a different limit before then.
The max depth: each time the crawler follows a link, it advances one depth into the site architecture. The start URL or start URLs constitute depth 1, and all pages linked to from the start URL(s) have a depth of 2. If you set a depth of 3, the crawler will stop after discovering all of the pages at depth 3 (but not following their links), unless it has reached a different limit before then.
You can also modify the maximum depth of a crawl while the crawl is running. To do so, go to the Crawl Monitoring and click on the blue "Pause crawl" drop-down button at the top right of the screen. Choose "Change max depth."
You can set:
Whether to use a mobile or desktop bot when crawling. This can be useful if you provide different pages to a mobile visitor than to a desktop visitor. Differences might include a lightweight style, a different page address (such as https://m.example.com or https://www.myexample.com/mobile).
The bot name in the User-Agent. This can be used to test your site's behavior if you treat some bots differently than others, or if you need to identify the OnCrawl bot.
The full user-agent for your chosen bot configuration is provided on the right.
Max crawl speed
You can set:
The number of pages crawled per second.
To the right, you can enter a number of URLs to find the approximate time it will take to crawl your site at the chosen speed.
The ideal speed is the speed your server and site architecture can handle best. Unlike Google, which does not make 10 requests per second of your server until it has discovered all of your pages, Oncrawl will be making a lot of requests one after another. If Oncrawl's crawl speed is too high, your server might not be able to keep up.
To protect your site, we ask you to verify ownership of your site by linking a Google account before we let you increase the crawl speed beyond 10 URLs/second.
You also can modify the maximum crawl speed while the crawl is running. To do so, go to the Crawl Monitoring and click on the blue "Pause crawl" drop-down button at the top right of the screen. Choose "Change max speed."
This controls which URLs discovered by the Oncrawl bot will be added to the list of URLs to fetch and explore.
You can choose:
Follow links (href="")
Follow HTTP redirects (3xx)
Follow alternates (<link rel="alternate">)
Follow canonicals (<link rel="canonical">)
Ignore nofollow tags
Ignore noindex tags
The crawler will follow href, 3xx, alternate and canonical links by default if you don't make any changes. It will respect nofollow and noindex tags.
URL with parameters
You can set:
Whether or not to crawl URLs with query strings following the main URL (parameters).
Whether or not to filter the parameters you want to crawl. You can then set the parameters to exclude ("Keep all parameters except for...") or to include ("Remove all parameters except for..."), if you don't want to crawl all parameters. List the parameters in the space provided with spaces between them.
Crawling parameters can significantly increase the number of URLs you crawl. Note that, when you crawl parameters, Oncrawl (and Google) count each of the following as separate URLs:
By default, if you start a crawl for https://www.example.com, Oncrawl will consider that any subdomains it encounters, such as https://store.example.com, are part of a different site.
You can override this behavior by checking the "Crawl encountered subdomains" box.
If you provide a sitemap, Oncrawl will compare the contents of your sitemap to the contents and structure of your site as discovered by the crawler.
Disable the sitemap analysis and remove this dashboard from your report.
Specify sitemap URLs: provide the URL to any sitemaps you would like to use. You can use more than one.
Allow soft mode: Oncrawl usually follows rules for bots with regards to sitemaps. If you want to ignore these rules, choose this option. A common use is to ignore the rule that makes a sitemap apply to the folder where it is stored. This allows sitemaps in https://www.example.com/sitemaps/ to apply to https://www.example.com and to https://www.example.com/news/
If you don't want to follow the instructions in your robots.txt file, you can provide a virtual, temporary version for the Oncrawl bot only. This will not modify or suspend your actual robots.txt file.
Enable virtual robots.txt and provide the contents of the virtual file that will apply only for the Oncrawl bot.
Note: you will need to have validated your project before you can override the robots.txt in place for this website.
In the "Analysis" section, indicate which third-party data should be used to create cross-analyses and added to the information available for each URL.
Note: your project needs to be an advanced project in order to enable additional analyses. You will need to convert your project to an advanced project before proceeding.
SEO impact report
This report provides cross-analysis with analytics data and log file data. Enable it by connecting your Google Analytics or Piano Analytics account here.
Learn more about the SEO impact report with Piano Analytics. You can also enable this report with Google Analytics.
If you use log monitoring and you don't need this report (for example, for highly targeted or quick, basic crawls), you can turn it off by unticking the Log Monitoring check box here. This will significantly speed up the analysis step when processing your crawl.
This report provides cross-analysis with data on SERP positions. Enable it by connecting your Google Search Console account here.
This report provides cross-analysis with data on external backlinks to your site. Enable it by connecting your Majestic account here.
Crawl over crawl
The crawl over crawl function compares two crawls of the same site or two sites with similar pages, such as a production and a staging site, or a desktop and a mobile site. Enable the cross-analysis with a similar crawl here.
Alternatively, you can add this analysis later from the project page by scheduling a crawl over crawl between two compatible crawls.
Data scraping uses rules to harvest information in the source code of your web pages as they are crawled. This data is saved in custom fields and can be used in cross-analysis for custom metrics.
In this section, you will define the scraping rules to find the data for each of the custom fields you want to create.
Data ingestion absorbs information provided in a CSV file and adds it to the information for each URL as it is crawled. Data from any source, such as exports from SEMrush, your CRM, or any other tool, can be provided in this formal.
In this section, you will provide the file or files containing the additional information you want to include in your analysis.
The extra settings are hidden by default. To access them, click the "Show extra settings" toggle at the top of the page.
These settings provide technical solutions in cases where your site is not normally accessible to a standard crawler.
HTTP headers are properties and their value that are provided as part of a request for a URL. Headers can be used, for example, to transmit cookies or session IDs stored by the browser.
If your site requires certain headers, you can add them to the HTTP headers provided by our crawler when it requests a URL.
Provide one header per line in the following format:
You can use this to pre-set cookies. Enter
Cookie: CookieName=CookieValue in the text box. Replace
CookieName with the name of your cookie and
CookieValue with the value you want to use.
Note that you can only set one value per type of HTTP header. If you need to set multiple cookies, use the following format:
Cookie: Cookie1_Name=Cookie1_Value; Cookie2_Name=Cookie2_Value; etc.
If you need to override the DNS, you can assign a different IP to a given host. This may be the case if you're crawling a site hosted on pre-production server but that still uses your usual domain name.
DNS overrides are not compatible with JS crawling.
If you need to provide a login and password to access your site, select "Enable HTTP authentication". This is often used to protect a pre-production site.
You can provide:
Username (required): the name used to access the site
Password (required): the password used to access the site
Scheme (optional): choose the type of authentication used. Basic, Digest, and NTLM are supported.
Realm (optional): indicate the structure you are accessing with this authentication.
Crawler IP Addresses
If you need to whitelist crawler IP addresses to allow them access to your site, you can find the list of the IP addresses that will be used to crawl your site here. You will then need to whitelist these addresses.