You've set up a crawl, but it won't start. Or it will start, but it keeps getting cancelled after a URL or two. Or it runs, but you only have one URL in the results. Most of the time, this is due to one of a few common issues.

If you recognize your project in one of the common issues below, you'll be back to crawling in just a few minutes. If not, let us know and we'll get you up and running again as quickly as we can.

Most common issues

Other common issues: What do your results look like?

Common crawl settings issues

1

Are your crawl settings too limiting?

This might be the case if you have one URL in your results and this URL has a 200 status code.

It is possible to set crawl settings that don't allow the OnCrawl bot to explore beyond the start URL. This might be the case, for example, when you set the max depth to 1, telling the crawler to stop after visiting all of the pages it can find without clicking on a link (that, is, all pages at a depth of 1). We don't want to prevent you from using these settings, since sometimes this is intentional. However, if you're trying to crawl your whole site, you should choose other options.

Here are a few examples of crawl settings that will significantly limit your crawl:

  • Start URL: crawl in mode "List of URLs" with only one or two URLs listed. In mode "List of URLs", the bot explores the URLs on the list but does not follow any links.
  • Crawl limits: Max depth set to 1.
  • Virtual robots.txt: "Enable virtual robots.txt" not checked, and the site's robots.txt file disallows robots
  • Virtual robots.txt: "Enable virtual robots.txt" checked, but the virtual robots.txt file disallows robots

How to fix it: modify the crawl settings to let the bot discover the rest of the site.

Common issues when the "Launch crawl" button won't work

This section will help if you can't click the "Launch crawl button because it's grayed out. If you don't see you issue here, check the next section on issues for crawls that won't start.

2

Is there a problem with your crawl settings?

If one of the crawl settings in your crawl profile can't be used as-is, you won't be able to launch a crawl. You might need to scroll down to see which section has an error.

You may need to scroll down:

How to fix it: Click on the name of the section to unfold it and resolve the error.

Common issues for crawls that won't start

This section is for you if your crawl won't start, or if your scheduled crawl was cancelled.

3

Have you already used your monthly quota of URLs?

Often your crawl won't start if you've reached the quota for URLs this month. You URL quota depends on your plan and can be found on your account page.

To check where you are for this month and the date when your quota will be reset, click on your account name at the top right of the screen and choose "Account".

The Quota block can be found to the right of the Subscription information.

How to fix it: contact us to purchase additional credits for this month, or to upgrade your plan. If that's not an option, you can also wait and launch the crawl again when your credits have been reset for a new month.

4

Have you reached your limit of concurrent crawls?

The number of simultaneous crawls is limited and depends on your plan. If you already have crawls running, you may have reached your limit. If this is the case, you won't be able to launch a crawl while another crawl is running (either crawling or analyzing).

This also applies to scheduled crawls. If a crawl is running when the scheduled crawl is supposed to start, the scheduled crawl will be cancelled.

You'll see this information the next time you log in: there will be a new "Cancelled crawls" tab in the crawl monitoring section on your project homepage, with a red alert listing the number of new cancelled crawls.

How to fix it: wait until the current crawl is finished (or pause or cancel it). When the current crawl is no longer running, you can start the new crawl.

5

Does your start URL use the correct protocol (http:// or https://)?

Your website might use either the http:// or the https:// protocol. The start URL that you provide must match the protocol required by your website.

How to fix it: check which protocol your website uses. Then, enter the correct start URL in the "Start URL" section of the crawl settings.

6

Does your start URL time out or not reply?

OnCrawl will test your start URL before running a crawl. If the request to access the URL times out, if the page can't be found, or if the server does not reply, the crawl will not run.

This can also occur if your site blocks crawls by bots including OnCrawl.

We try to let you know if there's an issue before you save your crawl settings, but sometimes a change occurs after you save your settings and before you launch a crawl.

How to fix it: make sure that your start URL's status is 200 or 3xx (and correct it if it isn't). In the case of server-related errors (5xx), pick a time and conditions for your crawl when the server has the best chance of responding. A response from the start URL is required to launch a crawl.

If your site blocks crawls by bots, you will need to authorize OnCrawl to crawl your site. Depending on how the restriction is set up on your site, you may need to check your meta robots instructions, use a virtual robots.txt, or whitelist OnCrawl in your host's HTTP settings.

Common issues that produce incomplete results

Keep reading if you have empty graphs and a few (but not all) pages in your results.

7

Are all of your pages non-canonical?

Non-canonical pages tell bots to visit a different page and not to take the non-canonical page into account. If this is the case, your crawl will run, but you might only have one page in your results, and your graphs will all be empty.

To check this, head over to the Data Explorer (under the "Tools" tab in the analysis sidebar) and use the OnCrawl Query Language to search for:

Canonical evaluation - is - not matching

(You can also find this information by clicking on the "Canonicalized pages" chart in the Crawl Report Summary.)

This will find all of the pages for which the canonical URL is different than the URL the bot requested.

If you have many results for this search and few (or no) results for the searches "Canonical evaluation - is - not set" (pages that don't declare a canonical URL) and "Canonical evaluation - is - matching" (pages that declare themselves as the canonical URL), you've found the problem.

How to fix it: Good news: you've just used OnCrawl to uncover a major issue on your website that's affecting Google as well as our bot. You'll want to make sure that pages with non-duplicate content are canonical. Remove the unnecessary canonical links that point away from the page where they appear. Then, run another crawl.

8

Are all of your pages non-indexable?

Non-indexable pages tell bots to ignore them. If this is the case, your crawl will run, but you might only have one page in your results.

To check whether this is what happened, head over to the Data Explorer (under the "Tools" tab in the analysis sidebar) and use the OnCrawl Query Language to search for:

Is indexable - is - true

If all (or most) of your pages are non-indexable, you'll have no (or only a few) results.

Now change the OnCrawl Query Language search to:

Is indexable - is - false

If you now have a long list of pages, you've found your problem.

How to fix it: Good news: you've just used OnCrawl to uncover a major issue on your website that's affecting Google as well as our bot. You'll want to make your pages indexable. We can help you figure out what you need to change for each page.

Using the results you obtained for the "Is indexable - is - false" search, you can either click on the URL for a detailed analysis, or add columns to view possible reasons that a page might not be indexed. This might include columns for:

  • Meta robots: forbidding bots in the meta declarations in the page HTML will prevent a page from being indexed 
  • Denied by robots.txt: forbidding bots via the robots.txt file will prevent a page from being indexed
  • Canonical evaluation: a non-matching canonical declaration will prevent a page from being indexed
  • Status code: pages with 5xx, 4xx and 3xx status codes are not crawled, and therefore not indexed

Once you've made these changes, run another crawl.

Common redirection issues

Your crawl has only one result, and its status code is 3xx? Keep reading.

9

Does your start URL redirect to another URL?

When you redirect your start URL to a different page, it sometimes happens that the redirect location returns an error. When this is the case, the crawl stops here.

You might suspect this is the case if your start URL is redirected. (You can check in the "Start URL" section of the crawl settings: we'll tell you if your Start URL is redirected.)

How to fix it: correct the error on the redirected page or correct the redirection from the start URL. If neither of these is possible, try starting from a different URL.

10

Does your start URL redirect to itself?

It's possible to redirect a start URL to itself, for example, to initialize a cookie.

Since OnCrawl only crawls each known page once, it won't return to crawl the page with the required cookie. Consequently, it can't advance.

How to fix it: You can add the content of the expected cookie in the crawl settings, under "HTTP headers". This will allow you to bypass the first redirection in order to access the rest of the site. (Make sure the extra settings are visible.)

Common error issues / bot authorization issues

If your start URL returns an error status code, it's likely an issue related to what bots are allowed to access on your website. If that's the case, this is the section for you.

11

Do your page tags, your robots.txt or your virtual robots.txt disallow robots?

If your Start URL or your entire site forbids access to bots, OnCrawl's bot won't be able to crawl it.

Aternatively, sometimes a robots.txt file is set up to disallow all robots. If this is the case, you should find a line in your robots.txt file such as:
Disallow: /  

or

Disallow: /my-start-directory/  

(This last is problematic when your start URL is https://www.mysite.com/my-start-directory/.)

If you're already using a virtual robots.txt file, your virtual file might disallow robots in the same manner.

How to fix it: Check for a <meta robots="noindex"> or  <meta robots="nofollow"> tag on your Start URL. If one is present, but you want OnCrawl (and Google) to crawl the page and follow the links on it, you'll need to remove it.

If the issue comes from a robots.txt file, you can set up a virtual robots.txt file for OnCrawl's bot and remove the lines that disallow robots. If you need help doing this, take a look at the help article on the subject.

12

Are you getting a "Your start url requires HTTP authentication (HTTP status code: 401 - Unauthorized)." error?

If your website uses an aggressive protection to forbid crawls from external crawlers, it could be blocking the OnCrawl bot. This will often return the following error:

Your start url requires HTTP authentication (HTTP status code: 401 - Unauthorized).

How to fix it: There are two effective ways to solve this problem. You will have to ask your IT department or your webmaster for help.

  1. Configure the crawl settings to use a bot with a specific user agent. You can do this by changing the bot name in the "Crawl bot" section of the crawl settings. Provide your IT manager with the full user agent listed on the right and ask them to allow the user agent you configured.
  2. Configure the crawl settings to use static IP addresses. You can force this in the crawl settings, under "Crawler IP Addresses". (Make sure the extra settings are visible.) Ask your website administrator whitelist these IP addresses.This paid option is not included by default. Please contact us by clicking on the Intercom button at the bottom right of this page. Let us know you want to get in touch with your sales representative in order to discuss adding it to your plan.

13

Has your website blacklisted any IPs? / Does your website require IPs to be whitelisted?

If your website uses a data shield with an aggressive whitelist/blacklist policy, it might be blocking the IPs used by the OnCrawl bot to crawl your site.

How to fix it: You can force the OnCrawl bot to use static IPs. You can force this in the crawl settings, under "Crawler IP Addresses". (Make sure the extra settings are visible.) You will need to ask the administrator of your website to whitelist the list of IPs used by the OnCrawl bot.

This paid option is not included by default. Please contact us by clicking on the Intercom button at the bottom right of this page in order to discuss adding it to your plan.

14

Does your website have a paywall or other security that treats bots as unauthorized visitors?

Paywalls and other security measures can sometimes block bots as well as unauthorized visitors. If the OnCrawl bot is blocked, it cannot carry out the crawl.

How to fix it: the fix for this problem will depend on the security measure that is in place on your site. Here are a few possibilities:

  • Your site checks for a certain cookie in order to authorize visitors. You can add the content of the cookie in the crawl settings, under "HTTP headers". (Make sure the extra settings are visible.)
  • Your site uses cookies to limit the number of pages a visitor can view. You can disable cookies in the crawl settings, under "Cookies". (Make sure the extra settings are visible.)
  • Your site requires all visitors to log in. You can enable HTTP authentication in the crawl settings, under "Authentication". (Make sure the extra settings are visible.) You will need to provide valid login information for the bot.
  • Your site requires all visitors to come from an authorized address. You can force the OnCrawl bot to use static IPs in the crawl settings, under "Crawler IP Addresses". (Make sure the extra settings are visible.) You will need to ask the administrator of your website to whitelist the list of IPs used by the OnCrawl bot.
    This paid option is not included by default. Please contact us by clicking on the Intercom button at the bottom right of this page in order to discuss adding it to your plan.

Issues related to JavaScript

15

Is your website rendered in JS (JavaScript)?

If your website is built using JavaScript, you may need to pre-render pages for a bot to discover the content of the page. If the bot cannot see the page content, it can't see any links, either.

You can test this yourself by disabling JavaScript temporarily in your browser.

How to fix it: Set your crawl to crawl the website as a JavaScript website. Keep in mind that JavaScript crawls use more resources, including more URLs in your quota: JavaScript uses 10 URLs from your quota for every 1 URL crawled. If you need help crawling your site with JavaScript, take a look at the help article on the subject.

Other problems

If you're still having problems with getting your crawl to run correctly, drop us a line at @oncrawl_cs or click on the Intercom button at the bottom right of your screen to start a chat with us. We're here to help get you unstuck!

Did this answer your question?