Skip to main content
All CollectionsOncrawl bot
How does the Oncrawl bot find and crawl pages?
How does the Oncrawl bot find and crawl pages?

What does Oncrawl's bot do to find and crawl pages? In this article, we explain how it works.

Updated over 2 months ago

In the source code provided for a URL, there may be many references to other URLs. Below we'll discuss which references our crawler regards as links and how they are treated.

What is the identity of the Oncrawl bot?

Oncrawl uses two bots:

  • Desktop: Mozilla/5.0 (compatible; Oncrawl/1.0; +http://www.oncrawl.com/) 

  • Mobile: Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Oncrawl/1.0; +http://www.oncrawl.com/) 

The full user agent for both bots is modifiable, but is listed as Oncrawl  by default.

What types of crawls can the Oncrawl bot do?

Our bots can:

  • Obtain information about the pages in a list, without following any links. If a page is not included in the link, it will not be crawled.

  • Follow links to discover new pages on a website and obtain information about all of the pages that are discovered. If a page cannot be reached by navigating through links from the start page, it will not be crawled.

What links does the Oncrawl bot follow?

The Oncrawl bot follows all links it encounters in the domain(s) listed as Start URLs. If crawling subdomains is authorized, this includes all linked subdomains; if not, links to subdomains that are not Start URLs are treated as links to external sites.

The table below gives the default behavior, but you can change this in the crawl settings when you set up your crawl.

Link TypeLin

Link followed by crawler?

Depth of target

Used to calculate Inrank?

Treated as Inlink in charts?

a href

Yes. "follow" only

N+1

Yes. "follow" and same hostname only

Yes

area

Yes

N+1

Yes

Yes

img

No

No

No

link rel=prev/next

Yes

N+1

No

No

link rel=canonical

Yes

N (same as origin)

No

No

link rel=alternate hreflang

Yes

N (same as origin)

No

No

link rel=alternate media

Yes

N (same as origin)

No

No

link rel=alternate *

No

No

No

link rel=amphtml

No

No

No

link rel=shortlink

No

No

No

link rel=stylesheet

No

No

No

iframe

No

No

No

frame

No

No

No

script

No

No

No

http Location

Yes

N (same as origin)

No

No

3xx (redirections)

Yes

N (same as origin)

No

No

Note: link rel=amphtml links are not followed. AMP information can be obtained in a crawl by scraping this data from the source code.

How does the Oncrawl bot treat instructions to robots?

Oncrawl follows all instructions to robots, including meta robots directives and instructions found in the robots.txt file.

This means that robots.txt instructions targeting User-agent: Oncrawl  (or the custom bot name if you have modified it) will apply to your crawl.

To give full access to the Oncrawl bot, you can replace your robots.txt file with a virtual one for Oncrawl. Add the following text at the top of the virtual file:

User-Agent: Oncrawl
Allow: /

What type of pages can the Oncrawl bot crawl and render?

The Oncrawl bot crawls (fetches, or requests information about) all internal links that are not forbidden and renders all HTML pages it discovers that are not restricted from rendering. Here are some of the types of pages it can handle:

  • Redirected pages

  • Pages with HTTP status error codes (404, 503...)

  • Pages with meta robots instructions (noindex, nofollow)

  • JavaScript pages (requires specific crawl settings)

  • Staging and pre-prod site (requires specific crawl settings)

How does the Oncrawl bot handle 5xx server errors?

When crawling a website, our crawler may encounter server errors (5xx Internal Server Error). These errors are generally due to temporary problems on the target site's server. To handle these situations, our crawler incorporates a retry logic to minimise the impact of these errors and maximise the completion of explorations.

Our crawler applies a three-stage retry strategy when error 500 is encountered on a page

First request

If a GET request to a page returns a 500 error, the crawler waits 30 seconds before making another attempt.

Second retry

After the first retry, a new request is sent. If this second request still returns a 500 error, the crawler waits another 30 seconds before making a third and final attempt.

Third retry

If the third request also returns a 500 error, the crawler records this last response as the final result. No further attempts are made.

How does the Oncrawl bot handle sitemaps?

Oncrawl uses your sitemap to find orphan pages and analyze the type of information present on your sitemap. If you want to crawl the URLs on your sitemap, you'll need to convert it to a list and crawl in list mode.

Oncrawl respects sitemap standards. For example, the sitemap should be placed in the same directory as the URLs it lists (URLs may be in subdirectories).

For example, we would expect to find the sitemap containing the URL https://www.example.com/pages/example-page :

  • ✅ at https://www.example.com/ 

  • ✅ or at https://www.example.com/pages/ 

  • 🚫 but not at https://www.example.com/sitemaps/ 

If your sitemaps does not conform to sitemaps.org standards, you should choose Allow soft mode in the crawl settings.

How does the Oncrawl bot know when to stop crawling?

The Oncrawl bot stops crawling when:

  • It finishes analyzing a depth and does not encounter any new internal links.

  • It discovers the maximum number of URLs indicated in the crawl settings.

  • It finishes discovering all pages at the maximum depth indicated in the crawl settings.

  • You abort the crawl.

How fast does the Oncrawl bot crawl?

Oncrawl can crawl as fast as you want it to.

We recommend 2-5 URLs/second as a good speed for most sites. Crawls at this speed can take about 15 minutes to crawl and analyze 1,000 URLs. (This can vary depending on the complexity of rendering your website's pages.)

To increase the crawl speed to more than 10 URLs/second, you will need to validate your site. This lets us make sure you have access to the website.

Because requests for pages are made one after another at the speed you have set, it is possible to overload a server with a crawl.

Make sure your server can handle the volume of requests at whatever crawl speed you set. 

Did this answer your question?