In the source code provided for a URL, there may be many references to other URLs. Below we'll discuss which references our crawler regards as links, and how they are treated.

What is the identity of the OnCrawl bot?

OnCrawl uses two bots:

  • Desktop: Mozilla/5.0 (compatible; OnCrawl/1.0; +http://www.oncrawl.com/) 
  • Mobile: Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; OnCrawl; +http://www.oncrawl.com/) 

The User Agent name for both bots is modifiable, but is listed as OnCrawl  by default.

What types of crawls can the OnCrawl bot do?

Our bots can:

  • Obtain information about the pages in a list, without following any links. If a page is not included in the link, it will not be crawled.
  • Follow links to discover new pages on a website, and obtain information about all of the pages that are discovered. If a page cannot be reached by navigating through links from the start page, it will not be crawled.

What links does the OnCrawl bot follow?

The OnCrawl bot follows all links it encounters in the domain(s) listed as Start URLs. If crawling subdomains is authorized, this includes all linked subdomains; if not, links to subdomains that are not Start URLs are treated as links to external sites.

Note: link rel=amphtml  links are not followed. AMP information can be obtained in a crawl by scraping this data from the source code.

How does the OnCrawl bot treat instructions to robots?

OnCrawl follows all instructions to robots, including meta robots directives and instructions found in the robots.txt file.

This means that robots.txt instructions targeting User-agent: OnCrawl  (or the custom bot name if you have modified it) will apply to your crawl.

To give full access to the OnCrawl bot, you can replace your robots.txt file with a virtual one for OnCrawl. Add the following text at the top of the virtual file:

User-Agent: OnCrawl
Allow: /

What type of pages can the OnCrawl bot crawl and render?

The OnCrawl bot crawls (fetches, or requests information about) all internal links that are not forbidden and renders all HTML pages it discovers and is not restricted from rendering. Here are some of the types of pages it can handle:

  • Redirected pages
  • Pages with HTTP status error codes (404, 503...)
  • Pages with meta robots instructions (noindex, nofollow)
  • JavaScript pages (requires specific crawl settings)
  • Staging and pre-prod site (requires specific crawl settings)

How does the OnCrawl bot handle sitemaps?

OnCrawl uses your sitemap to find orphan pages and analyze the type of information present on your sitemap. If you want to crawl the URLs on your sitemap, you'll need to convert it to a list and crawl in list mode.

OnCrawl respects sitemap standards. For example, the sitemap should be placed in the same directory as the URLs it lists (URLs may be in subdirectories).

For example, we would expect to find the sitemap containing the URL https://www.example.com/pages/example-page :

  • ✅at https://www.example.com/ 
  • ✅or at https://www.example.com/pages/ 
  • 🚫but not at https://www.example.com/sitemaps/ 

If your sitemaps does not conform to sitemap.org standards, you should request that we analyze your sitemap in "soft mode" in the crawl settings.

How does the OnCrawl bot know when to stop crawling?

The OnCrawl bot stops crawling when:

  • It finishes analyzing a depth and does not encounter any new internal links.
  • It discovers the maximum number of URLs indicated in the crawl settings.
  • It finishes discovering all pages at the maximum depth indicated in the crawl settings.
  • You abort the crawl.

How fast does the OnCrawl bot crawl?

OnCrawl can crawl as fast as you want it to.

We recommend 2-5 URLs/second as a good speed for most sites. Crawls at this speed can take about 15 minutes to crawl and analyze 1000 URLs. (This can vary depending on the complexity of rendering your website's pages.)

To increase the crawl speed to more than 10 URLs/second, you will need to validate your site. This lets us make sure you have access to the website.

Because requests for pages are made one after another at the speed you have set, it is possible to overload a server with a crawl.
Make sure your server can handle the volume of requests at whatever crawl speed you set. 

Going further

Now that you know how the crawler takes in information and URLs, you may be interested in:

If you still have questions about how OnCrawl's bot works, drop us a line at @oncrawl_cs or click on the Intercom button at the bottom right of your screen to start a chat with us.

Happy crawling!

This article can also by found by searching for:

comment est-ce que le bot OnCrawl fonctionne
cómo funciona el bot OnCrawl

Did this answer your question?