All Collections
How to get started
Crawl setup
Crawl mode: spider mode VS URL list
Crawl mode: spider mode VS URL list

Understand the different ways to crawl pages with OnCrawl

Updated over a week ago

In this guide we will tell you about the different ways to crawl your pages according to your needs:

Spider mode

In this mode, OnCrawlbot starts from a given url, follows all the "followed" outlinks that it encounters, thus exploring all the pages of the site, in the limits given by the maximum number of URLs or the maximum crawl depth, whichever happens first.

You can modify the types of links it follows in the Crawler behavior section of the crawl settings.

Using this discovery mode allows you to have an overview of your pages accessible via the internal linking, to understand how is structured the structure of your site via the analysis of the depth and the internal linking and to see how the popularity is distributed between your pages (Inrank).

From list of URLs

In this mode, Oncrawlbot crawls only a static list of urls.

By default, the links discovered on these pages are not followed, limiting the analysis to the pages contained in the list.

However, you can set the crawler to follow certain types of links even in URL list mode, in the Crawler behavior section of the crawl settings.

Note that if you follow links discovered on URLs from your list, the crawl results will most likely contain URLs that were not on your original list.

By using this mode, all pages are considered to be start urls; thus all the pages on the list will have a depth equal to 1.

If you haven't modified the default the crawler behavior, the analysis will be limited to the pages on the list and the redirects will not be followed.

If you are trying to crawl all URLs in a sitemap, first extract the URLs from the sitemap, then provide them in a file in the format described below.

How to upload this list

  • From crawl settings screen, under Start URL, choose the List of URLs option

  • From the drop-down mention "Select a list of URLs to crawl," pick a file. You can also upload a new one by clicking the Upload files button, which will take you to the data sources management interface.

You can head over to the data sources management interface at any time from the project homepage, by clicking the Add data sources button.

If you're uploading a new list, don't forget that you'll need to head back to the crawl settings to choose your uploaded list and launch your crawl!

Required file format

The file have to be inside a ZIP archive.

  • The archive should contain a plain text file containing a list of URLs, one per line.

  • You can upload a ZIP file containing one or more CSV file(s), or one or more TXT file(s) 

If you provide multiple files in a ZIP archive they must obey the following rules:

  • All files must have the same format: all CSV, or all TXT

Additionally the content of each file must obey the following rules:

  • The file must be UTF-8 encoded if you need to handle non ASCII characters in values.

  • Each line must be less than 1024 characters.

  • The full url must be provided

Thank you for reading this article and enjoy your crawl !

You can also find this article by searching for:

crawlear, rastrear una lista de URLs
crawler une liste d'URLs

Did this answer your question?