In this guide we will look at the different ways to crawl your site's pages according to your specific needs. In the Oncrawl app, you have the choice of three different crawling modes:
From a list of URLs
In this mode, the Oncrawl bot will begin the crawl from a given URL, then follow all the outlinks tagged as "dofollow" that it encounters. The bot is thus able to explore all the pages of the site, within the limits given by the maximum number of URLs or the maximum crawl depth, whichever happens first.
You can modify the types of links it follows in the Crawler behavior section of the crawl settings.
Using the spider mode allows you to:
Have an overview of which of your pages are accessible via the internal linking.
Understand how your site is structured thanks to the depth analysis and the internal linking.
See how the internal popularity is distributed between your pages (Inrank).
From a list of URLs
When you choose to crawl your site from a list of URLs, the Oncrawl bot will follow a static list of predefined URLs.
By default, the links discovered on these pages are not followed, limiting the analysis to the pages contained in the list.
However, you can set the crawler to follow certain types of links even in URL list mode, in the Crawler behavior section of the crawl settings.
Note that if you follow links discovered on URLs from your list, the crawl results will most likely contain URLs that were not on your original list.
By using this mode, all pages are considered to be start URLs; thus all the pages on the list will have a depth equal to one.
If you haven't modified the default the crawler behavior, the analysis will be limited to the pages on the list and any redirects will not be followed.
If you are trying to crawl all URLs in a sitemap, first extract the URLs from the sitemap, then provide them in a file in the format described below.
How to upload your list
From the crawl settings screen, under Start URL, choose the List of URLs option.
From the drop-down menu Select a list of URLs to crawl, pick a file. You can also upload a new one by clicking the Upload files button, which will take you to the data sources management interface.
You can head over to the data sources management interface at any time from the project homepage, by clicking the Add data sources button.
If you're uploading a new list, don't forget that you'll need to head back to the crawl settings to choose your uploaded list before launching your crawl.
Required file format
The files have to be inside a ZIP archive.
The archive should contain a plain text file with a list of URLs, one per line.
You can upload a ZIP file containing one or more CSV file(s), or one or more TXT file(s).
If you provide multiple files in a ZIP archive they must adhere to the following rules:
All files must have the same format: all CSV, or all TXT.
Additionally the content of each file must adhere to the following rules:
The file must be UTF-8 encoded if you need to handle non ASCII characters in values.
Each line must be less than 1024 characters.
The full URL must be provided.
On this mode, Oncrawl bot will crawl only the list of URLs found in your sitemaps.
When choosing this option, you have the possibility to either:
Provide a list of sitemaps URLs (the location of your sitemaps): Oncrawl will only crawl the URLs present within sitemaps you list in this section.
Provide a URL from which Oncrawl will discover on its own the list of sitemaps at the beginning of the crawl. If new sitemaps have been added to your site since the last crawl with this profile, Oncrawl will find them and take them into account.