In this guide we will tell you about the different ways to crawl your pages according to your needs:
In this mode, OnCrawlbot starts from a given url, follows all the "followed" outlinks that it encounters, thus exploring all the pages of the site, in the limits given by the maximum number of URLs or the maximum crawl depth, whichever happens first.
Using this discovery mode allows you to have an overview of your pages accessible via the internal linking, to understand how is structured the structure of your site via the analysis of the depth and the internal linking and to see how the popularity is distributed between your pages (Inrank).
From list of URLs
In this mode, Oncrawlbot crawls only a static list of urls. The links discovered on these pages are not followed, limiting the analysis to the pages contained in the list.
By using this mode, all pages are considered as start urls; thus all the pages will have a depth equal to 1 and an inrank equal to 10.
The analysis being limited to the pages contained in the list, the redirections will not be followed.
If you are trying to crawl all URLs in a sitemap, first extract the URLs from the sitemap, then provide them in a file in the format described below.
How to upload this list ?
- From crawl settings screen, choose List of URLs options
- Then pick a file or upload a new one, clicking the "Upload files" button, bringing you to the data sources management interface.
You can reach at any time the data sources management interface from the project homepage, clicking the "Add data sources" button.
Required file format
The file have to be inside a ZIP archive.
- The archive should contain a plain text file containing a list of URLs, one per line.
- You can upload a ZIP file containing one or more CSV file(s), or one or more TXT file(s)
If you provide multiple files in a ZIP archive they must obey the following rules:
- All files must have the same format: all CSV, or all TXT
Additionally the content of each file must obey the following rules:
- The file must be UTF-8 encoded if you need to handle non ASCII characters in values.
- Each line must be less than 1024 characters.
- The full url must be provided
Thank you for reading this article and enjoy your crawl !
You can also find this article by searching for:
crawlear, rastrear una lista de URLs
crawler une liste d'URLs