Accepted sitemap formats
OnCrawl can analyze sitemaps in the following formats:
- XML: sitemap.xml, sitemap_index.xml
- Gzip: sitemap.xml.gz, sitemap_index.xml.gz
- Text files: sitemap.txt
- Syndication feeds: sitemap.rss (RSS 2.0), sitemap.atom (Atom 1.0 or 3.0)
Best practices for sitemaps
When setting up a crawl:
- Specify your sitemap URLs if you have sitemaps that won't be found in the robots.txt file or that don't have a standard name.
- Use soft mode if you want OnCrawl to ignore the standard sitemap protocol. This can be useful, for example, if your sitemaps are not located at the root of your site (https://www.mysite/sitemap.xml.gz").
- Remember that the sitemap's location determines the URLs that can be included in the sitemap.A sitemap file stored at https://mysite.com/directory/sitemap.xml can contain all URLs starting with https://mysite.com/directory/, but cannot include URLs starting with https://mysite.com/other_directory/
- Remember that sitemaps must contain no more than 50 000 URLs and must be no larger than 50MB. If you need additional space, you can use a sitemap_index.html file.
If you still have questions about sitemaps, drop us a line at @oncrawl_cs or click on the Intercom button at the bottom right of your screen to start a chat with us.