Accepted sitemap formats

OnCrawl can analyze sitemaps in the following formats:

  • XML: sitemap.xml, sitemap_index.xml
  • Gzip: sitemap.xml.gz, sitemap_index.xml.gz
  • Text files: sitemap.txt
  • Syndication feeds: sitemap.rss (RSS 2.0), sitemap.atom (Atom 1.0 or 3.0)

Best practices for sitemaps

When setting up a crawl:

  • Specify your sitemap URLs if you have sitemaps that won't be found in the robots.txt file or that don't have a standard name.
  • Use soft mode if you want OnCrawl to ignore the standard sitemap protocol. This can be useful, for example, if your sitemaps are not located at the root of your site (https://www.mysite/sitemap.xml.gz).
  • Remember that the sitemap's location determines the URLs that can be included in the sitemap.A sitemap file stored at https://mysite.com/directory/sitemap.xml can contain all URLs starting with https://mysite.com/directory/, but cannot include URLs starting with https://mysite.com/other_directory/
  • Remember that sitemaps must contain no more than 50 000 URLs and must be no larger than 50MB. If you need additional space, you can use a sitemap_index.html file.
  • Sitemaps cannot be used as start URLs.

When using a sitemap:

  • Remember that scanning a sitemap is not the same as crawling the URLs in the sitemap. (If that's what you're trying to do, you can extract the URLs from your sitemap, then crawl the resulting list in URL list mode.)

Going further

If you still have questions about sitemaps, drop us a line at @oncrawl_cs or click on the Intercom button at the bottom right of your screen to start a chat with us.

Happy crawling!

You can also find this article by searching for:
formato mapa del sitio
format du sitemap

Did this answer your question?