All Collections
General information
FAQ
What format can I use for my sitemaps?
What format can I use for my sitemaps?

You can use sitemaps to compare and add data to a crawl. Here are the formats Oncrawl supports.

Updated over a week ago

A sitemap.xml file is a list of the pages on your website that you want indexed by search engines. The Oncrawl bot can crawl sitemaps and utilizes the data for cross analysis; the cross-analysis with the URLs in your sitemap leverages existing sitemap data to spot ways to improve your SEO.

Accepted sitemap formats

Oncrawl can analyze sitemaps in the following formats:

  • XML: sitemap.xml, sitemap_index.xml

  • Gzip: sitemap.xml.gz, sitemap_index.xml.gz

  • Text files: sitemap.txt

  • Syndication feeds: sitemap.rss (RSS 2.0), sitemap.atom (Atom 1.0 or 3.0)

Best practices for sitemaps

When setting up a crawl:

  • Specify your sitemap URLs if you have sitemaps that won't be found in the robots.txt file or that don't have a standard name.

  • Use soft mode if you want Oncrawl to ignore the standard sitemap protocol. This can be useful, for example, if your sitemaps are not located at the root of your site (https://www.mysite/sitemap.xml.gz).

  • Remember that the sitemap's location determines the URLs that can be included in the sitemap. A sitemap file stored at https://mysite.com/directory/sitemap.xml can contain all URLs starting with https://mysite.com/directory/, but cannot include URLs starting with https://mysite.com/other_directory/

  • Remember that sitemaps must contain no more than 50,000 URLs and must be no larger than 50MB (uncompressed). If you need additional space, you can use a sitemap_index.html file.

  • Sitemaps cannot be used as start URLs.

When using a sitemap:

  • Remember that scanning a sitemap is not the same as crawling the URLs in the sitemap. (If that's what you're trying to do, you can extract the URLs from your sitemap, then crawl the resulting list in URL list mode.)

Did this answer your question?