Welcome to OnCrawl's Getting Started Tutorials. Today we're going to look at how to set up and run a crawl.
Click on +Set up a new crawl from your project home page.
Each crawl uses a crawl profile, or a group of settings. We're currently looking at (and editing) the default settings. You can switch profiles here, but in this project, we only have the default profile for now.
Let's create a new one.
The best profile names explain what the crawl does, like "All subdomains" or "JS". I'm going to call this one "Full site - crawl only" : it will crawl my whole site, without any cross-analysis.
This top set of settings are the main settings for this profile. They tell the OnCrawl bot:
- How to crawl. We're going to use the spider mode to explore pages by following links, but you could also crawl only the URLs in a list.
- What page or pages to start on. We can add additional start pages here, for example, to examine a multi-language site or a site with multiple home pages.
- When to stop crawling. I want to crawl my entire site, so I'm going to put in some wild numbers. This is limited only by your plan.
- What bot to use: desktop, mobile, and the user-agent name. The full user agent is listed here, if you need it.
- How fast to crawl. Don't take your own server down.
- What to do with parameters. If you want to crawl some, but not all, parameters, you can filter them here. Just list the parameter, not the values.
- How to treat subdomains. Generally, as google considers these to be separate sites, you might not want to follow links to subdomains.
- How to deal with your sitemaps. Unless you're a real stickler for the sitemap.org protocol, it's a good idea to analyze them in soft mode.
- And whether or not to use static IP addresses for the OnCrawl bot, if you need to whitelist bots
At the top of the page, click here to show extra, advanced settings. These options can be extremely helpful for some sites, but aren't always necessary.
They include things like:
- A robots.txt override for the OnCrawl bot only
- The ability to specify HTTP headers to be passed by the crawler, which can be useful in many cases, such as with geographic redirects
- A DNS override, which can be really helpful if you're crawling a site on a pre-prod server
- Server authentication parameters
- And what to do with cookies. In most cases, you can keep them.
Finally, you can integrate additional data into this crawl, such as data from:
- An analytics solution such as Google Analytics
- Google Search Console
- A backlink tool, such as Majestic
- A previous crawl with the same start page
- Your web pages. We'll pick up--or scrape--the information in their source code or their text as our bot crawls them
- Any other source, as long as you can provide it in CSV format.
Since I won't be connecting additional data today, I can simply save the new profile.
Or I can save it and launch my crawl.
To keep you occupied while we crawl, we take you straight to the crawl monitoring page, where you can track the crawl in real time. But you don't have to wait here; we'll send you an email when it's done.
Questions? Reach out to us from the OnCrawl interface by clicking on the blue Intercom button at the bottom of the screen, or tweet to us at OnCrawl_CS.
See you next time!
Until then, happy crawling.