All status codes in the 300s. These codes are reported for a URL when the URL’s content has been moved to a new address. They are sometimes called “redirects”. Common 3xx codes include 301 (permanent redirect), 302 (temporary redirect) and 310.
All status codes in the 400s. These codes are reported for a URL when the URL does not have content associated with it. They are considered “client” (user or navigator) errors: the web server didn’t have a problem understanding the request or serving the content. 4xx errors can occur when a page’s URL is changed but not redirected, or when there is a typo in the URL or in the link to the URL (404), or when the user does not have permission to access the content (403).
All status codes in the 500s. These occur when the server is unable to complete the request. The server might be down (503) or it might have encountered and error (500).
Pages that receive organic visits from Google.
Ratio of the number of pages that receive organic visits from Google to the number of all pages crawled.
Text to be displayed when the image cannot be shown. It is provided as a parameter in the image tag: <img src=”https://www.examplesite.com/exampleimage.jpg” alt=”my example image”>. This text is used by crawlers to understand the content of images. A best practice is to provide alt text for all images that play an important role as part of the page's content.
Clickable text that links to another page. Used by search engines as a factor in evaluating link value.
A means of protecting a page or website from public access by requiring the visitor to provide credentials via a username and password. This method relies on htpasswd.
Links from other websites to pages on your website. Qualitative backlinks attract traffic and provide authority to your site.
Selector that allows you to filter all graphics and tables in the current dashboard to show only one group of pages. Choose among any of the groups in the current segmentation.
A program that moves from webpage to webpage, usually by following links found on the page, and records information found on the page. Among other things, bots are used by search engines to index the web, and by SEOs to audit websites.
Authoritative or original version of content that might be found at multiple URLs. The canonical version of the content is indicated in the <head> of all pages with the content, using the tag <link rel=”canonical” href=”https://www.examplesite.com/canonicalURL/”>. If multiple valid versions of the page exist, the canonical version is the one that should be indexed.
The case where all URLs with similar or identical content all indicate the same canonical URL.
Number of times a search listing is clicked on a search engine results page during the period being analyzed (by default: the past 45 days).
Group of pages sharing the same content or purporting to share the same content. This might be all pages with similar content, all pages indicating the same canonical URL, or all pages pointing to one another using hreflang declarations.
Small files left by a website on your computer to initialize or save information from one website visit to another. These might allow you to be tracked for analytics purposes, for example, or to keep you logged in.
The average number of times that pages are visited by Googlebots within a given period of time.
Crawl over crawl
OnCrawl’s crawl comparison feature that analyzes the differences between two crawls of the same website.
How fast the OnCrawl bot consumes your site, in pages per second(s). Faster speeds will make the crawl go faster, but might overwhelm your server.
Pages the OnCrawl crawler was able to request and obtain a valid response (a status code) for. These pages must be accessible to a crawler (not forbidden by robots.txt…), and be within the scope of the crawl (crawlable within the limits of URLs and depth that you set, on a subdomain you authorized for the crawl…). These pages may also be called "fetched" pages.
The addition of non-crawl information to the information collected for each URL. This allows the relationship between data from different sources to be analyzed, such as number of Googlebot visits vs page depth vs average SERP rank. Cross-analysis can provide particularly important insights into website performance and Google behaviors.
Abbreviation of “click-through rate”, or the rate at which a listing in the search engine results pages is clicked. CTR can indicate how attractive or appropriate a search listing is. CTR data is provided by Google through the Search Console.
A personalized selection and order of charts created and saved using Tools > Dashboard builder. These dashboards allow you to follow certain metrics, focus on certain types of pages, or create custom versions of the report dashboards that OnCrawl makes available to all users.
Data for each URL that is obtained from scraping the website is saved in a custom field. Custom fields can be used in Data Explorer tables and filters, segmentations, and anywhere else OnCrawl metrics can be used.
A page of charts that is part of an analysis. Under “Crawl report”, for example, there is a “Summary” dashboard provided by default by OnCrawl.
Interface providing access to all data in the analysis. Best practice is to use the OnCrawl Query Language to obtain data that matches a specific query, or set of parameters, then to adjust which fields are shown as columns in the table. The table can then be filtered, sorted, and exported. Predefined filters can replace the use of OnCrawl Query Language. The Data Explorer also shows the detail behind any OnCrawl chart or chart section if you click on the visualization in a dashboard.
Integration of any outside data in an analysis, using JSON or CSV files you provide. This flexible feature allows you to include and analyze any data you have that can be associated with a URL. A few common uses include: GoogleAds integration, SEMRush integration, revenue integration, and page value integration.
Dataset (Data Explorer)
Data inspected through the Data Explorer. Data is saved in different sets to improve speed and processing of analyses. Complete link data, for example, is saved separately from crawl data. An exhaustive record of server log events, if you use log analysis, is also available. Datasets are inspected separately.
Desktop, mobile, or tablet. Device type is used as a filter option at the top of some OnCrawl dashboards.
Ability to use specific host names and IP addresses instead of using Domain Name System (DNS) servers to translate website host names into IP addresses. This crawl setting can sometimes be useful when crawling sandbox, pre-production, and other protected sites.
<title>, <h1> or <meta name="description" content="..." /> tags that have the same value(s) on multiple pages. This is not the same as duplicate content.
Duplicate content, managed
Page content that is identical or nearly identical on multiple pages, where the pages coherently indicate which page should be indexed. This can be achieved through using matching canonical tags in the cluster of duplicate pages, or by correctly using hreflang tags.
Duplicate content, problematic
Page content that is identical or nearly identical on multiple pages, but pages don’t coherently indicate which one should be indexed. This might mean there are errors in canonical declarations, errors in hreflang declarations, or no duplicate management strategy in place for these pages.
For a file (a website resource), the letters following the . in the file name that indicate what type of file it is. Common extensions on a website can include: .css, .js, .jpg, .png...
Average number of days between the date of the first Googlebot hit on a page and the date of the page's first organic visit from Google, according to the log files analyzed by OnCrawl.
Abbreviation for “file transfer protocol”, a standard method of safely transferring files to a server or online storage location. To use it, for example to provide log files to OnCrawl, you will need an FTP client such as FileZilla.
A request event recorded in a server log file. These events can be counted, and provide an accurate view of activity. Hits are used to measure Googlebot behavior on a website: every time a Googlebot requests a page or a resource, it is said to “hit” that page or resource.
Any HTML heading tag in the series <h1>, <h2>, <h3>, <h4>, <h5>, <h6>.
A tag used on international sites or groups of sites to indicate pages that contain the same content but are destined for different markets. Markets can be differentiated by language and by country. Hreflang tags can be implemented in the page head, in the <header>, or in an XML sitemap. They specify the URL of each version of the content, including the current URL, and the language and country associated with each.
Hreflang self declaration
An hreflang tag that provides the targeted language (and country, if necessary) of the URL on which it appears. Hreflang self declarations are required, unlike self-canonical declarations, which are only recommended.
Part of the standard response to an HTTP request. This includes fields and data about the web page that the server provides to a browser or client along with the content of the page.
The number of times a URL is shown in Google's search results. To be counted as an impression, the search listing has to be visible on the page being viewed by the search user. Google provides additional information regarding what counts as an impression.
Whether or not a URL can be indexed, or saved by Google for use in future search results. Indexability can be affected by instructions to bots ("noindex") or by additional characteristics. For example, Google does not index non-canonical pages. Indexable pages are sometimes called "compliant" pages in OnCrawll.
A step in a search engine's process of connecting web pages to users' search queries. The process of saving a page to be used when creating search engine results pages, indexation occurs after pages are discovered. A page that is not indexed cannot be shown as a search result.
Inlinks (follow, nofollow)
Links to a page from other pages on the same website. Internal inlinks can be "nofollow" links, which indicate to bots that they should be ignored, or "follow" links. Follow links transmit Inrank to the page they link to.
OnCrawl's PageRank metric, which helps measure how Google understand the rank, or popularity, of a page within the website.
IP addresses (dynamic, static)
A numeric address of a computer connected to a network, like the internet. Computers use these addresses to communicate, rather than domain names. You might need the IP address of the OnCrawl bot to whitelist it, or allow it to access your website. Our bot uses addresses that can be dynamic (created each time the bot is used) or static (set and unchanging). If you need to whitelist the OnCrawl bot, you will need to use OnCrawl's static IP addresses.
Search terms your site ranks for that include the name of your brand or terms very closely linked to a search for the brand, rather than the products or services it offers. It is good practice to separate these terms from the rest of your keyword analysis because someone looking for your brand is almost always looking for your website, which can positively skew your site's search performance.
Search terms your site ranks for that do not include your brand name. It is good practice to analyze these terms apart from branded keywords.
Pages the OnCrawl bot has found a reference to, whether or not it has been able to crawl the page. Known pages include crawled pages, but also pages available in data from connectors that were not found by the crawler, links to pages forbidden to the crawler, links to pages that can't be explored because the crawl limits have been reached, etc.
Language code (hreflang)
A two-character reference in the hreflang declaration to the language in ISO 639-1 format.
The time it takes to load the page. OnCrawl measures the TTLB (time to last byte), from the time the bot makes the request to the time the bot receives the last byte from your web server.
A record of all requests for files kept by a web server. Log files are one of the few sources of information regarding Googlebot behavior on a website, among other data that can be extremely useful to SEO.
Log manager tools
OnCrawl's tool that allows you to monitor the upload, history and processing of log files.
HTML tags in the <head> section of a page that provide information to bots about what they are allowed to do on the page. These tags take the form: <meta name="robots" content="noindex, nofollow">. They can target specific bots by User-Agent, or all bots. And in addition to the general commands in the "content" property, some bots may respond to specific commands. This is the list of directives for Google.
Missing hreflang declarations
Hreflang declarations that are missing between some pages in a cluster of translated pages. When OnCrawl groups all of the pages that reference one another through hreflang declarations, it analyzes the full group. Often, some pages are missing references to one or more other pages in the group.
Near-duplicate content (similar content)
Content that is nearly the same as content on another URL. OnCrawl uses the same algorithm as Google to analyze the content of each page and calculate a similarity ratio, based on all of the content on the page. Pages with high ratios share a large portion of their content. While not 100% identical, high similarity ratios indicate that the pages are near-duplicates, or that almost all of the content on the page is identical to content on a different page. These pages are often considered to be duplicate content by search engines.
Newly crawled pages
Pages crawled by Googlebot that have no history of previous Googlebot visits since the beginning of log monitoring by OnCrawl. These pages are usually pages that have been discovered by Google for the first time.
A term or phrase containing N number of words, where N represents any number. OnCrawl reports on the top Ngrams (1-, 2-, and 3-word phrases) used on your site. Grams are used in natural language processing.
Everything that can be done on a page-level to improve a URL's visibility in search.
OnCrawl Query Language (OQL)
A system of search and filtering that is used to find information in OnCrawl after an analysis. OQL usually contains a series of statements or blocks of statements connected by AND or OR operators. A statement usually takes the form: [OnCrawl metric] [operator such as "contains" or "is greater than"] [value].
Protocol that allows a webpage to provide structured information to social networks in order to allow titles, images, descriptions, etc. to appear when a link is provided. Read more about the Open Graph Protocol on the dedicated website.
A page with no inlinks. It has been "orphaned" by the website structure, from which it is excluded. This can occur by error, or be intentional, for example in the case of a landing page for paid campaigns. Orphan pages are difficult to rank in search engines, and orphan pages that rank under-perform compared to similar pages because the website's architecture indicates that they are of little or no importance.
Outlinks (internal, external, follow, nofollow)
Links going out from a page to a different page, whether on the same website (internal outlinks) or on a different one (external outlinks). Like other links, they are followable by bots by default but they can be marked "nofollow" to prevent bots from following them and passing "link juice", or page importance, to the link's target page.
Number of steps from the home page (or Start URL) that a given URL is found. This is measured in the minimum number of clicks required to get to the URL from the home page. For both technical and algorithmic reasons, pages deep in a website perform less well than pages high up in a website.
In a segmentation, pages that should be analyzed together. Pages can be grouped based on any metric: pages that convert, pages that use the same template, pages that have the same theme, pages in a specific website folder, pages that attract the most attention from Googlebots, slowest pages on the website, pages published within the past week...
Pagination (prev / next)
Pages containing tags that indicate that they are part of a paginated series. These tags take the form "<link rel="prev" href="https://www.example.com/page1"> or <link rel="next" href="https://www.example.com/page3">. While no longer used by Google, these tags are useful for for marketing paginated series such as catalogs, archives, etc.
Parameters (on URLs)
Additional information added to the end of a URL, for example https://www.example.com/page?utm=email-list. Parameters are commonly used in ecommerce when sorting or filtering products, in marketing to track referring sources, as well as in other contexts. URLs with parameters contain the same content as the same URL with different parameters and the same URL with no parameters, yet each set of parameters is considered to be a different URL. Consequently, crawling URLs with parameters can both increase the number of URLs from your monthly consumed by the crawl, and increase the number of duplicate or near-duplicate pages found in the crawl. You may want to limit the parameters included in a crawl in the crawl settings. Crawls that ignore parameters will treat all URLs with ignored parameters as the same URL, as if the parameters didn't exist.
Note that OnCrawl always normalizes certain URLs. OnCrawl re-orders parameters and their values alphabetically so that URLs with the same parameters in different orders can be treated as the same URL. For example, https://www.example.com/page?size=L&color=black will be saved and treated by OnCrawl as if it were https://www.example.com/page?color=black&size=L
A program's ability to break something down into its parts. In OnCrawl, the log analyzer must be able to parse entries in log files: that is, to be able to identify each type of information in a log line and extract it correctly. In most cases, this is done by recognizing patterns that appear in every line, and matching each pattern to the type of information it represents. If your log files are missing information, have unusual formatting, or combine pattern formats within a single file, it may make parsing difficult.
The average rank of a URL on search engine results pages over the past 45 days, according to Google Search Console. It has been shown that the first 20 positions receive almost all of user clicks, and that the top three results on the first two pages perform better than other results on the same page. This is why OnCrawl groups pages that rank in the following categories: 1, 2-3, 4-10, 11-13, 14-20, 21-100, >100.
A grouping of the data and the analyses pertaining to a specific website. As the highest-level object in OnCrawl, projects can also have properties that can affect the analyses that are run in them: they can be shared, they can be validated, and certain paid options can be applied to individual projects.
Text of parameters and their values that is added to the end of a URL, beginning with a "?". For more information on parameters, see "Parameters".
Preset OQL filters in the Data Explorer that allow you to quickly generate reports for common SEO issues such as all pages with external follow out-going links; pages with hreflang; indexable pages in structure not crawled by Google; indexable pages that can likely be optimized; and more.
A permanent or temporary server setting via a HTTP status code in the 300s that sends a visitor to a different URL. Redirects allow you to seamlessly provide the right content to visitors that arrive on an old URL when you modify a URL to make a correction or while migrating a website, or when you delete a page.
A series of redirects so that when visitors request page A, they are sent to page B, which sends them to page C… Google states that its bots only follow 5 redirects in a chain per crawling session. After a certain number of chained redirects, most browsers will also stop attempting to reach a displayable page.
A series of redirects in which a page in the series redirects to an earlier page in the series. This makes it impossible to reach a displayable page. Browsers will show an error message.
Abbreviation for "Regular Expression", which is a pattern-matching tool that allows a program, a script, or a function to find one or more sequences that match the Regex criteria within a longer series of characters, such as a list of URLs or within a page. It can be used to find all URLs with a similar structure, for example. OnCrawl uses Regex in multiple ways. For example, Regex can be used to create a scraping rule, to define OQL filters, or to create page groups in a segmentation.
A tag found in the <head> part of the page that indicates the URL of alternate versions of the page's content, such as printer versions, mobile versions… It usually takes the form <link rel="alternate" href="https://www.example.com/alternate-version/">
A group of dashboards in OnCrawl analysis results: Crawl Report, SEO Impact Report, Ranking Report, Social Media Report, Backlink Report… Reports contain one or more dashboards of charts on a broad topic, often associated with the source(s) of the data in them.
Resource (log files)
A file that contains instructions for bots, particularly which directories on the website the bots are allowed or not allowed to visit. This file is usually the first file visited during any crawl session and is always found at the root (the home folder) of the web domain: https://www.example.com/robots.txt. Returning an HTTP status in the 400s or 500s can temporarily prevent Google from crawling your site.
A crawl that is currently being executed. You can follow the progress of running crawls from the crawl monitoring page, where you can also cancel, pause, or end the crawl ahead of time.
A crawl that has been planned to run in the future. Scheduled crawls can be run once, or programmed to repeat. They can be consulted and modified from the project home page.
Obtaining and saving information on a web page when it is crawled by a bot. OnCrawl allows you to scrape any information from your web pages by defining the rules that describe where and how to identify the information you want to record. This is often used to record product prices, article publication dates, and other information.
SEA and vertical bots
Bots used by Google in addition to Googlebot. While Googlebot is the principal bot used in indexing, Google uses different bots for different purposes. The behavior of bots associated with SEA practices, such as GoogleAds, and bots associated with specific Google verticals (images, news…) is analyzed separately in OnCrawl.
A means of grouping URLs in order to analyze meaningful parts of a website. Segmentations are a set of page groups; each group uses OQL and any OnCrawl metric to define a set of similar pages within your website. Segmentations can be created at any time, and can be applied to any analysis (as long as the metrics used in the segmentation are also available in the analysis). You can switch between segmentations using the drop-down menu at the top of any dashboard.
An organic visit, or visit from a Google search engine results page, recorded by an analytics solution or recorded in a log file.
How similar the content of a cluster of two or more URLs is, given as a percentage. This is calculated using the same algorithms as Google.
An XML file that lists URLs on the website and, optionally, information about them, in order to help bots find and understand the website content. Sitemaps can be submitted to search engines like Google to help inform the search engine of important or new pages on the website in order to facilitate discovery and indexing. Sitemaps are generally located in the root folder where the URLs they contain are found.
Sitemap, type (news, image, video)
Special types of XML files for news articles, images or videos. Google uses these sitemaps in its indexing process. More information can be found here.
Soft mode (sitemaps)
An option in the crawl settings that allows the OnCrawl bot to ignore the standard sitemap protocol, particularly where sitemap location is concerned. If you do not follow the standard sitemap protocol, this allows OnCrawl to find and analyze your sitemaps anyway.
The URL or URLs where the OnCrawl bot should begin its analysis. Typically, this will be the home page or home pages (in the case of a multi-language site, for instance).
A 3-digit number provided by the server when replying to a request for a page or a resource. This code indicates the URL's availability status: OK (200), redirected (300 series), or various errors (400 and 500 series). As this code is part of the server's HTTP response, it is sometimes called an HTTP status code.
Structured data (Schema.org)
Information organized in a machine-readable format following the protocol provided by Schema.org. This format allows search engines to better identify and classify information, which in turn allows them to present that information as part of enriched search listings: breadcrumbs, review stars, thumbnails, FAQ lists… are all made possible by structured data. Structured data can also be used to reinforce a website's E-A-T.
A section of a site treated separately from the main site. Subdomains are visible in URLs as the part that comes between the protocol (https://) and the domain, for example, the "shop" in https://shop.example.com. To a certain extent, subdomains are treated by search engines as separate websites. OnCrawl allows you to limit your analysis to the same subdomain as your Start URL by default, but you can also choose in the crawl settings to include any subdomains that are linked to from pages on the same domain as your Start URL.
Tags (HTML, SEO)
HTML markup on a webpage. HTML tags that appear in OnCrawl analyses include encoded structured data, Open Graph data, Twitter card data, and "SEO tags". OnCrawl uses the term "SEO tags" to refer to HTML tags that play a significant, traditional role in on-page SEO: <title>, <meta description="...">, <h1>, and other heading tags.
Target (of 3xx page, of a link)
The page to which a redirect or a link points. That is, where the visitor will end up.
A comparison of the amount of content text to the amount of tags and code in the page source code. When there is significantly more code than text, it can be a sign of thin content (see below), or of code bloat (massive use of code that does not provide useful information to users or browsers, unnecessarily weighing down the page in terms of size and load time).
Content, usually short content, that provides little or no information to a visitor.
Protocol that allows a webpage to provide structured information to Twitter in order to allow titles, images, descriptions, etc. to appear when a link is provided.
Unique pages crawled
Total number of unique (individual) pages crawled by Googlebot during the log monitoring period you set. Multiple hits on a single page are not counted; the number of individual pages crawled in that case would be 1.
An analysis tool that shows all data collected in an analysis for a single URL, including the source code as seen by the OnCrawl bot.
URL list/list mode
A crawl mode that examines all URLs in a list you provide, without following any links.
The identification of a bot or browser sent to a server as part of the request for website content.
A project with some assurance that you have permission to crawl the associated website. Validating a project unlocks crawl functions that require knowledge of a website or that require care and attention for responsible use, such as faster crawl speeds. You can validate a project by choosing “Verify ownership” from the project menu in the upper right-hand corner of the project home page.
A robots.txt file that is used ONLY by the OnCrawl bot to replace the actual robots.txt file on the website. This allows you to test robots.txt rules, or to bypass existing rules in order to crawl certain sections of your site without modifying the live version of your robots.txt file.
How big a page is, in bytes. This metric is often closely related to page speed, as it takes more time to transfer and load more bytes than it does to transfer and load fewer bytes.
The number of words on the page. Some SEOs believe this used to be a ranking factor. Today, it can be used to find pages with thin content. Correlations between word count and SEO performance are also common, though the ideal word count will vary from page type to page type and from website to website. “Treating a subject in depth” as recommended by Google and editorial style both influence content quality and word count, though more is not always better.
Language/region code used in an hreflang declaration to specify the URL to default to when no language or region is set. The page should be one that isn't targeted at any specific language or region, such as a homepage that uses a country selector.
Method of finding a particular information on a webpage (or any XML document) by indicating at which part in the structure the information is found. Xpath is commonly used in web scraping, including the creation of scraping rules for OnCrawl’s custom fields. There are multiple Xpath cheat sheets available online.