Data Ingestion

How to add external data to a crawl without using a connector

Updated over a week ago

Data ingestion is the method used to include outside information that cannot be found on your website. Data ingestion lets you add additional fields (metrics) to your crawl analysis.

When analyzing a website or monitoring certain key web pages, you might need information about these pages that cannot be collected directly by Oncrawl:

  • Identify pages related to paid/SEA campaigns

  • Include analysis of price margins or earnings on key URLs

  • Analyze URLs appearing in specific SERP or Search features, such as featured snippets or Google Discover

  • Draw information from an outside source like a PIM

  • Blend technical SEO data with data from tools like Semrush or Ahrefs

  • Ingest analytics data for conversions

Oncrawl can easily integrate and analyze this information — or any information you can list by URL and in CSV or JSON format. This type of information can even be used to break down your URLs into segments.

Required File Format

Oncrawl supports ingestion files that are zipped in a ZIP archive.

ZIP archives

The ZIP archive must contain one or more CSV or JSON files.

  • All files in the same archive must have the same format: all CSV, or all JSON

  • All files in the same archive must have the same field set (columns in a CSV file or object properties in a JSON file)

CSV and JSON files

  • The file must be UTF-8 encoded if they contain non-ASCII characters.

  • Files can contain up to 30 fields (columns or properties) per file.

  • Field (column or property) names must be compatible with analysis. Column labels or JSON properties should only use the letters A-Z in uppercase, a-z in lowercase, numbers 0-9, and _ (underscores).

  • Lines must be less than 1024 characters long.

  • Fields can be a strings, integer or float types.

  • If using numbers: use a point to separate decimal values. Don't use a comma.

  • If you don't have information for each field for every URL in the file, that's ok. Leave the column blank or skip the property for URLs that don't have that information.

  • You do not need to list all URLs on your site. If you have no new information on the URL, it does not need to be listed in your ingestion file.

  • Every file must contain the full URL in a filed named URL or url.

Example:

  • Correct URL format: https://www.oncrawl.com/seo-for-news-website-3-takeaways/

  • Incorrect URL format: /seo-for-news-website-3-takeaways/

Make sure that your file contains URLs listed the same way they appear in the Oncrawl crawl report. You can check the format of URLs in the Data explorer for any analysis, by adding the column Source and filtering for Oncrawl bot. This list shows the exact format of URLs the Oncrawl crawler used.

Example:

https://www.oncrawl.com/seo-for-news-website-3-takeaways/

is different than the same URL without a trailing slash:

https://www.oncrawl.com/seo-for-news-website-3-takeaways

and

https://www.oncrawl.com/seo-for-news-website-3-takeaways/

is not the same as the same URL in http (not https):

http://www.oncrawl.com/seo-for-news-website-3-takeaways/

Requirements for CSV formats

  • CSV files must have a first row that contains all of the field (column) names in your file.

  • You must have a field named URL or url.

  • Separators: use either , or ; to separate columns. Oncrawl will detect which you are using.

  • All rows or lines must contain a value in the URL field.

  • All rows can optionally contain a value for the additional fields you want to add to the URL mentioned in the same row.

Example:

Requirements for JSON formats

  • JSON files must contain exactly 1 object per line

  • The object properties will be used as the names of fields, with their corresponding values.

  • The object must contain a property named URL or url.

  • The object can optionally contain additional properties with values for the additional fields you want to add to the URL mentioned in the same object.

  • All values must be primitive: String, Integer or Float. Complex values like lists are nested objects and are not supported.

Example:

Using Data ingestion in Oncrawl 

Before launching a crawl, you provide one or more csv files containing the external information you want to add to your report. 

Adding files from the Data sources page

Files can be added from the Data sources page. From the project home, click on Add data sources.

Switch to the Data ingestion tab.

Drop a zipped file containing one or more csv files into the import area.

The files are processed. Click on a file to view more information about the number of lines processed, the fields (metrics) that were extracted, and any lines that could not be ingested.

Adding uploaded files to a crawl profile

Go the project home, under +Set up a new crawl. Then, click on Data ingestion at the bottom of the Analysis section.

Check the Enable data ingestion box and then choose your file or files from the drop-down menu. You can add multiple files to the same crawl.

During the crawl analysis process, the ingested data are merged into the crawl for each URL.

Where and how to use the data ingestion feature?

When the analysis is complete, there are multiple ways to leverage ingested data:

  • In the Data Explorer reports

  • By using segmentations

  • By using alerts

Data explorer reports

Use the User data metrics to filter your website data using OQL queries, or to add columns to report tables.

Data you ingested will appear under the User data subsection in the lists of available metrics.

Segmentations

Use ingested data to create page groups to help you understand your website based on information you imported. 

Just as in the Data explorer, you can create segments categorize groups of pages according to the values for the metrics you created through data ingestion.

Here is an example using analytics data from Matomo:

Alerts

Use ingested data to create alerts to help you monitor key metrics or groups of pages on your website.

As seen before, you can use metrics created through data ingestion to focus on alerts for a group of pages, or even set up alerts based on the ingested data itself.

Here is an example using analytics data from Matomo, which will send an alert if the pages with very few SEO visitor sessions have indexability or duplication issues:

Did this answer your question?