In this guide, we will cover:

How does the OnCrawl Data Ingest work? 

OnCrawl Data Ingestion lets add custom fields from a third-party dataset to each URL of a crawl report and thus enrich your analysis.   

Before launching a crawl, you can supply one or more files containing the third party data set fields you want to add to your report. 

Just head to your project home and select ‘Launch a new crawl’. Then, if this option is activated on your account, click on “Data Ingestion” at the end of your Analysis section.

Check the ‘Enable data ingestion’ box and upload your .zip file. 

Then, the file is processed and you can follow the number of lines being parsed, and their status.

At the end of the crawl, the ingested data are merged into the crawl on the URL field.  It is very important to check that the URL field in the ingested data matches the URLs used in the Crawl report (Crawl report: URLs fetched by the oncrawlbot).

Example :

https://www.oncrawl.com/seo-for-news-website-3-takeaways/

is different than

https://www.oncrawl.com/seo-for-news-website-3-takeaways

and

https://www.oncrawl.com/seo-for-news-website-3-takeaways/

is not the same as

http://www.oncrawl.com/seo-for-news-website-3-takeaways/

Where and how to use the data ingestion feature?

Following the crawl, you can visualize the ingested third-party data on:

  • The Data Explorer interface
  • The segmentation setup interface

The Data Explorer interface

Use the third-party custom fields to extract and visualize data using queries or by adding columns to datasets.

To browse your newly-created custom fields, look for data prefixed by the term "User data". 

According to their format, they can be associated with filters ‘Greater Than’, ‘Less Than’, ‘Equals’.. for numerical values for example.

Example

  • Ingest data from your Google Search Console to detect URLs with impressions on Google but which are not crawled by OnCrawl.
  • Ingest data from ranking tools to analyze revenues by visits, average sales, signups conversions, etc. 

The Segmentation setup interface

Use the third-party datasets to create Pages Segmentations (groups of pages) and obtain new insights regarding your website. 

Just as in the Data Explorer, you can create segments that take these fields into account and categorize groups of pages according to the values taken by these fields. 

Required File Format

The file can be formatted in CSV or JSON, inside a ZIP archive. You need to upload a ZIP file containing one or more CSV file(s), or one or more JSON file(s) 

If you provide multiple files in a ZIP archive they must obey the following rules:

  • All files must have the same format: all CSV, or all JSON
  • All files must have the same field set

Additionally the content of each file must obey the following rules:

  • The file must be UTF-8 encoded if you need to handle non ASCII characters in values.
  • You can supply up to 30 fields per file. Their name must be in the range [a-zA-Z0-9_-].
  • Each line must be less than 1024 characters.
  • The fields can be a String, Integer or Float type.
  • Number format : use a point to separate decimal. Don't use a comma.
  • If you have sparse data, it's not mandatory to supply all fields for URLs that do not have them.
  • If you have no data for an URL, it can be absent from the file.
  • The full url must be provided in the field named URL (or url).

Example

/seo-for-news-website-3-takeaways/ is not correct

https://www.oncrawl.com/seo-for-news-website-3-takeaways/ is good

It is very important to check that the URL field in the supplied data matches the URLs in the Crawl report.

Example:

https://www.oncrawl.com/seo-for-news-website-3-takeaways/

is different than

https://www.oncrawl.com/seo-for-news-website-3-takeaways

and

https://www.oncrawl.com/seo-for-news-website-3-takeaways/

is not the same as

http://www.oncrawl.com/seo-for-news-website-3-takeaways/

CSV

In the CSV format, you must have a header row that contains all the field names in your CSV. You must have a field named URL.

  • Separators: both , and ; are supported and auto-detected.
  • Subsequent rows must contain a value for the URL field, and a value for each additional field you want to add to this URL

Example:

JSON

In the JSON format, you must provide exactly 1 JSON object per line. The object properties are the name of the fields, with their corresponding values.

 
The object must contain at least a field named URL.

All values must be primitive: String, Integer or Float. Complex values like lists are nested objects and are not supported.

Example:

Did this answer your question?