Data Ingestion with CSV files

Data ingestion with a CSV file is a powerful tool that lets you add outside data to any crawl. A step-by-step guide.

Updated over a week ago

Why use a CSV file for data ingestion?

Ingested data from a CSV or Json file allows you to do things like:

  • Create Page Segmentations based on the data in your CSV file (and not just the URL).

  • Add external data to your crawls even without premium/pro accounts with API access. You just need to be able to export your data from your additional tool.

  • Include performance insights for pages, such as Product Revenue, Average Spent… 

You will need to have access to the data ingestion option in OnCrawl.

Step 1: Create your CSV file

First, export or compile your data into a list where each line starts with a URL and then the various types of information for that URL.

If you're exporting data, it's best to export it directly in the CSV format, but other spreadsheet formats will also work. (If you're using a spreadsheet, remember that you'll lose formatting when you switch to a CSV format.)

This is what your file might look like:

As a spreadsheet:

In a text file:

Creating the perfect CSV file

Producing the perfect CSV file isn't as hard as it sounds. It's all in the format.

The format is what makes your CSV file readable by OnCrawl.

Here's what the right format looks like:

  • A header row listing the name of each field, in the correct order.

  • Field names in the header row with no special characters or spaces. You can use uppercase letters, lowercase letters, - and _ (they will be replaced by space at the end of process) but no accents and no other characters.

  • A URL field called "URL" or "url" in the header row.

  • One line per URL. You can't list your URL in multiple lines. (Be careful with some third-party data exports. For example, if you're pulling data from Semrush, you will need to use the export for the percentage of traffic per URL.)

  • A maximum of 30 fields per file. (If you need more than 30 fields, make multiple files, one for the first 30 fields, one for the next 29 fields and the URL field, and so on. If a URL appears in multiple files, we'll know what to do with the data!)

  • Full URLs listed the same way the crawler will find them. This is very important. If you're not sure what format to use, take a look at your last crawl report and use the format found there.

  • Decimal numbers expressed with a point instead of a comma. For example, use 2.5 and avoid 2,5. If you're providing percentages, write them as decimal values and do not use a % sign. For example, use 0.3 and avoid 30%.

  • If a field contains text, it's best to put quotes around the field content. This prevents conflicts when there are special characters within the field. Some software will do this for you.

  • If you have special characters (non-ASCII values), you must encode the file in UTF-8. This can be tricky in Excel, but we'll walk you through it below.

  • No more than 1024 characters per line.

Some characteristics of CSV files don't matter:

  • You can use a comma (,) or a semicolon (;) to separate fields in each line. Whichever format you pick, use it for the whole file and for any other files you want to ingest at the same time.

  • You can have blank fields for some URLs. The only required field is the URL.

  • If there's no data for a URL, you do not need to include it in the file.

Creating the right CSV format with Excel (PC only)

Excel for Mac will not export a UTF-8 CSV file. If you only have alphanumeric characters, you can still use Excel to view and modify your file. However, to be safe, we don't recommend using Excel on a Mac for CSV files.

The instructions below are for Excel for PC, but most spreadsheet applications can (and will) do the same work.

  1. Open Excel. Click on the Windows menu at the bottom left of your screen and type "Excel". Click on the Microsoft Excel icon.

  2. To open your csv file with UTF-8 encoding, you will need to import it. In the Data ribbon, choose "Import file".

  3. Follow the guided steps to import your file. You will have to tell Excel whether your fields are separated by commas (,) or by semicolons (;).

  4. Modify your file. For example, you may need to change relative links (/my-category/my-page.html) to absolute ones (https://mysite.com/my-category/my-page.html).

  5. To save your file, open the File menu and click Save as…

  6. Choose the format CSV.

  7. Under "Advanced options", choose the formatting "UTF-8"

 What should you do if you have a Mac? If you want to use a spreadsheet program, we recommend using Numbers instead. When you're ready to save, use the option File > Export to… and choose CSV.... Under "Advanced options", check that the encoding is listed as "UTF-8". That's all!

Step 2: Create your zip file

Now that you have created your CSV file, you will need to zip it. You can group multiple files in the same zip, as long as they have:

  • the same format. (They're all CSV files.)

  • the same fields. (They contain the same information for each URL, in the same order. This means that the first line of each file should be identical.)

If you have a file compression program installed, you can probably zip your files by selecting the files you want to zip together and right clicking on them. Choose "Zip with..." or "Compress". Make sure to choose the .zip type of compression. We do not support other types of compression at this time.

Step 3: Ingest away :)

Once you have a zip file, you can add it to OnCrawl as an additional data source.

From your project page, in the first block under Tasks, click "Add data sources".

Click on the "Data Ingestion" tab at the top of the page.

Drag your zip into the blue box, or click on the upload icon in the same box to open an upload window and navigate to your zip file.

We will process your file. Information for each file you upload will be displayed on this page.

File Status

  • Uploading: the zip file is being uploaded

  • Queued for processing: the zip file has been uploaded successfully and is waiting to be processed

  • Processing: the CSV file is being analyzed

  • Processed: the analyzation has finished and your file is ready

  • Error: an error occurred and your file is not ready. You can contact us for more information (click on the blue Intercom button at the bottom of your screen.)

Valid and invalid lines

We let you know how many lines we were able to process. These are valid lines.

We also tell you how many lines we couldn't handle. These are invalid lines. Usually, this is because there was an error or a typo somewhere in the line.

Types of data

Data can be expressed in different formats:

  • Text

  • Numbers

We analyze your files to assign the correct type of data you've entered. We also check each column to make sure that the type of data it contains is consistent from line to line.

Step 4: set up and launch your crawl

Make sure you've uploaded your data and that your files have been analyzed (they should have the status "Processed") before running your crawl!

Set up a crawl with data ingestion.

On the crawl setup page, scroll down to the Analysis section.

Click on "Data ingestion" to expand the section.

Tick the "Enable data ingestion" box.

From the "select a file" drop-down list, choose the file you uploaded earlier. You can also choose to upload a new file at this point.

Note: You can add multiple files. 

Save your crawl setup.

Launch the crawl with data ingestion

Launch a crawl. As URLs in the CSV file are discovered by the crawler, the information in the file is added to the information the crawler finds for that URL.

Step 5: enjoy your enhanced crawl results

Congratulations!

You can now use the new information from your CSV file to create page segmentations.

You'll also find all of these new metrics available for filtering or adding columns to any report in the Data Explorer, or to create new segmentations. You'll find them listed "User data: [the name of your field]". 

Going further

If you still have questions, drop us a line at @oncrawl_cs or click on the Intercom button at the bottom right of your screen to start a chat with us.

Happy crawling!

Did this answer your question?