We'll cover setting up the Big Data Export to send your data to your Google bucket in the following steps:

Add Oncrawl's service account
Request a new data export
Check the status of your export
Get a list of your exports

We'll also explain the notifications associated with this feature, and how to consult the exported data directly from your Google Bucket.

Google Cloud Configuration

Step 1: Add our Service Account

To export data to a Google Cloud Services bucket, you must allow our service account to write in the desired bucket.

Our service account is: oncrawl-data-transfer@oncrawl.iam.gserviceaccount.com

You MUST give the following roles to our service account: (with IAM)

roles/storage.legacyBucketReader
- storage.objects.list
- storage.bucket.get
roles/storage.legacyBucketWriter
- storage.objects.list
- storage.objects.create
- storage.objects.delete
- storage.bucket.get
roles/storage.legacyObjectCreator
- storage.objects.create

Oncrawl Configuration

What you need before you start

A crawl (completed and not archived) and its ID
An active Oncrawl subscription
An Oncrawl API token with account:read and account:write. Help with API tokens.

This procedure uses the Oncrawl API. You can find full documentation here.

Step 2: Request a new data export

You will need to use the following commands to create a new data export.

You can export the following datasets:

page
link
raw_html
cluster
keyword
structured_data

Here are some examples:

Exports your page data using JSON to gs://export-parquet/pages. You also can use 'parquet' or 'csv'.

HTTP Request

curl -X POST "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json" \
d @ <<EOF
{
"data_export": {
"data_type": 'page',
"resource_id": 'YOUR_CRAWL_ID',
"output_format": 'json',
"target": 'gcs',
"target_parameters": {"gcs_bucket":"export-parquet","gcs_prefix":"pages"}
}
}

Parameters:

data_type (required)
Can be either page or link

resource_id (required)
ID of the crawl to export, the crawl must not be archived

output_format (required)
Can be either json, csv or parquet

output_format_parameters (optional)
An object with the configuration for the selected output_format

target_parameters (required)
An object with the configuration for the selected target

Exports your link data using Apache Parquet to gs://export-parquet/links

HTTP Request

curl -X POST "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer ***********************************" \
-H "Content-Type: application/json" \
-d '{"data_export":{"data_type":"link","resource_id":"5e9e8be0726794000141abf0","output_format":"csv","target":"gcs","target_parameters":{"gcs_bucket":"export-parquet","gcs_prefix":"links"}}}'

HTTP Response

{
  "data_export": {
    "data_type": "link", 
    "export_failure_reason": null, 
    "id": "5eb97604451c95250a96b41e", 
    "output_format": "csv", 
    "output_format_parameters": null, 
    "output_row_count": null, 
    "output_size_in_bytes": null, 
    "requested_at": 1589212676000, 
    "resource_id": "5e9e8be0726794000141abf0", 
    "status": "REQUESTED", 
    "target": "gcs", 
    "target_parameters": {
      "gcs_bucket": "export-parquet", 
      "gcs_prefix": "links"
    }
  }
}

Step 3: Status of your export

Requests the status of your export.

HTTP Request [bash]

curl "https://app.oncrawl.com/api/v2/account/data_exports/<data_export.id>" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json"

HTTP Response [json]

{
  "data_export": {
    "id": string,
"data_type": enum,
"resource_id": string,
    "output_format": enum,
    "output_format_parameters": object,
    "target": enum,
    "target_parameters": object,
    "status": enum,
    "export_failure_reason": string,
    "output_row_count": integer,
    "output_size_in_bytes": integer,
    "requested_at": integer
  }
}

Properties are:

id
The unique identifier of the file.

data_type
Data type that were exported, can be either page or link.

resource_id
The unique identifier of the crawl.

output_format
Format used for the data export, can be either json, csv or parquet.

output_format_parameters
Parameters that were used by the output_format.

target
Target used for the data export, can be either s3 or gcs.

target_parameters
Parameters that were used by the target.

status
Current status of the export, can be:

REQUESTED: Request received but not yet handled
EXPORTING: Data are currently being exported in desired format
UPLOADING: Data are currently being uploaded to the desired target
FAILED: An error occurred and export could not terminate successfully
DONE: Export completed successfully

export_failure_reason
Exists only if status is FAILED, contains the reason why the export failed.

output_row_count
Number of items that were exported.

output_size_in_bytes
Total size of the exported data.

requested_at
UTC timestamp when the export was requested.

Example:

curl "https://app.oncrawl.com/api/v2/account/data_exports/5e8b351c451c952cd1664381" \
-H "Authorization: Bearer ***********************************" \
-H "Content-Type: application/json"

Step 4: List of your exports

Requests a list of your exports.

HTTP Request [bash]

curl "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json"

You can filter on the following properties:

- status
- resource_id
- data_type
- output_format

You can view how to filter and paginate here.

HTTP Response [jsx]

{
  "data_exports": [ data_export ]
}

Example:

!curl "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer ***********************************" \
-H "Content-Type: application/json"

Notification

If everything goes well, you will receive an email with the Google Cloud Service link.

If we encountered a problem with the export, we send you an email that lists the error we received when trying to copy your data.

Get your data directly from your bucket

You can use the gsutil cp command:

gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [SAVE_TO_LOCATION]

Where:

[BUCKET_NAME] is the name of the bucket containing the object you are downloading. For example, my-bucket.
[OBJECT_NAME] is the name of the object you are downloading. For example, pets/dog.png.
[SAVE_TO_LOCATION] is the local path where you are saving your object. For example, Desktop/Images.

If successful, the response will look like the following example:

Operation completed over 1 objects/58.8 KiB.

Below are methods for various common languages:

Why is my data different between Oncrawl and Google Analytics (GA4)

Data Ingestion

Data Ingestion with CSV files

Big Data Export for AWS (Amazon S3)

How to set up log monitoring if you use GCP

Big Data Export for Google Bucket