All Collections
Big Data Export
Big Data Export for Google Bucket
Big Data Export for Google Bucket

Follow this tutorial to use the Big Data Export to send crawl and link data to your Google bucket.

Updated over a week ago

We'll cover setting up the Big Data Export to send your data to your Google bucket in the following steps:

  1. Add Oncrawl's service account

  2. Request a new data export

  3. Check the status of your export

  4. Get a list of your exports

We'll also explain the notifications associated with this feature, and how to consult the exported data directly from your Google Bucket.

Google Cloud Configuration

Step 1: Add our Service Account

To export data to a Google Cloud Services bucket, you must allow our service account to write in the desired bucket.

You MUST give the following roles to our service account: (with IAM)

  • roles/storage.legacyBucketReader
    - storage.objects.list
    - storage.bucket.get

  • roles/storage.legacyBucketWriter
    - storage.objects.list
    - storage.objects.create
    - storage.objects.delete
    - storage.bucket.get

  • roles/storage.legacyObjectCreator
    - storage.objects.create

Oncrawl Configuration

What you need before you start

  • A crawl (completed and not archived) and its ID

  • An active Oncrawl subscription

  • An Oncrawl API token with account:read and account:write. Help with API tokens.

This procedure uses the Oncrawl API. You can find full documentation here.

Step 2: Request a new data export

You will need to use the following commands to create a new data export.

You can export the following datasets:

  • page

  • link

  • raw_html

  • cluster

  • keyword

  • structured_data

Here are some examples:

  • Exports your page data using JSON to gs://export-parquet/pages. You also can use 'parquet' or 'csv'.

HTTP Request

curl -X POST "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json" \
d @ <<EOF
{
"data_export": {
"data_type": 'page',
"resource_id": 'YOUR_CRAWL_ID',
"output_format": 'json',
"target": 'gcs',
"target_parameters": {"gcs_bucket":"export-parquet","gcs_prefix":"pages"}
}
}
  • Parameters:

data_type (required)
Can be either page or link

resource_id (required)
ID of the crawl to export, the crawl must not be archived

output_format (required)
Can be either json, csv or parquet

output_format_parameters (optional)
An object with the configuration for the selected output_format

target_parameters (required)
An object with the configuration for the selected target

  • Exports your link data using Apache Parquet to gs://export-parquet/links

HTTP Request

curl -X POST "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer ***********************************" \
-H "Content-Type: application/json" \
-d '{"data_export":{"data_type":"link","resource_id":"5e9e8be0726794000141abf0","output_format":"csv","target":"gcs","target_parameters":{"gcs_bucket":"export-parquet","gcs_prefix":"links"}}}'

HTTP Response

{
"data_export": {
"data_type": "link",
"export_failure_reason": null,
"id": "5eb97604451c95250a96b41e",
"output_format": "csv",
"output_format_parameters": null,
"output_row_count": null,
"output_size_in_bytes": null,
"requested_at": 1589212676000,
"resource_id": "5e9e8be0726794000141abf0",
"status": "REQUESTED",
"target": "gcs",
"target_parameters": {
"gcs_bucket": "export-parquet",
"gcs_prefix": "links"
}
}
}

Step 3: Status of your export

  • Requests the status of your export.

HTTP Request [bash]

curl "https://app.oncrawl.com/api/v2/account/data_exports/<data_export.id>" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json"

HTTP Response [json]

{
"data_export": {
"id": string,
"data_type": enum,
"resource_id": string,
"output_format": enum,
"output_format_parameters": object,
"target": enum,
"target_parameters": object,
"status": enum,
"export_failure_reason": string,
"output_row_count": integer,
"output_size_in_bytes": integer,
"requested_at": integer
}
}
  • Properties are:

id
The unique identifier of the file.

data_type
Data type that were exported, can be either page or link.

resource_id
The unique identifier of the crawl.

output_format
Format used for the data export, can be either json, csv or parquet.

output_format_parameters
Parameters that were used by the output_format.

target
Target used for the data export, can be either s3 or gcs.

target_parameters
Parameters that were used by the target.

status
Current status of the export, can be:

  • REQUESTED: Request received but not yet handled

  • EXPORTING: Data are currently being exported in desired format

  • UPLOADING: Data are currently being uploaded to the desired target

  • FAILED: An error occurred and export could not terminate successfully

  • DONE: Export completed successfully

export_failure_reason
Exists only if status is FAILED, contains the reason why the export failed.

output_row_count
Number of items that were exported.

output_size_in_bytes
Total size of the exported data.

requested_at
UTC timestamp when the export was requested.

Example:

curl "https://app.oncrawl.com/api/v2/account/data_exports/5e8b351c451c952cd1664381" \
-H "Authorization: Bearer ***********************************" \
-H "Content-Type: application/json"

Step 4: List of your exports

  • Requests a list of your exports.

HTTP Request [bash]

curl "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json"
  • You can filter on the following properties:
    โ€‹
    - status
    - resource_id
    - data_type
    - output_format

You can view how to filter and paginate here.

HTTP Response [jsx]

{
"data_exports": [ data_export ]
}

Example:

!curl "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer ***********************************" \
-H "Content-Type: application/json"

Notification

If everything goes well, you will receive an email with the Google Cloud Service link.

If we encountered a problem with the export, we send you an email that lists the error we received when trying to copy your data.

Get your data directly from your bucket

You can use the gsutil cp command:

gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [SAVE_TO_LOCATION]

Where:

  • [BUCKET_NAME] is the name of the bucket containing the object you are downloading. For example, my-bucket.

  • [OBJECT_NAME] is the name of the object you are downloading. For example, pets/dog.png.

  • [SAVE_TO_LOCATION] is the local path where you are saving your object. For example, Desktop/Images.

If successful, the response will look like the following example:

Operation completed over 1 objects/58.8 KiB.

Below are methods for various common languages:

Did this answer your question?