We'll cover setting up the Big Data Export to send your data to your Google bucket in the following steps:
Add Oncrawl's service account
Request a new data export
Check the status of your export
Get a list of your exports
We'll also explain the notifications associated with this feature, and how to consult the exported data directly from your Google Bucket.
Google Cloud Configuration
Step 1: Add our Service Account
To export data to a Google Cloud Services bucket, you must allow our service account to write in the desired bucket.
Our service account is: oncrawl-data-transfer@oncrawl.iam.gserviceaccount.com
You MUST give the following roles to our service account: (with IAM)
roles/storage.legacyBucketReader
- storage.objects.list
- storage.bucket.getroles/storage.legacyBucketWriter
- storage.objects.list
- storage.objects.create
- storage.objects.delete
- storage.bucket.getroles/storage.legacyObjectCreator
- storage.objects.create
Oncrawl Configuration
What you need before you start
A crawl (completed and not archived) and its ID
An active Oncrawl subscription
An Oncrawl API token with account:read and account:write. Help with API tokens.
This procedure uses the Oncrawl API. You can find full documentation here.
Step 2: Request a new data export
You will need to use the following commands to create a new data export.
You can export the following datasets:
page
link
raw_html
cluster
keyword
structured_data
Here are some examples:
Exports your page data using JSON to gs://export-parquet/pages. You also can use 'parquet' or 'csv'.
HTTP Request
curl -X POST "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json" \
d @ <<EOF
{
"data_export": {
"data_type": 'page',
"resource_id": 'YOUR_CRAWL_ID',
"output_format": 'json',
"target": 'gcs',
"target_parameters": {"gcs_bucket":"export-parquet","gcs_prefix":"pages"}
}
}
Parameters:
data_type (required)
Can be either page
or link
resource_id (required)
ID of the crawl to export, the crawl must not be archived
output_format (required)
Can be either json
, csv
or parquet
output_format_parameters (optional)
An object with the configuration for the selected output_format
target_parameters (required)
An object with the configuration for the selected target
Exports your link data using Apache Parquet to gs://export-parquet/links
HTTP Request
curl -X POST "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer ***********************************" \
-H "Content-Type: application/json" \
-d '{"data_export":{"data_type":"link","resource_id":"5e9e8be0726794000141abf0","output_format":"csv","target":"gcs","target_parameters":{"gcs_bucket":"export-parquet","gcs_prefix":"links"}}}'
HTTP Response
{
"data_export": {
"data_type": "link",
"export_failure_reason": null,
"id": "5eb97604451c95250a96b41e",
"output_format": "csv",
"output_format_parameters": null,
"output_row_count": null,
"output_size_in_bytes": null,
"requested_at": 1589212676000,
"resource_id": "5e9e8be0726794000141abf0",
"status": "REQUESTED",
"target": "gcs",
"target_parameters": {
"gcs_bucket": "export-parquet",
"gcs_prefix": "links"
}
}
}
Step 3: Status of your export
Requests the status of your export.
HTTP Request [bash]
curl "https://app.oncrawl.com/api/v2/account/data_exports/<data_export.id>" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json"
HTTP Response [json]
{
"data_export": {
"id": string,
"data_type": enum,
"resource_id": string,
"output_format": enum,
"output_format_parameters": object,
"target": enum,
"target_parameters": object,
"status": enum,
"export_failure_reason": string,
"output_row_count": integer,
"output_size_in_bytes": integer,
"requested_at": integer
}
}
Properties are:
id
The unique identifier of the file.
data_type
Data type that were exported, can be either page
or link
.
resource_id
The unique identifier of the crawl.
output_format
Format used for the data export, can be either json
, csv
or parquet
.
output_format_parameters
Parameters that were used by the output_format.
target
Target used for the data export, can be either s3
or gcs
.
target_parameters
Parameters that were used by the target.
status
Current status of the export, can be:
REQUESTED
: Request received but not yet handledEXPORTING
: Data are currently being exported in desired formatUPLOADING
: Data are currently being uploaded to the desired targetFAILED
: An error occurred and export could not terminate successfullyDONE
: Export completed successfully
export_failure_reason
Exists only if status
is FAILED
, contains the reason why the export failed.
output_row_count
Number of items that were exported.
output_size_in_bytes
Total size of the exported data.
requested_at
UTC timestamp when the export was requested.
Example:
curl "https://app.oncrawl.com/api/v2/account/data_exports/5e8b351c451c952cd1664381" \
-H "Authorization: Bearer ***********************************" \
-H "Content-Type: application/json"
Step 4: List of your exports
Requests a list of your exports.
HTTP Request [bash]
curl "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json"
You can filter on the following properties:
β
- status
- resource_id
- data_type
- output_format
You can view how to filter and paginate here.
HTTP Response [jsx]
{
"data_exports": [ data_export ]
}
Example:
!curl "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer ***********************************" \
-H "Content-Type: application/json"
Notification
If everything goes well, you will receive an email with the Google Cloud Service link.
If we encountered a problem with the export, we send you an email that lists the error we received when trying to copy your data.
Get your data directly from your bucket
You can use the gsutil cp command:
gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [SAVE_TO_LOCATION]
Where:
[BUCKET_NAME] is the name of the bucket containing the object you are downloading. For example, my-bucket.
[OBJECT_NAME] is the name of the object you are downloading. For example, pets/dog.png.
[SAVE_TO_LOCATION] is the local path where you are saving your object. For example, Desktop/Images.
If successful, the response will look like the following example:
Operation completed over 1 objects/58.8 KiB.
Below are methods for various common languages: