All Collections
Big Data Export
Big Data Export for AWS (Amazon S3)
Big Data Export for AWS (Amazon S3)

Follow this tutorial to use the Big Data Export to send crawl and link data to AWS (Amazon S3).

Updated over a week ago

Here is the tutorial to export crawl and link data to your AWS S3.

We'll cover setting up the Big Data Export to send your data to your AWS S3 in the following steps:

  1. Add access to your AWS S3

  2. Crypt your access_key and secret_key in b64 format

  3. Send your b64 string

  4. Request a new data export

  5. Status of your export

  6. List of your exports

We'll also explain the notifications associated with this feature, and how to consult the exported data directly from your AWS S3.

AWS Configuration

Step 1: Add access to your S3 bucket

To export data into a S3 bucket, you must create a secret_key/access_key pair with the following permissions on the desired bucket:

  • s3:PutObject

  • s3:GetObject

Step 2: Crypt your access_key and secret_key in b64 format

The expected value is a base64-encoded string of JSON file following this format:

$ base64 /path/to/credentials.json

The file content is expected to have the following syntax:

{
"access_key": "**********************",
"secret_key": "*********************************"
}

Oncrawl configuration

What you need before you start

  • A crawl (completed and not archived) and its ID

  • An active Oncrawl subscription

  • An Oncrawl API token with account:read and account:write. Help with API tokens.

This procedure uses the Oncrawl API. You can find full documentation here.

Step 3: Send your b64 string

The secrets APIs allows you to store sensitive information once and reuse it by referencing it elsewhere.

Parameters are:

  • name (required) Must be unique and match the RegEx ^[a-z-A-Z][a-zA-Z0-9_-]{2,63}$

  • value (required) The secret's value, the actual content will depends on type property

You will need to substitute ACCESS_TOKEN with your token, and the value with the b64 encoded string generated above.

HTTP Request [jsx]

curl -X POST "https://app.oncrawl.com/api/v2/account/secrets" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"secret": {
"name": "secret_aws",
"type": "s3_credentials",
"value":" -- need to change PLACEHOLDER with base64 string created at step 2 -- "
}
}
EOF

HTTP Response [json]

A successful response will return an HTTP status 200 with the following body:

{
"secret": {
"creation_date": integer,
"id": string,
"name": string,
"owner_id": string,
"type": enum
}
}

  • Properties:

id

The unique identifier of the secret.

name

Name of the secret

owner_id

The unique identifier of the secret's owner.

type

Type of secret.


​creation_date

Date of creation as an UTC timestamp

Step 4: Request a new data export

You can export the following datasets:

  • page

  • link

  • raw_html

  • cluster

  • keyword

  • structured_data

You will need to use the following commands to create a new data export.

  • Exports your crawl data in CSV format to s3://myawsdataseo/crawl. You can also use parquet or json formats.

HTTP Request [bash]

curl -X POST "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"data_export": {
"data_type": enum,
"resource_id": string,
"output_format": enum,
"target": β€˜s3’,
"target_parameters": object
}
}
EOF

  • Parameters:

data_type (required)
Can be either page or link

resource_id (required)
ID of the crawl to export, the crawl must not be archived

output_format (required)
Can be either json, csv or parquet

target_parameters (required)
An object with the configuration for the selected target

  • Parameters are:

s3_credentials (required)

URI of the secret, it must be of type s3_credentials

A secret's URI have the expected format: secrets://<owner_id>/<secret_name>

s3_bucket (required)

Name of the bucket to upload data to

s3_region (required)

Valid S3 region where the bucket is located

s3_prefix (required)

Path on the bucket where the files will be uploaded (ex: oncrawl-exports/)

Example for copying to S3://myawsdataseo

!curl -X POST "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json" \
-d '{"data_export":{"data_type":"crawl","resource_id":"5e9e8be0726794200141acf0","output_format":"csv","target":"s3","target_parameters":{"s3_credentials": "secrets://5cd9aa8e451c95700f32aa90/secret1","s3_bucket": "myawsdataseo","s3_region": "eu-west-3","s3_prefix": "crawl"}}}'

Step 5: Status of your export

  • Requests the status of your export.

HTTP Request [bash]

curl "https://app.oncrawl.com/api/v2/account/data_exports/<data_export.id>" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json"

HTTP Response [json]

{
"data_export": {
"id": string,
"data_type": enum,
"resource_id": string,
"output_format": enum,
"output_format_parameters": object,
"target": enum,
"target_parameters": object,
"status": enum,
"export_failure_reason": string,
"output_row_count": integer,
"output_size_in_bytes": integer,
"requested_at": integer
}
}
  • Properties are:

id
The unique identifier of the file.

data_type
Data type that were exported, can be either page or link.

resource_id
The unique identifier of the crawl.

output_format
Format used for the data export, can be either json, csv or parquet.

output_format_parameters
Parameters that were used by the output_format.

target
Target used for the data export, can be either s3 or gcs.

target_parameters
Parameters that were used by the target.

status
Current status of the export, can be:

  • REQUESTED: Request received but not yet handled

  • EXPORTING: Data are currently being exported in desired format

  • UPLOADING: Data are currently being uploaded to the desired target

  • FAILED: An error occurred and export could not terminate successfully

  • DONE: Export completed successfully

export_failure_reason
Exists only if status is FAILED, contains the reason why the export failed.

output_row_count
Number of items that were exported.

output_size_in_bytes
Total size of the exported data.

requested_at
UTC timestamp when the export was requested.

Example:

curl "https://app.oncrawl.com/api/v2/account/data_exports/5e8b351c451c952cd1664381" \
-H "Authorization: Bearer **********************************" \
-H "Content-Type: application/json"

Step 6: List of your exports

  • Requests a list of your exports.

HTTP Request [bash]

curl "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json"
  • You can filter on the following properties:
    ​
    - status
    - resource_id
    - data_type
    - output_format

You can view how to filter and paginate here.

HTTP Response [jsx]

{
"data_exports": [ data_export ]
}

Example:

!curl "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer **********************************" \
-H "Content-Type: application/json"

Optional: Delete your secret

HTTP Request

bashcurl -X DELETE "https://app.oncrawl.com/api/v2/account/secrets/SECRET_ID" \-H "Authorization: Bearer ACCESS_TOKEN" \-H "Content-Type: application/json"

HTTP Response

It returns an HTTP 204 if successful.

Optional: Check your key

HTTP Request [bash]

bash
curl "https://app.oncrawl.com/api/v2/account/secrets" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json"

You can filter on either name or type properties. See how to filters and paginate here.

HTTP Response [jsx]

{
"secrets": [
{
"creation_date": integer,
"id": string,
"name": string,
"owner_id": string,
"type": enum
}
]
}

Example :

curl "https://app.oncrawl.com/api/v2/account/secrets" \
-H "Authorization: Bearer **********************************" \
-H "Content-Type: application/json"

The result:

{
"meta": {
"filters": {},
"limit": 20,
"offset": 0,
"sort": null,
"total": 1
},
"secrets": [
{
"creation_date": 1585678486000,
"id": "5e836c76451c956fa1202cf5",
"name": "secret1",
"owner_id": "5cd9aa8f451c957015k30aa90",
"type": "gcs_credentials"
}
]
}

Notification

You will receive an email with the AWS link.

If the export doesn't work, Oncrawl will notify you by email with the error Oncrawl received when trying to copy your data.

Get your Data directly from your AWS S3

Please follow the instructions provided by AWS S3:

Did this answer your question?