Here is the tutorial to export crawl and link data to your AWS S3.

We'll cover setting up the Big Data Export to send your data to your AWS S3 in the following steps:

Add access to your AWS S3
Crypt your access_key and secret_key in b64 format
Send your b64 string
Request a new data export
Status of your export
List of your exports

We'll also explain the notifications associated with this feature, and how to consult the exported data directly from your AWS S3.

AWS Configuration

Step 1: Add access to your S3 bucket

To export data into a S3 bucket, you must create a secret_key/access_key pair with the following permissions on the desired bucket:

s3:PutObject
s3:GetObject

Step 2: Crypt your access_key and secret_key in b64 format

The expected value is a base64-encoded string of JSON file following this format:

$ base64 /path/to/credentials.json

The file content is expected to have the following syntax:

{
  "access_key": "**********************",
  "secret_key": "*********************************"
}

Oncrawl configuration

What you need before you start

A crawl (completed and not archived) and its ID
An active Oncrawl subscription
An Oncrawl API token with account:read and account:write. Help with API tokens.

This procedure uses the Oncrawl API. You can find full documentation here.

Step 3: Send your b64 string

The secrets APIs allows you to store sensitive information once and reuse it by referencing it elsewhere.

Parameters are:

name (required) Must be unique and match the RegEx ^[a-z-A-Z][a-zA-Z0-9_-]{2,63}$
value (required) The secret's value, the actual content will depends on type property

You will need to substitute ACCESS_TOKEN with your token, and the value with the b64 encoded string generated above.

HTTP Request [jsx]

curl -X POST "https://app.oncrawl.com/api/v2/account/secrets" \
    -H "Authorization: Bearer ACCESS_TOKEN" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
        "secret": {
            "name": "secret_aws",
            "type": "s3_credentials",
"value":" -- need to change PLACEHOLDER with base64 string created at step 2 -- "
        }
    }
EOF

HTTP Response [json]

A successful response will return an HTTP status 200 with the following body:

{
  "secret": {
    "creation_date": integer,
    "id": string,
    "name": string,
    "owner_id": string,
    "type": enum
  }
}

Properties:

The unique identifier of the secret.

name

Name of the secret

owner_id

The unique identifier of the secret's owner.

type

Type of secret.

creation_date

Date of creation as an UTC timestamp

Step 4: Request a new data export

You can export the following datasets:

page
link
raw_html
cluster
keyword
structured_data

You will need to use the following commands to create a new data export.

Exports your crawl data in CSV format to s3://myawsdataseo/crawl. You can also use parquet or json formats.

HTTP Request [bash]

curl -X POST "https://app.oncrawl.com/api/v2/account/data_exports" \
    -H "Authorization: Bearer ACCESS_TOKEN" \
    -H "Content-Type: application/json" \
    -d @- <<EOF
    {
        "data_export": {
            "data_type": enum,
            "resource_id": string,
"output_format": enum,
"target": ‘s3’,
"target_parameters": object
        }
    }
EOF

Parameters:

data_type (required)
Can be either page or link

resource_id (required)
ID of the crawl to export, the crawl must not be archived

output_format (required)
Can be either json, csv or parquet

target_parameters (required)
An object with the configuration for the selected target

Parameters are:

s3_credentials (required)

URI of the secret, it must be of type s3_credentials

A secret's URI have the expected format: secrets://<owner_id>/<secret_name>

s3_bucket (required)

Name of the bucket to upload data to

s3_region (required)

Valid S3 region where the bucket is located

s3_prefix (required)

Path on the bucket where the files will be uploaded (ex: oncrawl-exports/)

Example for copying to S3://myawsdataseo

!curl -X POST "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json" \
-d '{"data_export":{"data_type":"crawl","resource_id":"5e9e8be0726794200141acf0","output_format":"csv","target":"s3","target_parameters":{"s3_credentials": "secrets://5cd9aa8e451c95700f32aa90/secret1","s3_bucket": "myawsdataseo","s3_region": "eu-west-3","s3_prefix": "crawl"}}}'

Step 5: Status of your export

Requests the status of your export.

HTTP Request [bash]

curl "https://app.oncrawl.com/api/v2/account/data_exports/<data_export.id>" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json"

HTTP Response [json]

{
  "data_export": {
    "id": string,
"data_type": enum,
"resource_id": string,
    "output_format": enum,
    "output_format_parameters": object,
    "target": enum,
    "target_parameters": object,
    "status": enum,
    "export_failure_reason": string,
    "output_row_count": integer,
    "output_size_in_bytes": integer,
    "requested_at": integer
  }
}

Properties are:

id
The unique identifier of the file.

data_type
Data type that were exported, can be either page or link.

resource_id
The unique identifier of the crawl.

output_format
Format used for the data export, can be either json, csv or parquet.

output_format_parameters
Parameters that were used by the output_format.

target
Target used for the data export, can be either s3 or gcs.

target_parameters
Parameters that were used by the target.

status
Current status of the export, can be:

REQUESTED: Request received but not yet handled
EXPORTING: Data are currently being exported in desired format
UPLOADING: Data are currently being uploaded to the desired target
FAILED: An error occurred and export could not terminate successfully
DONE: Export completed successfully

export_failure_reason
Exists only if status is FAILED, contains the reason why the export failed.

output_row_count
Number of items that were exported.

output_size_in_bytes
Total size of the exported data.

requested_at
UTC timestamp when the export was requested.

Example:

curl "https://app.oncrawl.com/api/v2/account/data_exports/5e8b351c451c952cd1664381" \
-H "Authorization: Bearer **********************************" \
-H "Content-Type: application/json"

Step 6: List of your exports

Requests a list of your exports.

HTTP Request [bash]

curl "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json"

You can filter on the following properties:

- status
- resource_id
- data_type
- output_format

You can view how to filter and paginate here.

HTTP Response [jsx]

{
  "data_exports": [ data_export ]
}

Example:

!curl "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer **********************************" \
-H "Content-Type: application/json"

Optional: Delete your secret

HTTP Request

bashcurl -X DELETE "https://app.oncrawl.com/api/v2/account/secrets/SECRET_ID" \-H "Authorization: Bearer ACCESS_TOKEN" \-H "Content-Type: application/json"

HTTP Response

It returns an HTTP 204 if successful.

Optional: Check your key

HTTP Request [bash]

bash
curl "https://app.oncrawl.com/api/v2/account/secrets" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json"

You can filter on either name or type properties. See how to filters and paginate here.

HTTP Response [jsx]

{
  "secrets": [
   {
      "creation_date": integer,
      "id": string,
      "name": string,
      "owner_id": string,
      "type": enum
    }
  ]
}

Example :

curl "https://app.oncrawl.com/api/v2/account/secrets" \
 -H "Authorization: Bearer **********************************" \
 -H "Content-Type: application/json"

The result:

{
  "meta": {
    "filters": {}, 
    "limit": 20, 
    "offset": 0, 
    "sort": null, 
    "total": 1
  }, 
  "secrets": [
    {
      "creation_date": 1585678486000, 
      "id": "5e836c76451c956fa1202cf5", 
      "name": "secret1", 
      "owner_id": "5cd9aa8f451c957015k30aa90", 
      "type": "gcs_credentials"
    }
  ]
}

Notification

You will receive an email with the AWS link.

If the export doesn't work, Oncrawl will notify you by email with the error Oncrawl received when trying to copy your data.

Get your Data directly from your AWS S3

Please follow the instructions provided by AWS S3:

https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-creating-buckets.html

How to set up log monitoring if you use Amazon S3

Big Data Export for Google Bucket

Big Data Export for AWS (Amazon S3)

AWS Configuration

Step 1: Add access to your S3 bucket

Step 2: Crypt your access_key and secret_key in b64 format

Oncrawl configuration

What you need before you start

Step 3: Send your b64 string

Step 4: Request a new data export

Step 5: Status of your export

Step 6: List of your exports

Optional: Delete your secret

Optional: Check your key

Notification

Get your Data directly from your AWS S3