Here is the tutorial to export crawl and link data to your AWS S3.
We'll cover setting up the Big Data Export to send your data to your AWS S3 in the following steps:
Add access to your AWS S3
Crypt your access_key and secret_key in b64 format
Send your b64 string
Request a new data export
Status of your export
List of your exports
We'll also explain the notifications associated with this feature, and how to consult the exported data directly from your AWS S3.
AWS Configuration
Step 1: Add access to your S3 bucket
To export data into a S3 bucket, you must create a secret_key/access_key pair with the following permissions on the desired bucket:
s3:PutObject
s3:GetObject
Step 2: Crypt your access_key and secret_key in b64 format
The expected value is a base64-encoded string of JSON file following this format:
$ base64 /path/to/credentials.json
The file content is expected to have the following syntax:
{
"access_key": "**********************",
"secret_key": "*********************************"
}
Oncrawl configuration
What you need before you start
A crawl (completed and not archived) and its ID
An active Oncrawl subscription
An Oncrawl API token with account:read and account:write. Help with API tokens.
This procedure uses the Oncrawl API. You can find full documentation here.
Step 3: Send your b64 string
The secrets APIs allows you to store sensitive information once and reuse it by referencing it elsewhere.
Parameters are:
name (required) Must be unique and match the RegEx
^[a-z-A-Z][a-zA-Z0-9_-]{2,63}$
value (required) The secret's value, the actual content will depends on type property
You will need to substitute ACCESS_TOKEN with your token, and the value with the b64 encoded string generated above.
HTTP Request [jsx]
curl -X POST "https://app.oncrawl.com/api/v2/account/secrets" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"secret": {
"name": "secret_aws",
"type": "s3_credentials",
"value":" -- need to change PLACEHOLDER with base64 string created at step 2 -- "
}
}
EOF
HTTP Response [json]
A successful response will return an HTTP status 200 with the following body:
{
"secret": {
"creation_date": integer,
"id": string,
"name": string,
"owner_id": string,
"type": enum
}
}
Properties:
id
The unique identifier of the secret.
name
Name of the secret
owner_id
The unique identifier of the secret's owner.
type
Type of secret.
βcreation_date
Date of creation as an UTC timestamp
Step 4: Request a new data export
You can export the following datasets:
page
link
raw_html
cluster
keyword
structured_data
You will need to use the following commands to create a new data export.
Exports your crawl data in CSV format to s3://myawsdataseo/crawl. You can also use parquet or json formats.
HTTP Request [bash]
curl -X POST "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json" \
-d @- <<EOF
{
"data_export": {
"data_type": enum,
"resource_id": string,
"output_format": enum,
"target": βs3β,
"target_parameters": object
}
}
EOF
Parameters:
data_type (required)
Can be either page
or link
resource_id (required)
ID of the crawl to export, the crawl must not be archived
output_format (required)
Can be either json
, csv
or parquet
target_parameters (required)
An object with the configuration for the selected target
Parameters are:
s3_credentials (required)
URI of the secret, it must be of type s3_credentials
A secret's URI have the expected format: secrets://<owner_id>/<secret_name>
s3_bucket (required)
Name of the bucket to upload data to
s3_region (required)
Valid S3 region where the bucket is located
s3_prefix (required)
Path on the bucket where the files will be uploaded (ex: oncrawl-exports/
)
Example for copying to S3://myawsdataseo
!curl -X POST "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json" \
-d '{"data_export":{"data_type":"crawl","resource_id":"5e9e8be0726794200141acf0","output_format":"csv","target":"s3","target_parameters":{"s3_credentials": "secrets://5cd9aa8e451c95700f32aa90/secret1","s3_bucket": "myawsdataseo","s3_region": "eu-west-3","s3_prefix": "crawl"}}}'
Step 5: Status of your export
Requests the status of your export.
HTTP Request [bash]
curl "https://app.oncrawl.com/api/v2/account/data_exports/<data_export.id>" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json"
HTTP Response [json]
{
"data_export": {
"id": string,
"data_type": enum,
"resource_id": string,
"output_format": enum,
"output_format_parameters": object,
"target": enum,
"target_parameters": object,
"status": enum,
"export_failure_reason": string,
"output_row_count": integer,
"output_size_in_bytes": integer,
"requested_at": integer
}
}
Properties are:
id
The unique identifier of the file.
data_type
Data type that were exported, can be either page
or link
.
resource_id
The unique identifier of the crawl.
output_format
Format used for the data export, can be either json
, csv
or parquet
.
output_format_parameters
Parameters that were used by the output_format.
target
Target used for the data export, can be either s3
or gcs
.
target_parameters
Parameters that were used by the target.
status
Current status of the export, can be:
REQUESTED
: Request received but not yet handledEXPORTING
: Data are currently being exported in desired formatUPLOADING
: Data are currently being uploaded to the desired targetFAILED
: An error occurred and export could not terminate successfullyDONE
: Export completed successfully
export_failure_reason
Exists only if status
is FAILED
, contains the reason why the export failed.
output_row_count
Number of items that were exported.
output_size_in_bytes
Total size of the exported data.
requested_at
UTC timestamp when the export was requested.
Example:
curl "https://app.oncrawl.com/api/v2/account/data_exports/5e8b351c451c952cd1664381" \
-H "Authorization: Bearer **********************************" \
-H "Content-Type: application/json"
Step 6: List of your exports
Requests a list of your exports.
HTTP Request [bash]
curl "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json"
You can filter on the following properties:
β
- status
- resource_id
- data_type
- output_format
You can view how to filter and paginate here.
HTTP Response [jsx]
{
"data_exports": [ data_export ]
}
Example:
!curl "https://app.oncrawl.com/api/v2/account/data_exports" \
-H "Authorization: Bearer **********************************" \
-H "Content-Type: application/json"
Optional: Delete your secret
HTTP Request
bashcurl -X DELETE "https://app.oncrawl.com/api/v2/account/secrets/SECRET_ID" \-H "Authorization: Bearer ACCESS_TOKEN" \-H "Content-Type: application/json"
HTTP Response
It returns an HTTP 204
if successful.
Optional: Check your key
HTTP Request [bash]
bash
curl "https://app.oncrawl.com/api/v2/account/secrets" \
-H "Authorization: Bearer ACCESS_TOKEN" \
-H "Content-Type: application/json"
You can filter on either name
or type
properties. See how to filters and paginate here.
HTTP Response [jsx]
{
"secrets": [
{
"creation_date": integer,
"id": string,
"name": string,
"owner_id": string,
"type": enum
}
]
}
Example :
curl "https://app.oncrawl.com/api/v2/account/secrets" \
-H "Authorization: Bearer **********************************" \
-H "Content-Type: application/json"
The result:
{
"meta": {
"filters": {},
"limit": 20,
"offset": 0,
"sort": null,
"total": 1
},
"secrets": [
{
"creation_date": 1585678486000,
"id": "5e836c76451c956fa1202cf5",
"name": "secret1",
"owner_id": "5cd9aa8f451c957015k30aa90",
"type": "gcs_credentials"
}
]
}
Notification
You will receive an email with the AWS link.
If the export doesn't work, Oncrawl will notify you by email with the error Oncrawl received when trying to copy your data.
Get your Data directly from your AWS S3
Please follow the instructions provided by AWS S3: