Data scraping and custom fields

Data scraping is an option that allows you to analyse a portion of the source code extracted during a crawl using custom fields.

Updated over a week ago

You can use data scraping to create custom fields and obtain analyses based on the content or structure of your pages. You might also know this function as "custom extraction."

To create a custom field, you will need to set up a scraping configuration in the crawl setting interface. When you define new crawl settings, you can add specific extraction rules to retrieve data when the page is fetched.

What is a Custom Field?

A Custom Field is some attribute of a web page that you want to collect for each page in your website. A Custom Field becomes a new column that you can add to any report in the Data Explorer.

You will be able to use the different values of your new attribute to analyze the pages of your site.

Custom Fields require the code source to be examined. The examination is carried out during the process of crawling your website. Because of this, they must be set before running a crawl.

How do I get the data for my Custom Field?

Customs fields use extraction rules to find attributes in the source code of your pages during a crawl. These rules can be written as REGEX, as XPATH queries, or as a combination of both.

The choice of REGEX or XPATH depends on what you are looking for in the source code. You can use the following rules of thumb:

  • Use XPATH when you want to capture the text marked by HTML tags (for example, H2 text) or the text of an HTML attribute (for example, image alt text). If you're familiar with CSS selector, XPATH covers the same capabilities.

  • Use REGEX when you want to capture a specific pattern of characters (for example, a date).

Once you've found what you're looking for, you can use Transformations to modify the final results. This can be useful, for example, if you want to count the number of times something occurs, or measure other characteristics of the content you were looking for.

The rules and transformations are set up when you set up a crawl. You can skip ahead to "Creating a Custom Field" for more information.

Regex expressions

REGEX is used to find a specific string of text in the source code. A REGEX expression describes the text or type of text you are looking for using generic characters that can be interpreted by a search program.

Because the expression uses certain special characters to represent characteristics of what you're looking for, you have to escape special characters by placing a \ in front of them. Characters that must be escaped include: ? . / ( ) [ ] { }

A single REGEX may search for several strings. By default, only the first search string, or group, will be taken into account. To use a different group, or multiple groups, you will need to provide an output pattern.

Use the expression {n} to indicate a group, and replace "n" with the number of the group in the order it appeared in the search pattern. (The first group is number 0.)

For example, to search for a date written as 2018/07/19 and return 19-07-2018:

  • REGEX: (\d{4})/(\d{2})/(\d{2})

  • Output pattern: {2}-{1}-{0}

For help with REGEX, please see our quick guide here.

XPATH queries

XPATH is a query language that can be used to find elements in a structured document, such as a webpage in HTML or XML. It describes the structural element you are looking for in your page source.

If you're used to using CSS selector, you'll want to use XPATH. To convert any existing CSS selector expressions to XPath, you can use this site: https://css2xpath.github.io/

For example, it can be used to find the text of all H2 headers on a page:

  • Step 1: XPATH: //h2

  • Step 2: XPATH: string(*) 

For help with XPATH queries, you can find more information in the XPath tutorial at w3schools.com.

Creating a Custom Field

Before you start

Data scraping is an optional feature. If you can't enable scraping in the "Scraping" section in your crawl settings, you don't have access to this option.

Click on the Intercom button at the bottom right of the screen to talk to our team about adding data scraping to your plan.

Activate Data Scraping in your Crawl Settings

  1. From the project home page, click on "Set up a new Crawl" or choose one of your existing crawl settings.

  2. Under "Analysis", click on "Scraping" to expand the section. If you can't access this section, you may need to upgrade your plan to include data scraping.

  3. Check the box "Enable scraping."

Name your custom field

Give a name to your Custom Field. This name will be used in the Data Explorer to find and display the columns.

Try to give an explicit name based on the information you want to extract.

Pick a name made up of lowercase and uppercase letters, hyphens, underscores, and numbers. Special characters, including accents, cannot be used in names of custom fields.

Set the parsing rules

Set up the parsing rules. You can use multiple steps. Each successive step will be applied to the result of the previous step. 

For each step:

  1. Choose the kind of rule: REGEX or XPATH.

  2. Specify the rule. If it's a REGEX, you can also provide an optional output pattern.

(We give you a few examples of some of our frequently-used rules at the end of this article. Go ahead and jump to the end if want.)

Select transformations to perform on the results

Drop empty values: if any of the values found do not contain any characters, ignore them. This might be the case if, for example, you searched for the name of a WordPress category, but have pages that do not have a category.

Normalize URLs: if the results are URLs, format them all as "http://yourdomain.com/path". This removes relative URLs ('/path").

Apply a pattern: add characters before or after the values you find. The value found is represented by the {0} in the field below. You can place text before or after the {0}. Make sure not to delete or modify the {0} if you want to see the value itself in your results.

Get number of values: instead of listing the values found, count the number of results.

Replace HTML entities with display characters: if the value found contains characters such as accents or other characters with special HTML, such as the & for the character "&", display the character instead of the HTML.

Condense white spaces: if your result contains a series of adjacent spaces, line returns, and other "white" spaces, replace the series with a single space.

Get value's length: count the number of characters in the found value and replace the value with the number of characters.

Select the format of values to export

Decide in what format you want to export the results.

You can keep all values, or just the first value found.

A value can be saved as a:

  1. String: a series of characters

  2. Number: a numerical value expressed as a whole number (1, 2, 3, …)

  3. Decimal number: a numerical value expressed as a number with decimals (1.23, 4.5, 6.789, ...)

  4. Boolean value: true or false

  5. Date value: a date, expressed in the format YYYY-MM-DD

  6. Datetime value: a date and time, expressed in the format YYYY-MM-DD HH:mm:ss

Check the rule and its output

Before saving, it's a good idea to make sure that your scraping rule works the way you expect it to. Check your rule by running it on a sample URL, or by providing text content for a sample analysis.

The result will display on the right.

Don't forget to click "Save Custom Field" in the verification box when you're satisfied!

Limits

  • 100 scraping rules per crawl configuration

  • The total size of the scraped data cannot be bigger than 100Kb per page

  • A multi-valued field cannot have more than 100 values

If it were to happen, the result would be truncated by either:

  • Truncating the value itself (ex: string) to fit our limit

  • Dropping the field altogether

Because of this, fixed-size fields (boolean, integer, ...) are parsed in priority versus others (strings,..) to ensures that we scrape as much data as possible.

You can use the Data Explorer with the field processing_warning to have more details about truncated or dropped scraped fields.

Examples of Custom Field configurations

Custom field name: amphtml

Parsing step 1:

  Kind of rule:              XPATH

  Rule:                       //link[@rel="amphtml"]

Parsing step 2:

  Kind of rule:              XPATH

  Rule:                       string(//@href)

Transformations:      Drop empty values, Normalize URLs

Export:

  Keep...                   All values

  As…                      String (e.g. "lorem ipsum")

Extract the entire breadcrumb as a series of values (schema.org based)

Custom field name: breadcrumb

Parsing step 1:

  Kind of rule:              XPATH

  Rule:                         //*[@itemtype="http://schema.org/BreadcrumbList"]//*[@itemprop="name"]/text()

Transformations:      none

Export:

  Keep...                   All values

  As…                      String (e.g. "lorem ipsum")

This is for a schema.org breadcrumb integration. XPATH should be adapted to your integration.

Extract breadcrumb depth

Custom field name: breadcrumb_depth

Parsing step 1:

  Kind of rule:              XPATH

  Rule:                       //*[@itemtype="http://schema.org/BreadcrumbList"]//*[@itemprop="name"]/text()

Transformations:      Get number of values

Export:

  Keep...                   All values

  As…                      Number (e.g. 123)

This is for a schema.org breadcrumb integration. XPATH should be adapted to your integration.

Check for the presence of a Google Analytics tag

Custom field name: GA_is_present

Parsing step 1:

  Kind of rule:              REGEX

  Rule:                       _gaq.push\(\['_setAccount', 'UA-XXXXXXXX-X'\]\);

  Output pattern:      none

Transformations:     none

Export:

  Keep...                   First value

  As…                      Boolean value (e.g. true)

NB: replace 'UA-XXXXXXXX-X' with your account ID

Going further

If you still have questions about data scraping and custom fields, feel free to drop us a line at @oncrawl_cs or click on the Intercom button at the bottom right of your screen to start a chat with us.

Happy crawling!

You can also find this article by searching for:
extracción de datos campos personalizados métricos personalizados
extraire données du code source, champs personnalisées, métriques customisés

Did this answer your question?