The Custom Fields configuration is present in the crawl setting interface. When you define a new crawl setting, you can add specific extraction rules to retrieve data when the page is fetched.

How to create a Custom Field

Click on "Set up a new Crawl", choose one of your crawl setting, click on "Edit Settings" then go to the "Custom Fields" section (if you cannot find this section you have to ask for the activation to the team - click on the bottom right intercom button)

What kind of extraction is possible

Customs fields are based on two kinds of extraction rules : REGEX or XPATH

REGEX is used to find a specific string in the source code.
Remember that you have to escape special characters like ? . / ( ) [ ] { } with a \
XPATH is a query language based on the DOM tree.

For each kind of rule, you can extract several extraction types:

Mono-valued will return a single string ;
multi-valued will return an array ;
Check if exist will return a true or false value ;
Length will count number of extracted characters ;
Number of occurrences will count how many extracted terms match the rule.

Custom Field creation

  1. Give a name to your Custom Field, this name will be used in the Data Explorer to find and display the columns. Try to give an explicit name based on what you'll extract.
  2. Choose the kind of rule: REGEX or XPATH.
  3. Specify the rule (some of our best rules are compiled at the end of this article).
  4. Choose the type of extract: string, array, boolean, counts.
  5. (Optional) Choose the field format: you can use $1 $2 to access to the first or second part of the REGEX extracted pattern.
  6. Check the rule: fill out an url that can validate your rule or test a part of the source code.

Customs fields examples

Extract the AMP url form the link tag
Custom field name
: amphtml
Rules kind :                XPATH
Extract type :             Mono-valued
rules :                         string(//link[@rel="amphtml"]/@href)

Extract the entire breadcrumb in an array (schema.org based)
Custom field name : breadcrumb
Rules kind :                XPATH
Extract type :             Multi-valued
rules :                         //*[@itemtype="http://schema.org/BreadcrumbList"]//*[@itemprop="name"]/text()
NB : this is for an schema.org breadcrumb integration, XPATH should be adapted to your integration

Extract breadcrumb depth
Custom field name : breadcrumb_depth
Rules kind :                XPATH
Extract type :             Number of occurrences
rules :                         //*[@itemtype="http://schema.org/BreadcrumbList"]//*[@itemprop="name"]/text()
NB : this is for an schema.org breadcrumb integration, XPATH should be adapted to your integration

Check the presence of Google Analytics tag
Custom field name : GA_is_present
Rules kind :                REGEX
Extract type :             Check if exists
rules :                         _gaq.push\(\['_setAccount', 'UA-XXXXXXXX-X'\]\);
NB : replace 'UA-XXXXXXXX-X' by your account ID


Did this answer your question?