If you manage news articles, you may want to look at data for articles published in the last seven days, between eight and thirty days ago, between thirty-one and 90 days ago, and before the past 90 days.

With OnCrawl, you can set up a segmentation for these relative dates (or "rolling dates") and reuse it, even for future crawls.

To set up and to use this segmentation, the initial and future crawls must be run with data scraping in order to make the publication date, found in the schema.org data in the HTML of article pages, available for analysis.

We provide an example segmentation by month at the bottom of this article, which you can copy and paste.

Set up a crawl with data scraping for datePublished

To obtain the publication date for all article pages, we'll look for schema.org structured data for news articles, and will scrape the value of the datePublished field into a OnCrawl metric that we will name "datePublished".

We will show you two different methods based on how your publication dates are formatted: using JSON-LD or as meta properties (itemprop).

Method 1: For schema.org structured data using JSON-LD

Start by setting up a crawl with data scraping:

  1. From the project home page, click "+ Set up new crawl".
  2. Under "Analysis", click "Scraping" to expand the section.
  3. Check the "Enable scraping" box.
    Note
    : if your plan does not include scraping, please talk to your account manager to adapt your plan.
  4. In the field "Custom field name", enter "datePublished".
    You must use this exact name if you intend to use the example segmentations provided below.
  5. Under "Parsing", fill in the following information for Step 1:
     - Rule kind: REGEX
     - Rule: \"datePublished\":\"([0-9]{4}-[0-9]{2}-[0-9]{2})
    This expression looks for the phrase "datePublished":" followed by an expression in the format 0000-00-00, where 0 is any digit from 0 to 9.
     -
    Output pattern (optional): {0}
    This output pattern indicates that we want to save the first expression in the regex above, in exactly the format that we found it. This is the default output pattern, so we can also leave this field blank.
  6. Under Export, select:
     - Keep…: First value
     - As…: Date value (e.g. "YYYY-MM-DD")
  7. In the "Check output" box at the bottom left, select "Using URL".
     - Provide a "Sample URL" from your site that includes a datePublished.
     - Click "Check" to make sure this scraping rule works for your site. The correct publication date should appear in the "Check result" box on the right.
  8. Click "Save custom field".

Now, launch a crawl to scrape the datePublished from your pages.

Method 2: For schema.org structured data using meta data in the format <time itemprop="datePublished">

Start by setting up a crawl with data scraping:

  1. From the project home page, click "+ Set up new crawl".
  2. Under "Analysis", click "Scraping" to expand the section.
  3. Check the "Enable scraping" box.
    Note: if your plan does not include scraping, please talk to your account manager to adapt your plan.
  4. In the field "Custom field name", enter "datePublished".
    You must use this exact name if you intend to use the example segmentations provided below.
  5. Under "Parsing", fill in the following information for Step 1:
     - Rule kind: XPATH
     - Rule: //time[@itemprop="datePublished"]
    This rule looks for a time tag with an itemprop property set to "datePublished".
  6. Click "+" to add a step. Fill in the following information for Step 2:
     - Rule kind: XPATH
     - Rule: string(//@datetime)
    This rule looks for the string of text contained in the datetime property.
  7. Click "+" to add a step. Fill in the following information for Step 3
     - Rule: ([0-9]{4}-[0-9]{2}-[0-9]{2})
    This rule looks for an expression in the format 0000-00-00, where 0 is any digit from 0 to 9.
     -
    Output pattern (optional): {0}
    This output pattern indicates that we want to save the first expression in the regex above, in exactly the format that we found it. This is the default output pattern, so we can also choose to leave this field blank.
  8. Under Export, select:
     - Keep…: First value
     - As…: Date value (e.g. "YYYY-MM-DD")
  9. In the "Check output" box at the bottom left, select "Using URL".
     - Provide a "Sample URL" from your site that includes a datePublished.
     - Click "Check" to make sure this scraping rule works for your site. The correct publication date should appear in the "Check result" box on the right.
  10. Click "Save custom field".

Now, launch a crawl to scrape the datePublished from your pages.

Set up a segmentation for rolling date ranges based on datePublished

Note: To use datePublished as a segmentation metric, you must have a prior crawl that scraped the datePublished from the HTML of your pages during the crawl.

This segmentation will apply to the original crawl and to future crawls that have scraped data into a date field called datePublished.

Here are two ways to go about it: from scratch, or by copying and pasting our example.

Method 1: Create a segmentation from scratch using your own date ranges

Prepare the segmentation:

  1. From the project home page, scroll down to the "Analysis" section and click on the "Configure Segmentation" button. This will take you to the Segmentation page.
  2. Click on the "+ Create segmentation" button at the top of the page.
  3. Select "From scratch" and click "Continue".
  4. Enter a name for the segmentation. Let's call it "Publication date".
  5. Click "Create segmentation." This will take you to the segmentation page for your new segmentation.

Create a page group for each date range, and set the definition for each group using a regular expression.

  1. Click on "+ Create page group" at the top right of the page.
  2. Enter a name for the group of pages.
  3. Choose a color that will represent this group in all of the OnCrawl charts.
  4. Place the group last in the series of page groups.
  5. Click "Create page group". This will take you to the page where you can indicate which pages to put in this group.
  1. Choose the metric to apply the regular expression to: "Custom field: datePublished".
  2. Choose the operator, and enter the earliest date and the latest date, relative to now. OnCrawl uses date calculations based on the syntax provided by Elastic.

    Here are some examples:

    Published in the last week
    Group name: Last week
    Operator: between
    Earliest value: now-1W
    Latest value: now

    Published within the last month, but before the last week
    Group name: Last month
    Operator: between
    Earliest value: now-1M
    Latest value: now-1W

    Published within the last 90 days, but before the last month
    Group name: Last 90 days
    Earliest value: now-90D
    Latest value: now-1M

    Published before the last 90 days
    Group name: Older
    Operator: less than
    Value: now-3M

    Has no date value
    Group name: No date
    Operator: has no value
  3. Click "Refresh matching URLs" to make sure everything is working right.
  4. Click "Save changes".

Method 2: Create a segmentation by copying and pasting our month-based segmentation

Prepare the segmentation:

  1. From the project home page, scroll down to the "Analysis" section and click on the "Configure Segmentation" button. This will take you to the Segmentation page.
  2. Click on the "+ Create segmentation" button at the top of the page.
  3. Select "From existing set or import" and click "Continue".
  4. Choose the method "Paste JSON". A text field will appear.
  5. Paste the full text of the segmentation below and click "Continue" :
[
  {
    "name": "current month",
    "color": "#F1C8AE",
    "oql": {
      "field": [
        "custom_datePublished",
        "between",
        [
          "now/M",
          "now"
        ]
      ]
    }
  },
  {
    "name": "month -1 ",
    "color": "#E69F9E",
    "oql": {
      "field": [
        "custom_datePublished",
        "between",
        [
          "now-1M/M",
          "now/M"
        ]
      ]
    }
  },
  {
    "name": "month - 2",
    "color": "#DC778E",
    "oql": {
      "field": [
        "custom_datePublished",
        "between",
        [
          "now-2M/M",
          "now-1M/M"
        ]
      ]
    }
  },
  {
    "name": "month -3",
    "color": "#C65787",
    "oql": {
      "field": [
        "custom_datePublished",
        "between",
        [
          "now-3M/M",
          "now-2M/M"
        ]
      ]
    }
  },
  {
    "name": "quarter -1",
    "color": "#9B448B",
    "oql": {
      "field": [
        "custom_datePublished",
        "between",
        [
          "now-6M/M",
          "now-3M/M"
        ]
      ]
    }
  },
  {
    "name": "quarter -2",
    "color": "#703290",
    "oql": {
      "field": [
        "custom_datePublished",
        "between",
        [
          "now-9M/M",
          "now-6M/M"
        ]
      ]
    }
  },
  {
    "name": "quarter -3",
    "color": "#32164B",
    "oql": {
      "field": [
        "custom_datePublished",
        "between",
        [
          "now-12M/M",
          "now-9M/M"
        ]
      ]
    }
  },
  {
    "name": "no date",
    "color": "#333333",
    "oql": {
      "field": [
        "custom_datePublished",
        "has_no_value",
        ""
      ]
    }
  }
]
  1. Enter a name for the segmentation. Let's call it "Publication date".
  2. Click "Create segmentation." This will take you to the segmentation page for your new segmentation.

You will see that the page groups have already been created for the current month, month-1, month-2, month-3, quarter-1, quarter-2, quarter-3, and no date.

Apply a segmentation based on datePublished

Return to your project home page and click on any analysis of a crawl that has scraped publication dates to view the analysis results.

At the top of the page, use the "Segmentation" drop-down menu to select the segmentation "Publication date" that you just created.

Note: The groups that are displayed are based on today's date. The means, for example, that if your crawl dates from more than a week ago, you won't see any articles in the "Last week" group. This is because the most recent publication date known in OnCrawl for this crawl is from before the current week.

You can also focus on one of the page groups in the segmentation by changing the group selected in the second drop-down menu, "Base filter".

Going further

If you still have questions, drop us a line at @oncrawl_cs or click on the Intercom button at the bottom right of your screen to start a chat with us.

Happy crawling!

You can also find this article by searching for:
dates glissantes, segmentation glissante, fecha de publicación, fechas evolutivas, grupos de páginas, groupes de pages

Did this answer your question?