Extract, convert and adjust dates from a Coveo Cloud V2 Web source

Let’s say the pages in your Web source have a date/time value somewhere in the markup of the page. You’d like to be able to extract that string, convert it to a date type and then set one of your fields / mappings with that value, so you can use it in your search results, facets or in other components. This was my exact situation just two weeks ago, and through a lot of Python research and syntactical hoops, I was able to achieve the desired result.


I had one Web source that had some interestingly formatted event dates, and our company didn’t want to burden the client with updating their date formatting on the site to match the formatting of other dates in other sources, so we had to take what we were given and find a way to convert it. Here is one example for one of the client’s sites:

Saturday 3 February 2018 9:00 am CST

At first glance this didn’t seem too difficult. Until I learned how much of a pain date formatting in Python can be.


Phase 1: Extract

First, we had to scrape that date out of there and into a raw / temporary field. To do this we utilized the Coveo Web Scraping Configuration, which is essentially a field on your Web source that lets you extract data, exclude certain parts of a page, and other things. What you enter into this field must be in a JSON format. In this case I also had to brush up on my XPath skills, since I would need to provide a path to the value I wanted to extract. My web scraping configuration looked like this:

 "for": {
 "urls": [
 "metadata": {
 "textpubdate": {
 "type": "XPATH",
 "path": "//div[contains(@id,\"formField_dateTime_event_start\")]/div[contains(@class,\"fieldResponse\")]/text()"

What this means: I’m specifying that I would like to extract data from the page as metadata, and I specify my temporary string field textpubdate. The XPath selector above is looking for a div with class fieldResponse inside of a div with class formField_dateTime_event_start and then simply extracts the inner text by calling text() at the end. After I rebuilt my source, it worked – the mapping was showing the string value.

Phase 2: Convert

Time to learn Python! The next step is to create an indexing pipeline extension, using Python as the programming language, which will handle the back end work of converting the dates. To be specific, Coveo Cloud V2 extensions seem to be running version 2.7.6 of Python (at least from what I found), so there are some solutions that won’t work if they don’t work with v2.7.6. Also, solutions I found on the internet did not always apply, as some formatting directives that work on Linux for example, don’t work on Windows. Some examples:

  • If your extracted date string uses lower case time parts such as ‘am’ and ‘pm’, those aren’t supported for use with strptime in the en_US locale. They are supported if you update the locale to de_DE (German), but switching locales didn’t seem to be supported by the Coveo Python OS from what I could tell.
  • If your date uses a single digit numerical day, you’re out of luck because the %e directive is not supported in the standard Windows C library, and you will get an error in the log browser if you try using it.

Thankfully, I found an official list of Windows-supported directives for use with the strptime function. Those should all work in a Coveo extension too.

So, my problem remained: certain time parts could not be converted because no working directive existed. The only option left? Get rid of it.


I wasn’t printing it out in my search results anyway. I decided to make a Regex string that would find the time part of the date and remove it (some parts of this were Regex found online), which involved learning re.compile, re.search,re.sub and a bunch other fun Python Regex functions and gotchas – such as:

  • Despite seeing it in the vast majority of articles, the r prefix should not be entered before a Regex string if you are using standard escape sequences. Since my Regex definitely was, I just dropped the r and my re.compile succeeded.

Phase 3: Adjust (if necessary)

If you rebuild your source now, you should have dates coming into your field (you can check on this in the Content Browser).  However, you might notice they are the wrong day! In fact, they might be one day before what the actual date was on the page. This is because of the conversion that you are doing. When the date string gets converted, it gets converted to a date using your local time. Using the timedelta library, you would need to add a certain amount of hours so that the date matches UTC (see full code below).


Final code used:

from datetime import datetime, timedelta
import string
import re

 # Get Coveo field value (string)
 pubdate = document.get_meta_data_value('textpubdate')
 # Try to strip out time, am/pm and timezone
 pattern = re.compile('([0-9]|0[0-9]|1[0-9]|2[0-3]):[0-5][0-9]\s(am|pm|AM|PM)\s[a-zA-Z]{3}$')
 if (re.search(pattern, pubdate[0])):
 log('Match succeeded', 'Debug')
 # Replace the found pattern with nothing
 newpubdate = re.sub(pattern, '', pubdate[0])
 # Convert string value to Coveo date type, strip out ending spaces if any
 pdate = datetime.strptime(newpubdate.rstrip(), '%A %d %B %Y')
 # Add 6 hours to date due to time zone used in conversion
 newDate = pdate + timedelta(hours=6)
 # Set date field
except Exception as e:

The timedelta library caused some hiccups; I had to use it in the exact way I have it written below or it didn’t work. One final comment: make good use of the log() command while doing these so you can see any errors or custom messages you enter in the Log Browser. Hope this helps someone and let me know if you have any questions, comments or suggestions!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s