Extract, convert and adjust dates from a Coveo Cloud V2 Web source

Let’s say the pages in your Web source have a date/time value somewhere in the markup of the page. You’d like to be able to extract that string, convert it to a date type and then set one of your fields / mappings with that value, so you can use it in your search results, facets or in other components. This was my exact situation just two weeks ago, and through a lot of Python research and syntactical hoops, I was able to achieve the desired result.

Background

I had one Web source that had some interestingly formatted event dates, and our company didn’t want to burden the client with updating their date formatting on the site to match the formatting of other dates in other sources, so we had to take what we were given and find a way to convert it. Here is one example for one of the client’s sites:

Saturday 3 February 2018 9:00 am CST

At first glance this didn’t seem too difficult. Until I learned how much of a pain date formatting in Python can be.

burns

Phase 1: Extract

First, we had to scrape that date out of there and into a raw / temporary field. To do this we utilized the Coveo Web Scraping Configuration, which is essentially a field on your Web source that lets you extract data, exclude certain parts of a page, and other things. What you enter into this field must be in a JSON format. In this case I also had to brush up on my XPath skills, since I would need to provide a path to the value I wanted to extract. My web scraping configuration looked like this:

[
 {
 "for": {
 "urls": [
 ".*"
 ]
 },
 "metadata": {
 "textpubdate": {
 "type": "XPATH",
 "path": "//div[contains(@id,\"formField_dateTime_event_start\")]/div[contains(@class,\"fieldResponse\")]/text()"
 }
 }
 }
]

What this means: I’m specifying that I would like to extract data from the page as metadata, and I specify my temporary string field textpubdate. The XPath selector above is looking for a div with class fieldResponse inside of a div with class formField_dateTime_event_start and then simply extracts the inner text by calling text() at the end. After I rebuilt my source, it worked – the mapping was showing the string value.

Phase 2: Convert

Time to learn Python! The next step is to create an indexing pipeline extension, using Python as the programming language, which will handle the back end work of converting the dates. To be specific, Coveo Cloud V2 extensions seem to be running version 2.7.6 of Python (at least from what I found), so there are some solutions that won’t work if they don’t work with v2.7.6. Also, solutions I found on the internet did not always apply, as some formatting directives that work on Linux for example, don’t work on Windows. Some examples:

  • If your extracted date string uses lower case time parts such as ‘am’ and ‘pm’, those aren’t supported for use with strptime in the en_US locale. They are supported if you update the locale to de_DE (German), but switching locales didn’t seem to be supported by the Coveo Python OS from what I could tell.
  • If your date uses a single digit numerical day, you’re out of luck because the %e directive is not supported in the standard Windows C library, and you will get an error in the log browser if you try using it.

Thankfully, I found an official list of Windows-supported directives for use with the strptime function. Those should all work in a Coveo extension too.

So, my problem remained: certain time parts could not be converted because no working directive existed. The only option left? Get rid of it.

ramsay

I wasn’t printing it out in my search results anyway. I decided to make a Regex string that would find the time part of the date and remove it (some parts of this were Regex found online), which involved learning re.compile, re.search,re.sub and a bunch other fun Python Regex functions and gotchas – such as:

  • Despite seeing it in the vast majority of articles, the r prefix should not be entered before a Regex string if you are using standard escape sequences. Since my Regex definitely was, I just dropped the r and my re.compile succeeded.

Phase 3: Adjust (if necessary)

If you rebuild your source now, you should have dates coming into your field (you can check on this in the Content Browser).  However, you might notice they are the wrong day! In fact, they might be one day before what the actual date was on the page. This is because of the conversion that you are doing. When the date string gets converted, it gets converted to a date using your local time. Using the timedelta library, you would need to add a certain amount of hours so that the date matches UTC (see full code below).

Completion

Final code used:

from datetime import datetime, timedelta
import string
import re

try:
 # Get Coveo field value (string)
 pubdate = document.get_meta_data_value('textpubdate')
 # Try to strip out time, am/pm and timezone
 pattern = re.compile('([0-9]|0[0-9]|1[0-9]|2[0-3]):[0-5][0-9]\s(am|pm|AM|PM)\s[a-zA-Z]{3}$')
 if (re.search(pattern, pubdate[0])):
 log('Match succeeded', 'Debug')
 # Replace the found pattern with nothing
 newpubdate = re.sub(pattern, '', pubdate[0])
 # Convert string value to Coveo date type, strip out ending spaces if any
 pdate = datetime.strptime(newpubdate.rstrip(), '%A %d %B %Y')
 # Add 6 hours to date due to time zone used in conversion
 newDate = pdate + timedelta(hours=6)
 # Set date field
 document.add_meta_data({'aoparesultdate':newDate})
except Exception as e:
 log(str(e),'Error')

The timedelta library caused some hiccups; I had to use it in the exact way I have it written below or it didn’t work. One final comment: make good use of the log() command while doing these so you can see any errors or custom messages you enter in the Log Browser. Hope this helps someone and let me know if you have any questions, comments or suggestions!

 

Advertisements

Source mapping with different values based on the document URL

I’ve been working with Web sources in the Coveo Cloud V2 Platform a lot lately. One interesting predicament I ran into with one of my sources was the need to have a “multi-value” source mapping; meaning a mapping that would return a different string based on a condition. So in my case, I had a mapping on my source for my content type field, and I needed to return “Event” if the current indexed item URL contains “/events”, otherwise “Page” for everything else. I thought Item Types were exactly what I needed, but I never really found out because I wasn’t able to create them. Instead, I decided to create an indexing pipeline extension that would handle this logic, and hook it up to my source.

Creating the extension

If you aren’t familiar with Indexing Pipeline Extensions, they are essentially Python scripts you can write and attach to your sources, to apply some complex logic for each individual indexed item as it goes through the indexing pipeline. I suggest going through the above linked document and related documents below it for more information. The extensions can be really powerful and can do many things, like rejecting content based on a condition, setting the value of a mapping, and so on.

In this case, I had to set the value of my content type field based on the URL of a document. I knew I could use the clickableURI out-of-the-box field for this, so I wrote the following:

import string

try:
 my_uri = document.get_meta_data_value('clickableuri')
 if "/events" in my_uri[0]:
 document.add_meta_data({'aopacontenttype': "Event"})
 else:
 document.add_meta_data({'aopacontenttype': "Page"})
except Exception as e:
 log(str(e),'Error')

It took some research into Python (V2.7.6), its syntax and available methods, but I was able to do it.

Adding the extension to the source

Next, I had to go back to the Sources screen, select my source, click ​(…) More then Manage extensions and add my extension to the source, at the Post-Conversion stage, and apply it to all items.

Completion

Lastly, I had to rebuild the source. Upon checking the newly indexed content in the Content Browser, I was able to see “Event” as the field value for event pages, and “Page” for all other pages. Hope this helps! I will be writing another post or two about my other experiences with indexing pipeline extensions soon.