Creating a Coveo facet with Sitecore sub-domains and external sites as options

Take this example: you have multiple sub-domains hosted in your Sitecore site, and each one has its own node under Content and its own Home item. You also have various external sites not hosted in Sitecore as completely separate sites. You’re building a search page with Coveo and want to develop a basic facet to list all of your internal sub-domains and external sites so that the end-users can filter pages and content by site. How would one do this? In this blog post, I will explain. Disclaimer: When I did this, I was working with Coveo for Sitecore v4.1 – Hive Framework, so if your version is much different, it’s possible not everything in this post will be accurate. Additionally, please note that the XML/HTML code snippets in this blog post will have the greater than/less than symbols stripped out because WordPress can’t handle those.

Create a Coveo Field for Your Facet

Every new facet you create will need its own field, so let’s create one. This is going to be a string, single-value field representing the name of a particular site (i.e. “Store” or “Marketing”, etc). In your Coveo.SearchProvider.Custom.config file, add a new field entry in the fieldMap/fieldNames node. In my case, my field’s name was aopaSiteName, so that naming convention will be used in other parts of this post as well. Your field entry should look something like this:

fieldType fieldName="aopaSiteName" settingType="Coveo.Framework.Configuration.FieldConfiguration, Coveo.Framework" isExternal="true" isFacet="true" /

Note that the field definition has isExternal="true" – this specifies that the field is not a normal Coveo for Sitecore field, so the field name won’t need to be converted. So if you put isExternal="false", your field name will become something like fmyfield99999 instead of just myfield. I suggest you keep this setting true to make development easier. Then, isFacet=true is needed so that this field can be used in a facet.

Continue reading “Creating a Coveo facet with Sitecore sub-domains and external sites as options”

Advertisements

Extract, convert and adjust dates from a Coveo Cloud V2 Web source

Let’s say the pages in your Web source have a date/time value somewhere in the markup of the page. You’d like to be able to extract that string, convert it to a date type and then set one of your fields / mappings with that value, so you can use it in your search results, facets or in other components. This was my exact situation just two weeks ago, and through a lot of Python research and syntactical hoops, I was able to achieve the desired result.

Background

I had one Web source that had some interestingly formatted event dates, and our company didn’t want to burden the client with updating their date formatting on the site to match the formatting of other dates in other sources, so we had to take what we were given and find a way to convert it. Here is one example for one of the client’s sites:

Saturday 3 February 2018 9:00 am CST

At first glance this didn’t seem too difficult. Until I learned how much of a pain date formatting in Python can be.

burns

Phase 1: Extract

First, we had to scrape that date out of there and into a raw / temporary field. To do this we utilized the Coveo Web Scraping Configuration, which is essentially a field on your Web source that lets you extract data, exclude certain parts of a page, and other things. What you enter into this field must be in a JSON format. In this case I also had to brush up on my XPath skills, since I would need to provide a path to the value I wanted to extract. My web scraping configuration looked like this:

[
 {
 "for": {
 "urls": [
 ".*"
 ]
 },
 "metadata": {
 "textpubdate": {
 "type": "XPATH",
 "path": "//div[contains(@id,\"formField_dateTime_event_start\")]/div[contains(@class,\"fieldResponse\")]/text()"
 }
 }
 }
]

What this means: I’m specifying that I would like to extract data from the page as metadata, and I specify my temporary string field textpubdate. The XPath selector above is looking for a div with class fieldResponse inside of a div with class formField_dateTime_event_start and then simply extracts the inner text by calling text() at the end. After I rebuilt my source, it worked – the mapping was showing the string value.

Phase 2: Convert

Time to learn Python! The next step is to create an indexing pipeline extension, using Python as the programming language, which will handle the back end work of converting the dates. To be specific, Coveo Cloud V2 extensions seem to be running version 2.7.6 of Python (at least from what I found), so there are some solutions that won’t work if they don’t work with v2.7.6. Also, solutions I found on the internet did not always apply, as some formatting directives that work on Linux for example, don’t work on Windows. Some examples:

  • If your extracted date string uses lower case time parts such as ‘am’ and ‘pm’, those aren’t supported for use with strptime in the en_US locale. They are supported if you update the locale to de_DE (German), but switching locales didn’t seem to be supported by the Coveo Python OS from what I could tell.
  • If your date uses a single digit numerical day, you’re out of luck because the %e directive is not supported in the standard Windows C library, and you will get an error in the log browser if you try using it.

Thankfully, I found an official list of Windows-supported directives for use with the strptime function. Those should all work in a Coveo extension too.

So, my problem remained: certain time parts could not be converted because no working directive existed. The only option left? Get rid of it.

ramsay

I wasn’t printing it out in my search results anyway. I decided to make a Regex string that would find the time part of the date and remove it (some parts of this were Regex found online), which involved learning re.compile, re.search,re.sub and a bunch other fun Python Regex functions and gotchas – such as:

  • Despite seeing it in the vast majority of articles, the r prefix should not be entered before a Regex string if you are using standard escape sequences. Since my Regex definitely was, I just dropped the r and my re.compile succeeded.

Phase 3: Adjust (if necessary)

If you rebuild your source now, you should have dates coming into your field (you can check on this in the Content Browser).  However, you might notice they are the wrong day! In fact, they might be one day before what the actual date was on the page. This is because of the conversion that you are doing. When the date string gets converted, it gets converted to a date using your local time. Using the timedelta library, you would need to add a certain amount of hours so that the date matches UTC (see full code below).

Completion

Final code used:

from datetime import datetime, timedelta
import string
import re

try:
 # Get Coveo field value (string)
 pubdate = document.get_meta_data_value('textpubdate')
 # Try to strip out time, am/pm and timezone
 pattern = re.compile('([0-9]|0[0-9]|1[0-9]|2[0-3]):[0-5][0-9]\s(am|pm|AM|PM)\s[a-zA-Z]{3}$')
 if (re.search(pattern, pubdate[0])):
 log('Match succeeded', 'Debug')
 # Replace the found pattern with nothing
 newpubdate = re.sub(pattern, '', pubdate[0])
 # Convert string value to Coveo date type, strip out ending spaces if any
 pdate = datetime.strptime(newpubdate.rstrip(), '%A %d %B %Y')
 # Add 6 hours to date due to time zone used in conversion
 newDate = pdate + timedelta(hours=6)
 # Set date field
 document.add_meta_data({'aoparesultdate':newDate})
except Exception as e:
 log(str(e),'Error')

The timedelta library caused some hiccups; I had to use it in the exact way I have it written below or it didn’t work. One final comment: make good use of the log() command while doing these so you can see any errors or custom messages you enter in the Log Browser. Hope this helps someone and let me know if you have any questions, comments or suggestions!

 

Source mapping with different values based on the document URL

I’ve been working with Web sources in the Coveo Cloud V2 Platform a lot lately. One interesting predicament I ran into with one of my sources was the need to have a “multi-value” source mapping; meaning a mapping that would return a different string based on a condition. So in my case, I had a mapping on my source for my content type field, and I needed to return “Event” if the current indexed item URL contains “/events”, otherwise “Page” for everything else. I thought Item Types were exactly what I needed, but I never really found out because I wasn’t able to create them. Instead, I decided to create an indexing pipeline extension that would handle this logic, and hook it up to my source.

Creating the extension

If you aren’t familiar with Indexing Pipeline Extensions, they are essentially Python scripts you can write and attach to your sources, to apply some complex logic for each individual indexed item as it goes through the indexing pipeline. I suggest going through the above linked document and related documents below it for more information. The extensions can be really powerful and can do many things, like rejecting content based on a condition, setting the value of a mapping, and so on.

In this case, I had to set the value of my content type field based on the URL of a document. I knew I could use the clickableURI out-of-the-box field for this, so I wrote the following:

import string

try:
 my_uri = document.get_meta_data_value('clickableuri')
 if "/events" in my_uri[0]:
 document.add_meta_data({'aopacontenttype': "Event"})
 else:
 document.add_meta_data({'aopacontenttype': "Page"})
except Exception as e:
 log(str(e),'Error')

It took some research into Python (V2.7.6), its syntax and available methods, but I was able to do it.

Adding the extension to the source

Next, I had to go back to the Sources screen, select my source, click ​(…) More then Manage extensions and add my extension to the source, at the Post-Conversion stage, and apply it to all items.

Completion

Lastly, I had to rebuild the source. Upon checking the newly indexed content in the Content Browser, I was able to see “Event” as the field value for event pages, and “Page” for all other pages. Hope this helps! I will be writing another post or two about my other experiences with indexing pipeline extensions soon.