Making More of Ferret Queries

Wow! This takes me back! Please check the date this post was authored, as it may no longer be relevant in a modern context.

Oh Ferret, how lovely your speed, how confusing your documentation. Maybe we’ll go over that in another post, but for now, let’s see how we can make Ferret a bit kinder to normal users by better understanding their queries.

Over at FindYourDoc we’re searching not only a large number of records, but a large variety of fields. When we get a query such as:

  Doctor in Nashville TN

You and I know the user has given us a lot to go on, but Ferret doesn’t. When humans look at that query we pull out a human understanding.

  • Doctor - a type of care provider
  • in - a throwaway word
  • Nashville - a city
  • TN - a state

So if I were a programmer (imagine that), I would form the same query using Ferret’s query syntax:

  type:Doctor city:Nashville state:TN

Well great, but our visitors are not programmers, they’re grandmothers and dog trainers, patients and college students. Let’s look at that query again, and see how we could break it down:

  • Doctor & TN - These are phrase that look for a match in specific sets.
  • Nashville - matches data in a very large set.
  • in - throwaway word.

So let’s see how the raw query might be moved closer to my programmers query.

Regex to the Rescue

Wow, that was a whole intro paragraph with no code! Let’s take a basic transformation that we could do to move toward our programmer’s query, that of changing TN to state:TN. Code time!

%w( AL AZ AR CA CO CT DE FL GA HI ID
    IL IA KS KY LA ME MD MA MI MN
    MS MD MT NE NV NH NJ NM NY
    NC ND OH OK OR PA RI SC SD TN
    TX UT VT VA WA WV WI WY ).each do |state|
  query.gsub!( /(?:\A|\s)(#{state})(?=\s|\z)/i, " state:#{state}" )
end

And what a block of code it is. Naturally, we wouldn’t normally have a big block of states there, it would be in self.states or something similar. So what’s going on?

  query.gsub!( /(?:\A|\s)(#{state})(?=\s|\z)/i, " state:#{state}" )

For each state we run this gsub line. The regex in it has three parts:

  (?:\A|\s)

This section is a “grouping”. We know it’s a grouping because it’s in (). The magic of this particular grouping is the use of ?:, which tells the regex engine this is not a grouping to be saved for reference later on. It will require this group to be matched, but when we replace TN with state:TN we won’t want to replace what this grouping matched. That’s why we have ?:.

Inside our non-referenced group we have a short snippet “\A|\s”. Well, that’s simple:

  • \A - The beginning of the string
  • \s - A space or other white-space

The pipe symbol, |, is an “or” in regex. So we have a non-referenced group that matches either the beginning of the string, or a space. The next segment is easy:

  (#{state})

Super easy. We’re matching a state. Note that the state is in (), which means this is our match to be replaced later on. Our last section is quite similar to the first:

  (?=\s|\z)

Look at how we’ve used ?= inside the (). Adding ?= first thing in our parentheses turns it into a look-ahead assertion. We’re looking forward in the string to see if we can find something, but we’re not storing it for later. The \s|\z is looking for:

  • \s - a space or other white-space
  • \z - the end of the string

So remembering |, we want to find a space or the end of the string after our match. Take a peek at the whole thing again:

%w( AL AZ AR CA CO CT DE FL GA HI ID
    IL IA KS KY LA ME MD MA MI MN
    MS MD MT NE NV NH NJ NM NY
    NC ND OH OK OR PA RI SC SD TN
    TX UT VT VA WA WV WI WY ).each do |state|
  query.gsub!( /(?:\A|\s)(#{state})(?=\s|\z)/i, " state:#{state}" )
end

Notice the regex also uses an i at the end. That will make our regex case insensitive so we can match TN and tn. Also notice that we don’t look for IN. Well, sorry Indiana, but we don’t want queries like:

  Cardiology in New York

to become:

  Cardiology state:IN New York

It just wouldn’t work.

Wash, Rinse, Repeat

Well, neato, what other kinds of data could we apply this same technique to? Two types that I can come up with:

  • Discrete values in a set (matching a state)
  • Structured values (like a zipcode)

Take a peek at an example of the latter:

  query.gsub!( /(?:\A|\s)([0-9]{5})(?=\s|\z)/i, ' zipcode:\1' )

We’re looking for 5 numbers, then tacking zipcode: onto the front of them. We’ve taken our human understanding of a structure and explained it to ferret.

Ferret has a concept of weighting certain fields, and that can help tweak your results to better match your queries, but tricks like this can help a lot. Searching for:

  Doctor in Nashville TN

Without FindYourDoc’s query tweaking, my top result is scored at ~0.46. With it turned on, the top result is ~7.97. That’s a sign Ferret is doing much better at understanding what I was asking for. We can’t manage to trap city names, since there’s just too many, but we can trap the provider type and state to get this:

  type:Doctor in Nashville state:TN

Other fields besides provider type and state are captured by our query tweaker as well. Those tweaks give every visitor a personal programmer to help re-phrase what they say, and that makes our results far better for grandmothers and dog trainers. Try it our on your own Ferret site, it doesn’t disappoint.