Testing Search Relevance the Lean Way

12 min readAug 2, 2019

Previous Article: Lean Way to Build Product Search

What is Search Relevance?

Recall the feeling of frustration when you are searching for your car key in your house— you know it’s somewhere nearby but you just can’t seem to find it. Either you lost the key outside of the home or misplaced it somewhere in the house itself. You will know only after actually finding it after some hustle or never find it. With me yet?

Now imagine going to a website and search for something and the results look terrible. It does not clearly show what you are looking for. It also doesn’t say clearly that the item you are looking for is not available. It just gives you some vague results. It says “Maybe, I have what you are looking for, keep looking”.

Alas! — you have come to a website which has a low “Search Relevance”.

I searched for an Apple iPhone! This is what I got. I am not hungry duh! Well, this is what irrelevance looks like.

If you search for “Apple iphone” on some e-commerce website and you get apple fruit in results, certainly the results are irrelevant! But if you get all the variants of iphone like white, gold, black, 32GB, 64 GB etc as search results, you can say that results are indeed “relevant” to you.

Another example. If you search for a book “Blue Ocean Strategy” on Amazon, you can see the second result as “Blue Ocean Shift”. Third result is “Zero To One — Peter Theil”. One might argue that second and third results are not a textual match and hence irrelevant. But a reader who reads Blue Ocean Strategy, might also be interested in reading it’s sequel “Blue Ocean Shift”. He will be very much interested in reading another gem in entrepreneurship “Zero To One”. The results are indeed relevant!

They look like oceans-apart (pun intended) but actually they aren’t. These books are indeed correlated and hence relevant.

In the above two examples, you can guess the extent of broad definition Search Relevance can take. If defining relevance is so hard and vague, how difficult will be to test relevance! How can you test something if defining itself is difficult!

Testing Search Relevance & Pragmatism

People who have had their hands dirty with testing in regular applications will agree with me that writing test-cases to get a decent coverage is not an easy task. It takes significant efforts from the developer’s end to write all those unit tests. You have to think about the application from different vantage points to come up with a set of solid test cases.

In contrast to application testing, search relevance testing is an even elusive target. You can never reach a point where you can say that the relevance is in perfect condition now. You can’t say something like — I have 80% relevance coverage. Striving for perfection in any software engineering project is a recipe for disaster and search relevance is no different. We need to take a strategy of pragmatism.

According to Google:
Pragmatism = An approach that evaluates theories or beliefs in terms of the success of their practical application.

In this article, we will discuss many pragmatic approaches to discover the test cases whose results can be measured. A relevance engineer can then use these test-cases as her measuring yardstick to make the progress towards the elusive goal of Relevant Search Testing.

Emotions felt by developers and teams who invest and don’t invest in test cases.

This article is all about — how to become God of Search Relevance!

Generate TestCases — Manually

Problems well-defined are problems solved — Raymond Hettinger, Python Core Developer
First step to fix relevance is to understand what is important to users

Top 100 Search Queries — Goldmine for relevance

Fastest way to find out what are the most important search queries for a business is to simply look at top 100 search queries in the past. This is what users search the most. Hence, most gain will be achieved if we optimise for these search queries. Courtesy - The age old 80/20 Principle.

Manually search all these 100 queries yourself on the existing website or the app to figure out what’s broken in the

Ask your data engineer to find out top 100 search queries in past month or so.
Take these 100 search queries and manually search all of them on the website to figure out qualitatively what you do not like in the search results. Create a list of problems and work towards the solution backwards. Quality of “problems” that you define by looking at this list will govern how your relevance improvement efforts will workout in near future. Try to identify the entities that people are searching for. For example, in case of an e-commerce, you might find out that people search using these entities. Entities are in italics.

- Category names like “mobile phones”
- Brands names “Samsung”
- Product features like “yellow top”
- Occasions like “party wear”, “Diwali Dresses”
- Domain specific Cohorts like age group “baby ”. So, one may search “baby clothes”
For a website like goodreads, Authors and books will be the first class entities. For facebook - people, groups, pages, events etc can be the first class entities. You, as a Relevance Engineer need to somehow figure out who are the first class citizens in your domain.

Later, when you have loads of time, schedule a daily report and set alarms on how the search is doing on top 100 keywords on a daily basis. You can measure CTR for these and any drop will be worth looking at. More on this later.

Better the Quality of domain specific entities one can identify - easier will be to optimise Search Relevance.

Ask the domain experts

We talked about entities and taxonomies in the above section. They are the nouns that dominate the domain for which you are trying to debug the relevance.

If you are working with Product Managers who are passionate about the product, you need to let them speak and come up with their pain points. Hidden between the words will lie a lot of entities and taxonomies that they think are important. You need to have this mental framework of hunting for these “entities” to be able to translate relevance complaints into relevance test-cases!

Listen! Listen to your users. Listen your PMs. Listen to your data. Don’t invent imaginary use cases.

Make them create a list of problems that are important to them. You can use tools like Quepid to do so.

Automated TestCases Generation

How about if we can just configure some rules and write a few scripts and then the automation suit will tell us how good our search is!

Automated Entity Testing

Let’s say for a company, you have found out the important entities as: Product Title / Category / Brand, email IDs

Example 1. Product Title. An obvious example is taking a document’s title, issuing that to the search engine as a search, and checking whether that document was returned as a first result.

Example 2. Search a category name and if products from some other category show up, it could be a potential issue. Similarly, search a brand name and see if any other brands show up.

Example 3: Search for a person’s email ID and check that indeed that person comes as the top result.

Automated Shingles Detection

Consider an example “Make-up”. It can be written in many forms — make up, make-up, makeup etc. To be able to match this both ways, it can be done by using query time synonyms and shingles.
Query time synonyms:

make-up → makeup
make up → makeup

Shingles in the Index Analyzer should do the similar

make-up → makeup
make-up → makeup

If we do the above, all of these will match each other — make up, make-up, makeup.

It’s easy to configure index shingle analyser but, synonym detection i.e (make-up, makeup, make up) needs to be done by scanning the corpus and can be automated using a simple python script.

Automated Synonym Detection

For a fashion website, “ethnic dresses” will mean same as “traditional dress”. “swimwear” == “swimming costume”. For digital electronics, windows and microsoft are heavily correlated. There are various ways to automatically detect synonyms which are out of scope here.

Automated Synonym Testing

There are methods to detect synonyms. There is a chance that all synonyms will not work very well with the system. We can track the performance of every synonym that we are adding in the system. if any synonym is back fires, i.e CTR drops because of it, we can remove it from the system.

Automated Important Attributes Detection

For all the words in queries in past 6 months or so, look for the places where this words lies in the document. For example, in a query “red sweater”, you will find “red” in “colour” attribute and “sweater” in category attribute. This way you can generate a list of important attributes where you should search.

Automated Analyser Configuration Detection

Let’s say you are contemplating whether you should use english or minimal_english analyser for your analysis chain. Take all the words in the corpus and pass them through both of the analyzers above. Group all the words that are grouped together. See which analyzer works better for your data by looking at ~ 100 entries in the above list of groups.

Automated Broad Words Detection

Look for the words that have broad meanings i.e. the words that covers a big span of documents. Example “wear” in “partywear”. If a word has results in many categories, then it can be a word with a broad meaning. If we know the meaning of word “wear” that it can be anything that can be wore by a person. Then, if any category which is a wearable and has “party” in it, should also match “partywear”.

Top 100 Low Recall search Queries

There may exist use cases that you have never thought about. Look for search queries where number of results itself is very low. This list will automatically tell you where are your user finding it difficult to find products. You can apply techniques like increasing fuzziness, decreasing percentage query match , showing recommendations etc to increase recall.

Note that there will be some queries where the search results will indeed be zero because relevant products does not exist in the system. In this case, it’s better to clearly tell the user that there are no matches, rather than diluting the results by showing less relevant results. In some domains users often like to know when “they’re out of results” so that they don’t waste time searching/scrolling through over-diluted listings.

Top 100 Low CTR search Queries

You can look at the queries which are being queried by a significant number of people but still has an extremely low Click Through Rate. This can be an entry point for you to reason out what can be fixed in terms of search relevance.

Top 10 Complaints

Make it extremely easy to show their emotions about how bad your search sucks. Create a portal where people can lodge their complaints about search “not-working”. Open this portal to all the employees or even to the public. See what they have to say. They will complain a lot in English language and then your job will be to translate it into a Search-Relevancy problem.

How The New York Times Tackles Relevance

On one hand, this can give you extremely good ideas and on the other hand this can give utopian-unicorn ideas which as just not possible. Henry Ford: “If I had asked people what they wanted, they would have said faster horses.”. So be careful here.

Starvation Detection

There will be some products which are not showing up in search at all and are getting absolutely zero visibility. These are the products that are dying the slow death. May be it’s time to give them a chance by randomizing the search results and giving those products a chance to make it to the top. Else, after a few chances those products can be safely removed from the system.

Positive Feedback Loop Detection

Store the results returned by your system for a few queries q1, q2 and q3 for a few days or months in a row. If the results are not changing much with time for a given query, there is a high chance that your system is suffering from Positive Feedback Loop. You can stop this from happening by giving a negative boost to non performing documents rather than giving positive boost to high performing documents in past.

How not to measure Search Relevance

You might get tempted to invent use cases and come up with creative ways to search. You might say if I search for “table” in an online book store and the search is not showing me tables. So my recommendation is that first understand what your users are looking for and the way they search. Then optimize your search engine accordingly. Don’t solve for imaginary issues.

Machine Learning

Machine Learning techniques are the last hammer in the nail. All the techniques mentioned in this article are a precursor to any machine learning technique like Learning To Rank or Genetic Parameter Optimization. Optimisations mentioned in this article will lay a solid foundation for any advanced ML you may use on your corpus for Relevance Tuning.

What People Say on Internet

Your search is bad because I can’t find the thing I know you have.
Above mentioned techniques will help you tackle this.
Your search is bad because your site doesn’t have the content I’m looking for.
Find out words that are being searched on your website that indeed have low CTR and indeed don’t have a genuine match on your app. Let’s say you have a furniture shop and people are searching for “bedside table”, and you only have beds. Its time to add bedside tables to your inventory! Rather than confusing your users, be honest — tell them that you did not find any relevant search results but here are products that you may like.

Tools for Fixing Search Relevance

Fixing Search Relevance is mostly a matter of specifying the problem and then using the most generic approach that solves the problem.
Under-fitting is better than Over-fitting

Analysers, Tokenisers and Filters

Keep you analyser chain simple and easy to reason about. Explainability is the virtue worth preserving. It will keep the barrier to entry low for people who work on search improvement in future on your project.

Query-time Rewriting

There is a lot of literature out there on how to do this. Adding a small set of synonyms solve a big part of the problem in my experience. Use above mentioned techniques to hunt for synonyms.

Index-time Processing

Lot of cleanup can be done before the document reaches Elasticsearch or Solr. Do this a lot to keep things simple when they enter the domain of Elasticsearch or Solr. Complexity should be dropped as early in the pipeline. Simplicity is the virtue we are after.

Fixing the Catalog

Many a times, simply documents in the corpus have bad data. Instead of trying too hard to cover bad catalog by tuning ES and Solr, its better to just ask Catalog team to fix the catalog. May be you can help them with the guidelines of how the catalog should be fixed so that it becomes more search friendly.

Don’t fix it

There will be some problems where fixing them will require so much effort that it simply is not worth the effort. If these are only a handful, ignore them. You want a a search that works nicely for most common use cases and not for every use case. Perfection is a sickness.

Key to maintainable Search Relevance are Simplicity and Explainability.

Future Work

Add more pragmatic approaches to debug and fix Search Relevancy.

How can Euler Help You ?

Ours is a team of Search and Machine Learning nerds who constantly attack on the above problems for various companies. We have built pipelines for some of the above problems that run over and above your existing catalog to extract this higher level information and feed it back to the search engine. While doing this for your company, we train your engineers so that Search can later be improved in house after the engagement ends with Euler.

Let us know if you would like us to help you kickstart!

Connect with us — Send an email to mayank@euler-systems.com

Thanks to Peter Dixon-Moses for early review and critical comments on the write-up.

Bibliography

OpenSourceConnection — Reflective Search
Concepts from Test Driven Development
Experience out of working on Search for Nykaa, Hopscotch and Simpplr
Discussion with people who lead search at Saavn, Flipkart
Debates that we do internally at Euler Systems
Nikhil Dandekar on How to measure Search Relevance
Guidelines by google to crowdsource Search Relevance
How Google Search works