Qualitative evaluation of search ranking algorithms

Published in

Thumbtack Blog

7 min readMay 18, 2021

by Bharadwaj Ramachandran

When a customer lands on the Thumbtack homepage, they implicitly trust our product to help them find the best business for the job they have in mind — be it painting their home, hiring a photographer, or finding an accountant to do their taxes. The first step in this experience is interacting with our search bar and sifting through a list of ranked businesses.

Browsing search results is an integral part of the customer journey towards finding the right pro to hire on Thumbtack. It stands to reason that this surface is the subject of many of our experiments. However, iterating on search ranking poses a unique challenge at the time of development. We rely heavily on machine learning to power search ranking, a topic we’ve discussed in a previous post. As we develop a new search ranking algorithm, we need to verify that our new ranking algorithm is working as intended, both qualitatively and quantitatively.

Figure 1: A search results page on Thumbtack

To evaluate results quantitatively, we use a simulator tool that replays historical search requests on two different algorithms and compares their results. We’ve covered the details of this in a previous blog post. On top of that, new models are quantitatively evaluated against a test dataset and a validation dataset on offline data. However, offline evaluation using Jupyter notebooks and the use of the simulator tool are not designed to help us with the following:

Quickly debug why a particular model ranks a business at a specific position in the results, relative to other businesses
Perform qualitative (human) evaluation of the new ranking model’s overall performance on a set of searches, relative to the baseline model
Visualize how different parts of the ranking algorithm interact with each other
View model inputs and outputs for a particular search result in one place

To address these problems, we built an internal tool that we call the “side-by-side evaluator tool”, or “SxS tool” for short. To understand how the tool works, it helps to understand the basics of our ranking architecture. Our modular ranking system allows us to compose different ranking algorithms and additional components such as filters on top of one another. To illustrate how this works, let us examine the structure of a ranker.

The structure of a ranker

Our ranking system consists of two stages. The first stage is candidate generation, where we take a customer’s query and whittle down the relatively large search space of businesses across the country into an un-ranked list of a few hundred businesses deemed relevant to the search. While this stage is interesting, it is not the focus of the SxS tool. The tool deals with the second stage, ranking, which takes in an unranked set of relevant businesses and orders them.

Figure 2: The structure of an example ranker

As can be seen in Figure 2, the example ranker runs three steps sequentially.

It first runs an ensemble of machine learning models. In this case, there are two models to run: the contact model and the response model. Each model has a list of features nested beneath it, and the results of the models are combined to produce a ranking score. The list is reordered based on the ranking score and passed into the next step.

Next, it runs a series of filters. Each filter in this component is responsible for removing businesses from the list if they satisfy a certain condition. In this example, there are two filters: the DedupeFilter and the TruncateFilter. The first is an example of a filter that ensures that there are no duplicates, and the second is an example of a filter that truncates the list of businesses returned to reduce the size of the response we return to the front end.

Lastly, the re-ranker step is responsible for reordering the list based on heuristics that aren’t used in our machine learning models. In our example, there is a single re-ranker called the SortByReviewsReranker. It gives customers the ability to reorder the list based on an additional search term that searches over business reviews.

Now that we know the rough structure of a ranker and understand this example, let’s take a look at what the side-by-side evaluator tool allows us to do.

The SxS tool

The image above shows the top half of the SxS tool that we have built. Manual testing and evaluation of new algorithms is a core use case of the SxS tool. To support this, we added a drop-down at the top of the page to vary the geographic region and search term for which we’re comparing results. In this case, the page we’ve selected is roofers in Dallas, TX. The tool also allows us to choose which rankers we’re evaluating “side-by-side” in the two “Ranker” drop downs. The remaining options allow us to parametrize various aspects of the request, as well as verify if either list is sorted according to a specific field. These tools are especially useful when debugging in our development environment. However, the two most important parts of the tool are hidden behind the “show ranker config” and “show metadata selector” buttons.

Ranker configuration

A “ranker config” is a complete diagram of the ranker, similar to the nested structure of a ranker that we walked through earlier in this post. Two ranker configurations are visible in Figure 4. In the SxS tool, ranker configurations are reconfigurable on the fly. There are checkboxes next to each part of the ranker that allow the user to deselect entire components and models, or even do something as granular as deselect a particular feature in a model to see how zeroing out that feature changes the output. In the image below, you can see an example where we deselect various features on the right hand side to see how it affects search results.

Metadata selector

The metadata selector allows the developer to view values for particular features so that they can better understand the rank of a given business in the results. Not only does this make debugging much easier, but it also improves our ability to understand why a given business ranks higher or lower in one ranker versus the other. You can also see in Figure 5 that the SxS tool points out when a business increases or decreases in rank, which makes drastic changes easier to spot and investigate prior to launching an experiment. For example, 1 → 7 in the top left of the image indicates that the business ranked in the 1st position on the ranker on the left has moved to the 7th position in the ranker on the right. By examining the metadata fields in both rankers, the developer can deduce what caused the change in rank.

Figure 5: Side-by-side comparison with metadata selections made

Conclusion and next steps

Let’s revisit the goals of the SxS ranker that we stated at the beginning and outline how the SxS tool helps solve each of our four asks.

The ranking metadata inspector gives the developer the ability to inspect the data behind each result at a granular level. This fulfills the first of our wants, which was to make debugging easier.
Viewing changes in the SxS makes human evaluation of a particular search easier. This is especially true when the evaluation is done across members of the ranking team, and when we use analytics tools at our disposal to ensure that we cover a wide variety of search terms, geographies, and markets in our human evaluation.
The ability to change rankers and toggle their sub-components in the ranker config allows us to visualize how different parts of the ranker affect search results.
The metadata selector tool functions as a shortcut to see how model inputs translate to model outputs for a particular search result.

As for next steps, there are many quality-of-life improvements we could make to the tool. Right now, developers need to do a significant amount of scrolling to compare ranking metadata for a business whose rank changed between the ranker on the left and the ranker on the right. For example, if a business was ranked in the 2nd spot on the left, but the 8th spot on the right, the developer would have to scroll between these two positions to compare ranking metadata for the same business across two rankers. On the more quantitative side, we might want to view some descriptive statistics about each ranker at a glance to make it easier to spot data issues. Going forward we not only want to be able to visualize the features, the configurations and the machine learning model outputs but also introduce machine learning model explainability into the SxS tool, so we can better visualize model predictions.

If problems like search ranking, machine learning, and model evaluation interest you, join us as we build out a robust marketplace for local services!

Special thanks to Navneet Rao, Joe Tsay, Richard Demsyn-Jones, Mark Andrew Yao, Karen Lo, and Dhananjay Sathe for feedback on this post.

Bharadwaj is a Senior Software Engineer at Thumbtack. This story was originally published at https://medium.com on May 18, 2021.