Test 5 - Scoring Tiers
Last updated
Last updated
Two candidates were compared, one control (AND query from test 4), and one which stacked a loose, generic query with a set of much more constrained and highly boosted queries. #246
By layering up the queries from a low-precision, high-recall generic query with no boost, to a highly boosted set of precise queries on a specific set of fields, we're able to tune the precision and recall of our queries and match our queries directly to user intentions. We can also continuously fine-tune these queries as more intentions/expectations are added.
Here, we're stacking a base query with two equally weighted AND
queries across subject and genre, followed by an even more heavily weighted OR
query on the title. The code itself can be seen here.
A bug in the deployment of the query meant that all trafic was directed to the new query, and none to the AND query.
We're haven't observed a drop in any of the most significant metrics like clicks per search over the first few days of the candidate's release:
first page only
0.235
0.234
beyond first page
0.557
0.552
While we don't have a parallel dataset to check the candidate against, we can look at the data we have from the and the data from the previous week of AND query to produce the following results, showing no significant quantitative difference in the results:
Subjectively, this query seems significantly better than the results from the AND query - we're not seeing so many queries returning 0 results, precision seems high (the top of the list of results seem intuitively to be the most relevant), and we're matching enough terms that the recall also seems high. Recall here is driven by the generic base query which we may want to continue to tune to return fewer results in future, perhaps by setting some value of minimum_should_match
.
As an example of improvements we're seeing - 'everest chest'
now returns the expected results at the top of the list, followed by pictures of Everest, followed by pictures of other chests
We're going forward with the scoring tiers query, and will fine tune our ordering of fields and intentions etc in the next test, probably adding n-grams to a few of the fields to increase precision and shrink recall slightly.