📇
Catalogue API
  • Catalogue API
  • developers
  • How users request items
  • Search
    • Current queries
      • images
      • Query structure
    • Search
      • Changelog
      • Collecting data
      • Query design
      • Query design
      • wellcomecollection.org query development index
      • Reporting and metrics
      • Work IDs crib sheet
      • Analysis
        • Less than 3-word searches
        • Subsequent searches
        • Searches with 3 words or more
      • Hypotheses
        • Behaviours
        • Concepts, subject, and another field
        • Concepts, subjects with other field
        • Concepts, subjects
        • Contributor with other field
        • Contributors
        • Further research and design considerations
        • Genre with other field
        • Genres
        • Mood
        • Phrases
        • Reference number with other field
        • Reference numbers
        • Search scenarios
        • Synonymous names and subjects
        • Title with other field
        • Titles
      • Relevance tests
        • Test 1 - Explicit feedback
        • Test 2 - Implicit feedback
        • Test 3 - Adding notes
        • Test 4 - AND or OR
        • Test 5 - Scoring Tiers
        • Test 6 - English tokeniser and Contributors
        • Test 7 - BoolBoosted vs ConstScore
        • Test 8 - BoolBoosted vs PhaserBeam
    • Rank
      • Rank cluster
      • Developing with rank
      • Testing
Powered by GitBook
On this page
  • Candidates
  • Results
  • Conclusions
  1. Search
  2. Search
  3. Relevance tests

Test 5 - Scoring Tiers

PreviousTest 4 - AND or ORNextTest 6 - English tokeniser and Contributors

Last updated 10 months ago

Candidates

Two candidates were compared, one control (AND query from test 4), and one which stacked a loose, generic query with a set of much more constrained and highly boosted queries. #246

By layering up the queries from a low-precision, high-recall generic query with no boost, to a highly boosted set of precise queries on a specific set of fields, we're able to tune the precision and recall of our queries and match our queries directly to user intentions. We can also continuously fine-tune these queries as more intentions/expectations are added.

Here, we're stacking a base query with two equally weighted AND queries across subject and genre, followed by an even more heavily weighted OR query on the title. The code itself can be seen here.

Results

A bug in the deployment of the query meant that all trafic was directed to the new query, and none to the AND query.

We're haven't observed a drop in any of the most significant metrics like clicks per search over the first few days of the candidate's release:

AND query
scoring tiers

first page only

0.235

0.234

beyond first page

0.557

0.552

While we don't have a parallel dataset to check the candidate against, we can look at the data we have from the and the data from the previous week of AND query to produce the following results, showing no significant quantitative difference in the results:

Subjectively, this query seems significantly better than the results from the AND query - we're not seeing so many queries returning 0 results, precision seems high (the top of the list of results seem intuitively to be the most relevant), and we're matching enough terms that the recall also seems high. Recall here is driven by the generic base query which we may want to continue to tune to return fewer results in future, perhaps by setting some value of minimum_should_match.

As an example of improvements we're seeing - 'everest chest' now returns the expected results at the top of the list, followed by pictures of Everest, followed by pictures of other chests

Conclusions

We're going forward with the scoring tiers query, and will fine tune our ordering of fields and intentions etc in the next test, probably adding n-grams to a few of the fields to increase precision and shrink recall slightly.