Query structure
Complete query json
click here for the complete query json
Intentions
The following section lists the broad intentions which we try to reflect in the structure of our query. Rank uses a list of search term / work ID pairs which illustrate these intentions to validate the performance of each candidate query. The precise examples for works can be seen here.
Users should be able to match against general work data
We include a multi_match
on a list of the work's core fields
Users should be able to match documents by their identifiers
Identifiers are multi_match
ed and heavily boosted so that matches will always appear at the top of the list.
Users should be able to match archive identifiers
We include a match
on the search.relations
field.
The field is analysed with a with_slashes_char_filter
, which allows us to capture hierarchical IDs like PP/CRI/A
as a single token. Slashes are converted to __
s and the query is split on whitespace
. See https://github.com/wellcomecollection/catalogue-pipeline/pull/1654
The field is heavily boosted to ensure that matching documents appear at the top of the list of results.
Users should be able to match documents by their title and contributors (in the same query)
For example, a user might want to find Time lapse by Cassils by searching for "cassils time lapse".
We construct a search.titlesAndContributors
field with data copied from the title and contributors fields, and analyse it in multiple languages. We then match against these fields, preferring matches which are analysed in english. The highest scoring of those multi_match
es is added to the total document score. https://github.com/wellcomecollection/catalogue-pipeline/pull/1654
Users should see documents whose titles are most obviously connected to their query at the top of the list
We use a span_first
query to match the first tokens in the title field. The order of those terms is taken into account by using the title.shingles
subfield. These matches are heavily boosted to ensure that the most obviously matched titles appear first.
For example, e.g. if we had three works
Human genetic information : science, law, and ethics International journal of law and information technology Information law : compliance for librarians and information professionals
and somebody searches for "Information law", all other things being equal, we want to prioritise the third result. Based on user feedback documented here https://github.com/wellcomecollection/catalogue-api/issues/466
Users should see documents which contain exact matches to their query above documents which contain partial matches
We've introduced a section of the query which is designed to match documents which contain more exact matches to the user's query, including casing and some punctuation, multi-matched over a few key title fields. This field is analysed with a shingle
filter, meaning spans of multiple matched tokens will score even more highly. The field mappings also use the discard the lowercase filter and include a hyphens
char filter, meaning that hyphenated words will be treated as a single token. This is important because we want to match x-ray
over x ray
when someone searches for x-ray
, or AIDS
over aids
when someone searches for AIDS
.
Last updated