📇
Catalogue API
  • Catalogue API
  • developers
  • How users request items
  • Search
    • Current queries
      • images
      • Query structure
    • Search
      • Changelog
      • Collecting data
      • Query design
      • Query design
      • wellcomecollection.org query development index
      • Reporting and metrics
      • Work IDs crib sheet
      • Analysis
        • Less than 3-word searches
        • Subsequent searches
        • Searches with 3 words or more
      • Hypotheses
        • Behaviours
        • Concepts, subject, and another field
        • Concepts, subjects with other field
        • Concepts, subjects
        • Contributor with other field
        • Contributors
        • Further research and design considerations
        • Genre with other field
        • Genres
        • Mood
        • Phrases
        • Reference number with other field
        • Reference numbers
        • Search scenarios
        • Synonymous names and subjects
        • Title with other field
        • Titles
      • Relevance tests
        • Test 1 - Explicit feedback
        • Test 2 - Implicit feedback
        • Test 3 - Adding notes
        • Test 4 - AND or OR
        • Test 5 - Scoring Tiers
        • Test 6 - English tokeniser and Contributors
        • Test 7 - BoolBoosted vs ConstScore
        • Test 8 - BoolBoosted vs PhaserBeam
    • Rank
      • Rank cluster
      • Developing with rank
      • Testing
Powered by GitBook
On this page
  • Complete query json
  • Intentions
  • Users should be able to match against general work data
  • Users should be able to match documents by their identifiers
  • Users should be able to match archive identifiers
  • Users should be able to match documents by their title and contributors (in the same query)
  • Users should see documents whose titles are most obviously connected to their query at the top of the list
  • Users should see documents which contain exact matches to their query above documents which contain partial matches
  1. Search
  2. Current queries

Query structure

PreviousimagesNextSearch

Last updated 10 months ago

Complete query json

Intentions

The following section lists the broad intentions which we try to reflect in the structure of our query. uses a list of search term / work ID pairs which illustrate these intentions to validate the performance of each candidate query. The precise examples for works can be seen .

Users should be able to match against general work data

We include a multi_match on a list of the work's core fields

{
  "multi_match": {
    "query": "{{query}}",
    "fields": [
      "data.contributors.agent.label^1000.0",
      "data.subjects.concepts.label^10.0",
      "data.genres.concepts.label^10.0",
      "data.production.*.label^10.0",
      "data.description",
      "data.physicalDescription",
      "data.language.label",
      "data.edition",
      "data.notes.contents",
      "data.lettering"
    ],
    "type": "cross_fields",
    "operator": "And",
    "_name": "data"
  }
}

Users should be able to match documents by their identifiers

Identifiers are multi_matched and heavily boosted so that matches will always appear at the top of the list.

{
  "multi_match": {
    "query": "{{query}}",
    "fields": [
      "state.canonicalId^1000.0",
      "state.sourceIdentifier.value^1000.0",
      "data.otherIdentifiers.value^1000.0",
      "data.items.id.canonicalId^1000.0",
      "data.items.id.sourceIdentifier.value^1000.0",
      "data.items.id.otherIdentifiers.value^1000.0",
      "data.imageData.id.canonicalId^1000.0",
      "data.imageData.id.sourceIdentifier.value^1000.0",
      "data.imageData.id.otherIdentifiers.value^1000.0",
      "data.referenceNumber^1000.0"
    ],
    "type": "best_fields",
    "analyzer": "whitespace_analyzer",
    "operator": "Or",
    "_name": "identifiers"
  }
}

Users should be able to match archive identifiers

We include a match on the search.relations field.

{
  "match": {
    "search.relations": {
      "query": "{{query}}",
      "_name": "relations",
      "boost": 1000,
      "operator": "AND"
    }
  }
}
{
  "with_slashes_char_filter" : {
    "type" : "mapping",
    "mappings" : [
      "/=> __"
    ]
  }
}

The field is heavily boosted to ensure that matching documents appear at the top of the list of results.

Users should be able to match documents by their title and contributors (in the same query)

{
  "dis_max": {
    "queries": [
      {
        "multi_match": {
          "query": "{{query}}",
          "fields": [
            "search.titlesAndContributors^100.0",
            "search.titlesAndContributors.english^100.0",
            "search.titlesAndContributors.shingles^100.0"
          ],
          "type": "best_fields",
          "operator": "And",
          "_name": "title and contributor exact spellings"
        }
      },
      {
        "multi_match": {
          "query": "{{query}}",
          "fields": [
            "search.titlesAndContributors.arabic",
            "search.titlesAndContributors.bengali",
            "search.titlesAndContributors.french",
            "search.titlesAndContributors.german",
            "search.titlesAndContributors.hindi",
            "search.titlesAndContributors.italian"
          ],
          "type": "best_fields",
          "operator": "And",
          "_name": "non-english titles and contributors"
        }
      }
    ]
  }
}

Users should see documents whose titles are most obviously connected to their query at the top of the list

We use a span_first query to match the first tokens in the title field. The order of those terms is taken into account by using the title.shingles subfield. These matches are heavily boosted to ensure that the most obviously matched titles appear first.

For example, e.g. if we had three works

Human genetic information : science, law, and ethics International journal of law and information technology Information law : compliance for librarians and information professionals

{
  "span_first": {
    "match": {
      "span_term": {
        "data.title.shingles": "{{query}}"
      }
    },
    "end": 1,
    "boost": 1000
  }
}

Users should see documents which contain exact matches to their query above documents which contain partial matches

We've introduced a section of the query which is designed to match documents which contain more exact matches to the user's query, including casing and some punctuation, multi-matched over a few key title fields. This field is analysed with a shingle filter, meaning spans of multiple matched tokens will score even more highly. The field mappings also use the discard the lowercase filter and include a hyphens char filter, meaning that hyphenated words will be treated as a single token. This is important because we want to match x-ray over x ray when someone searches for x-ray, or AIDS over aids when someone searches for AIDS.

{
    "multi_match": {
        "query": "{{query}}",
        "fields": [
            "query.title.shingles_cased^1000.0",
            "query.alternativeTitles.shingles_cased^100.0",
            "query.partOf.title.shingles_cased^10.0"
        ],
        "type": "most_fields",
        "operator": "And",
        "_name": "shingles_cased"
    }
}

The field is analysed with a with_slashes_char_filter, which allows us to capture hierarchical IDs like PP/CRI/A as a single token. Slashes are converted to __s and the query is split on whitespace. See

For example, a user might want to find by searching for "cassils time lapse".

We construct a search.titlesAndContributors field with data copied from the title and contributors fields, and analyse it in multiple languages. We then match against these fields, preferring matches which are analysed in english. The highest scoring of those multi_matches is added to the total document score.

and somebody searches for "Information law", all other things being equal, we want to prioritise the third result. Based on user feedback documented here

click here for the complete query json
Rank
here
https://github.com/wellcomecollection/catalogue-pipeline/pull/1654
Time lapse by Cassils
https://github.com/wellcomecollection/catalogue-pipeline/pull/1654
https://github.com/wellcomecollection/catalogue-api/issues/466