Catalogue pipeline
  • Introduction
  • Fetching records from source catalogues
    • What is an adapter?
    • CALM: Our archive catalogue
    • MIRO: Our image collections
  • Transforming records into a single, common model
    • Our single model: the Work
    • Creating canonical identifiers
  • Combining records from multiple sources
    • Why do we combine records?
    • How we choose which records to combine
  • Other topics
    • Catalogue
    • Search
      • wellcomecollection.org query development index
      • Hypotheses
        • Concepts, subjects
        • Contributors
        • Titles
        • Genres
        • Reference numbers
        • Synonymous names and subjects
        • Mood
        • Phrases
        • Concepts, subjects with other field
        • Contributor with other field
        • Title with other field
        • Genre with other field
        • Reference number with other field
        • Behaviours
        • Further research and design considerations
      • Analysis
        • Less than 3-word searches
        • Searches with 3 words or more
        • Subsequent searches
      • Query design
      • Relevance tests
        • Test 1 - Explicit feedback
        • Test 2 - Implicit feedback
        • Test 3 - Adding notes
        • Test 4 - AND or OR
        • Test 5 - Scoring Tiers
        • Test 6 - English tokeniser and Contributors
        • Test 7 - BoolBoosted vs ConstScore
        • Test 8 - BoolBoosted vs PhaserBeam
      • Collecting data
      • Reporting and metrics
      • Work IDs crib sheet
    • Adapters
      • Adapter lifecycle
      • Fetching records from Sierra
    • Sierra
      • Sierra IDs
    • Pipeline
      • Merging
    • APM
Powered by GitBook
On this page
  1. Other topics

Pipeline

PreviousSierra IDsNextMerging

Last updated 2 years ago

Our ingest pipeline is made up of a number of steps, which are illustrated below. Each orange box is one of our applications.

We have a series of data sources (for now, catalogue data from Calm and image data from Miro, but we'll add others). Each presents data in a different format, or with a different API.

Each data source has an adapter that sits in front of it, and knows how to extract data from its APIs. The adapter copies the complete contents of the original data source into a DynamoDB table, one table per data source.

The DynamoDB tables present a consistent interface to the data. In particular, they produce an event stream of updates to the table – new records, or updates to existing records.

Each table has atransformer that listens to the event stream, that takes new records from DynamoDB, and turns them into our unified work type. The transformer then pushes the cleaned-up records onto an SQS queue.

Each data source has its own identifiers. These may overlap or be inconsistent – and so we mint our own identifiers.

After an item has been transformed into out unified model, we have an ID minter that gives each record a canonical identifier. We keep a record of IDs in a DynamoDB table so we can assign the same ID consistently, and march records between data sources. The identified records are pushed onto a second SQS queue.

In turn, we have an ingestor that reads items from the queue of identified records, and indexes them into Elasticsearch. This is the search index used by our API.