Pipeline

Our ingest pipeline is made up of a number of steps, which are illustrated below. Each orange box is one of our applications.

We have a series of data sources (for now, catalogue data from Calm and image data from Miro, but we'll add others). Each presents data in a different format, or with a different API.

Each data source has an adapter that sits in front of it, and knows how to extract data from its APIs. The adapter copies the complete contents of the original data source into a DynamoDB table, one table per data source.

The DynamoDB tables present a consistent interface to the data. In particular, they produce an event stream of updates to the table – new records, or updates to existing records.

Each table has atransformer that listens to the event stream, that takes new records from DynamoDB, and turns them into our unified work type. The transformer then pushes the cleaned-up records onto an SQS queue.

Each data source has its own identifiers. These may overlap or be inconsistent – and so we mint our own identifiers.

After an item has been transformed into out unified model, we have an ID minter that gives each record a canonical identifier. We keep a record of IDs in a DynamoDB table so we can assign the same ID consistently, and march records between data sources. The identified records are pushed onto a second SQS queue.

In turn, we have an ingestor that reads items from the queue of identified records, and indexes them into Elasticsearch. This is the search index used by our API.

PreviousSierra IDs NextMerging

Last updated 2 years ago