RFC 005: Reporting Pipeline

This RFC proposes a reporting pipeline for the Wellcome Collection data, allowing for analytics and reporting on data from various sources.

Last modified: 2018-11-02T16:46:57+00:00

Background

The collection aggregates data from a number of sources:

Archival records
Library systems
Digital asset metadata

Data will flow from these systems into our data store.

Problem Statement

In order to make decisions about collection data.

We need to run analytics and reporting on data from various sources.

Proposed Solution

We propose to add a simple reporting pipeline powered by lambda functions feeding an ElasticSearch cluster.

Note: Kinesis Firehose is unsuitable for this purpose at time of writing as it is incapable of performing ElasticSearch document updates.

Process flow

The event stream from the SourceData "Versioned Hybrid Store" triggers:

A lambda which performs a custom transformation on source data making it suitable for ingest into elasticsearch.
- This lambda will pass a json object and index identifier to SNS
An ingestion lambda PUTs the object passed to the specified index

It is intended that there may be multiple transformation lambdas, providing custom transforms. There will be one ingestion lambda intended to try and PUT any object to any index specified.

Ingestion Lambda proposed message format

The ingestion lambda needs to take a message that configures which index to attempt to add the object to.

{
  "index": "my-index-1",
  "object": {
    "foo": "bar"
  }
}

Elasticsearch mappings

It is not intended that strict mappings will be provided. It will instead be the job of the transformation to provide representative data.

PreviousRFC 004: METS Adapter NextRFC 006: Reindexer architecture

Last updated 2 months ago