RFC 005: Reporting Pipeline
This RFC proposes a reporting pipeline for the Wellcome Collection data, allowing for analytics and reporting on data from various sources.
Last modified: 2018-11-02T16:46:57+00:00
Background
The collection aggregates data from a number of sources:
Archival records
Library systems
Digital asset metadata
Data will flow from these systems into our data store.
Problem Statement
In order to make decisions about collection data.
We need to run analytics and reporting on data from various sources.
Proposed Solution
We propose to add a simple reporting pipeline powered by lambda functions feeding an ElasticSearch cluster.

Note: Kinesis Firehose is unsuitable for this purpose at time of writing as it is incapable of performing ElasticSearch document updates.
Process flow
The event stream from the SourceData "Versioned Hybrid Store" triggers:
A lambda which performs a custom transformation on source data making it suitable for ingest into elasticsearch.
This lambda will pass a json object and index identifier to SNS
An ingestion lambda PUTs the object passed to the specified index
It is intended that there may be multiple transformation lambdas, providing custom transforms. There will be one ingestion lambda intended to try and PUT any object to any index specified.
Ingestion Lambda proposed message format
The ingestion lambda needs to take a message that configures which index to attempt to add the object to.
{
"index": "my-index-1",
"object": {
"foo": "bar"
}
}
Elasticsearch mappings
It is not intended that strict mappings will be provided. It will instead be the job of the transformation to provide representative data.
Last updated