Request For Comments (RFCs)
  • Request for comments (RFC)
  • RFC 001: Matcher architecture
  • RFC 002: Archival Storage Service
  • RFC 003: Asset Access
  • RFC 004: METS Adapter
  • RFC 005: Reporting Pipeline
  • RFC 006: Reindexer architecture
  • RFC 007: Goobi Upload
  • RFC 008: API Filtering
  • RFC 009: AWS account setup
  • RFC 010: Data model
  • RFC 011: Network Architecture
  • RFC 012: API Architecture
  • RFC 013: Release & Deployment tracking
    • Deployment example
    • Version 1
  • RFC 014: Born digital workflow
  • RFC 015: How we work
    • Code Reviews
    • Shared Libraries
  • RFC 016: Holdings service
  • RFC 017: URL Design
  • RFC 018: Pipeline Tracing
  • RFC 019: Platform Reliability
    • CI/CD
    • Observability
    • Reliability
  • RFC 020: Locations and requesting
  • RFC 021: Data science in the pipeline
  • RFC 022: Logging
    • Logging example
  • RFC 023: Images endpoint
  • RFC 024: Library management
  • RFC 025: Tagging our Terraform resources
  • RFC 026: Relevance reporting service
  • RFC 026: Relation Embedder
  • RFC 027: Pipeline Intermediate Storage
  • RFC 029: Work state modelling
  • RFC 030: Pipeline merging
  • RFC 031: Relation Batcher
  • RFC 032: Calm deletion watcher
  • RFC 033: Api internal model versioning
  • RFC 034: Modelling Locations in the Catalogue API
  • RFC 035: Modelling MARC 856 "web linking entry"
  • RFC 036: Modelling holdings records
  • RFC 037: API faceting principles & expectations
  • RFC 038: Matcher versioning
  • RFC 039: Requesting API design
  • RFC 040: TEI Adapter
  • RFC 041: Tracking changes to the Miro data
  • RFC 042: Requesting model
  • RFC 043: Removing deleted records from (re)indexes
  • RFC 044: Tracking Patron Deletions
  • RFC 045: Work relationships in Sierra, part 2
    • Work relationships in Sierra
  • RFC 046: Born Digital in IIIF
  • RFC 047: Changing the structure of the Catalogue API index
  • RFC 048: Concepts work plan
  • RFC 049: Changing how aggregations are retrieved by the Catalogue API
  • RFC 050: Design considerations for the concepts API
  • 051-concepts-adapters
  • RFC 052: The Concepts Pipeline - phase one
  • RFC 053: Logging in Lambdas
  • RFC 054: Authoritative ids with multiple Canonical ids.
  • RFC 055: Genres as Concepts
  • RFC 056: Prismic to Elasticsearch ETL pipeline
  • RFC 058: Relevance testing
    • Examples of rank CLI usage
  • RFC 059: Splitting the catalogue pipeline Terraform
  • RFC 060: Service health-check principles
  • RFC 061: Content API next steps
  • RFC 062: Content API: All search and indexing of addressable content types
  • RFC 062: Wellcome Collection Graph overview and next steps
  • RFC 063: Catalogue Pipeline services from ECS to Lambda
  • RFC 064: Graph data model
  • RFC 065: Library Data Link Explorer
  • RFC 066: Catalogue Graph pipeline
  • RFC 067: Prismic API ID casing
  • RFC 068: Exhibitions in Content API
  • RFC 069: Catalogue Graph Ingestor
  • RFC 070: Concepts API changes
  • RFC 071: Python Building and Deployment
    • The current state
  • RFC 072: Transitive Sierra hierarchies
  • RFC 073: Content API
    • Content API: articles endpoint
    • Content API: Events endpoint
    • Content API: exhibitions endpoint
    • The future of this endpoint
  • RFC 074: Offsite requesting
    • Sierra locations in the Catalogue API
  • RFC 075: Using Apache Iceberg tables in Catalogue Pipeline adapters
Powered by GitBook
On this page
  • Context
  • Principles
  • Proposal
  • Python API
  • Worked example

RFC 041: Tracking changes to the Miro data

This RFC describes a proposal for tracking changes to the Miro data, which is used to populate the Catalogue API.

Last modified: 2021-05-19T09:17:59+01:00

Context

The Miro data was originally exported as a collection of XML files, one per letter prefix. We have split these XML files into a series of JSON files, one per image. These JSON files are stored in S3, with a pointer in DynamoDB:

MiroSourcePayload {
  id: String
  isClearedForCatalogueAPI: Boolean
  location: S3ObjectLocation
  version: Int
}

S3ObjectLocation {
  bucket: String
  key: String
}

Although the Miro data is static, we may change how we want to use it. For example:

  • A contributor may ask us to remove an image from the site

  • We may change the license on an image

  • We may make an image available that we were previously unsure about

We need a way to apply these changes and record them.

Principles

  • We should be able to override specific values in the Miro data/transformer. We should assume we will asked to make changes on an ongoing basis -- this isn't a one-off operation.

  • We should keep a record of our changes: who made them, when, and why

  • Our changes should be separate from the Miro exports

Proposal

We extend the MiroSourcePayload model with two optional fields:

MiroSourcePayload {
  id: String
  isClearedForCatalogueAPI: Boolean
  location: S3ObjectLocation
  version: Int
  events: List[MiroUpdateEvent]?
  overrides: MiroSourceOverride?
}

The MiroUpdateEvent model will track our changes to the data:

MiroUpdateEvent {
  description: String
  message: String
  date: Datetime
  user: String
}

The description will be an automatically generated description of the change, e.g.

Change license override from "None" to "cc-by"

and the message will be a human-written explanation of why we made the change, e.g.

We realised we could make this available under a more permissive licence.

The date and user will be automatically populated.

The MiroSourceOverride model will allow us to track overrides:

MiroSourceOverride {
  license: License?
}

This model can be extended to add new overrides as necessary.

When we make changes to a Miro record, we add a new MiroUpdateEvent to record the change, and we update the DynamoDB record. This gives us change tracking that preserves the integrity of the original Miro data in S3.

Python API

As part of this change, there will be a collection of Python functions that you can use to write scripts for modifying the Miro data.

# miro_updates.py

def make_image_available(image_id, message: str)

def suppress_image(image_id, message: str)

def set_license_override(image_id, license_code: str, message: str)

def remove_license_override(image_id, message: str)

You could use these to, for example, write a script to suppress three images:

from miro_updates import suppress_image

for image_id in ["A0000001", "A0000002", "A0000003"]:
    suppress_image(image_id, message="We were asked to take these images down; see email from John Smith on 18 May 2021")

These functions will send a message to the Miro updates topic, so the record gets re-transformed by the Miro transformer.

Worked example

Suppose we have the following Miro record:

MiroSourcePayload {
  id = "A0000001"
  isClearedForCatalogueAPI = true
  location = S3ObjectLocation { bucket = "vhs-miro", key = "A0000001.json" }
  version = 1
}

The data in the S3 metadata means this is mapped to an "in-copyright" license.

We get an email from the contributor, who tells us we can release it under the CC-BY-NC license. We call the Python helper:

set_license_override(
    image_id="A0000001",
    license_code="cc-by-nc",
    message="An email from John Smith (the contributor) explained we can use CC-BY-NC"
)

The helper will add an appropriate MiroUpdateEvent and MiroSourceOverride:

 MiroSourcePayload {
   id = "A0000001"
   isClearedForCatalogueAPI = true
   location = S3ObjectLocation { bucket = "vhs-miro", key = "A0000001.json" }
+  events = [
+    MiroUpdateEvent {
+      description = "Change license override from 'None' to 'cc-by-nc'"
+      message = "An email from John Smith (the contributor) explained we can use CC-BY-NC"
+      date = 2001-01-01T01:01:01Z
+      user = "Alex Chan <chana@wellcomecloud.onmicrosoft.com>"
+    },
+  ]
+  overrides = MiroSourceOverride {
+    license = "cc-by-nc"
+  }
+  version = 2
 }

Later we get another contributor, saying we can now use CC-BY. We call the helper a second time:

set_license_override(
    image_id="A0000001",
    license_code="cc-by",
    message="An email from John Smith (the contributor) said we can use CC-BY"
)

And the record gets updated again:

 MiroSourcePayload {
   id = "A0000001"
   isClearedForCatalogueAPI = true
   location = S3ObjectLocation { bucket = "vhs-miro", key = "A0000001.json" }
   events = [
     MiroUpdateEvent {
       description = "Change license override from 'None' to 'cc-by-nc'"
       message = "An email from John Smith (the contributor) explained we can use CC-BY-NC"
       date = 2001-01-01T01:01:01Z
       user = "Alex Chan <chana@wellcomecloud.onmicrosoft.com>"
     },
    MiroUpdateEvent {
      description = "Change license override from 'cc-by-nc' to 'cc-by'"
      message = "An email from John Smith (the contributor) said we can use CC-BY"
      date = 2002-02-02T02:02:02Z
      user = "Henry Wellcome <wellcomeh@wellcomecloud.onmicrosoft.com>"
    },
   ]
   overrides = MiroSourceOverride {
+    license = "cc-by"
   }
+  version = 3
 }
PreviousRFC 040: TEI AdapterNextRFC 042: Requesting model

Last updated 10 days ago