RFC 029: Work state modelling

This RFC proposes a new way of modelling works in the catalogue pipeline and API, separating the type of work from its state in the pipeline. This aims to improve composability, clarity, and ease of adding new types or states.

Last modified: 2020-09-07T14:40:36+01:00

Context

The central model to the catalogue pipeline and API is the Work model. We currently have a relatively complex type hierarchy to represent different sorts of work, as well as their state within the pipeline. For example works extending from IdentifiedBaseWork (IdentifiedWork, IdentifiedInvisibleWork, IdentifiedRedirectedWork) indicate works that have been minted by the ID minter, whereas corresponding works extending from TransformedBaseWork have not. Data that is common to these different sort of works is held in the WorkData case class.

Having these as types is generally good as it makes a compile time guarantee that some particular piece of data exists at some point in the code: we would not have the same guarantees if for example we used the same work model regardless of whether the work had an ID or not (using an Option field), which would lead to more error handling code and potentially the introduction of subtle bugs which the type system would otherwise prevent.

There are some upcoming use cases where we need to modify the work modelling, such as adding relations (parts, partOf, precededBy, succeededBy) to the model for works after denormalisation by the relation embedder. Also @jamesgorrie and @jtweed have been investigating having additional sorts of works in the type hierarchy (such as collections, series, items) similarly to how there are different sorts of concepts.

Problem

There are a number of issues with the way the modelling is currently implemented, primarily to do with composability of the types and the fact that they do not express our business logic as clearly as they could:

  1. To add a new sort of work we need to add a large amount of subtypes. For example to add a work such as a Collection we would potentially need the addition of up to 6 new types: UnidentifiedCollection, UnidentifiedInvisibleCollection etc.

  2. There is no way to add new state dependent data in a type safe manner without also adding more types of work for each case. To take the relation embedder as an example, that would require new types such as DenormalisedWork, DenormalisedCollection etc.

  3. There are already 2 type parameters on WorkData, both for the identified state of things such as concepts, and also for the identified state of images. Also, if we decided to add relation data directly to WorkData (as optional fields) rather than creating new work types we would need an additional third type parameter (to express whether the referenced works are pre or post minter). These are all tied together so ideally we should be able to specify them with a single type parameter.

  4. The type hierarchy from BaseWork downwards is quite complex and it can be hard to understand the meaning of particular types of work within the code. The naming could also do with changes in places (e.g. TransformedBaseWork does not make much sense as a name, given that the concept of a work does not exist before the transformer).

Proposed Solution

This RFC proposes having a cleaner separation of the pipeline state of a work with the sort of work it is, namely with different forms of polymorphism:

  1. Sub-typing is used for differentiating different sorts of work, with a Work being a sum type ADT consisting of the finite number of possible case classes. Note this proposed sub-typing is much simpler than the current implementation, having only a single level of depth and a single parent type.

  2. Parameterisation of the Work by a state parameter indicates what stage of the pipeline it is in, and which can hold specific data depending on stage. By constraining the parameter to one of a set of known types, we can consider the Work model as a finite state machine with the stages in our pipeline each containing a particular transition.

By separating these two things like this we are able to add new types of work or new pipeline states without worrying about how one will impact the other. The following is a sketch of how this might look:

Potential Issues

  • There may be some problems with the implementation of this with regards to the id_minter pipeline stage, which works on raw JSON data rather than Scala types. For example, in the WorkState.Denormalised case class above sourceIdentifier has been included again (it already exists in the work itself) as the id_minter will need to generate the canonicalId here for the output to be correct. Personally I think the id_minter could be implemented pretty simply on our types rather than raw JSON, which would make things a bit safer any avoid some of these idiosyncrasies, although a disadvantage of this is it would require a separate id_minter for Work, Image and any future types we wanted to attach IDs to.

  • We use circe generated type names in a few places, so code relying on this might be messed up. For example, IdentifiedWork becomes Work[Identified].

  • The suggestion above although a big improvement does not completely solve the composability issue, as with the way it has been written still may require things like InvisbleCollection etc. However, I am not sure whether indicating invisibility of a work actually requires having a separate type rather than simply a boolean flag, which would be much more composable.

Last updated