Work relationships in Sierra, part 2
This is an update to the original RFC-044: Sierra Series. What is presented here is a distillation of our current understanding of this issue, and the practicalities of implementation.
RFC-044 is a year old, and engendered a large amount of discussion, making it unwieldy to review. Much of the information in that document is still valid, and provides essential context, so it is left intact.
These RFCs concern the linking of Works to Series, and of Works to other Works coming from Sierra.
It is apparent that these are two different things, with two different manifestations in the API and Website. Though they do have some common features.
Summary
At the broadest level, there are three features to be added resulting from this RFC.
Linking of Works to Series
Linking of parent Works to child Works
Linking of child Works to parent Works
API change
This requires a change to the API:
id
is currently required in partOf objects, it needs to be optional.a new field needs to be added to part and partOf objects:
partName
Preliminary Work
We must ensure that the website will not attempt to render inappropriate hierarchies when we start adding new part/partOf relationships.
Staged development proposal
All parent relationships described in this document to be displayed as series links.
This includes 773 relationships with identifiers, which will eventually link between Works.
The partName ($g or $v) is ignored
Improve handling of identified Work->Work links for hierarchical display.
Improve handling of asymmetric partOf links to link directly by id.
Display the partName if desired.
This approach allows us to first deliver an improvement to the linking of as many records as possible, and then iterates over how that linking can be improved for those records where we have more information.
It also represents a cone of uncertainty. It is clear that we want to link from Works to Series via a filtered search for the Series title. The correct presentation of links between parent and child works is less clear. In particular, how best to handle asymmetric links. The value of presenting the partName text to users is not clear.
Relevant MARC fields
Common features
Most Precise Wins
The 773 field is not restrictive about what it can point to, whereas the 4x0 and 830 fields explicitly refer to a Series. Records with a 773 field that matches one of the series fields should ignore the 773 field and just present it as a Series link.
Where a 773 field does not match one of the Series fields, then it is treated as a partOf in its own right.
If the title of a 773 field matches one of the Series fields, but it also contains a $w subfield, then Investigate where this happens in order to decide, before developing a solution
Find out if it ever happens
Find out how many objects there are in each "series" where it does happen
Ask C&I what it means
partName property name
The subfields $g - Related parts
and $v Volume/Sequential designation
are used in the Sierra data in 773/774 and 440/490/830 fields, respectively.
Giving them a common name in the API is simpler than having two different names, forcing a client to first try one then the other or look at one property if the parent is a Series and another one if the parent is a Work or Unknown.
No Nesting of part/partOf
Whereas hierarchies from CALM data can be arbitrarily deep and broad, and benefit from nested partOf
and part
values, values derived from these fields are effectively flat. Although the first layer in either may be arbitrarily broad (multiple parents/children), the next layer would have maximally one value.
In addition, where $g
or $v
are used, the immediate parent/child as proposed in the original RFC is of less importance, being a volume number or page number.
They denote membership of a container, and optionally denote some kind of subsection within that container.
Semantically, it may be correct to define an object by nesting, but in practice, using a structure that allows arbitrarily deep and wide nesting adds unnecessary complexity to both server and client.
For each field value from Sierra creating a partOf object
There is optionally one grandparent
The terminal ancestor is the "important" one.
And the same is true In a part relationship derived from these fields:
Each child optionally has one child
The terminal descendant is the "important" one.
Using nesting in this situation would mean that a client would have to iterate to find terminal ancestors/descendants. This pattern may be clear to Wellcome Collection developers, but not obvious to external consumers of the API.
Linking a Work to a Series
Manifestation
These are not to be presented within a hierarchy. On the website, the name of the series should present as a link to a filtered search for other objects in the same series. The behaviour of the API will need to be updated in order to support filtering partOf by title
as well as id
as it currently does.
In the API, this should be represented in a partOf value with no id, and a type of "Series", thus:
A MARC value may have a subfield denoting a "part" of the series, this should be separated from the series title, and presented in a new field, partName
In the case of 830, 430 nd 490 fields, this is in the $v subField.
830 Published papers (Wellcome Chemical Research Laboratories) ;|vno. 149.
becomes
Some 773 entries do not have a $w subField, e.g. Think of Me
These should be treated as a Series.
In some cases, there is a 773 with a matching 830 or 4x0 property. This should result in only one partOf.
A record may be part of multiple Series, e.g. Derrida after the end of writing:
This will result in a partOf entry for each series.
As there is no object corresponding to the parent, there is no symmetrical parts
list.
In Depth
A Series may be something like:
a run of books, possibly, but not necessarily, on a common subject.
e.g. the Usborne Touchy-Feely series, or Oxford Very Short Introductions
A grouping of things, possibly from an external source.
These are designated using the 4xx and 8xx codes, and (normally) only identified by name.
These should not be represented in the same fashion as a CALM hierarchy because:
There may be very many objects in any given series,
There may not be any inherent order, the series is just a bag of items.
However, membership of a series is useful information, as is the ability to find other objects from that series.
series title and partName
Separating the partName from the series title allows us to create an accurate filtered search. The partOf title is the exact match to use.
Why not store it all in title?
Storing the series title and partName in the same field makes this difficult.
One could expect there to be only one "Published papers (Wellcome Chemical Research Laboratories) ; no. 149", and at least 149 "Published papers (Wellcome Chemical Research Laboratories)". One would also expect there to be many Series with titles containing most of those tokens.
A search for all those tokens would return too many results (anything containing Published, or papers, or Wellcome etc.)
A search for the exact string would return only one result (i.e. this one)
A client would have to somehow know how to parse out the partName to request the right exact phrase (without the benefit of the MARC subfield markers).
Links from child Works to parent Works
Manifestation
These are to be presented with a hierarchy. On the website, this should behave in the same manner as CALM-derived Collection hierarchies.
Order can be determined from the order of 774 properties in the parent object.
The value of partName should be extracted from the $g subfield, if present.
If the relationship is asymmetric, there should be no hierarchy. The child should link to the parent Work in such a way as to allow the user to find other children of that parent.
In Depth
A Host Item/Constituent Unit relationship may describe something like:
Pictures in an album or montage (e.g. Basil Hood)
Articles in a Journal (e.g. Edinburgh Medical and Surgical Journal)
A single work published in multiple volumes? (I don't have any examples of this)
773 without corresponding 774
Given a Work A that has a 773 that refers to Work B, but Work A is not listed as a 774 of Work B, there is insufficient information to attempt to render a hierarchy.
Some articles in the Edinburgh Medical and Surgical Journal Refer to the journal, with an id in a 773 field ($w subfield), e.g. Notice of an instance of molluscum chronicum. A search for Edinburgh Medical and Surgical Journal yields over 1000 hits, which indicates that this should behave more like a Series link (though this could include other matches, not just articles in the journal).
These are to be treated as a Series.
Links from parent Works to child Works
Manifestation
These are to be presented with a hierarchy. On the website, this should behave in the same manner as CALM-derived Collection hierarchies.
Whether to include the partName in the display value can be decided on implementation.
The value of partName should be extracted from the $g subfield, if present.
In the API, this will manifest in the parts
list.
This extract from the Basil Hood Photograph Album
becomes
Where a 774 field has no id associated, these are to be ignored. e.g. 774 1 |tLists of snake names in Malay.|h4 p.; 34 x 21 cm
In Depth
774 without id
Some 774 values have no id, e.g. Catalogues of Malayan plants, birds and snakes Wellcome Malay 8 on wellcomecollection.org.
This cannot be presented as any kind of list of links in a UI, because there is nothing to link to.
This example is a TEI manuscript, so the data for the API comes from there rather than Sierra, so the Sierra data can safely be ignored.
Unanswered Questions
Worldcat ids
Because it may be difficult to match a worldcat id to an id in the pipeline, it may be simpler to check for symmetry by title. e.g. Having found a record with a 774 field containing an identifier
Search for the document with the title from the 774.
Having found that that document contains a 773 with the title of the main document, we can then link them.
Other potential partName subfields
There are also subfields $p (Name of part/section of a work) and $n (Number of part/section of a work) Both of these are present on at least one example:
Madame Delait, the bearded woman of Plombières
However, in this case, the values do not look useful:
830 0 Fallaize Collection.|pname of grouping ;|nnumber
It may be worth investigating whether these subfields ever have useful information.
Punctuation between subfields.
MARC data is designed to be printed out verbatim (after stripping subfield tags), so fields and subfields often contain punctuation to permit this, e.g. the colon and semicolon at the ends of subfields in these two examples.
830 Published papers (Wellcome Chemical Research Laboratories) ;|vno. 149.
774 0 |gPage 6 :|tCharing Cross Hospital: a portrait of house surgeons. Photograph, 1906.|w(Wcat)28916i
Ideally, these separators would not be presented unless the fields are output together. We will need to iterate towards the correct way to trim them.
Note that the order of subfields is defined in the data, so the full string cannot be simply reconstructed from individual fields.
It may be the case that we need to store the whole field as a separate string for presentation, as well as storing the main or title field to facilitate linking.
Last updated