Work relationships in Sierra
Last updated
Last updated
This RFC has been superceded by sierra-work-relationships and is preserved here to provide context.
There are various ways to represent relationship between Works in the library and the archive. One example is collections in archives, where Works are organized in hierarchy with parents, children and siblings.
In Sierra, there are multiple ways to represent relationships between bib records (which become Works):
A group of bibs can form a series (for example, a series of books written by the same author)
A bib can be part of another bib (for example, a chapter is part of a book)
Conversely, a bib can contain other bibs (for example, a book can contain chapters)
There are other ways in Sierra to represent relationships, but this RFC is focusing on these three. The goal is to define an approach that's flexible enough to be reused for other types of relationship.
These relationships are exposed in Encore (https://search.wellcomelibrary.org), but now our new site (https://wellcomecollection.org/works). We need to add these relationships to the new site, so we can finish migrating users away from Encore.
If you look at an individual bib record in Encore, these links are shown as part of the bib metadata. For example, b31787691:
The "Series" values are links, and take the user to a pre-filled search for the title of the series. For example, clicking the second link "Perspectives in continental philosophy ; no. 39." takes you to this search:
This only does a free text search on the record, and sometimes returns inaccurate results. If you click the first link on b31787691, the search is empty – even though we know at least one work in this series!
This is an obvious opportunity for improvement – we can provide a more accurate way to browse works by series link.
Users should be able to:
See whether a work is part of a series, part of another work, or contains other works
Find other works in the same series, or which are part of the same work, or contained by this work
We can reuse some of the functionality we have for modelling relations and archive trees, but the size of some series means we can't reuse the UI components.
There are five variable-length fields ("varfields") in Sierra that we'll look at for this work:
490 Series Statement. This is used to mark a bib as part of a series, for example:
The field is structured to distinguish between the title of the series (Perspectives in Continental philosophy series
) and the bib's place within this series (no. 39
).
This is present on ~60k bibs.
The interesting subfields are $a (series statement) and $v (volume). We will ignore the other subfields – either they're not useful for this work, they're non-standard fields, or their use is likely an error on the original Sierra record.
440 Series Statement. For example:
This is legacy data that will eventually be migrated to 490, but it's currently present on 60k+ bibs, so we have to include it in this work.
The interesting subfields are $a (series statement) and $v (volume).
830 Series Added Entry-Uniform Title. This is another field for marking a bib as part of a series, for example:
This is present on ~36k bibs.
773 Host Item Entry. If this is present on a bib, it tells us about the containing bib. For example:
The bib on which this appears is part of this volume of the bulletin of the history of dentistry.
This is present on ~492k bibs.
774 Constituent Unit Entry. This is the opposite of 773, and tells us about the component parts of this bib. For example:
The bib on which these fields appear has two component parts.
This is only present on 749 bibs.
We'll continue to refine our use of these fields and subfields as we go along – it's very hard to define an exact specification upfront.
We will likely do some cleanup (e.g. deduplication across fields) once we've done the first transformation of the data, but where possible we should try to fix data at source (in Sierra) rather than writing code to deal with it.
Some series have thousands of items, e.g. "ACLS Humanities E-Book." appears in 5k+ instances of 830 Series Added Entry-Uniform Title.
"Early European Books : Printed sources to 1700" appears in 31k+ instances of 440 Series Statement, divided into individual volumes.
We should consider dismissing or removing certain values in subfield $a as unlikely to be useful for navigation, including:
(a single space) which appears in 1k+ instances of 830
Previously, all the relationships in the catalogue API have been to other Works. We're not going to create Works for series (unless the Library team have created one in Sierra).
We will use use the parts
and partOf
properties on the Work model, that are current used to represent archive collections.
This is how relations are modelled for the archive collection on dq3spb42:
These are rendered on the website as a collapsible hierarchy:
Some series can have thousands of entries, so this UI isn't practical – we'll need to render the information differently. We'll determine how to render them based on the type
property of the entry in partOf
.
For archive works, we'll use partOf: Collection
and render an archive tree
For works which are part of a series (440, 490, 830), we'll use partOf: Series
and display a link as part of the work metadata (similar to Encore)
For works which have constituent parts or are part of something else, we'll use partOf: Work
If a series or host item entry has a volume, we'll use a nested partOf
property to record the series and the volume individually (see example below).
If a series or host item entry has an identifier (such as an ISSN), we'll add it to the partOf
and mint a canonical identifier for it.
b10747850
We'll deduplicate because the information in 490 and 830 is the same:
We have a nested partOf
because there's volume information in subfield $v.
b31098058
Here we have 490 and 773 with slightly different information.
Although we could write logic to deduplicate these, it's better to do the deduplication (if we want to) in Sierra itself.
b31787
2125597 490 with ISSN + 830 and 773. There are 1037 bibs that have the same ISSN in a series statement but with different volume subfield.
3001878 Only 830 with no 490/440 or 773
1110225 Series statement in 440 with only the title
1204561 440 with id
2301867 773 not overlapping with series and no id.
1186777 773 with an id in subfield $w which links to another Work in our library (see example below)
1172977 774 with ids (related to above 773)
3017508 774 no ids