Request For Comments (RFCs)
  • Request for comments (RFC)
  • RFC 001: Matcher architecture
  • RFC 002: Archival Storage Service
  • RFC 003: Asset Access
  • RFC 004: METS Adapter
  • RFC 005: Reporting Pipeline
  • RFC 006: Reindexer architecture
  • RFC 007: Goobi Upload
  • RFC 008: API Filtering
  • RFC 009: AWS account setup
  • RFC 010: Data model
  • RFC 011: Network Architecture
  • RFC 012: API Architecture
  • RFC 013: Release & Deployment tracking
    • Deployment example
    • Version 1
  • RFC 014: Born digital workflow
  • RFC 015: How we work
    • Code Reviews
    • Shared Libraries
  • RFC 016: Holdings service
  • RFC 017: URL Design
  • RFC 018: Pipeline Tracing
  • RFC 019: Platform Reliability
    • CI/CD
    • Observability
    • Reliability
  • RFC 020: Locations and requesting
  • RFC 021: Data science in the pipeline
  • RFC 022: Logging
    • Logging example
  • RFC 023: Images endpoint
  • RFC 024: Library management
  • RFC 025: Tagging our Terraform resources
  • RFC 026: Relevance reporting service
  • RFC 026: Relation Embedder
  • RFC 027: Pipeline Intermediate Storage
  • RFC 029: Work state modelling
  • RFC 030: Pipeline merging
  • RFC 031: Relation Batcher
  • RFC 032: Calm deletion watcher
  • RFC 033: Api internal model versioning
  • RFC 034: Modelling Locations in the Catalogue API
  • RFC 035: Modelling MARC 856 "web linking entry"
  • RFC 036: Modelling holdings records
  • RFC 037: API faceting principles & expectations
  • RFC 038: Matcher versioning
  • RFC 039: Requesting API design
  • RFC 040: TEI Adapter
  • RFC 041: Tracking changes to the Miro data
  • RFC 042: Requesting model
  • RFC 043: Removing deleted records from (re)indexes
  • RFC 044: Tracking Patron Deletions
  • RFC 045: Work relationships in Sierra, part 2
    • Work relationships in Sierra
  • RFC 046: Born Digital in IIIF
  • RFC 047: Changing the structure of the Catalogue API index
  • RFC 048: Concepts work plan
  • RFC 049: Changing how aggregations are retrieved by the Catalogue API
  • RFC 050: Design considerations for the concepts API
  • 051-concepts-adapters
  • RFC 052: The Concepts Pipeline - phase one
  • RFC 053: Logging in Lambdas
  • RFC 054: Authoritative ids with multiple Canonical ids.
  • RFC 055: Genres as Concepts
  • RFC 056: Prismic to Elasticsearch ETL pipeline
  • RFC 058: Relevance testing
    • Examples of rank CLI usage
  • RFC 059: Splitting the catalogue pipeline Terraform
  • RFC 060: Service health-check principles
  • RFC 061: Content API next steps
  • RFC 062: Content API: All search and indexing of addressable content types
  • RFC 062: Wellcome Collection Graph overview and next steps
  • RFC 063: Catalogue Pipeline services from ECS to Lambda
  • RFC 064: Graph data model
  • RFC 065: Library Data Link Explorer
  • RFC 066: Catalogue Graph pipeline
  • RFC 067: Prismic API ID casing
  • RFC 068: Exhibitions in Content API
  • RFC 069: Catalogue Graph Ingestor
  • RFC 070: Concepts API changes
  • RFC 071: Python Building and Deployment
    • The current state
  • RFC 072: Transitive Sierra hierarchies
  • RFC 073: Content API
    • Content API: articles endpoint
    • Content API: Events endpoint
    • Content API: exhibitions endpoint
    • The future of this endpoint
  • RFC 074: Offsite requesting
    • Sierra locations in the Catalogue API
Powered by GitBook
On this page
  • Context
  • Overview
  • Nodes
  • Work
  • Series
  • Image
  • Location and language (optional)
  • Concept
  • SourceConcept
  • SourceName
  • SourceLocation
  • Edges
  • Work-->Work
  • Work-->Concept
  • Work-->Image
  • Image-->Image
  • Edges from concepts to source nodes
  • Edges between source nodes
  • Future directions and other considerations
  • Appendix

RFC 064: Graph data model

PreviousRFC 063: Catalogue Pipeline services from ECS to LambdaNextRFC 065: Library Data Link Explorer

Last updated 2 days ago

An update to the previous on the knowledge graph, focusing on a new graph data model for concept enrichment and linking to external ontologies.

Last modified: 2024-12-05T16:31:45+00:00

Context

Following from , which provides an overview of a past prototype and proposal for a new graph based on data from the Collection and external ontologies, this RFC focuses on a new graph data model. While the primary/current focus of the suggested model is the enrichment of concepts, other data and attributes are also included as these may enable future machine learning work and visualisation. The graph data model is designed to enable enrichment of concepts while keeping original labels and sources intact. This is to make sure we are always able to tell where a new link or piece of information was acquired from, whether it be manual tagging, a particular source ontology, or ML inference.

Overview

Here is a visual overview of the proposed data model:

Nodes

Work

Series

Image

Location and language (optional)

Some works have information available on their production location and/or language. While this may not be a priority, there is an opportunity to link these to external ontologies in the future (this may require extensive data cleaning and/or a ML approach). In particular, LoC and Wikidata have rich information available about both geographical locations and languages, which can be distinguished from other concepts via properties such as instance of. Note that in the current graph data model, these nodes are disconnected from Work nodes (but indirectly queryable via Work properties). This is due to the potential for the formation of supernodes (highly connected nodes for very common categories such as the English language), which can otherwise negatively impact query performance.

Concept

Concept type
Count

Person

364483

Concept

122812

Organisation

57517

Place

11149

Agent

8828

Meeting

5153

Genre

1494

Period

1032

In some cases, these manually tagged concepts already come with an identifier from an external ontology (see table below). These sources are also added to concept nodes as a property.

Concept source
Count

label-derived

308908

lc-names

197509

lc-subjects

37343

nlm-mesh

28425

viaf

153

fihrist

130

SourceConcept

SourceName

SourceLocation

Edges

Work-->Work

Work-->Concept

  • HAS_CONCEPT: Edges from works to their manually tagged concepts.

  • CONTRIBUTUED_TO: Edges to works from their manually tagged contributors.

Work-->Image

  • HAS_IMAGE: Edges from works to their images, if available.

Image-->Image

  • VISUALLY_SIMILAR: Edges between similar image embeddings. This is optional, as the most similar image embeddings can alternatively be retrieved directly from the vector store as needed.

Edges from concepts to source nodes

  • HAS_SOURCE_CONCEPT: Edges between manually tagged concepts and their source ontologies, if a match can be made via a provided source ID, label, or a ML algorithm. Information about the source of the match is stored in the matched_by edge attribute. For example, if we want to match a subset of label-derived concepts to LCSH, we can log the source of these matches under matched_by='label'. It is also worth noting that there are 770 MeSH concepts which do not match the true label, and 617 MeSH IDs are not correctly formatted with a 'D' at the start (at least some of these look like LoC IDs). A data cleaning step is therefore likely required before linking concept IDs to their source.

Edges between source nodes

  • SAME_AS: This can include concepts matched on label (such as between label-derived concepts and LCSH or MeSH), machine learning derived, or directly from source ontologies (exactly and closely matching concepts from LoC, Wikidata to MeSH via property P486). In each case, the source of the link can be added as an edge attribute source.

  • RELATED_TO: Edges between source nodes which are closely related, but do not refer to the same entity and are not hierarchical. This includes similar entries from MeSH to MeSH via SeeRelatedDescriptor and Related Terms from LoC to LoC.

  • HAS_PARENT: Edges between source nodes which are hierarchical. This includes MeSH tree parent terms, as well as properties P31/instance of and p279/subclass of from Wikidata. The hierarchical nature of this relationship can be understood like this: if you have a work tagged with a concept that has a parent, it should, in theory, make sense to also display that work under its parent concept. For example, a work tagged with 'Cardiotoxicity' could reasonably also be tagged with its parent term 'Heart Diseases'.

  • NARROWER_THAN: Edges between related source nodes where one is broader/narrower than the other, but which may not represent a parent/child relationship in the strictest sense. This includes LoC Broader Terms and Narrower Terms as well as component terms, where the composite concept is narrower than either component (for example Malaria--Prevention, which is narrower than both Malaria and Prevention).

  • LOCATED_IN: Hierarchical relationship between locations. For example, when mapping locations from LoC to Wikidata, countries can be identified via property P17, and cities/states/counties via P131.

Future directions and other considerations

The graph data model includes a variety of information and links from the different source ontologies. This can enable various improvements to concept pages in the future, such as:

  • Filtering and aggregating works related to a single, unified concept which exists in multiple source ontologies. This extends to label-derived concepts which can be matched to these.

  • Displaying relevant information from external data sources on concept pages, such as descriptions, birth dates, and links to other data.

  • Providing onward journeys from concept pages to related, broader/narrower concepts and concepts that frequently co-occur on works.

Furthermore, a graph enables network analysis which can identify isolated concept pages, highly connected clusters of interlinked concepts, and concepts acting as bridges between such groups.

Appendix

Overall concept linkage

An estimate of the amount of linkage which can be achieved through a combination of (i) exact matches between label-derived concepts and LCSH, (ii) exact matches between label-derived concepts and MeSH, (iii) existing links between ontologies via SAME_AS relationships between SourceConcept nodes:

Number of concept IDs before linkage:

Number of concept IDs after linkage:

Top 100 most common concepts

If you look at the top 100 most common concept labels, 94 of these can actually be matched to a vocabulary entry in at least one source via existing identifiers and the above linkage process (click on image for interactive version):

Properties on Work nodes can be found in . These are derived from information available in the works snapshot. There are several properties which could be linked directly to source nodes in the future, such as language and production location. Attributes containing potentially useful text and descriptions are all included. Any free text which provides useful information about the work may be relevant for future machine learning work. This includes title, description, physical description, notes, lettering.

Series nodes only have a title property (see ). These are extracted from works and their main purpose is to represent meaningful links between different works which are part of the same series.

Nodes for images which are part of works. Suggested attributes can be viewed in , and include information such as the identifier and iiif url. While images could feasibly be linked to the the subjects of the works they are associated with, these edges are not strictly needed to get this information (the suggested graph model enables traversal of edges which can provide this information via Image<-HAS_IMAGE-Work-HAS_CONCEPT->Concept).

These are the concepts that works have been manually tagged with. Any concept which has a unique 8-digit identifier will be represented as a Concept node. There is no split into Person and Concept nodes at this level, in contrast to the previous graph model. This is because (i) concepts tagged as Person can also include other terms which are not names, and (ii) there are additional types other than Concept and Person (see table below). As these can potentially all be linked to external vocabularies, it makes more sense to keep all of these as Concept nodes and add the type tag as a node property (see ).

Nodes for concepts from the following external ontologies: Library of Congress Subject Headings (LCSH), Wikidata, or Medical Subject Headings (MeSH). Aside from the source identifier and label, properties include a description, alternative identifiers, and alternative labels/synonyms (see ). Additional properties can be added to incorporate more information from source ontologies, if needed. A full list of source properties is available for and . It is worth noting that Wikidata in particular has a wide range of properties, including links to various other external databases (e.g.the National Portrait Gallery, OpenAlex and many more), and a decision may need to be made whether to include all of these under alternative_ids.

Nodes for names from Library of Congress Name Authority File (LCNAF) (excluding concepts which are an instance of MADS/RDF Geographic, see below) and corresponding Wikidata concepts. MeSH is not included here as it does not include any names. Additional properties are included on SourceName nodes which specifically provide relevant context for names (see ): date of birth, date of death, and place of birth. Note that, for simplicity, there is currently no further split into different name categories such as organisation or meeting names (therefore the additional birth date/place properties for these will be empty). However, it should be relatively straightforward to split these further, if necessary, as they are identifiable via source vocabulary subdivisions and/or parent terms. For instance, there may be a need to incorporate specific information for organisations such as their location, founder etc.

Nodes for concepts from LCSH or LCNAF which are an instance of MADS/RDF Geographic and MeSH Geographicals (Z tree codes). Location-specific properties are included on SourceLocation nodes (see ), such as their coordinates.

All edges can be viewed in .

PART_OF and SUCCEEDED_BY: Works can be linked to other works, when a work is succeeded by or part of another work. For example, is linking various records from the Medical Women's Federation. These relationships can be represented by SUCCEEDED_BY and PART_OF edges.

However, the intention is not to make any assumptions on what should eventually be displayed on concept pages and how, as this will require more extensive user research. Furthermore, while enrichment of concept pages via source ontologies is the current focus of the graph, it is only one of its possible use cases. For example, a graph can also facilitate visualisation of the Collection and support ML tasks via graph embeddings, as described previously in . Having said that, it is possible to incrementally build the graph based on the above data model, starting with concepts and adding more data as needed.

work.yaml
series.yaml
image.yaml
concept.yaml
sourceconcept.yaml
MeSH
Wikidata
sourcename.yaml
sourcelocation.yaml
edges.yaml
rjdt9j3h
RFC #62
RFC #62
062-knowledge-graph
graph_model