Request For Comments (RFCs)
  • Request for comments (RFC)
  • RFC 001: Matcher architecture
  • RFC 002: Archival Storage Service
  • RFC 003: Asset Access
  • RFC 004: METS Adapter
  • RFC 005: Reporting Pipeline
  • RFC 006: Reindexer architecture
  • RFC 007: Goobi Upload
  • RFC 008: API Filtering
  • RFC 009: AWS account setup
  • RFC 010: Data model
  • RFC 011: Network Architecture
  • RFC 012: API Architecture
  • RFC 013: Release & Deployment tracking
    • Deployment example
    • Version 1
  • RFC 014: Born digital workflow
  • RFC 015: How we work
    • Code Reviews
    • Shared Libraries
  • RFC 016: Holdings service
  • URL Design
  • Pipeline Tracing
  • Platform Reliability
    • CI/CD
    • Observability
    • Reliability
  • RFC 020: Locations and requesting
  • RFC 021: Data science in the pipeline
  • RFC 022: Logging
    • Logging example
  • RFC 023: Images endpoint
  • RFC 024: Library management
  • RFC 025: Tagging our Terraform resources
  • RFC 026: Relevance reporting service
  • RFC 026: Relation Embedder
  • RFC 027: Pipeline Intermediate Storage
  • RFC 029: Work state modelling
  • Pipeline merging
  • RFC 031: Relation Batcher
  • RFC 032: Calm deletion watcher
  • RFC 033: Api internal model versioning
  • RFC 034: Modelling Locations in the Catalogue API
  • RFC 035: Modelling MARC 856 "web linking entry"
  • RFC 036: Modelling holdings records
  • API faceting principles & expectations
  • Matcher versioning
  • Requesting API design
  • TEI Adapter
  • Tracking changes to the Miro data
  • How do we tell users how to find stuff?
  • Removing deleted records from (re)indexes
  • RFC 044: Tracking Patron Deletions
  • Work relationships in Sierra, part 2
    • Work relationships in Sierra
  • Born Digital in IIIF
  • Transitive hierarchies in Sierra
  • RFC 047: Changing the structure of the Catalogue API index
  • RFC 048: Concepts work plan
  • RFC 049: Changing how aggregations are retrieved by the Catalogue API
  • RFC 050: Design considerations for the concepts API
  • RFC 051: Ingesting Library of Congress concepts
  • RFC: 052: The Concepts Pipeline - phase one
  • RFC 053: Logging in Lambdas
  • RFC 054: Authoritative ids with multiple Canonical ids.
  • RFC 055: Genres as Concepts
  • RFC 055: Content API
    • Content API: articles endpoint
    • Content API: Events endpoint
    • Content API: exhibitions endpoint
    • The future of this endpoint
  • RFC 056: Prismic to Elasticsearch ETL pipeline
  • RFC 57: Relevance testing
    • Examples of rank CLI usage
  • RFC 059: Splitting the catalogue pipeline Terraform
  • RFC 060: Service health-check principles
  • RFC 060: Offsite requesting
    • Sierra locations in the Catalogue API
  • Content-api: next steps
Powered by GitBook
On this page
  • Context
  • What needs to be done
  • Extra thoughts

RFC 048: Concepts work plan

PreviousRFC 047: Changing the structure of the Catalogue API indexNextRFC 049: Changing how aggregations are retrieved by the Catalogue API

Last updated 10 months ago

Status: Draft :building_construction:

Last updated: 29/04/2022

Context

We are now starting work on introducing "concepts" to the Wellcome digital platform: identifiable entities like subjects, people, organisations, genres, etc. Works can be tagged with concepts, and concepts can be linked to other concepts by (for example) predicates, in this way forming a knowledge graph. We will start by using concepts that are manually tagged in the source catalogues, and in future we might infer a work's concepts.

This opens the door to a variety of possible user outcomes. A small number of these might be...

  • Dedicated, search-engine-indexable pages for (eg) subjects and people. These can possibly include curated content.

  • Searching by concept (rather than by keyword).

  • Navigating between concepts by their relationships to one another.

  • Having access to data on concepts from other sources (for example, Wikipedia) within the wc.org experience.

The initial stage of this work will be to implement a minimal version of the first of these points: we will call these pages concept pages. These will likely not refer externally to "concepts"; instead it will be restricted to the aforementioned subjects and people.

This RFC outlines the major technical decisions that will need to be made around data modelling, API design, and system architecture in order to serve this outcome - it does not aim to make any of these decisions.

What needs to be done

This approach presupposes the following architectural decisions:

  • There is a separate API endpoint within the catalogue for concepts

  • There is a knowledge graph of concepts that exists independently of works

  • Works will be tagged with identifiable concepts which (may) exist in the knowledge graph

1. Concepts API Design

We need to design an API that will allow us to populate concept pages. This may be one API for all concept types (ie subjects or people), or different APIs may provide different types. It should be designed in such a way that the future inclusion of curated content is possible. There should be sufficient information on a concept to disambiguate it (eg birth/death dates for people).

2. Knowledge graph population

We need to populate a knowledge graph (wherein the nodes are concepts or have an injective mapping to concepts) from arbitrary sources eg. LCSH, LC names, MeSH, Wikidata, Wikipedia. This will likely initially be just one source without consideration of edges.

3. Identified concepts on works

Where works are tagged with identifiable (in practice, Library of Congress and MeSH) concepts, we need to be able to match these up with concepts in our knowledge graph. The outcome here is that, given a work, we should be able to know its subjects/agents/etc, and given a concept, we should be able to query for the works that are tagged with it.

Extra thoughts

  • Is there any level of denormalisation of concepts onto works / vice versa?

  • How will we handle multiple source identifiers mapping to individual canonical identifiers?

🚧