Request For Comments (RFCs)
  • Request for comments (RFC)
  • RFC 001: Matcher architecture
  • RFC 002: Archival Storage Service
  • RFC 003: Asset Access
  • RFC 004: METS Adapter
  • RFC 005: Reporting Pipeline
  • RFC 006: Reindexer architecture
  • RFC 007: Goobi Upload
  • RFC 008: API Filtering
  • RFC 009: AWS account setup
  • RFC 010: Data model
  • RFC 011: Network Architecture
  • RFC 012: API Architecture
  • RFC 013: Release & Deployment tracking
    • Deployment example
    • Version 1
  • RFC 014: Born digital workflow
  • RFC 015: How we work
    • Code Reviews
    • Shared Libraries
  • RFC 016: Holdings service
  • URL Design
  • Pipeline Tracing
  • Platform Reliability
    • CI/CD
    • Observability
    • Reliability
  • RFC 020: Locations and requesting
  • RFC 021: Data science in the pipeline
  • RFC 022: Logging
    • Logging example
  • RFC 023: Images endpoint
  • RFC 024: Library management
  • RFC 025: Tagging our Terraform resources
  • RFC 026: Relevance reporting service
  • RFC 026: Relation Embedder
  • RFC 027: Pipeline Intermediate Storage
  • RFC 029: Work state modelling
  • Pipeline merging
  • RFC 031: Relation Batcher
  • RFC 032: Calm deletion watcher
  • RFC 033: Api internal model versioning
  • RFC 034: Modelling Locations in the Catalogue API
  • RFC 035: Modelling MARC 856 "web linking entry"
  • RFC 036: Modelling holdings records
  • API faceting principles & expectations
  • Matcher versioning
  • Requesting API design
  • TEI Adapter
  • Tracking changes to the Miro data
  • How do we tell users how to find stuff?
  • Removing deleted records from (re)indexes
  • RFC 044: Tracking Patron Deletions
  • Work relationships in Sierra, part 2
    • Work relationships in Sierra
  • Born Digital in IIIF
  • Transitive hierarchies in Sierra
  • RFC 047: Changing the structure of the Catalogue API index
  • RFC 048: Concepts work plan
  • RFC 049: Changing how aggregations are retrieved by the Catalogue API
  • RFC 050: Design considerations for the concepts API
  • RFC 051: Ingesting Library of Congress concepts
  • RFC: 052: The Concepts Pipeline - phase one
  • RFC 053: Logging in Lambdas
  • RFC 054: Authoritative ids with multiple Canonical ids.
  • RFC 055: Genres as Concepts
  • RFC 055: Content API
    • Content API: articles endpoint
    • Content API: Events endpoint
    • Content API: exhibitions endpoint
    • The future of this endpoint
  • RFC 056: Prismic to Elasticsearch ETL pipeline
  • RFC 57: Relevance testing
    • Examples of rank CLI usage
  • RFC 059: Splitting the catalogue pipeline Terraform
  • RFC 060: Service health-check principles
  • RFC 060: Offsite requesting
    • Sierra locations in the Catalogue API
  • Content-api: next steps
Powered by GitBook
On this page
  • Problem statement
  • Suggested solution
  • Web
  • S3

RFC 007: Goobi Upload

Last updated: 02 November 2018.

Problem statement

There are currently four mechanisms in use for uploading assets to Goobi workflows:

  • IA harvesting: Fully automated download from IA, coordinated by Goobi

  • FTP: Bulk upload mechanism, automatically matches to existing processes

  • Home directory: Bulk upload mechanism, requires manual matching to existing processes

  • Hot folder: Bulk upload for editorial photography, automatically creates new process

Two of these (home directories and hot folders) rely on SMB network shares and a third relies on an insecure, outdated technology that we don't want to run in AWS (FTP).

We want to rationalise this to the following:

  • Web upload: Built in Goobi web upload for small numbers of files

  • S3: A new bulk upload mechanism, which automatically matches or creates processes

This allows us to replace the three existing bulk upload mechanisms, one of which is semi-manual, with one that is fully automated and works regardless of network location.

Suggested solution

Web

This is already available in Goobi, no changes required.

S3

Package format

Packages should be uploaded to S3 as zip files, one per process. All assets and metadata should be at the root level, in a single directory. Compressing packages into a single file is required to ensure that packages are only processed when completely uploaded.

S3 layout

s3://wellcomecollection-workflow-upload
|
| /digitised
| /editorial
| /failed

Processing

Initiation

Processing should be triggered automatically by S3 event notifications.

Digitised content

Packages placed in the digitised prefix should be automatically matched to an existing process.

Editorial photography

Packages placed in the editorial prefix should automatically create an editorial photography process.

Completion

Succesfully processed packages should be deleted from the upload bucket. Failed packages should be moved to the failed prefix.

PreviousRFC 006: Reindexer architectureNextRFC 008: API Filtering

Last updated 10 months ago