Request For Comments (RFCs)
  • Request for comments (RFC)
  • RFC 001: Matcher architecture
  • RFC 002: Archival Storage Service
  • RFC 003: Asset Access
  • RFC 004: METS Adapter
  • RFC 005: Reporting Pipeline
  • RFC 006: Reindexer architecture
  • RFC 007: Goobi Upload
  • RFC 008: API Filtering
  • RFC 009: AWS account setup
  • RFC 010: Data model
  • RFC 011: Network Architecture
  • RFC 012: API Architecture
  • RFC 013: Release & Deployment tracking
    • Deployment example
    • Version 1
  • RFC 014: Born digital workflow
  • RFC 015: How we work
    • Code Reviews
    • Shared Libraries
  • RFC 016: Holdings service
  • URL Design
  • Pipeline Tracing
  • Platform Reliability
    • CI/CD
    • Observability
    • Reliability
  • RFC 020: Locations and requesting
  • RFC 021: Data science in the pipeline
  • RFC 022: Logging
    • Logging example
  • RFC 023: Images endpoint
  • RFC 024: Library management
  • RFC 025: Tagging our Terraform resources
  • RFC 026: Relevance reporting service
  • RFC 026: Relation Embedder
  • RFC 027: Pipeline Intermediate Storage
  • RFC 029: Work state modelling
  • Pipeline merging
  • RFC 031: Relation Batcher
  • RFC 032: Calm deletion watcher
  • RFC 033: Api internal model versioning
  • RFC 034: Modelling Locations in the Catalogue API
  • RFC 035: Modelling MARC 856 "web linking entry"
  • RFC 036: Modelling holdings records
  • API faceting principles & expectations
  • Matcher versioning
  • Requesting API design
  • TEI Adapter
  • Tracking changes to the Miro data
  • How do we tell users how to find stuff?
  • Removing deleted records from (re)indexes
  • RFC 044: Tracking Patron Deletions
  • Work relationships in Sierra, part 2
    • Work relationships in Sierra
  • Born Digital in IIIF
  • Transitive hierarchies in Sierra
  • RFC 047: Changing the structure of the Catalogue API index
  • RFC 048: Concepts work plan
  • RFC 049: Changing how aggregations are retrieved by the Catalogue API
  • RFC 050: Design considerations for the concepts API
  • RFC 051: Ingesting Library of Congress concepts
  • RFC: 052: The Concepts Pipeline - phase one
  • RFC 053: Logging in Lambdas
  • RFC 054: Authoritative ids with multiple Canonical ids.
  • RFC 055: Genres as Concepts
  • RFC 055: Content API
    • Content API: articles endpoint
    • Content API: Events endpoint
    • Content API: exhibitions endpoint
    • The future of this endpoint
  • RFC 056: Prismic to Elasticsearch ETL pipeline
  • RFC 57: Relevance testing
    • Examples of rank CLI usage
  • RFC 059: Splitting the catalogue pipeline Terraform
  • RFC 060: Service health-check principles
  • RFC 060: Offsite requesting
    • Sierra locations in the Catalogue API
  • Content-api: next steps
Powered by GitBook
On this page
  • Background
  • Proposed solution
  • Principles
  • Overview
  • Appraisal
  • Cataloguing
  • Transfer
  • Processing
  • Storage
  • Future enhancement
  • Appraisal in Archivematica
  • Integration with Calm

RFC 014: Born digital workflow

PreviousVersion 1NextRFC 015: How we work

Last updated 10 months ago

Last updated: 10 June 2019.

Background

As part of the move to our new , we will be adopting to manage the preservation workflow for born-digital archives. This is analogous to our use of for managing digitisation workflows.

This document describes an initial workflow that will allow us to start using Archivematica as soon as possible. This means that, for now, we will continue to perform some parts of the workflow manually, outside of Archivematica.

We will be actively contributing to the development of Archivematica, including ensuring that it better addresses our needs around appraisal and cataloguing. This will mean we are able to streamline the workflow over time, as the amount of born-digital material increases.

Proposed solution

Principles

There are three key principles underlying this initial workflow:

  • Appraisal and selection will remain manual until the functionality in Archivematica improves

  • Cataloguing will be done in Calm and file-level description will not be part of the catalogue record

  • All processing steps within Archivematica will be fully automated and require no intervention

Overview

The workflow consists of the following high-level stages:

  1. Appraisal

  2. Cataloguing

  3. Transfer

  4. Processing

  5. Storage

Each of these are described below.

Appraisal

During the appraisal stage, an archivist will determine which files are in scope for the collections and perform a policy-led sensitivity check. We intend to move towards system-assisted appraisal, using future version of Archivematica, but for now this stage will build on existing manual processes.

Step
Description

Disk level appraisal

High level appraisal to determine what storage media is in scope.

File level appraisal

Selection of in scope files from selected storage media.

Sensitivity check

Policy-led review of selected files to determine whether they contain sensitive material.

Cataloguing

Once an initial appraisal has been completed, the files in scope for archiving need to be arranged into archive items and catalogued in Calm. Items should be created in Calm to represent each related set of files, which will be submitted and stored as a single package in the storage service.

This means that an item in Calm corresponds to a bag in the storage service, not an individual file. File-level description is fully automated and the metadata generated is stored in a METS file created by the Archivematica processing workflow.

Step
Description

Find/create collection

Find, or create if necessary, the collection in Calm that the items will be part of.

Find/create series

Find, or create if necessary, the series and sub-series within the collection that the items will be part of.

Create items

For each set of related files, create an item in Calm that represents and describes them collectively in the catalogue. This should describe the files as a set, not list details of individual files.

Transfer

Once each set of related files as been identified, they need to be assembled into a package that can be transferred into Archivematica for processing and storage. This package is a simple zip file, with the addition of a minimal set of descriptive metadata.

This metadata is used for identification only. The canonical version of the metadata, which is also the data that will be displayed to end users, remains the data in Calm. Similarly, whilst Archivematica does have support for rights information, we will not be supplying this through the workflow. Calm will be the sole location for rights information.

The package created for transfer into Archivematica should reflect the original order of the files, as this structure will be maintained in the storage service and in the file-level metadata presented to users.

The zip file should have the following structure:

PPABC_1234.zip
|-- file1.txt
|-- file2.doc
|-- file3.jpg
|-- folder1
    \-- [...] 
|-- folder2
    \-- [...] 
|-- [...]
\-- metadata
    \-- metadata.csv

Note the addition of a metadata directory, with a single file named metadata.csv, which contains item level metadata from Calm. metadata.csv should have the following structure:

filename,dc.title,dc.creator,dc.date,dc.identifier
objects,<Title>,<CreatorName>,<Date>,<RefNo>

Once the zip files have been created, they need to be uploaded to AWS S3. As soon as the upload of a zip completes, it should be picked up by Archivematica and a processing workflow started automatically.

Access to S3 requires the installation of a client application. Until this can be made available more widely, zip files should be passed to the Digital Production team for upload to S3.

Step
Description

Create packages

Move the set of files for each item into its own directory, preserving the original order.

Add package metadata

Add a metadata.csv file to each item, inside a directory called metadata. This file should include the appropriate item level metadata from Calm.

Upload packages

Create a separate zip file from the parent directory for each item and upload to AWS S3, via the Digital Production team.

Processing

Archivematica will be configured to perform all processing steps automatically. No manual intervention should be required once zip files created with the current manual appraisal process have been uploaded to S3.

We will use the default format policies and tools that come pre-configured in Archivematica, ensuring ongoing best practice for format identification, metadata extraction and file normalisation. All files are virus checked by Archivematica before processing.

Step
Description

Format identification

Determine the most reliable way to identify the format of each file, as described in PRONOM.

Metadata extraction

Run format-specific tools to extract as much metadata as possible from within each file.

File normalisation

Convert files into the formats best suited for preservation and access, in addition to keeping the original file.

To achieve a fully automated workflow, we will use a pipeline with the following processing configuration:

Setting
Value

Assign UUIDs to directories

No

Send transfer to quarantine

No

Remove from quarantine after (days)

28

Generate transfer structure report

No

Perform file identification (Transfer)

Yes

Extract packages

No

Delete packages after extraction

No

Perform policy checks on originals

No

Examine contents

Yes

Create SIP(s)

Create single SIP and continue processing

Perform file format identification (Ingest)

No, use existing data

Normalize

Normalize for preservation and access

Approve normalization

Yes

Generate thumbnails

No

Perform policy checks on preservation derivatives

No

Perform policy checks on access derivatives

No

Bind PIDs

No

Document empty directories

No

Reminder: add metadata if desired

Continue

Transcribe files (OCR)

No

Perform file format identification command (Submission documentation & metadata)

No

Select compression algorithm

Gzipped tar

Select compression level

5 - normal compression mode

Store AIP

Yes

Store AIP location

Wellcome AIP Storage

Upload DIP

Do not upload

Store DIP

Do not store

Store DIP location

Default location

Storage

The final stage of the workflow involves the creation of Archive Information Packages (AIPs) and safely moving them to long term storage. The parts of the process are handled automatically by Archivematica and the new storage service.

Once an AIP has been successfully stored in the storage service, its unique identifier should be added to the equivalent item record in Calm, so that it can be easily located using the catalogue. For now this addition needs to be a manual process, but we hope to be able to automate it in the future.

Step
Description

Create SIP

Archivematica automatically creates a Submission Information Package (SIP), that describes the files that will be stored as an Archive Information Package (AIP).

Store AIP

Archivematica automatically creates an AIP from the SIP and ensures it is correctly stored in the storage service.

Link from catalogue

Add the unique identifier of the AIP to the associated item record in Calm, so that the catalogue can be used to find files in the storage service.

Future enhancement

There are number of future enhancements that we intend to explore to improve the workflow outlined above. These include using Archivematica for appraisal and integration with Calm.

Appraisal in Archivematica

Whilst Archivematica does already include support for appraisal, we believe it's not currently easy enough to use or functional enough to replace the current manual process.

By working to improve Archivematica, we could:

  • Transfer all files into Archivematica on acquisition, providing automated generation of submission documentation and increased safety against data loss.

  • Provide automated analysis of file formats and associated file metadata to help with appraisal tasks.

  • Use machine learning as part of this analysis, to make suggestions and increase the scope of sensitive checking.

  • Arrange files into SIPs within Archivematica, streamlining the creation of packages for ingest.

Integration with Calm

Integration between Archivematica and Calm would allow for easier management of associated metadata, further streamlining the creation of packages for ingest.

With full integration, we could:

  • Find and create associated records in Calm during SIP creation in Archivematica.

  • Automatically include metadata in SIPs from associated Calm records.

  • Automatically update associated Calm records with AIP identifiers and file-level description, e.g. extent.

storage service
Archivematica
Goobi
Workflow diagram