Request For Comments (RFCs)
  • Request for comments (RFC)
  • RFC 001: Matcher architecture
  • RFC 002: Archival Storage Service
  • RFC 003: Asset Access
  • RFC 004: METS Adapter
  • RFC 005: Reporting Pipeline
  • RFC 006: Reindexer architecture
  • RFC 007: Goobi Upload
  • RFC 008: API Filtering
  • RFC 009: AWS account setup
  • RFC 010: Data model
  • RFC 011: Network Architecture
  • RFC 012: API Architecture
  • RFC 013: Release & Deployment tracking
    • Deployment example
    • Version 1
  • RFC 014: Born digital workflow
  • RFC 015: How we work
    • Code Reviews
    • Shared Libraries
  • RFC 016: Holdings service
  • RFC 017: URL Design
  • RFC 018: Pipeline Tracing
  • RFC 019: Platform Reliability
    • CI/CD
    • Observability
    • Reliability
  • RFC 020: Locations and requesting
  • RFC 021: Data science in the pipeline
  • RFC 022: Logging
    • Logging example
  • RFC 023: Images endpoint
  • RFC 024: Library management
  • RFC 025: Tagging our Terraform resources
  • RFC 026: Relevance reporting service
  • RFC 026: Relation Embedder
  • RFC 027: Pipeline Intermediate Storage
  • RFC 029: Work state modelling
  • RFC 030: Pipeline merging
  • RFC 031: Relation Batcher
  • RFC 032: Calm deletion watcher
  • RFC 033: Api internal model versioning
  • RFC 034: Modelling Locations in the Catalogue API
  • RFC 035: Modelling MARC 856 "web linking entry"
  • RFC 036: Modelling holdings records
  • RFC 037: API faceting principles & expectations
  • RFC 038: Matcher versioning
  • RFC 039: Requesting API design
  • RFC 040: TEI Adapter
  • RFC 041: Tracking changes to the Miro data
  • RFC 042: Requesting model
  • RFC 043: Removing deleted records from (re)indexes
  • RFC 044: Tracking Patron Deletions
  • RFC 045: Work relationships in Sierra, part 2
    • Work relationships in Sierra
  • RFC 046: Born Digital in IIIF
  • RFC 047: Changing the structure of the Catalogue API index
  • RFC 048: Concepts work plan
  • RFC 049: Changing how aggregations are retrieved by the Catalogue API
  • RFC 050: Design considerations for the concepts API
  • 051-concepts-adapters
  • RFC 052: The Concepts Pipeline - phase one
  • RFC 053: Logging in Lambdas
  • RFC 054: Authoritative ids with multiple Canonical ids.
  • RFC 055: Genres as Concepts
  • RFC 056: Prismic to Elasticsearch ETL pipeline
  • RFC 058: Relevance testing
    • Examples of rank CLI usage
  • RFC 059: Splitting the catalogue pipeline Terraform
  • RFC 060: Service health-check principles
  • RFC 061: Content API next steps
  • RFC 062: Content API: All search and indexing of addressable content types
  • RFC 062: Wellcome Collection Graph overview and next steps
  • RFC 063: Catalogue Pipeline services from ECS to Lambda
  • RFC 064: Graph data model
  • RFC 065: Library Data Link Explorer
  • RFC 066: Catalogue Graph pipeline
  • RFC 067: Prismic API ID casing
  • RFC 068: Exhibitions in Content API
  • RFC 069: Catalogue Graph Ingestor
  • RFC 070: Concepts API changes
  • RFC 071: Python Building and Deployment
    • The current state
  • RFC 072: Transitive Sierra hierarchies
  • RFC 073: Content API
    • Content API: articles endpoint
    • Content API: Events endpoint
    • Content API: exhibitions endpoint
    • The future of this endpoint
  • RFC 074: Offsite requesting
    • Sierra locations in the Catalogue API
  • RFC 075: Using Apache Iceberg tables in Catalogue Pipeline adapters
Powered by GitBook
On this page
  • Overview / background
  • Proposed solution
  • Implementation details
  • Potential drawbacks
  • Alternatives

RFC 018: Pipeline Tracing

PreviousRFC 017: URL DesignNextRFC 019: Platform Reliability

Last updated 10 days ago

This RFC outlines a proposal for adding distributed tracing to the Wellcome Collection catalogue pipeline. The goal is to improve debugging and monitoring of the pipeline by tracking the flow of data through it, from the adapters right through to ingest.

Last modified: 2020-01-29T10:36:52+00:00

Overview / background

When things go wrong in the pipeline, debugging them involves a confusing and slow mix of checking DLQs and reading through the logs of several separate services in order to figure out what went wrong, for which documents, and where. This slows down development across the catalogue and makes it harder to catch bugs.

As well as more comprehensive, structured logging in the constituent services of the pipeline, it would be beneficial to track the flow of data through it, from the adapters right through to ingest.

Proposed solution

The project defines a common API and methodology for distributed tracing: tracing requests through multiple services whilst propagating and accumulating context throughout, enabling profiling and monitoring of the whole application.

Whilst tracing is typically used in a request/response context where traces/spans stack hierarchically, it can also be used for monitoring unidirectional data flow: ie, for a pipeline.

Even with no further tracing within a service, a top-level trace can tell us which data is in which service, and when. This would be a great starting point for knowing where to focus debugging efforts.

We currently use Elastic APM for performance monitoring of the catalogue API, and it contains an . Using this, rather than the unwrapped Elastic APM public API, means the tracing can be augmented or swapped out as we wish.

An example of a distributed trace in Elastic APM:

a distributed trace in elastic APM

Implementation details

  • Tracing context to be propagated in SNS/SQS messages in their attributes. maxMessageSize in big_messaging may need to be slightly decreased in order to account for the size overhead of this.

  • Spans to be opened on message receipt, mapped onto the Akka Source, and to be closed either on error or on message ACK, at the Sink.

  • Spans to be annotated with the source identifiers and (post-minter) the minted identifiers.

  • At this stage, no further tracing to be implemented within services, and auto-instrumentation + transaction "activation" to be disabled due to the multithreading problems that we know these can cause.

Potential drawbacks

There are few-to-no drawbacks to adding tracing to the pipeline. However, the lack of auto-instrumentation provided by Elastic APM for our specific stack will make adding further tracing (ie, outside the scope of this RFC) more arduous than it would be if we were using Spring/Java, for example. This is partially addressed in the following section.

Alternatives

is a useful reference,

There are a handful of Scala-specific tracing tools which could potentially be easier to integrate and provide better auto-instrumentation than Elastic APM: for example, , (including ), and . However, most of these are (a) paid and (b) even where they can send data to Elasticsearch, they don't work with Elastic APM, which we get for free and are already using with the catalogue API.

If, when we come to add more detailed tracing, we encounter more problems of the sort we dealt with , we may wish to bring in one of these libraries. Hopefully, due to the aforementioned use of the OpenTracing standard, we could do this without having to rip out Elastic APM tracing.

This example
Kamon
Zipkin
akka-tracing
Lightbend Telemetry
here
OpenTracing
OpenTracing compatibility layer