RFC 078: Name Reconciliation Service (NARESE)
Last modified: 2025-09-08T00:00:00Z
Executive Summary
We have many slightly different textual versions of the same person / agent name in our Works metadata (e.g. John Smith
, J. Smith
, Smith, John (1870-1932)
). This fragmentation:
Lowers search relevance (relevant works are split across variants).
Makes aggregation, analytics, and future enrichment harder.
Increases manual effort for curators who need to judge if two labels refer to the same underlying individual.
Existing duplicate detection handles whole-record duplicates, not name surface form reconciliation.
Why It Matters
Better reconciliation improves: user search experience (grouping all works by the same person), data quality (cleaner, more consistent metadata), and downstream enrichment (e.g. linking to authority files, building richer entity graphs). It also lays groundwork for reconciling other entity types later (organisations, places, subjects) using a repeatable pattern.
Goals
Produce machine-generated clusters of person/agent name variants with traceable reasoning (provenance + confidence tier).
Optimise for precision first so we avoid incorrect merges that damage trust.
Be lightweight & incrementally runnable: can resume, can process new data snapshots without redoing validated work.
Provide a clear outward-facing API contract for future integration once promoted from PoC.
Non-Goals (Initial Phase)
Full canonical Person entity graph with persistent new IDs (that is a later, longer-term step).
Real-time interactive service from day one (initial PoC is batch oriented).
Reconciliation beyond Person / Agent types (other entity classes are future roadmap).
Perfect recall; we prefer to miss some hard matches early rather than introduce false merges.
Proposed Approach
For each name label we create an embedding: a numeric representation capturing semantic similarity.
We retrieve the top similar candidate labels using a fast local index (FAISS) and apply a simple similarity threshold.
A constrained Large Language Model (LLM) is given just those candidates plus concise rules and returns which ones really refer to the same person (or nothing if unsure).
We save a small JSON file per processed label containing the decision, candidates, and evidence elements (similarity scores).
Later, these JSON outputs are aggregated into clusters and exposed via an API with confidence tiers.
Option Trade-offs
Train & tune traditional fuzzy / rule system only
Struggles with initials, dates, transliteration nuance; brittle scaling
Slightly higher infra (embeddings + LLM) complexity
Cluster all names in one global pass (e.g. HDBSCAN)
Hard to tune thresholds globally; no contextual reasoning
Per-label LLM calls add cost but raise precision
Larger / hosted embedding model
Higher cost with marginal quality gain for short strings
Smaller model keeps latency & cost low
Value Realisation & Success Signals
Early manual evaluation (“eyeball” review) shows high precision on the non-trivial reconciliation opportunities.
Reduction in duplicate person facets / improved aggregation in search experiments.
Stable or improving precision after iterative prompt / threshold changes (monitored with a reproducible sampled set).
Risks & Mitigations (Plain Language)
Incorrect merges (false positives)
Damaged trust; harder rollback
Precision-first rules, conservative threshold, manual sampled QA
Cost creep from many LLM calls
Budget pressure
Cache identical candidate sets, threshold pruning, potential lightweight heuristic pre-filter
Lack of ground truth delays quantitative metrics
Slower optimisation
Introduce authority file (VIAF) subset + curator-reviewed gold set
Glossary (Quick Read)
Embedding: Numeric vector capturing similarity of text strings.
FAISS: A fast library for approximate nearest neighbour search over embeddings.
LLM: Large Language Model; here used in a tightly scoped, deterministic (temperature 0) judgement role.
Reconciliation: Deciding multiple text labels refer to the same real-world entity (person/agent).
Candidate Set: The shortlist of potentially matching names retrieved by similarity search before LLM filtering.
Technical Details
Existing deduplication focuses on record-level duplicate detection but does not reconcile variant surface forms of the same underlying Person/Agent entity at scale. We need a lightweight, incrementally runnable pipeline to propose reconciled clusters of person names with explainable provenance, leveraging recent improvements in embedding retrieval + LLM reasoning, while minimising manual curation load.
Proposal
Implement a batch-oriented NAme REconciliation SErvice (NARESE) that produces JSON reconciliation artifacts for candidate Person/Agent labels using an enchanced vector search and LLM workflow to make reconcilation judgements (adjudication).
High-level Flow
Load label dataset derived from a snapshot of the Works Denormalised Index (
dn_labels_dedup_data.csv
).Filter to
type in {Person, Agent}
.For each target label:
Embed the label with instruction-tuned embedding model (
Qwen/Qwen3-Embedding-0.6B
) using a fixed prompt to stabilise retrieval semantics.Perform vector similarity search (FAISS index
name_rec_faiss.index
, k=100).Compute dynamic cutoff:
cutoff = max(sim0, 0.8) * 0.7
; rows above cutoff form the context set; all k results form the candidate set.Construct a compact CSV (label,idx) passed to an LLM (
ChatBedrock
modelopenai.gpt-oss-120b-1:0
) with a system instruction that encodes reconciliation rules.Parse returned JSON list of reconciled indices, map back to labels.
Persist result to
data/name_rec/name_rec_<id>_<idx>.json
if not already present (idempotent incremental run behaviour).
The cutoff multiplier of 0.7 was chosen from empirical observations during experiments to maximize recall while keeping most irrelevant names out.
NARESE Architecture Diagram

Components
Orchestrator:
name_rec_run.py
(batch driver / resumable loop).Embedding Model: SentenceTransformer wrapper (local inference, minimal latency, deterministic prompt).
Vector Store: FAISS flat index file (pre-built separately; future: pipeline to rebuild on new data ingest).
LLM Adjudicator: ChatSession abstraction over Bedrock ChatBedrock client (temperature=0) with LangGraph memory (currently unused beyond session scoping).
Output Artifacts: Per-target JSON containing:
{ "label": "<target label>", "idx": "<WellcomeId>_<FaissIndex>", "reconciled_labels": [ { "label": "<matched label>", "idx": "<WellcomeId>_<FaissIndex>" } ], "candidates": [ { "label": "<candidate label>", "idx": "<WellcomeId>_<FaissIndex>", "similarity": <float> } ] }
NARESE Per Sample Sequence Diagram

Architecture Diagrams:
docs/name_rec_architecture.puml
(component & data),docs/name_rec_sequence.puml
(per-record sequence).
Current PoC vs Target Production
Input Source
Static CSV extracted from denormalised Elasticsearch works index (snapshot taken July 2025) -> dn_labels_dedup_data.csv
Live incremental feed (stream or periodic export) from canonical Works & Entities stores
Execution Mode
Manual batch script (name_rec_run.py
) sampling ~3000 rows for evaluation
On-demand API + scheduled full refresh job
Storage
Per-record JSON files on local filesystem
Durable object storage + backing relational / graph store for clusters
Index Refresh
Manual, ad-hoc FAISS build
Automated pipeline with dimension / checksum guard & blue/green swap
Reconciliation Engine
Embedding + Bedrock LLM per target
Same core flow with caching, possible lightweight rules model for high-confidence auto matches
Monitoring
None
Metrics (latency, success rate, empty cluster ratio) + structured logs + tracing
Governance
Developer-only
Product + Data curation review workflow (UI)
The current PoC validates technical feasibility (semantic retrieval + LLM filtering) using a static July 2025 snapshot. This RFC defines the production API that will sit in front of an evolved pipeline able to serve real-time reconciliation queries and batch enrichment.
Reconciliation Logic (LLM Rule Set)
System prompt enforces:
Require medium disambiguation level: full name match OR initials + supporting date.
Initial-based matches without date are low certainty and excluded unless promoted by additional context.
Ambiguity => return empty list (preference for precision over recall initially).
Output strict JSON list of
id_idx
strings.
Verbatim Prompts are included in the Appendix below.
File / Data Contracts
dn_labels_dedup_data.csv
Source label universe (name retains historical 'dedup' misnomer)
Input snapshot
Legacy naming from earlier Dedupe lib prep; NOT a strictly deduplicated set
name_rec_faiss.index
Embedding ANN search
Rebuilt on data refresh
Must align embedding dimension
data/name_rec_sample.csv
Quality sampling / manual audit
Each run
3000-row deterministic sample (seed=420)
data/name_rec/name_rec_*.json
Reconciliation results
Incremental append
Skip-if-exists enables resumability
Samples of the above CSV and JSON are found in the Appendix below.
The FAISS vector store file for 414,721 records is 1.6 GB in size.
Quality / Evaluation Strategy
Initial evaluation via:
Manual spot check of sampled outputs (precision focus).
Cluster cohesion metric: average intra-cluster cosine similarity vs random baseline.
False positive guard: monitor proportion of empty reconciliation outputs (too high => overly strict threshold; too low => likely drift).
Future: Introduce ground truth subset from authority files (VIAF / LC) for sampled F1 (accuracy).
Human Eyeball Evaluation Rationale
The PoC and subsequent tuning phases use an eyeball evaluation via an Evaluation App.
Approximately 3,000 Person/Agent rows (drawn uniformly at random, seeded for reproducibility) before promoting material logic changes. This size is chosen to achieve an acceptable margin of error for estimating key performance proportions (e.g. proportion of correctly reconciled non-trivial cases) under the following assumptions:
We treat the evaluation as estimating a binomial proportion (success = reconciliation judged correct & non-trivial, i.e. not an obvious full-text identical variant).
A preliminary pilot (3000-row inspection) observed roughly 1000 non-trivial reconciliation opportunities (≈36%).
We target a 95% confidence level and an absolute margin of error of ±3 percentage points for the non-trivial correctness rate after filtering obvious cases.
Operational policy:
Production (full run) is unsampled; sampling is only for manual QA / regression inspection.
Any significant algorithmic change (embedding model swap, similarity cutoff adjustment, LLM prompt revision, introduction of heuristic pre-filters) triggers a fresh 3,000-row eyeball pass before acceptance.
If observed precision in the non-trivial subset remains ≥ the prior accepted baseline within the target margin of error, the change is approved; otherwise iterate.
A re-sample may also be initiated if monitoring shows drift (e.g. sharp change in proportion of empty reconciliations or candidate distribution shift).
We deliberately avoid larger ad‑hoc samples to reduce evaluator/curator fatigue; future automation (Improved Evaluation/Curator App + partial ground truth) will allow reducing or eliminating manual eyeball passes.
NARESE Evaluation App Screenshot

NARESE Evaluation App - Expert Mode - Screenshot

NARESE Evaluation App Architecture Diagram

Operational Characteristics
Pure batch; no service yet. Re-runnable; safe to interrupt (idempotent file emission).
Deterministic sampling for reproducibility in reviews.
No PII retention beyond transient in-memory embedding.
Fails fast if FAISS index absent.
Bedrock dependency isolated (one model id reference); swappable.
Extensibility Roadmap
Short-term:
Add full dataset (non-sampled) execution mode.
Persist reconciliation confidence tiers (low/medium) instead of binary include/exclude.
Add optional fuzzy lexical features (edit distance, Jaro-Winkler) pre-LLM to prune trivially unrelated candidates.
Mid-term:
Build incremental indexer (append-only vector index update pipeline).
Introduce authority file alignment (VIAF ingest) to seed / validate clusters.
Provide REST/GraphQL endpoint for on-demand reconciliation resolution.
Long-term:
Feedback loop with curator validation UI producing reinforcement signals.
Transition reconciled groups into canonical Person entity graph (new ID space) powering search facet normalisation.
Multi-lingual expansion (add language-aware transliteration heuristics).
Production API Specification
This section specifies the external contract to be implemented once the PoC is accepted.
Guiding Principles
Deterministic, idempotent responses for identical requests (modulo data version).
Precision-first: service returns only medium+ confidence matches by default.
Transparent confidence & provenance metadata.
Backwards compatible evolution via versioned path:
/v1/
.
Endpoints
GET /v1/health
Returns
{ status: "ok", indexVersion: "2025-07-<hash>", model: "Qwen/Qwen3-Embedding-0.6B", dataDate: "2025-07-xx" }
.
GET /v1/reconcile
Query params:
q
(required): raw name string.type
(optional, defaultPerson
): enum[Person, Agent]
.limit
(optional, default 10, max 50): max reconciled labels.include
(optional): comma list of extra blocks[candidates,context,debug]
.
Response (200):
{
"query": "John Smith",
"type": "Person",
"reconciled": [
{"label": "Smith, John (1870-1932)", "idx": "abcd1234_567", "confidence": "medium"}
],
"candidates": [
{"label": "J. Smith", "idx": "efgh5678_1234", "similarity": 0.84}
],
"contextVersion": "2025-07-<hash>",
"latencyMs": 42
}
Errors:
400: missing
q
/ invalid param429: rate limit
503: index loading / degraded mode
POST /v1/reconcile/batch
Each item mirrors single response shape plus
position
field.
GET /v1/cluster/{idx}
Returns the full reconciled cluster (all members & metadata) anchored at a canonical representative.
Response adds:
canonicalLabel
,members
,evidence
(list citing similarity scores & LLM rationale if stored).
GET /v1/meta/index
Operational metadata:
{ "dimension": 1536, "size": 1200341, "builtAt": "2025-07-29T12:00:00Z", "buildId": "abc123" }
.
Data Model (Production)
idx
Composite identifier {WellcomeId}_{FaissIndex}
where WellcomeId
is the underlying work/entity id from source data and FaissIndex
is the integer position of its embedding in the loaded FAISS index (stable for a given index build)
label
Original surface form
canonicalLabel
Selected representative (cluster-level)
confidence
Confidence tier: low
, medium
, high
similarity
Raw vector similarity (float)
version
Index / data snapshot identifier
evidence
Optional structured rationale entries
Confidence Derivation (Future)
Start with rule mapping (e.g. medium if full name match OR initials+date, high if multiple orthographic & date matches, low otherwise but not returned unless explicitly requested via
include=low
).Potential addition of logistic calibration model trained on curated pairs.
Performance Targets (Initial SLOs)
P50 < 100ms, P95 < 350ms for
/v1/reconcile
(excluding cold start) with index resident in memory.Availability target 99.5% (business hours) Phase 1.
Error budget tracked on 5xx + malformed 4xx ratio.
Rate Limiting & Quotas
Token bucket per API key (default 10 QPS burst 30).
Batch endpoint counts each query individually toward quota.
Versioning Strategy
Breaking changes -> new major path
/v2/
.Additive fields allowed silently; clients advised to ignore unknown keys.
Authentication / Authorization
Phase 1: API key header
X-API-Key
validated against secrets store.Phase 2: Move to OAuth2 client credentials for internal services.
Observability
Structured log per request: requestId, latency, candidateCount, reconciledCount, emptyResult flag, indexVersion.
Metrics: histogram (latency), counter (errors by code), gauge (index size), ratio (empty / total).
Trace spans: encode, search, llm_call, assemble_response.
Caching
Memory LRU for recent queries (normalized label) keyed by
(q,type,indexVersion,include)
.CDN-friendly
Cache-Control: public, max-age=60
for non-empty responses without debug blocks.
Failure Modes
LLM timeout: 503 with
retryAfter
.Index not loaded: 503 with
retryAfter
.
OpenAPI (Excerpt)
Illustrative subset (not exhaustive):
paths:
/v1/reconcile:
get:
parameters:
- name: q
in: query
required: true
schema: { type: string }
- name: type
in: query
schema: { type: string, enum: [Person, Agent] }
- name: limit
in: query
schema: { type: integer, minimum: 1, maximum: 50, default: 10 }
responses:
'200': { description: OK }
'400': { description: Bad Request }
'503': { description: Unavailable }
Migration Path from PoC
Add CLI to emit consolidated cluster file from JSON fragments.
Build index build job -> artifact store (S3) with manifest including dimension & checksum.
Wrap reconciliation logic in service (FastAPI / Flask) with warm index load.
Introduce persistence layer for cluster canonicalisation (optionally Postgres + JSONB for evidence).
Add rate limiting & API key gateway.
Backfill existing JSON outputs through new service for parity comparison.
Alternatives considered
Python Dedupe library — Fast but brittle; poor recall on variant & transliterated forms. Requires trainng with poor training tools.
Direct clustering of all embeddings (HDBSCAN / Agglomerative) — Hard to tune globally; lacks contextual disambiguation (dates / initials) that LLM brings on small candidate sets.
Use larger hosted embedding model (e.g. Qwen3-Embedding-8B) — Higher cost, minimal marginal gain for short name strings in pilot phase.
Impact
Benefits:
Improves aggregation & relevance for person-based queries.
Establishes a pattern for hybrid ANN retrieval + LLM constrained reconciliation.
Produces auditable, human-readable artifacts for review.
Provides cross-domain opportunity - requires little prompt engineering to reconcile other types of entities (e.g. Organisations, Places and Subjects/Concepts).
Produces auditable, human-readable artifacts for review.
Provides cross-domain opportunity - requires little prompt engineering to reconcile other types of entities (e.g. Organisations, Places and Subjects/Concepts).
Risks / Challenges:
LLM hallucination risk (mitigated by narrow prompt + deterministic temp=0).
Cutoff heuristic may under-cluster rare names; may need adaptive logic.
Cost growth if scaled to full corpus frequently (LLM invocation per label).
Absence of evaluation ground truth initially delays quantitative assurance.
Cost growth if scaled to full corpus frequently (LLM invocation per label).
Absence of evaluation ground truth initially delays quantitative assurance.
Mitigations:
Tight JSON parsing (reject non-list outputs and retry with higher Temperature setting on the LLM).
Add logging of similarity distributions for dynamic tuning.
Introduce caching for unchanged candidate sets across runs.
Add logging of similarity distributions for dynamic tuning.
Introduce caching for unchanged candidate sets across runs.
Next steps
Add evaluation metrics script to compute basic cluster cohesion metrics over sample.
Implement non-sampled full run mode with CLI args (e.g.
--full
/--limit N
).Add confidence tier annotation to output JSON structure.
Integrate authority file (VIAF) bootstrap for a subset and compare overlap.
Build FAISS (re)indexing utility & doc (dimension/version guard).
Add automated test for JSON schema of reconciliation outputs.
Consider architecture decision record (ADR) referencing this RFC once accepted.
Appendix
Prompts
embed_prompt = "Instruct: Given a search query that contains a person's name, retrieve relevant passages that mention the same person.\nQuery: "
system_instruction = """# INSTRUCTIONS
You will be given a target name of a person and a CSV string with the following format:
name,index
## Goal
Identify all names in the CSV that can be reconciled to the target name and return their indices.
## Rules
- Take in account initials and dates to disambiguate.
- If there is a match for a full name but no date, we take it that is is disambiguated to a medium degree.
- If the match depends on initials it is disambiguated to a low degree unless there is a date match which makes it a medium degree.
- If the name cannot be disambiguated to a medium degree, do not return the index.
- If the target name is too ambiguous, do not return any indices.
## Output Format
Return a JSON list of indices with the following format:
["idx1", "idx2", ...]
}
"""
Data Samples
dn_labels_dedup_data.csv
dn_labels_dedup_data.csv
idx,label,id,type,source
9,"Nelson, Geoffrey B. (Geoffrey Brian)",a223f5a6,Person,contributor
10,"Wolff, G.",a2249bxm,Person,contributor
11,"Jones, John E",a224b9mp,Person,contributor
12,"Jones, John E.",a224b9mp,Person,contributor
13,"Hulverscheidt, Marion, 1970-",a224rx5x,Person,contributor
14,"Zimmermann, Wilhelm.",a224tns9,Person,contributor
name_rec_sample.csv
name_rec_sample.csv
idx,label,id,type,source
179619,"Khoo, F. Y.",gdffqkj6,Person,contributor
141893,"Reid, T. Whitehead.",f2cn9e36,Person,contributor
639736,"Walsh, John.",zscc6auu,Person,contributor
444212,"Pulci, Luigi",stsup32k,Person,contributor
532593,"Swieten, Gottfried van",vyrg7p83,Person,contributor
394589,"Crooks, Robert, 1941-",r2wdmtrk,Person,contributor
data/name_rec/name_rec_*.json
data/name_rec/name_rec_*.json
{
"label": "Talbot, Marianne.",
"idx": "zwfqyva2_643606",
"reconciled_labels": [
{
"label": "Talbot, Marianne.",
"idx": "zwfqyva2_643606"
},
{
"label": "Talbot, Marianne",
"idx": "zwfqyva2_643605"
}
],
"candidates": [
{
"label": "Talbot, Marianne.",
"idx": "zwfqyva2_643606",
"similarity": 0.8748117685317993
},
{
"label": "Talbot, Mary Anne.",
"idx": "rgz5xsey_407445",
"similarity": 0.7381430864334106
},
{
"label": "Talbot, Marianne",
"idx": "zwfqyva2_643605",
"similarity": 0.737197995185852
},
{
"label": "Talbot, Catherine",
"idx": "j9kz5ykj_231954",
"similarity": 0.6984497904777527
},
{
"label": "Talbot, T. G.",
"idx": "gh8cx4xe_183128",
"similarity": 0.691516637802124
},
{
"label": "Talbot, Catherine, 1721-1770",
"idx": "n3t6xc7s_310846",
"similarity": 0.6896426677703857
},
{
"label": "Talbot, I. T.",
"idx": "w79jfb5b_538649",
"similarity": 0.6891906261444092
},
{
"label": "Talbot, Emma, 1969-",
"idx": "ghkmv8ah_183424",
"similarity": 0.6868633031845093
},
{
"label": "Talbot, Sarah",
"idx": "me48quwa_292144",
"similarity": 0.6862633228302002
},
...
]
}
Last updated