What is the difference between salted Bloom filter PPRL and HMAC tokenization?

Salted Bloom filter PPRL encodes field values as bit arrays using q-gram hashing and supports fuzzy matching through Dice or Jaccard similarity, making it tolerant of typographical variation. HMAC tokenization generates an exact keyed hash of a normalized field value and only matches records that are identical after preprocessing. Bloom filters handle data quality variation probabilistically while HMAC linkage requires strict normalization but produces no false positives from similar-but-different values.

Is a Bloom filter encoding of a patient field considered de-identified under HIPAA?

Not automatically. HIPAA Safe Harbor de-identification requires removal of 18 specified identifiers. A Bloom filter encoding derived from those identifiers is not de-identified under Safe Harbor. It may qualify under the Expert Determination pathway if a qualified statistical expert documents that the residual reidentification risk is very small, but that analysis must be performed per deployment and per institution.

What makes secure multiparty computation more privacy-preserving than the token exchange model?

In the token exchange model, a linkage party receives token sets from both institutions and can learn which records are linked even if it cannot read the underlying data. SMC removes this residual exposure by computing the linkage function jointly across parties with no single party ever holding both inputs. The tradeoff is significantly higher computational cost and protocol complexity, which limits SMC to use cases where the linkage party itself must be untrusted.

How do consent receipts integrate with PPRL audit requirements?

Consent receipts conforming to the Kantara Initiative Consent Receipt Specification encode patient authorization for specific processing events with cryptographic provenance. When linked to a PPRL event through a W3C Verifiable Credential, the receipt creates a tamper-evident audit record proving that a specific linkage was authorized at the time it occurred. Static data use agreements in PDF format cannot provide this level of event-level verifiable accountability.

What is the main operational challenge in HMAC-based record linkage across institutions?

The primary operational challenge is maintaining a shared canonical preprocessing specification for quasi-identifier normalization across all participating institutions. Case normalization, handling of hyphenated names, date format standardization and handling of missing fields must all be specified precisely and enforced consistently. Any divergence in preprocessing between institutions breaks token matching even when the underlying patient data is correct.

Privacy-Preserving Record Linkage: Bloom Filters, HMAC & SMC

The Linkage Problem That Cannot Be Solved With Consent Alone

Two hospitals serve the same patient. A research consortium wants longitudinal data across both. The patient has not been assigned a universal identifier. The hospitals are legally prohibited from sharing raw records with each other or with the consortium. And yet, the research is clinically meaningful, potentially life-saving, and stalled entirely because no one can link the records without exposing the data.

This is not a hypothetical. It is the daily operational reality for federated health data networks, cancer registries, rare disease cohorts and pharmacovigilance systems across every major healthcare system in the world.

The naive solution is to pool the data. Pick a trusted third party, hand them everything, let them run a join. That approach fails on multiple dimensions: regulatory exposure under HIPAA, GDPR Article 9 and equivalent national frameworks. Reidentification risk at the third party. And the institutional trust problem that no hospital system will accept another hospital as a "trusted" custodian of its patient population.

Privacy-preserving record linkage (PPRL) is the technical discipline that solves this. It draws on probabilistic data structures, keyed hash functions, and multiparty cryptographic protocols to allow institutions to determine whether their records refer to the same individual without either party seeing the other's underlying data. The field has matured significantly since Christen and Vatsalan's foundational survey work, and as of 2026 there are production deployments in Australia, Germany, the United Kingdom and the United States running at population scale.

This article explains how the core protocols work, where they fail and what governance architecture has to accompany them for the linkage to be legally defensible and scientifically valid.

Bloom Filters With Salt: Probabilistic Matching Without Raw Exposure

The Bloom filter is a space-efficient probabilistic data structure that encodes set membership. In the record linkage context, a field value such as a patient's surname is tokenized into q-grams, each q-gram is hashed into one or more bit positions in a fixed-length bit array, and the resulting array encodes the field without retaining the original string.

Two institutions can each encode the same field independently. The similarity between two Bloom filter encodings, measured by Dice coefficient or Jaccard similarity over the bit arrays, approximates the string similarity between the underlying values. A surname of "Johnson" and a misspelled "Jonson" will produce similar but not identical bit arrays. A surname of "Johnson" and "Smith" will produce arrays that share almost no bit positions.

The critical vulnerability in naive Bloom filter PPRL is frequency analysis. Because the encoding is deterministic per value, an adversary with a frequency table of common surnames can reconstruct the plaintext values from the bit arrays by matching frequency distributions. Cryptographer Dinusha Vatsalan and colleagues documented this attack class extensively in work cited by the IPDLN (International Population Data Linkage Network).

Salting addresses this. Each institution applies a shared secret salt to the q-gram hashing step before constructing the bit array. The salt is established through a separate key agreement protocol, typically Diffie-Hellman over an elliptic curve. Two institutions using the same salt produce comparable encodings. A third party who intercepts the bit arrays but does not hold the salt cannot reconstruct the plaintext values because the frequency distribution of the salted encodings does not match the frequency distribution of unsalted common values.

The remaining attack vector is the linkage party itself. If a single entity receives both sets of salted Bloom filter encodings and computes the similarities, that entity learns which records are linked even if it cannot read the underlying values. For many research use cases, this residual disclosure is acceptable under a data use agreement. For higher-sensitivity deployments, secure multiparty computation is required to remove even this exposure.

HMAC-Based Tokenization and the Token Exchange Model

HMAC stands for Hash-based Message Authentication Code, standardized in IETF RFC 2104 and widely used for keyed data authentication. In the record linkage context, HMAC is used differently: as a pseudonymization mechanism to generate stable, institution-specific tokens from patient quasi-identifiers.

The basic protocol works as follows. Each institution normalizes a set of quasi-identifiers (first name, last name, date of birth, sex, postal code) through a shared canonical preprocessing specification. Each normalized value is then passed through HMAC-SHA256 with a shared secret key. The resulting 256-bit digest is the token for that field and that individual at that institution.

Because HMAC is deterministic given the same key and input, two institutions applying HMAC to the same normalized quasi-identifiers with the same shared key will produce identical tokens. Record linkage reduces to an exact match on token vectors rather than a probabilistic similarity computation.

This has an important implication: HMAC-based linkage is exact, not fuzzy. It handles systematic data quality variation through the preprocessing normalization step, not through similarity thresholds. If one institution stores "PATRICK" and another stores "Patrick" the preprocessing specification must enforce case normalization or the tokens will not match. Managing the preprocessing specification is, in practice, one of the hardest operational problems in HMAC-based PPRL.

The token exchange model used in several national health research networks adds an additional architectural constraint: tokens are never retained after the linkage event. The linkage engine receives two token sets, computes the intersection, returns a pseudonymous linkage key to each institution, and discards the tokens. Each institution maps its internal records to the pseudonymous key and shares only the mapped, de-identified research variables with the consortium. Neither institution ever sends raw records anywhere.

The Australian SURE (Secure Unified Research Environment) network and the German MIRACUM consortium both operate variants of this architecture. MIRACUM is documented in published work through the German Medical Informatics Initiative and uses HMAC-SHA3 for token generation across university hospital sites.

Secure Multiparty Computation for Record Alignment

Secure multiparty computation (SMC) allows two or more parties to jointly compute a function over their private inputs without any party learning anything about the other parties' inputs beyond what is revealed by the output of the function itself. The theoretical foundation is Yao's garbled circuit construction from the 1980s and its subsequent generalizations including GMW (Goldreich-Micali-Wigderson) and SPDZ protocols.

Applied to record linkage, SMC allows two hospitals to compute: "do these two records refer to the same person" without either hospital sending its record data to the other or to a third party. The computation is performed jointly, typically through an oblivious transfer protocol, and each party learns only the binary linkage result for each record pair.

The practical constraint with SMC is computational cost. Garbled circuit protocols scale poorly with the number of comparisons required. For two institutions each holding one million records, a naive pairwise comparison requires evaluating one trillion record pairs. Several research groups have addressed this with private set intersection (PSI) protocols that first reduce the candidate pair space using blocking variables before applying the full SMC comparison. The PSI step uses additively homomorphic encryption or oblivious pseudorandom functions (OPRFs) to exchange blocked candidate pairs without revealing non-matching records.

The ENCRYPTO group at TU Darmstadt has published production-grade SMC libraries including ABY (A Framework for Efficient Mixed-Protocol Secure Two-Party Computation) that are used in health data research contexts. The SPIKE project in Germany applied SMC to cancer registry linkage across federal state boundaries where legal constraints prohibited any form of centralized token processing.

SMC-based PPRL provides the strongest privacy guarantees of the three main approaches. It also has the highest protocol complexity, the greatest infrastructure overhead and the most demanding requirements for synchronized cryptographic key management across institutions. For most research deployments as of 2026, salted Bloom filters or HMAC tokenization with strict governance controls offer a more operationally tractable tradeoff.

Real-World Deployments: What the Research Infrastructure Actually Looks Like

The Australian AIHW (Australian Institute of Health and Welfare) runs population-scale record linkage across state health registries using a separation principle: a linkage unit holds only quasi-identifiers, and a data custodian holds only research variables. Neither unit can reconstruct the full record. This is not cryptographic PPRL in the strict sense but it is the institutional precursor that motivated the cryptographic protocols now being deployed at the state level.

The UK SAIL Databank at Swansea University uses an anonymisation and linkage approach where the linkage key generation is handled by a Trusted Third Party (TTP) that is structurally separated from both the data providers and the researchers. The TTP holds the linkage function but never the research data. This architecture is described in published literature by Jones and Ford and is referenced in the UK Health Data Research Alliance's operational framework.

In the United States, PCORnet (the National Patient-Centered Clinical Research Network) operates a federated model where member institutions never send patient-level data to the coordinating center. Distributed queries execute at each site. Record linkage across sites for longitudinal studies uses a combination of HMAC tokenization and probabilistic matching with site-specific salt management. The PCORnet Common Data Model specification documents the normalization requirements that enable cross-site token comparison.

The NIH National COVID Cohort Collaborative (N3C) took a different approach: a limited dataset under DUA, with controlled access through a secure enclave. N3C is not PPRL in the cryptographic sense but it illustrates that even the best-resourced federal initiatives often default to governance controls over cryptographic controls when operating under emergency timelines. The lesson for the field is that cryptographic PPRL and governance-based PPRL are complements not substitutes.

Attack Surface Analysis and Known Weaknesses

Every PPRL protocol has a residual attack surface. Understanding it is not optional for practitioners deploying these systems in production.

For salted Bloom filter PPRL, the primary residual risk is the salt key itself. If the shared salt is compromised, all encoded records are vulnerable to frequency analysis reconstruction. Salt management therefore requires hardware security module (HSM) storage, key rotation schedules and revocation procedures. The W3C Data Privacy Vocabularies and Controls Community Group (DPVCG) has published conceptual vocabulary for encoding these controls in machine-readable consent and policy documents, relevant for audit trail construction.

For HMAC tokenization, the attack is different. HMAC tokens are exact representations of the normalized input. An adversary who holds a token and can enumerate a finite space of quasi-identifier combinations (for example, all combinations of common first names, common last names and dates of birth in a given year) can reconstruct the plaintext through a precomputation attack. This is the PPRL equivalent of a rainbow table attack. Defense requires either extremely high-entropy inputs or the addition of a per-record nonce, which breaks the token reuse property that makes linkage possible.

For SMC protocols, implementation vulnerabilities dominate. Side-channel leakage through timing and memory access patterns, incorrect use of oblivious transfer extensions and misimplemented garbled circuit evaluation are the primary failure modes documented in the cryptographic engineering literature. Using audited open-source libraries (ABY, MOTION, MP-SPDZ) rather than custom implementations is mandatory for production health data deployments.

Linkage false positive rates deserve separate attention. A false positive link joins two records that belong to different individuals. In a clinical context this is not a statistical inconvenience: it can introduce erroneous medical history into a patient's longitudinal record. Threshold selection in probabilistic PPRL must account for the downstream clinical consequences of false positives not just aggregate F1 performance metrics.

The Governance Layer That Cryptography Cannot Replace

Cryptographic PPRL is a technical control. It reduces the information exposed during linkage. It does not eliminate the need for legal basis, institutional agreement, patient notification or audit accountability.

Under GDPR Article 9, health data processing requires explicit legal basis even when the data is pseudonymized. A Bloom filter encoding of a patient's date of birth is still personal data under GDPR Recital 26 if the encoding is re-linkable to the individual with reasonable effort. The pseudonymization reduces risk and may shift the legitimate interest calculation but it does not remove the data from the scope of the regulation.

HIPAA's Safe Harbor de-identification standard requires removal of 18 specified identifiers. PPRL encodings derived from those identifiers are not de-identified under Safe Harbor. They may qualify under the Expert Determination standard with appropriate statistical justification for the residual reidentification risk, but that determination requires documented expert analysis per institution per use case.

The data fiduciary model, which Dr. Patrick Fisher examines at length in the context of the Personal Data Asset Origination System (PDAOS), offers a governance architecture for PPRL deployments: an independent fiduciary holds the linkage keys and the linkage function, operates under legally binding duty to data subjects, and can be audited by a data protection authority. This is structurally different from both the trusted third party model (where the TTP operates at the discretion of the commissioning institutions) and the pure SMC model (which has no institutional accountability layer).

MyDataKey, the cryptographic identity infrastructure developed alongside the PDAOS framework, implements consent receipts conforming to the Kantara Initiative Consent Receipt Specification and W3C Verifiable Credentials. These receipts can encode patient consent for specific linkage events with cryptographic provenance, allowing auditors to verify that any given linkage was authorized at the time it was performed. This creates a linkage audit trail that survives institutional personnel changes and system migrations, a requirement that static DUAs stored in PDF format simply cannot meet.

The technical and governance layers are not separable. Deploying strong cryptographic PPRL without a fiduciary-grade governance layer produces systems that are technically sophisticated and institutionally unaccountable. Deploying governance controls without cryptographic controls produces systems that are legally documented but operationally leaky. Both layers are required, and as of 2026 the research infrastructure community is still working toward deployments that take both seriously.

Privacy-Preserving Record Linkage Across Institutional Boundaries: Cryptographic Protocols in Health Data Infrastructure