What is the difference between a salted Bloom filter and HMAC tokenization in record linkage?

A salted Bloom filter encodes character n-grams into a bit array and supports approximate similarity matching, which tolerates typos and name variations. HMAC tokenization applies a keyed hash directly to structured fields like date of birth and produces an exact-match token. Bloom filters handle noisy data better but require larger storage and are more vulnerable to frequency attacks at small filter sizes. Most production PPRL pipelines use HMAC tokenization for blocking and Bloom filters for similarity scoring within candidate blocks.

Does PPRL satisfy HIPAA de-identification requirements?

Regulatory classification of PPRL outputs is not settled in 2026. Some institutional review boards treat salted Bloom filter encodings as de-identified data under the HIPAA Expert Determination standard, while others classify them as limited datasets requiring data use agreements. The ambiguity stems from demonstrated re-identification attacks against small Bloom filters under certain conditions. Engineering teams should obtain formal legal review and apply conservative filter sizes (1000 bits or more) to strengthen the technical argument for de-identification.

What is private set intersection and why is it stronger than Bloom filter approaches?

Private set intersection (PSI) is a secure multiparty computation protocol that allows two parties to discover which records they share without revealing any non-matching records to each other. Unlike Bloom filter methods, PSI requires no shared secret salt or HMAC key, which eliminates the key compromise attack surface. The trade-off is computational cost and implementation complexity. OPRF-based PSI protocols standardized by the IETF are making this approach practical at health registry scale in 2026.

How should a PPRL pipeline handle salt or key management across projects?

Each linkage project should use a unique salt or HMAC key so that encoded outputs from different projects cannot be joined to reconstruct fuller identity profiles. Keys should be generated using a cryptographically secure random number generator, distributed to participating institutions over authenticated encrypted channels and revoked at project conclusion. Emerging decentralized identity infrastructure based on W3C DIDs offers a principled framework for key distribution and revocation in multi-institutional research networks.

Can individuals control whether their records are linked across institutions?

Current PPRL deployments typically operate under institutional consent frameworks approved by ethics boards, not individual per-linkage consent. The technical architecture to support individual control exists: if a patient holds a verifiable credential encoding their quasi-identifiers, they can authorize specific linkages by presenting that credential to participating institutions. This model, being developed within personal data sovereignty frameworks like the Personal Data Asset Origination System, would make individual consent a cryptographic gate rather than a governance assumption.

Privacy-Preserving Record Linkage for Health Data

Why Record Linkage Matters Across Institutional Walls

Imagine a cancer patient who receives chemotherapy at a university hospital, follow-up imaging at a community clinic and primary care at a federally qualified health center. Three institutions. Three electronic health record systems. Three siloed datasets that, taken together, could reveal critical patterns in treatment response. Taken separately, each dataset is blind to the full clinical picture.

Privacy-preserving record linkage, or PPRL, is the technical discipline that solves this problem without requiring any institution to hand over raw patient data. It allows researchers and public health agencies to answer the question: are these records about the same person? And it does so using cryptographic commitments rather than plaintext identifiers.

As of 2026 this remains one of the hardest unsolved problems in health informatics. Not because the math is missing. The math exists. The challenge is deploying that math inside real institutional environments with legacy systems, competing IRB protocols, underfunded IT teams and attorneys who treat any data-sharing agreement like a controlled substance.

This article examines the three dominant cryptographic approaches to PPRL, explains how each one actually works under the hood and points to the real deployments that have stress-tested these methods at scale.

The Raw Data Pooling Trap

The naive approach to cross-institutional record linkage is to build a central registry. Every institution sends patient demographics to a trusted third party, the third party runs deterministic or probabilistic matching, then returns a set of linked record IDs. Simple. Understandable. Deeply problematic.

Raw data pooling introduces a single point of compromise. The trusted third party holds enough information to re-identify individuals, reconstruct care histories and, if breached, expose exactly the kind of sensitive health data that HIPAA and the FTC Act are designed to protect. Calling that party "trusted" is a policy decision, not a technical guarantee.

Beyond breach risk, raw pooling creates jurisdictional friction. Under the EU General Data Protection Regulation, transferring identifiable health data across member state boundaries requires specific legal bases that are difficult to satisfy at research timescales. Under HIPAA, even a business associate agreement does not eliminate the residual risk that a de-identification failure somewhere in the pipeline will expose protected health information.

The goal of PPRL is to replace the trusted third party with a cryptographic protocol. Instead of asking an institution to trust another institution, you ask both institutions to trust mathematics. The protocol guarantees that no single party ever sees plaintext data from the other, and that the output reveals only match decisions, not underlying identifiers.

Bloom Filters with Salt: Probabilistic Matching Without Exposure

The most widely deployed cryptographic primitive in PPRL today is the salted Bloom filter, sometimes called a cryptographic long-term key or CLK in the Australian research literature. The technique was introduced systematically by Schnell, Bachteler and Reiher and has been extended substantially since.

A Bloom filter is a fixed-length bit array. To encode a string, you break it into overlapping character n-grams, then hash each n-gram with k independent hash functions, setting the corresponding bit positions to 1. The result is a compact binary representation that supports approximate set-membership testing. Two Bloom filters encoding similar strings will share a high proportion of set bits. That similarity is measurable using the Dice coefficient or Jaccard index without ever decoding the underlying string.

The privacy enhancement comes from salting. Before hashing, each n-gram is concatenated with a secret salt that is unique to the institution or to the specific linkage project. An attacker who intercepts a salted Bloom filter cannot reconstruct the original string by precomputing a rainbow table, because the salt changes the hash input unpredictably. Without the salt, brute-force inversion becomes computationally infeasible for realistic identifier spaces.

Two institutions that agree on a shared salt can still compare their Bloom filters and derive similarity scores. Two institutions that use different salts cannot compare their filters directly. This gives project coordinators a precise cryptographic control lever: salt agreement is the mechanism by which consent to linkage is operationalized.

The threat model for salted Bloom filters is not perfect. Research by Christen, Schnell and colleagues has demonstrated that under certain conditions, graph-based frequency attacks can partially reconstruct encoded values when the filter size is small relative to the identifier space. Mitigations include using larger filter sizes (1024 bits or more), adding random bit flipping at a controlled rate to introduce plausible deniability and combining Bloom filters with additional cryptographic layers described below.

The Anonlink library, developed by Data61 at CSIRO and maintained as open source, implements salted Bloom filters at scale. It has been used in linkage projects across Australian health registries processing millions of records.

HMAC-Based Tokenization for Quasi-Identifier Hashing

Where Bloom filters operate on n-gram representations to tolerate typographic variation, HMAC-based tokenization takes a different approach. It applies a keyed hash function directly to structured quasi-identifiers: date of birth, postal code, sex assigned at birth or a combination of these fields concatenated in a canonical format.

HMAC, defined in RFC 2104, uses a secret key to produce a message authentication code. The critical property is that HMAC outputs are pseudorandom: two identical inputs under the same key produce identical outputs, but an attacker without the key cannot invert the output to recover the input. Two institutions that share the same HMAC key can hash the same patient identifiers and compare the resulting tokens to find exact matches.

The limitation of HMAC tokenization relative to Bloom filters is brittleness to data quality issues. A single character difference in a date of birth field, a maiden name versus a married name, or a transposition error in a postal code will produce a completely different HMAC output. Exact-match tokenization fails wherever data entry inconsistency exists, which in health records is everywhere.

Practical PPRL pipelines often use HMAC tokenization as a blocking step. First, records are partitioned into candidate pairs using HMAC on high-quality stable fields like full date of birth and biological sex. Then, within each candidate block, salted Bloom filter similarity is computed on noisier fields like name and address. This two-stage architecture reduces the computational cost of Bloom filter comparison from O(n squared) to O(n log n) in realistic data distributions.

Key management is the operational challenge for HMAC-based PPRL. The shared key must be distributed to participating institutions securely, rotated between projects to prevent cross-project linkage attacks and revoked when a project concludes. Protocols built on top of W3C Decentralized Identifiers and verifiable credentials are beginning to offer a principled infrastructure for key distribution in multi-institutional research networks, though adoption remains early.

Secure Multiparty Computation and Private Set Intersection

Salted Bloom filters and HMAC tokenization both require a shared secret. That shared secret, however carefully managed, still represents a point of coordination risk. If the salt or HMAC key is compromised, all historical linkage records derived from that key become vulnerable to inversion attacks.

Secure multiparty computation, or SMC, offers a stronger guarantee. In an SMC protocol, two or more parties jointly compute a function over their private inputs such that each party learns only the output of the function, not the other party's input. No shared secret is required. The cryptographic security comes from the protocol structure itself.

For record linkage, the relevant SMC primitive is private set intersection, or PSI. In a PSI protocol, institution A holds a set of patient identifiers encoded as cryptographic commitments. Institution B holds a similarly encoded set. The protocol runs an oblivious comparison that reveals only which identifiers appear in both sets, not any identifier that is unique to either party.

Modern PSI protocols are built on public-key cryptography, typically elliptic curve Diffie-Hellman or oblivious pseudorandom functions (OPRFs). The IETF is actively standardizing OPRF-based protocols under the VOPRF Internet Draft. These protocols are communication-efficient: two parties can execute PSI over millions of records with message complexity that grows linearly rather than quadratically.

The trade-off for SMC and PSI is computational cost and protocol complexity. A basic Bloom filter comparison is implementable in a weekend by a competent data engineer. A full-scale OPRF-based PSI protocol requires careful cryptographic engineering, ideally reviewed by researchers with specific expertise in applied cryptography. Implementation errors in SMC protocols can leak information in subtle ways that are not visible to functional testing.

Research groups at Stanford, Carnegie Mellon and the Alan Turing Institute have demonstrated PSI-based record linkage in health research contexts. The performance numbers from these demonstrations show that PSI at the scale of a national cancer registry, tens of millions of records, is feasible on commodity cloud hardware in 2026 with protocol runtimes measured in minutes rather than hours.

Real-World Deployments and Lessons from the Field

The gap between cryptographic proof-of-concept and production deployment is wide. Several real deployments in 2026 have closed that gap and their design decisions are instructive.

The Australian Longitudinal Study on Women's Health has used Bloom filter-based PPRL to link survey data with Medicare Benefits Schedule records and Pharmaceutical Benefits Scheme data across multiple custodians. The deployment uses the CSIRO Anonlink infrastructure and processes linkages under a data governance framework that treats the encoded Bloom filter vectors as the privacy boundary. Raw identifiers never leave the originating institution.

In Germany, the Medical Informatics Initiative has built a record linkage working group that has standardized on a PPRL approach based on salted Bloom filters with 1024-bit filter length and bigram tokenization. The standard is documented in the MII core dataset specification and is being adopted across university hospital networks in the consortium.

In the United States, the Patient-Centered Outcomes Research Institute has funded PPRL pilots within its PCORnet clinical research network. These pilots have grappled with a specific challenge: IRB variability. Different institutional review boards classify encoded Bloom filter representations differently, some treating them as de-identified data and some treating them as limited datasets under HIPAA. That regulatory ambiguity slows deployment more than any cryptographic limitation.

The lesson from each of these deployments is the same. Cryptographic correctness is necessary but not sufficient. PPRL deployment requires governance frameworks that answer four questions clearly: who controls the salt or key, what is the legal basis for linkage under applicable privacy law, how are match outputs audited and how is linkage consent documented and revocable by the data subject.

Building a PPRL Pipeline: Architectural Decisions That Actually Matter

For engineering teams building PPRL infrastructure in 2026, the following architectural decisions have outsized impact on both security and practical adoption.

Encoding Standardization Before Hashing

Normalize all quasi-identifiers before encoding. Date of birth should be ISO 8601 format. Names should be uppercased, stripped of punctuation and phonetically standardized using Soundex or Double Metaphone before n-gram tokenization. Address fields should be geocoded and standardized. Garbage in, garbage match results out.

Filter Size and Hash Function Count

For Bloom filter implementations, the filter length and the number of hash functions k jointly determine both the false-positive rate and the resistance to frequency-based attacks. Research by Christen and colleagues recommends filter sizes of at least 1000 bits for quasi-identifier combinations derived from typical health record demographics. Using cryptographic hash functions (SHA-256 or BLAKE3) rather than non-cryptographic alternatives (Murmur, FNV) is mandatory when the encoded output will be shared externally.

Linkage Output Minimization

The output of a PPRL protocol should be a set of matched record ID pairs, not similarity scores. Releasing raw Dice coefficients for all candidate pairs leaks distributional information about the underlying quasi-identifiers. Apply a decision threshold and release only binary match or no-match decisions. If uncertainty quantification is needed for downstream analysis, model that uncertainty through calibrated probabilistic outputs at the aggregate level rather than the individual record level.

Audit Logging and Consent Receipts

Every linkage operation should produce an immutable audit log that captures which institutions participated, which project salt or key was used, how many records were submitted by each party and the match rate. The Kantara Initiative Consent Receipt Specification provides a structured vocabulary for documenting the consent basis for each linkage. This matters both for regulatory compliance and for the data subject rights that GDPR and various US state privacy laws now require.

Integration with Decentralized Identity Infrastructure

The longer-term architecture for cross-institutional PPRL will likely integrate with W3C Verifiable Credentials and Decentralized Identifiers. An individual who holds a verifiable credential attesting to their date of birth and postal code can authorize a PPRL linkage by presenting that credential to both institutions without exposing the underlying data to either. This is the direction that MyDataKey is building toward as part of the Personal Data Asset Origination System: a model where the individual, not the institution, controls the cryptographic key that enables linkage.

The vision articulated in The Invisible Data, Volume 6 of The Invisible Series, is that personal data should flow under the sovereignty of the person it describes. PPRL is one of the technical architectures that makes that vision operationally real, not just philosophically appealing. When the cryptographic key to linkage lives in the subject's own identity wallet, consent is not a checkbox. It is a gate in the protocol.

That is where health data infrastructure needs to go. The cryptographic tools exist. The governance frameworks are catching up. The engineering work is in connecting them.

Privacy-Preserving Record Linkage Across Institutional Boundaries: Cryptographic Protocols for Health Data