Most people understand, at some level, that their data is being collected. What they do not understand is which data, by whom, at what granularity, and toward what purpose. That gap between what users believe is being recorded and what is actually captured is the core subject of The Invisible Data, Volume 6 of The Invisible Series authored by Dr. Patrick Fisher, PhD. The book argues that the most economically valuable personal data is precisely the data that remains invisible to the person generating it.
In 2026, that argument is no longer theoretical. AI training pipelines have converted invisible personal data from a passive surveillance byproduct into a primary industrial input. Personal data ownership is not a privacy preference. It is a structural economic and ethical problem that demands engineering solutions.
What Invisible Data Actually Means
The term "invisible data" does not refer to data that is hidden in a dramatic sense. It refers to data that users generate continuously without perceiving the act of generation, and without receiving any record of what was captured.
Visible data is easy to identify. A user fills out a registration form. They type a search query. They post a photo. These are deliberate disclosures. The user makes a choice and understands, at least broadly, that information is being transmitted.
Invisible data operates differently. It includes scroll velocity and pause duration on specific content blocks. It includes the sequence and timing of mouse movements before a purchase decision. It includes device sensor readings, network signal patterns, battery consumption logs and ambient audio classifications. It includes inferred emotional states derived from keystroke cadence. It includes the social graph reconstructed from contact list metadata, not from the contacts themselves.
None of these data types appear in a consent dialog. None of them are described in plain-language privacy notices. They are collected because the technical infrastructure permits collection, and because there has been no architectural requirement to do otherwise.
Dr. Fisher frames this in The Invisible Data as a fundamental asymmetry of perception. The platform sees everything. The person sees nothing of what the platform sees. That perceptual gap is not accidental. It is the product of deliberate system design.
The Harvest You Never Authorized
Understanding the categories of unauthorized harvest requires moving past the standard privacy discourse, which focuses almost entirely on directly identifying information like names, email addresses and Social Security numbers.
The more valuable harvest operates at the inferential layer. Platforms do not need your name if they have your behavioral fingerprint. They do not need your location history if they can reconstruct your movement patterns from app-usage timestamps and network handoff logs. They do not need your medical history if purchase data, search sequences and sleep-pattern signals allow probabilistic inference of health conditions.
This inferential harvest has three components that are especially underappreciated.
The first is relational metadata. Your data reveals information about people who never consented to share anything with the platform at all. Your communication patterns, contact lists and co-location signals expose the behavioral profiles of third parties. Those third parties have no legal standing in any existing data protection framework to object to this collection.
The second is temporal correlation. A single data point has limited value. The same data point observed across eighteen months, correlated with purchasing cycles, life events and communication frequency changes, becomes a predictive asset of substantial commercial worth. The value is created by the platform's accumulation infrastructure, not by any additional disclosure from the user.
The third is model-derived attributes. Platforms train internal classifiers on aggregated behavioral data to assign attributes to individual profiles. These attributes, which might include inferred political orientation, estimated creditworthiness, predicted churn probability or assessed psychological vulnerability, are never shown to the individual. They exist as derived records that influence consequential decisions while remaining entirely outside the individual's awareness or control.
How AI Training Intensifies the Asymmetry
The pre-AI data economy was extractive. The AI training economy is extractive at a qualitatively different order of magnitude.
When a platform collected behavioral data in 2010, it used that data to serve targeted advertising. The data had a relatively local and bounded use. The economic value returned to the platform was real but constrained by the advertising market's mechanics.
AI foundation model training changes the equation entirely. Personal data contributed to a training corpus does not produce a one-time advertising impression. It produces a permanent parametric encoding that persists inside a model's weights indefinitely. That model is then commercialized across thousands of applications, enterprise deployments and API integrations. The economic value of the original data contribution compounds with every downstream use of the model.
The individual whose behavioral patterns, written expressions, preferences and social interactions were encoded into those weights receives nothing from that compounding value. They receive no attribution. They receive no compensation. They have no mechanism to verify that their data contributed to the model at all. And they have no ability to request removal from a model's parametric encoding, because current model architectures do not support granular unlearning at the individual data-point level.
W3C's work on data provenance through the PROV ontology (available at w3.org/TR/prov-overview) establishes the conceptual vocabulary for tracking data origin and transformation chains. The problem is that existing AI training pipelines do not implement provenance tracking. Data enters preprocessing pipelines without cryptographic anchoring, without consent receipts and without attribution records. By the time data reaches a training batch, its origin is irrecoverable.
The NIST Privacy Framework, maintained at nist.gov/privacy-framework, identifies data processing transparency as a core privacy outcome. Current large-scale AI training practices are structurally incompatible with that outcome. Transparency requires a provenance record. AI training pipelines routinely destroy provenance before training begins.
The Provenance Gap at the Heart of Modern AI
Provenance in data engineering refers to the complete, auditable record of a data item's origin, all transformations applied to it and all systems that have processed it. In manufacturing, provenance is called a chain of custody. In archival science, it is the basis of authenticity verification. In AI, it is largely absent.
The provenance gap creates several interconnected problems that compound the personal data ownership crisis.
Without provenance, consent cannot be verified. A data subject may have consented to their data being used for product improvement in 2021. They did not consent to that data being included in a foundation model training run in 2026. Because no provenance chain exists linking their original consent record to the training dataset, no system can verify whether their consent covered this downstream use. The consent record and the data use exist in completely separate operational silos.
Without provenance, attribution is impossible. If a researcher or regulator wants to determine whether a specific individual's data contributed to a model's training corpus, there is no technical mechanism to answer that question. The EU AI Act, which entered full application in 2026, requires transparency obligations for high-risk AI systems and general-purpose AI models. Compliance with those transparency obligations is technically impossible without provenance infrastructure.
Without provenance, economic claims cannot be established. Data trusts and data cooperative models, which have emerged as promising governance structures for collective data negotiation, require provenance records to calculate contribution shares and distribute economic returns. A data trust that cannot verify what each member contributed cannot execute its core function.
The IETF's work on Structured Syntax Suffixes and media type registration (see RFC 6838 at datatracker.ietf.org) provides foundational vocabulary for data typing. Building consent-aware provenance chains on top of these standards is technically feasible. The barrier is not cryptographic capability. The barrier is that no platform has an economic incentive to implement provenance infrastructure voluntarily, because provenance infrastructure would make the invisible data visible.
What PDAOS Infrastructure Corrects
The Personal Data Asset Origination System, developed as part of the research program at Own Your Data Inc, addresses the provenance gap at the architectural level rather than at the regulatory layer.
PDAOS operates from a foundational premise: data should be cryptographically anchored to its origin at the moment of generation. This means that before any data point enters a platform's processing pipeline, it carries an immutable provenance record that includes the subject's identity credential, the consent scope under which the data was shared, the timestamp and contextual metadata of generation, and a cryptographic hash that makes any downstream modification detectable.
This architecture draws on Decentralized Identifiers (DIDs) as specified by the W3C DID Core specification at w3.org/TR/did-core. Each data subject holds a DID that serves as the root of their data provenance graph. Every data point they generate is signed against that DID, creating a verifiable chain that persists through preprocessing, dataset assembly and model training.
PDAOS also implements the W3C Verifiable Credentials specification at w3.org/TR/vc-data-model to encode consent receipts. A consent receipt in the PDAOS model is not a checkbox acknowledgment. It is a cryptographically signed document that specifies the exact processing purposes for which data was shared, the retention window, the permissible downstream uses and the revocation conditions. When a training pipeline ingests data, it can verify the consent receipt programmatically. Data without a valid consent receipt for the intended processing purpose cannot enter the pipeline.
The zero-knowledge proof layer in PDAOS addresses the tension between privacy and auditability. A regulator or researcher can verify that a dataset was assembled in compliance with all constituent consent receipts without seeing the underlying personal data. The proof system, drawing on ZK-SNARK constructions, produces a validity proof for the dataset's consent compliance state that is mathematically verifiable without revealing individual records.
MyDataKey, the product implementation of PDAOS infrastructure, is accessible at mydatakey.org for practitioners interested in the technical specifications and reference implementations.
Consent Receipts, Verifiable Credentials and the Path Forward
The Kantara Initiative's Consent Receipt Specification, which predates but aligns with modern verifiable credential architectures, established that a consent transaction should produce a portable, machine-readable record held by the data subject. In 2026, that specification's goals can be fully realized through the W3C Verifiable Credentials infrastructure in ways that were not technically mature when the original Kantara specification was drafted.
A PDAOS-compliant consent receipt carries several properties that distinguish it from current consent mechanisms.
It is portable. The consent record lives in the data subject's credential wallet, not in the platform's database. The platform cannot unilaterally modify, delete or reinterpret the consent record.
It is selective. Using selective disclosure mechanisms derived from BBS+ signature schemes (documented in active IETF drafts), a data subject can prove specific attributes of their consent record to a verifier without revealing the full consent document. A researcher can verify that a dataset contributor consented to research use without learning the contributor's identity.
It is revocable. The PDAOS consent model includes a revocation registry anchored to a distributed ledger. When a data subject revokes consent, downstream systems that hold derived data under that consent receipt receive a cryptographic signal that their processing authorization has expired. This does not retroactively erase trained model weights, which remains an open technical problem across the field, but it does prevent continued processing under an expired consent scope.
It is economically legible. Because the consent receipt establishes a verifiable record of data contribution, it enables the calculation of attribution shares for data cooperative models. This is the technical prerequisite for any serious implementation of a data fiduciary structure or data trust that distributes economic value back to contributors.
The UK Information Commissioner's Office has published guidance on data sharing agreements that acknowledges the need for machine-readable consent records. That guidance is available at ico.org.uk. The European Data Protection Board's guidance on consent under the GDPR, available at edpb.europa.eu, similarly requires that consent be as easy to withdraw as to give. PDAOS consent receipt architecture is designed to satisfy both requirements by construction.
From Philosophy to Engineering: Closing Thoughts
The Invisible Data opens with a provocation that Dr. Fisher returns to throughout the text: the data that defines you most accurately is the data you have never seen. Your explicit disclosures are noisy and self-curated. Your behavioral signals are unfiltered. Your inferred attributes are computed by systems that know your patterns better than you articulate them yourself.
That asymmetry was tolerable, if uncomfortable, in the advertising era. In the AI training era it is not tolerable. The personal data ownership question is no longer about privacy in the narrow sense of keeping information confidential. It is about who holds the economic rights to the parametric encodings derived from your life, who can audit those derivations, and what structural mechanisms exist to rebalance a system that was designed from the start to make the most valuable data invisible.
PDAOS does not solve the philosophical problem by assertion. It addresses the engineering prerequisites for a solution. Cryptographic provenance anchoring, DID-based identity, verifiable consent receipts and zero-knowledge compliance proofs are not abstract ideals. They are implementable standards built on W3C, IETF and NIST specifications that exist today.
The invisible data becomes visible when the infrastructure requires it to be. Building that infrastructure is the work of Own Your Data Inc, the research program documented in The Invisible Series, and the engineering community that recognizes personal data ownership as a solvable systems design problem rather than a permanent condition of digital life.
Practitioners building AI data pipelines, privacy engineers designing consent architectures and policy researchers modeling data governance frameworks can explore the PDAOS technical specifications and the broader research context at mydatakey.org and through The Invisible Data at theinvisible.life.
