The model card is supposed to be the nutrition label of machine learning. It is the document that tells you what a model was trained on, what it can do, where it fails and who might be harmed by those failures. In 2026, after years of public debate about AI transparency, the model card has become nearly universal as a marketing artifact. As a technical disclosure instrument, it remains largely ceremonial.
This piece audits the training data disclosure practices of the major foundation model developers, measures them against the specification that created the model card concept and asks what the gap reveals about incentive structures in AI development. The focus keyword throughout is training data disclosure because that is precisely where the most consequential gaps appear.
What Model Cards Were Designed to Do
The model card concept emerged from a 2019 paper by Margaret Mitchell, Timnit Gebru and colleagues at Google, published at FAccT 2019 (Mitchell et al., "Model Cards for Model Reporting," doi:10.1145/3287560.3287596). The paper proposed a structured reporting format for trained machine learning models with explicit sections covering intended use, factors affecting performance, evaluation data, training data and ethical considerations.
The training data section was not optional in that framework. It called for documentation of datasets used, any preprocessing applied and the composition of training data with respect to demographic and contextual factors. The purpose was audibility. A researcher, regulator or affected community should be able to read a model card and understand what data shaped the model's behavior.
The original paper was narrow-scope by design. It addressed models deployed in specific, bounded contexts: credit scoring, medical imaging, content moderation. The authors did not anticipate that within five years the same format would be applied to foundation models trained on trillions of tokens drawn from the open web, proprietary corpora, licensed datasets and unknown sources combined.
The Mitchell et al. Specification: A Baseline for Audit
To audit current practice fairly, it is useful to specify exactly what the Mitchell et al. framework required of the training data section. The framework asked for four things.
- Dataset identity: the name and source of datasets used in training
- Preprocessing details: how raw data was cleaned, filtered or transformed before training
- Relationship to evaluation data: whether training and evaluation sets were constructed to avoid leakage
- Demographic and contextual factors: who and what is represented in the training data and at what proportions
None of these asks are unreasonable. For a bounded model trained on a curated dataset, they are straightforward to fulfill. The challenge emerges at foundation model scale, where the answer to "what datasets were used" could run to thousands of entries and where the answer to "what proportions" is often unknown even to the developers themselves.
That difficulty is real. It does not, by itself, explain why most major model cards avoid the training data section almost entirely.
What Major Foundation Model Developers Actually Disclose
Reviewing the publicly available model documentation from Anthropic, OpenAI, Google and Meta reveals a consistent pattern: extensive documentation of capabilities and safety evaluations, thin documentation of training data provenance.
Anthropic
Anthropic publishes what it calls a model card and a system card for Claude models. The system cards are detailed on safety evaluation methodology. On training data, they acknowledge the use of web data, licensed data and proprietary data generated through Constitutional AI processes. Specific dataset names, data volumes and filtering criteria are not disclosed publicly. The cards reference internal processes without making those processes auditable to external parties.
OpenAI
OpenAI's GPT-4 system card, published in March 2023 and updated since, contains substantial evaluation detail. The training data section states that GPT-4 was trained on data from the internet as well as licensed data, with a knowledge cutoff date. It does not identify specific web crawl sources, does not describe filtering pipelines in technical depth and does not address the proportion of data from different source categories. OpenAI's rationale for limited disclosure has referenced competitive sensitivity and the prevention of adversarial manipulation of the training pipeline.
Google DeepMind
Gemini model documentation follows a similar pattern. Google publishes technical reports that are detailed on architecture and benchmarks. Training data is described in aggregate terms: multimodal data, web text, code repositories. The Gemini technical report (arXiv:2312.11805) includes a data section but explicitly declines to name specific dataset sources, citing a mix of competitive and safety reasons.
Meta
Meta's Llama models come closest to the Mitchell et al. specification among the major developers, partly because Meta has positioned openness as a differentiator. The Llama 3 technical report identifies Common Crawl as a primary source, names several supplementary datasets and describes deduplication and quality filtering steps. Proportions remain approximate and the provenance of licensed versus public domain data is not fully separated. The paper accompanying Llama 2 (arXiv:2307.09288) provided more detail than most comparable publications and still fell short of full auditability.
Across all four developers, the pattern is consistent: evaluation transparency is high, training data transparency is low.
The Structural Gap Between Specification and Practice
The gap between what model cards specify and what developers publish is not accidental. It reflects several structural pressures that all push in the same direction.
The first pressure is legal exposure. Disclosing that a model was trained on a specific dataset invites copyright claims from rights holders in that dataset. The ongoing litigation around training data use in the United States and Europe has made legal teams cautious. Vague disclosures reduce legal surface area.
The second pressure is competitive sensitivity. Training data pipelines represent real investment. The quality, filtering methodology and curation logic behind a training corpus are genuinely proprietary. Developers argue, with some legitimacy, that detailed disclosure is a form of trade secret exposure.
The third pressure is the absence of regulatory mandate. In the absence of a binding legal requirement to disclose training data at a specific level of granularity, the incentive to disclose fully is weak. The EU AI Act, which entered phased enforcement in 2026, includes transparency obligations for general-purpose AI systems but the implementing standards for training data documentation are still being developed under ETSI and CEN-CENELEC mandates. The requirement exists in principle. The technical standard for compliance does not yet exist in enforceable form.
The result is a disclosure environment where model cards fulfill a social expectation of transparency without fulfilling the technical function of transparency.
Training Data Provenance as a Data Rights Problem
The model card gap is not purely a technical documentation problem. It is a data rights problem.
When a foundation model is trained on data scraped from the web, that data includes text, images and code produced by individuals who did not consent to that use. The individuals whose writing shaped the model's behavior have no visibility into whether their data was included, in what proportion or how it influenced specific model outputs. This is the data sovereignty failure that the Personal Data Asset Origination System (PDAOS) framework developed at Own Your Data Inc is designed to address at the infrastructure level.
Consent receipts, as specified in the Kantara Initiative Consent Receipt Specification and carried forward in W3C Data Privacy Vocabulary work (w3.org/TR/dpv), provide a technical mechanism for recording what data was used under what terms. If training pipelines were required to generate consent receipts at ingestion time, the information needed to produce a genuinely compliant model card would exist by construction.
The absence of that infrastructure is a deliberate architectural choice, not a technical limitation. Consent-aware data pipelines are technically feasible. They impose cost. In the current regulatory environment, that cost is avoidable.
Dr. Patrick Fisher's work in The Invisible Data (Volume 6 of The Invisible Series) frames this as the origination problem: data has no traceable origin story inside most AI systems, which means the rights and obligations attached to that data at the point of creation are permanently severed by the time the data reaches a training corpus. The model card, in this framing, cannot document what the pipeline never recorded.
Toward Machine-Readable Data Provenance Standards
The W3C PROV ontology (w3.org/TR/prov-overview) provides a vocabulary for expressing data provenance as linked data. PROV-O allows a dataset to carry a machine-readable record of its origin, transformations and derivations. Applied to training data pipelines, PROV-O could enable a model card to reference a verifiable provenance graph rather than a prose description.
The Croissant metadata format, developed by a community including Google Research and adopted by major dataset repositories including Hugging Face and Kaggle, extends schema.org to describe ML datasets with structured metadata. Croissant records enable programmatic inspection of dataset composition. A training pipeline that consumed only Croissant-described datasets could produce a model card training data section by programmatic query rather than by manual documentation. The Croissant specification is available at github.com/mlcommons/croissant.
The Data Nutrition Project at MIT has developed the Dataset Nutrition Label as a complementary instrument, providing structured disclosure at the dataset level that can be referenced from model-level documentation.
None of these tools are widely adopted in foundation model development pipelines as of 2026. Their adoption would require either regulatory mandate or industry coordination on a scale that has not materialized. The EU AI Act's Article 53 obligations for general-purpose AI model providers include technical documentation requirements that could drive adoption of structured provenance formats, but the specifics depend on standards that are still in draft.
What Meaningful Disclosure Requires
An audit of current practice leads to a clear conclusion: model cards as practiced by major foundation model developers do not meet the Mitchell et al. specification for training data disclosure. They meet a social norm of having a model card without meeting the technical standard the model card was designed to enforce.
Meaningful training data disclosure at foundation model scale requires several things that current practice lacks.
It requires provenance-aware ingestion pipelines that record source, license and consent status at the point of data collection. This is an engineering requirement, not a documentation requirement. You cannot document what was not recorded.
It requires machine-readable output, not prose. A model trained on petabytes of text cannot have its training data described usefully in three paragraphs. Structured metadata formats like PROV-O and Croissant make the disclosure auditable by software rather than requiring a human to interpret ambiguous language.
It requires separation of data categories in disclosure: publicly licensed data, proprietary data, user-generated data acquired under terms of service, synthetic data generated by prior models. Aggregate descriptions obscure the rights implications of each category.
It requires regulatory backstop. Voluntary disclosure has produced voluntary opacity. The EU AI Act framework is the most credible regulatory instrument currently in force, and its training data transparency provisions for general-purpose AI providers are worth watching closely as implementing standards develop.
The MyDataKey project at mydatakey.org approaches the origination problem from the individual data owner's side: if individuals can assert cryptographic provenance over their own data contributions, the consent receipt infrastructure exists at the source. That approach complements pipeline-level provenance recording but cannot substitute for it when most training data was collected before any such infrastructure existed.
The model card was designed as an instrument of accountability. In its current form, for foundation models, it functions as an instrument of plausible deniability. That gap is not a documentation problem. It is a structural feature of an industry that has not yet been required to make the infrastructure investments that genuine transparency demands.
