Model Cards and the Inconsistency of LLM Training Data Disclosure: An Audit of Foundation Model Documentation Practices

Model Cards and the Inconsistency of LLM Training Data Disclosure: An Audit of Foundation Model Documentation Practices
Quick Answer
Major foundation model providers including OpenAI, Anthropic, Google DeepMind and Meta do not meet the training data disclosure standard specified by Mitchell et al. in their FAccT 2019 model card framework. Current model cards provide qualitative source descriptions and safety evaluation results while omitting corpus proportions, rights clearance documentation, filtering thresholds and domain breakdowns. The EU AI Act's GPAI provisions now impose regulatory minimums that voluntary practice has consistently failed to reach.

Model cards were proposed as a transparency instrument. The original intent was clear: if you deploy a machine learning model in a consequential setting, you should document what data trained it, who it may harm, and under what conditions it fails. That was the proposal in 2019. As of 2026, the largest and most consequential models in deployment carry documentation that is simultaneously more detailed and less useful than anything the original framework anticipated. This article examines what major foundation model providers actually disclose, how that disclosure compares to the Mitchell et al. specification, and what the gap means for practitioners who rely on model cards to make deployment decisions.

What Model Cards Were Designed to Do

The model card concept emerged from applied machine learning ethics research. The framework proposed that models should ship with structured documentation analogous to nutrition labels on food or drug package inserts in medicine. The analogy is instructive. A nutrition label does not tell you everything about how a product was manufactured. It does tell you the specific facts you need to make an informed consumption decision.

For machine learning models, those facts include the intended use cases, the populations the model was evaluated on, known performance disparities across demographic subgroups, and the nature of the training data. None of these are optional extras. They are the minimum viable information for a deploying organization to discharge its duty of care.

The original framing applied primarily to narrower task-specific models: a toxicity classifier, a medical image segmentation model, a credit scoring algorithm. Foundation models broke the assumption that a model has a bounded intended use. When a model can be applied to arbitrary tasks, the model card framework faces a structural challenge. That challenge has not been resolved. It has been absorbed into marketing materials.

The Mitchell et al. Standard and What It Requires

The Mitchell et al. paper "Model Cards for Model Reporting" was published at the ACM Conference on Fairness, Accountability, and Transparency (FAccT) in 2019. The paper is specific. It identifies nine documentation categories that a complete model card should address: model details, intended use, factors (relevant demographic and contextual variables), metrics, evaluation data, training data, quantitative analyses, ethical considerations and recommendations.

The training data section is not a checkbox. The specification asks for a description of the datasets used, the processing applied, and any known gaps or limitations in the data itself. That means answering questions like: what domains are represented, what time ranges are covered, how was the data collected, were rights-holders notified or compensated, what filtering was applied to remove harmful content, and what was explicitly excluded.

The intended use section requires documentation of both intended uses and out-of-scope uses. It also requires identification of the primary intended users. For foundation models deployed through APIs, this creates a documentation obligation that most providers quietly sidestep: if your model will be used by enterprise clients to make decisions about real people, the people subject to those decisions are relevant stakeholders in the model card, not just the developers who call your API.

The Mitchell et al. framework has since been extended and operationalized. Google's Datasheets for Datasets (Gebru et al.) addresses the data artifact side. The Hugging Face community has built templated model card infrastructure. The EU AI Act, which came into full effect on a phased schedule and is actively enforced as of 2026, requires technical documentation for high-risk AI systems that overlaps substantially with model card requirements. The documentation norm is not controversial in principle. The gap is entirely in practice.

Auditing Major Foundation Model Disclosures

Looking at what Anthropic, OpenAI, Google DeepMind and Meta AI actually publish about their foundation models reveals a consistent pattern: qualitative descriptions of training philosophy, broad gestures toward data source categories, and minimal quantitative specificity about the composition of training corpora.

OpenAI and the GPT series: OpenAI has published system cards for its multimodal models that focus heavily on safety evaluations and red-teaming results. The training data sections describe the use of web data filtered for quality, licensed data from third-party providers, and data created through contractor annotation programs. The web crawl sources, the licensed data contracts, the specific filtering thresholds and the proportion of each source type in the final training mix are not disclosed. The GPT-4 technical report, published in 2023, explicitly states that the training data mixture, hardware details and training method specifics are not disclosed for competitive reasons and safety considerations. This is a legitimate business and safety argument. It is not a transparency argument.

Anthropic and the Claude series: Anthropic's documentation tends to emphasize Constitutional AI methodology and safety benchmarking. Training data sourcing receives general treatment: large-scale web data plus curated datasets plus human feedback data. The specific corpora, their proportions and the rights clearance process for third-party data are not published. Anthropic has been notably forthcoming about evaluation methodology. The upstream data provenance is not covered at equivalent depth.

Google DeepMind and Gemini: The Gemini technical report documents multimodal training across text, image, audio and video. Data sourcing descriptions are more granular than some competitors, acknowledging the use of web documents, books, code repositories and specialized datasets. Specific dataset names and proportional breakdowns remain undisclosed. The filtering and quality processing pipeline is described at a high level without the methodological specificity that would allow independent replication or auditing.

Meta AI and the Llama series: Meta's Llama models have attracted particular attention because they are released under open licensing, making independent analysis possible in ways it is not for closed models. The Llama technical reports document training data more specifically than their closed-model counterparts. CommonCrawl, Wikipedia, GitHub, ArXiv and other named sources appear in the documentation. Token counts and rough proportional weights are disclosed for some versions. This is meaningfully better than the industry norm. It is still not a complete provenance record. Rights clearance documentation for each corpus, the specific crawl dates, and the deduplication methodology are not published at a level that would satisfy a GDPR data protection impact assessment, for example.

The Training Data Provenance Gap

The gap between what Mitchell et al. specified and what foundation model providers publish is not primarily a gap in effort. These are well-resourced organizations with sophisticated technical documentation capabilities. The gap reflects a set of structural tensions that documentation requirements surface without resolving.

The first tension is between transparency and competitive advantage. Training data curation is genuinely a source of differentiation. A complete corpus description is a partial recipe for replication. This is a real commercial pressure. It does not justify silence on the question of whether training data was lawfully obtained. Those are separate issues that get merged in practice.

The second tension is between transparency and legal exposure. Several foundation model providers are defendants in litigation concerning the use of copyrighted material in training data. Detailed corpus disclosure in a model card creates discoverable admissions. Legal counsel does not typically advise voluntary disclosure while litigation is active. This produces a situation where users of models deployed in consequential settings cannot access the information they need to assess legal risk in their own deployment context.

The third tension is scale. A model trained on trillions of tokens sourced from hundreds of distinct pipelines does not have a simple training data section. The logistics of complete provenance documentation at that scale are genuinely hard. They are not harder than the logistics of building the training infrastructure itself. Organizations that can coordinate petabyte-scale distributed training runs can maintain provenance records if they choose to invest in that infrastructure. The NIST AI Risk Management Framework, in its 2023 publication (AI RMF 1.0), identifies data provenance as a core component of AI transparency and recommends maintaining documentation of data sources, processing steps and lineage throughout the model development lifecycle.

Training data disclosure is not only a technical documentation problem. It is a consent architecture problem. The data governance frameworks that apply to personal data under GDPR, CCPA and emerging state-level AI laws require that data subjects have rights over how their personal information is used. Web-scraped training corpora contain personal data. They contain the written expressions, images and other contributions of individuals who have not consented to their use as AI training material.

Model cards as currently practiced do not address this. The Mitchell et al. framework did not specifically anticipate it either, given that the original context was narrower task-specific models rather than foundation models trained on the open web. The gap is now significant enough that it requires deliberate extension of the model card specification.

The W3C Data Privacy Vocabularies and Controls Community Group (DPVCG) has developed vocabulary specifications for expressing consent, purposes and processing activities in machine-readable form. The IETF has published RFC 8259 (JSON) and related standards that support structured data exchange. These technical building blocks exist for machine-readable consent documentation. They are not being applied systematically to model card training data sections.

The concept of a data trust or data fiduciary, where a neutral party holds data on behalf of contributors and enforces terms of use, provides one architectural model for addressing this at scale. Volume 6 of The Invisible Series, "The Invisible Data," examines these consent architecture questions in depth, including how Personal Data Asset Origination Systems could provide the kind of traceable provenance that model card training data sections currently lack. The personal data sovereignty problem and the AI training data problem share the same underlying architecture.

Toward Accountable Documentation Practices

The path toward meaningful training data disclosure does not require foundation model providers to publish complete corpus dumps or expose proprietary filtering pipelines. It requires a more honest separation between what is disclosed and what is withheld, with explicit justification for the latter.

A minimally accountable model card training data section should answer the following questions. What categories of data sources were used, with rough proportional breakdowns? What were the cutoff dates for each source? What rights clearance process, if any, was applied to third-party data? What filtering criteria were used to exclude harmful, private or legally encumbered content? What was the language distribution? What was the domain distribution, at least at a coarse level?

None of these require disclosure of proprietary techniques. All of them would materially help deployers assess fitness for purpose, legal risk and potential failure modes. The EU AI Act's transparency requirements for general-purpose AI models, as enforced by national competent authorities and the AI Office, are pushing in this direction. The Act's provisions on training data summaries for GPAI models give regulatory backing to what has been a voluntary norm that has clearly not worked on a voluntary basis.

Practitioners evaluating foundation models for deployment in 2026 should treat the absence of a training data provenance section as a risk signal, not as neutral. If a model card cannot tell you what domains a model was trained on, it cannot tell you what failure modes to anticipate in out-of-distribution inputs. If a model card cannot tell you whether training data was rights-cleared, your legal team cannot assess deployment risk in content generation contexts.

The documentation infrastructure exists. The W3C Verifiable Credentials specification, the NIST AI RMF and the EU AI Act's technical documentation annexes all provide frameworks that serious organizations can build toward. The question is whether the incentive structure shifts enough to make honest provenance documentation the default rather than the exception. Regulatory enforcement and procurement requirements from enterprise customers are the most likely forcing functions. Voluntary commitment to the spirit of what Mitchell et al. proposed in 2019 has not been sufficient to close the gap that this audit consistently reveals.

Frequently Asked Questions

What did Mitchell et al. actually require in the training data section of a model card?
The Mitchell et al. FAccT 2019 specification asks for a description of the datasets used, the preprocessing and processing steps applied, and any known limitations or gaps in the training data. This includes information about data domains, collection methods and known biases. It is a structured disclosure requirement, not a general statement that training data was used.
Why do foundation model providers withhold training data details from model cards?
Three structural tensions drive this. First, detailed corpus descriptions partially reveal proprietary data curation strategies that represent competitive advantage. Second, active copyright litigation makes voluntary corpus disclosure legally risky as potential discoverable admissions. Third, documenting provenance at trillion-token scale requires infrastructure investment that has not been prioritized. None of these tensions justify the absence of rights clearance documentation or domain distribution data.
Does the EU AI Act require training data disclosure for large language models?
Yes. The EU AI Act's provisions on general-purpose AI models require providers to prepare and maintain technical documentation that includes a summary of training data used, including the types of data, geographic and linguistic coverage, and any relevant rights clearance. These requirements are enforced by the EU AI Office and national competent authorities, giving regulatory backing to what had previously been a voluntary standard.
Is the Llama model series more transparent about training data than closed models?
Relatively, yes. Meta's Llama technical reports name specific training corpora including CommonCrawl, Wikipedia, GitHub and ArXiv and provide rough token count distributions across sources. This exceeds industry norms for closed foundation models. It still does not constitute a complete provenance record that would satisfy GDPR data protection impact assessment requirements or the rights clearance documentation that downstream deployers need.
What should a software engineer or enterprise deployer do when a model card lacks training data provenance?
Treat the absence as a risk signal requiring explicit mitigation. Document the gap in your own deployment risk assessment. Query the provider directly for any supplementary data disclosure documentation. Assess whether the use case involves domains where out-of-distribution failures from unknown training data composition would create legal or safety exposure. In regulated industries, consult legal counsel on whether deploying a model with undocumented training provenance meets your organization's duty of care obligations.
model cardsLLMtraining datafoundation modelsAI transparencydata provenanceAI governance
← Back to Blog