Following hot on the heels of the General Purpose AI (GPAI) guidelines and Code of Practice (see our commentary here, here and here), the European Commission’s AI Office has recently published its long-awaited template for publicly summarising the content used to train GPAI models, together with an explanatory notice and related FAQs page.
With the GPAI provisions of the EU AI Act having come into effect over the weekend, GPAI model providers must ensure they are familiar with the final version of the template and what it requires them to disclose.
Background
As the global AI race continues to intensify, developers of GPAI models have become increasingly reluctant to disclose the data used to train their models. At the same time, however, rights holders have long been advocating for greater transparency, to enable them to identify where their works have been used in training datasets.
The EU AI Act seeks to balance these concerns by requiring GPAI model providers who place their models on the EU market to publish a “sufficiently detailed” summary of the content used to train those models, in the form of a template provided by the AI Office. One of the key aims behind this obligation is to enable parties with legitimate interests, including copyright holders, to exercise their rights while giving due consideration to developers’ needs to protect their trade secrets and confidential business information.
Whilst the Recitals to the AI Act provide some limited guidance on the scope of this obligation and the objectives behind it, and further detail was provided in the AI Office’s preliminary proposals that were unveiled at the start of this year (see our blog), stakeholders have been patiently waiting to see the full template.
What do the template and explanatory notice say?
The template itself largely follows the structure of the AI Office’s preliminary proposals, with the template being broken down into three key sections.
1. General information – this covers basic information about the model, including the identity of the model and its provider, the modality of the model (text, images etc), a high level indication of the amount of training data used to train the model and more general characteristics of that training data.
2. List of data sources - this section requires GPAI model providers to:
- disclose the main datasets that were used to train the model, such as large private or public datasets;
- provide a narrative description of data that has been crawled and scraped from online sources (either by the provider or on its behalf). This includes details of the crawlers used, their purpose and behaviour, the period of collection and a summary of the top 10% of all domain names scraped, determined by the size of the content scraped (although there are some relaxations for SMEs); and
- provide a narrative description of all other data sources used, including user data and synthetic data.
3. Other data processing aspects – this includes measures implemented by the provider to comply with rights holders’ opt outs from the TDM exception and to remove illegal content.
The explanatory notice clarifies that the information to be provided should cover data used in all stages of model training, from pre-training to post-training. However, data used during operation of the model (such as through retrieval augmented generation) is not covered, unless the model actively learns from that data.
Recognising the need to balance transparency with the protection of model providers’ trade secrets, different levels of detail are required based on the source of the data in question, with greater detail required about the use of publicly available datasets and more limited disclosure required for things like licensed data.
The notice also makes it clear that the template is intended to provide a common minimal baseline for summarising this information, but notes that providers can voluntarily decide to go beyond this. Indeed, in some places, it actively encourages this.
Once complete, the summary has to be published on the model provider’s official website, as well as on all of the model’s public distribution channels. And that has to be done, at the latest, when the model is placed on the EU market.
When will this come into effect?
The obligation on GPAI model providers to publish these training data summaries came into effect on Saturday, 2 August, with the AI Office having the ability to enforce it from 2 August 2026.
For models that are already on the market before that date, however, there will be a two year grace period, giving providers of those models until 2 August 2027 to publish their summaries.
It’s also worth noting that, unlike compliance with the Code of Practice – which is voluntary - completing this template is mandatory. Non-compliance could lead to fines of up to 3% of the provider’s total annual worldwide turnover or €15m, whichever is higher. So, there are significant teeth behind this.
Comment
It’s fair to say that the Commission left it pretty late to publish this template, but at least it was released before the provisions came into force.
Whilst it’s clear that the AI Office has tried to achieve a balance between transparency on the one hand and protecting GPAI model providers’ trade secrets and confidential business information on the other, initial reactions suggest that neither model providers nor rights holders are particularly enamoured with the outcome. Some model providers remain concerned that it may lead to them having to disclose confidential information and trade secrets; whilst many rights holders argue that it doesn’t go far enough and won’t facilitate them in enforcing their rights (see, for example, this statement from the European Writers’ Council).
Either way, with publication in this format mandatory, the rules having just come into force and the potential for substantial fines for non-compliance, providers need to be thinking about this now.