The European AI Office has recently unveiled its preliminary proposals on the template for the summary of training data that general-purpose AI (“GPAI”) model providers will be required to publish under the EU AI Act.
While at first glance the template may appear to be a mere administrative formality, the AI Office’s proposals are highly anticipated as they set the stage for another key battleground in the hotly-contested debate on how best to balance the rights of GPAI developers against the legitimate interests of rightsholders.
Background
As the global AI race intensifies, developers of GPAI models have become increasingly reluctant to disclose the data used to train their models. At the same time, however, rightsholders have long been advocating for greater transparency, to enable them to identify where their works have been used in training datasets.
The AI Act seeks to address these concerns by requiring GPAI model providers to publish a “sufficiently detailed summary” of the content used to train GPAI models placed on the EU market (see our earlier blog). As stated in the AI Act, the aim of the summary is to enable parties with legitimate interests, including copyright holders, to exercise their rights while giving due consideration to developers’ needs to protect their trade secrets.
The AI Act doesn’t define “sufficiently detailed”, but Recital 107 provides some guidance as to what is expected: the summary doesn’t need to specify each copyrighted work used to train the GPAI model or be technically detailed. Instead, it should be “generally comprehensive”, such as by listing the main data sets used (e.g., large publicly-available datasets), with a narrative explanation of other sources. However, the AI Act acknowledges the need for a “simple” and “effective” template to be published by the AI Office to iron out the specifics of what is required of GPAI model providers.
AI Office’s preliminary structure and elements of the template
The AI Office has now published details of the preliminary structure and elements of its template. Broadly, the AI Office proposes that the template will comprise of three sections:
1. General information: the first section will include general information regarding the GPAI model. This includes information about the provider and the date of placement of the model on the EU market as well as information on the overall training data size, modalities (e.g., text, image, video or audio) and characteristics of each modality. As such, providers will not only be required to set out the volumes of data for each modality but also provide a breakdown of the types of data within each modality – for instance, for text data, whether that text includes fictional texts, scientific texts and/or news publications.
2. List of data sources: under the second section, GPAI model providers should list the data sources used to train the GPAI model, including publicly accessible datasets, private datasets of third parties (such as data licensed by rightsholders), data crawled and scraped from online sources and provider-sourced data (including via prompts). Different requirements apply to different datasets – for example, for the main publicly accessible datasets used to train the model, providers will need to detail unique identifiers, links and the period of collection. The AI Office also proposes a proportionate approach to disclosure: for data scraped from online sources, for instance, providers will need to list the top 10% of all domain names per data modality; however, that requirement is reduced for SMEs to the top 5% (or 1,000 domains, whichever is lower) regardless of modality.
3. Other relevant data processing aspects: the final section addresses any other relevant data processing aspects, such as measures implemented during data collection to respect rightsholder opt-outs from the EU’s broad text and data mining exception (see further here), as well as steps taken after data collection to remove data that is subject to such rights reservations.
Next steps
The drafting of the template is closely linked to the AI Office’s ongoing work on its GPAI Code of Practice (the “Code”) (see our blog post here). As such, the AI Office has presented its preliminary ideas for the template to participants of the Code working group for transparency and copyright, with feedback due from that group a few days ago (31 January).
We understand that the full draft template will be further discussed in dedicated working groups over the coming weeks, with the goal of adopting a finalised template in Q2 2025 in advance of the AI Act’s rules on GPAI models (including the summary requirement) coming into force on 2 August 2025.
As the drafting process continues, we can expect the structure and content of the template to be subject to further lobbying by both sides of the debate.
Comment
It’s great to finally see the EU AI Office’s initial proposals on this template, which are broadly in line with expectations. However, once the template has been finalised and adopted, the real test of its efficacy will be in the approach leading providers take to populating it and, even more critically, how strictly compliance will be enforced (including how the template’s disclosure requirements will be interpreted).
In any case, given the requirements will come into force this summer and in light of the potential for substantial fines for non-compliance, providers should start thinking now about gathering the required information based on the preliminary structure released by the AI Office. While many leading providers may have much of this information readily available as part of their broader AI data governance efforts, populating the template will still require collaboration between legal and technical teams so early preparation is key.
With some of the first provisions of the EU AI Act coming into force a couple of days ago (which we’ll shortly follow with a separate blog post on) and the third draft of the Code due to be published in a couple of weeks, there is plenty in this space to keep an eye on.