This browser is not actively supported anymore. For the best passle experience, we strongly recommend you upgrade your browser.
Digital developments in focus
| 2 minutes read

Are the Floodgates Opening? The Future of Large Datasets in AI

The use of large third-party datasets in the development of AI models has been a contentious issue for some time now.  In the latest development, Getty Images are bringing a copyright claim in the UK against AI developer Stability AI, in relation to Stability AI’s “Stable Diffusion” AI art tool.

What did Stability AI do?

According to Getty, Images, “Stability AI unlawfully copied and processed millions of images protected by copyright and the associated metadata owned or represented by Getty Images absent a license to benefit Stability AI’s commercial interests and to the detriment of the content creators.

Stability AI used an open-source dataset scraped from the web by an organisation called LAION (a non-profit, but heavily funded by Stability AI), whose “LAION-5B” dataset includes images from Getty Images.  Getty Images alleges that Stability’s AI’s use of the LAION-5B dataset when training the Stable Diffusion model infringed the copyright in its images.

Unlike some other AI providers (for example OpenAI, who have not disclosed the underlying dataset for their CLIP model), Stability AI's use of an open-source data set has arguably put them in a more vulnerable position, as this transparency gives owners of copyright in the underlying works clear evidence of (potentially) infringing use.

One of many cases? 

Following on the heels of similar lawsuits brought in the US against Microsoft's Copilot and by a small group of artists against three AI art developers (including Stability AI), this seems to herald a difficult new phase for the developers of AI models from large datasets. 

A win for Getty Images could therefore embolden other content creators and copyright holders whose images were included in the LAION dataset to take similar legal action if those developers have not obtained licences from the underlying copyright holders.

Copyright owners want to be compensated for the benefit that AI developers are getting from using their works, and the AI industry may find that it needs to change how it operates, to avoid being bogged down in endless lawsuits. Indeed, Getty Images specifically noted in their press release that they offer licences for training AI systems, but Stability AI chose not to seek such a licence.

What is the future of monetising datasets?

For AI developers, these lawsuits will feed into their legal risk assessments. It will impact, for example, the approach they take to using a particular dataset, and on any decision on whether to seek (and pay for) a licence from some or all of the owners of the underlying works.

However, seeking individual licences from each copyright holder whose works are used in these massive datasets will be expensive and administratively burdensome.  The LAION 5B dataset, for example, includes over five billion image-text pairs.  Perhaps in the future we will see royalty collection agencies for AI model developers and users, similar to those in the music industry, to try to reduce this administrative burden.

This is not without its own pitfalls – the difficulty of establishing what usage is being made of an underlying copyrighted work in any given use of an AI tool (particularly in the deployment rather than training phase) could make this kind of royalty-bearing usage very difficult to identify and calculate.

Overall, this case therefore serves as a reminder that the use of large datasets in AI is a complex issue that requires careful consideration and protection. As the field of AI continues to evolve, we can expect to see more legal action and discussions around the proper use and monetisation of these datasets.

"Stability AI did not seek any such license from Getty Images and instead...chose to ignore viable licensing options and long‑standing legal protections in pursuit of their stand‑alone commercial interests."


ai, data, data analytics, ip, emerging tech