Brussels / 1 & 2 February 2025

schedule

Community Insights: Best Practices for Open Datasets for LLM training


The landscape of AI training data stands at a critical crossroads. As large language models increasingly shape our digital ecosystem, the methods of data collection and curation have become a complex battleground of legal, ethical, and technical challenges.

Concerns from creators have spurred lawsuits and prompted AI companies to limit transparency about training datasets, undermining accountability, innovation, and research. While using open access or public domain data could address these issues, at the time of writing, no large-scale competitive models trained on such data exist yet due to challenges like unreliable metadata, digitization costs, the “consent crisis,” and the need for legal and technical expertise.

In this talk, we will discuss pioneering community efforts toward creating open and responsible AI training datasets that challenge the current opaque practices of major AI companies and chart a path toward open datasets as part of a larger public AI ecosystem that can address humanity's most pressing needs while distributing control among many stakeholders.

In June 2024, Mozilla and EleutherAI convened 30 open dataset builders from across the field—organizations such as Hugging Face, Pleias, Cohere4AI, LLM360, TogetherAI, and many more—to address these critical issues. Based on the insights from this gathering, we co-created a research paper titled "Towards Best Practices for Open Datasets for LLM Training". This paper outlines the challenges of navigating the production of open datasets and provides practical recommendations for sourcing, processing, governing, and releasing such datasets. These recommendations are rooted in on-the-ground experience and paired with examples of what is already being done. While the paper references OSI's Open Source AI definition, it goes further by outlining possible tiers of openness and offering avenues for more ethical data governance in AI datasets.

This session will provide an in-depth exploration of the current landscape and its main players, unpacking the legal ambiguities surrounding AI training data and highlighting the critical importance of transparency and governance. We will share a practical roadmap for developing datasets that promote healthy openness, respect people and communities’ broadly defined rights, and advance the field of artificial intelligence from a digital public good perspective, as well as an overview of concrete policy and tech investments that would unlock the ecosystem.

Building a future where AI systems can be trained on openly licensed data that is responsibly curated and governed requires collaboration across legal, technical, and policy domains, and fostering a culture of openness. With this session, we hope to engage you in a discussion around the unique insights into this cutting-edge work and invite you to add your voice to the growing community of responsible open-source AI developers and advocates.

Speakers

Photo of Kasia Odrozek Kasia Odrozek