Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Defining Mandatory Components for AI Systems Seeking Digital Public Good (DPG) Recognition #193

Open
ricardomiron opened this issue Nov 5, 2024 · 5 comments

Comments

@ricardomiron
Copy link
Collaborator

💡 This proposal is now open for Community Discussion for 4 weeks and will close on Dec 4th, 2024. We encourage you to review the proposal below and share your reactions, questions and comments directly in this issue.

How We Arrived at This Proposal

In 2023, the Digital Public Goods Alliance (DPGA) Secretariat, in collaboration with UNICEF, convened a dedicated Community of Practice (COP) on AI systems to draft expert recommendations on updating the DPG Standard. The objective was to determine how best to assess AI systems for DPG recognition. After extensive deliberation and consultation with stakeholders, the AI COP delivered a set of recommendations in August 2024 for review by the DPG Standard Council.

The DPG Standard Council has carefully considered the AI COP’s recommendations, and as part of a series of key upcoming updates to the DPG Standard, has decided to begin with explicitly defining which components should be provided as mandatory, including the source data, code, and model for AI systems aspiring to be recognized as Digital Public Goods. This decision aims to strengthen the transparency and accountability of AI systems, ensuring that they meet consistent requirements across DPG categories.

Proposal

DPGs are defined as more than just open source solutions; they include open software, open data, open content, and open AI systems, all of which are required to be accessible, adaptable, compliant with best practices, and aligned with the UN’s Sustainable Development Goals. To be recognized as a DPG, a solution must demonstrate a commitment to do no harm by design, and adhere to principles of openness and transparency. Through these criteria, the DPG Standard aims to support digital tools that are genuinely open, impactful, and safe for communities everywhere.

We propose that to qualify as a Digital Public Good, open source AI systems must provide the following components:

These components are based on the Model Openness Framework (MOF) and Open Source AI Definition Checklist to evaluate machine learning systems. Other components listed in these frameworks, such as research papers, evaluation results, sample model outputs, among others are optional unless specified in other indicators of the DPG Standard.

Fundamental Change on Open Training Data Requirement for AI Systems

This proposal, which includes the requirement to open training data is a fundamental change to the Standard, that directly affects other indicators such as Platform Independence and although it may restrict the number of AI solutions that meet the DPG Standard initially, we believe it reinforces the DPG’s core values, including transparency, equity, and ethical development. This requirement ensures that AI systems contributing to the public good do so with a commitment to openness that supports accountability, safety, and meaningful societal benefits.

Community Comment Period

This proposal is now open for a 4-week public comment period, as per the Standard governance process, on GitHub starting today-4th, November 2024, until 4th December, 2024 to gather community insights on this proposed update. Your feedback will play an essential role in refining and finalising this change. If the list of required components and update to open training data is confirmed, the DPG Standard Council will proceed to review other specific DPG indicators for additional minor and major revisions that build on this foundational shift (documented here).

For further insights into this direction and the rationale for these updates, please refer to the recent blog post by our DPGA Secretariat CEO. We look forward to your engagement and collaboration in shaping this important step forward for AI DPGs.

@pdelboca
Copy link

pdelboca commented Nov 6, 2024

This is great @ricardomiron !

And I'm glad to see that there is a requirement for the training data to be open. As the DPGA pointed in their blogpost, maintaining a high bar on training data could potentially result in fewer AI systems meeting the DPG Standard criteria but I think there are no other options to ensure or prove that a) it has been built ethically and b) that the model can be interpreted correctly.

Enforcing open data as a requirement it like enforcing to communicate the ingredients of a snack to certify it is healthy: you cannot guaranteed the second without the first ones. I know is a stretchy analogy but it helps understanding how vital is the data to understand the AI system as a whole.

Open Data and Ethical AI Systems

There is plenty of research and publications about how unethically is the process of building the datasets used for training AI models. From ilegal data scraping, copyrighted material, cheap labor for labeling, extractivism, etc. Just as having a traceability of the ingredients of candies allows us to to tackle ethical challenges around the Oil Palm Industry, having a traceability of where the data comes from will allow us to understand the production chain of AI.

Open Data and explainability

We need to access the training data to be able to assess biases in the model. Accessing to the model itself or the parameters is not enough. Going back to the analogy of food industry, knowing the molecule of your candy bar is not enough to understand if the ingredients they are using are healthy or not.

@tverbeke
Copy link

Hi @ricardomiron

This is a solid proposal and deserves support.

If one uses the term 'open source AI', the four freedoms that are underlying the idea of FOSS need to be guaranteed. The freedoms to study and modify the AI system can only be fully realized if the training data are open, so the decision to require the training data directly derives from the very nature of FOSS when extended to AI.

I would remove the reference to the Open Source AI Definition Checklist, though, since

  1. it refers to a draft version (v 0.0.9),
  2. the OSAID itself erodes the meaning of open source by not requiring the training data (which could be confusing here), and
  3. for this reason is highly contested within the open source community (see e.g. https://sfconservancy.org/blog/2024/oct/31/open-source-ai-definition-osaid-erodes-foss/, https://www.schneier.com/blog/archives/2024/11/ai-industry-is-trying-to-subvert-the-definition-of-open-source-ai.html, and many more)

@samj
Copy link

samj commented Nov 27, 2024

Good morning @ricardomiron,

Thank you for the opportunity to comment on this important standard.

We would like to add our voices, particularly in strong support of the requirement for data, which we consider to be the 'source' of AI systems.

We note that the original author of the Open Source Definition (OSD), and Debian Free Software Guidelines (DFSG) on which it was based, also takes this position, arguing that the OSD can be used in its current form to assess the openness of AI systems. The data is required to assess and address security and ethical issues including fairness and bias, as well as to enable these models to function as the foundation for future generations. This is what our industry and those dependent on it have come to expect of Open Source over the past quarter century.

We further note that this is necessary but not sufficient to avoid the perpetuation of the status quo where "open AI is highly dependent on the resources of a few large corporate actors [and their paid agents, including the Open Source Initiative (OSI)], who effectively control the AI industry and the research ecology beyond" (Nature: Why ‘open’ AI systems are actually closed, and why this matters).

Given the OSI's willingness to release the OSAID 1.0 without achieving community consensus, and indeed in the face of sustained opposition from ourselves and others (several of whom have already been noted above) and presence of conflicting standards including the OSD itself, as well as work being done by the Free Software Foundation, Debian, and of course DPGA, which do all require the data, we recommend against referencing the OSI or OSAID in DPG standards.

For safety and stability, we propose that industry actors instead refer directly to the Open Source Definition, ideally v1.9 specifically or at least prefer terminology like "OSD-compliant terms" over "OSI-approved license", unless and until the community achieves clear consensus on a future version, and we have recently launched the Open Source Declaration to that effect. The OSD covers all software, while the OSAID conflicts with it for any software that "infers, from the input it receives, how to generate outputs" (i.e., almost all software). It is therefore likely that OSI's leadership will attempt to "harmonise" the two definitions when they return from post-launch vacation, no doubt in a fashion that also "differs in significant ways from the views of most software freedom advocates".

We further note that respected industry experts argue that it is not feasible to apply Open Source to AI, likening the OSI's OSAID to the failed Tacoma Narrows bridge, and that it is in any case too soon to do so, the OSD itself having been the culmination of decades of work predating the OSI's incorporation. For example, at the time of writing large models have "hit a wall", the performance of smaller models is rapidly improving, open datasets are being regularly released, and copyright questions are working their way through the courts, all of which trend towards the requirement for openness in data.

While the OSD has proven itself strong on openness over the past quarter century, it is not explicit in the completeness dimension, which has caused problems dating back to id software's release of Quake without the data the year after it was launched. The Linux Foundation's Model Openness Framework (MOF) does achieve a higher standard in terms of completeness for AI only (which is yet to be tested), but it fails on openness, with the highest Class I accepting data under "any license or unlicensed", suggesting the need for a Class 0. The OSAID accepts any data or no data, failing in both dimensions. I have discussed this in more detail in Openness vs Completeness: Data Dependencies and Open Source AI.

A better long-term solution may be to bugfix the OSD itself by making completeness explicit for all data, which has been proposed for a potential future version, but we our community has no stronger claim to consensus than the OSI (except for the absence of objections).

openness-vs-completeness-wip

In any case, this demonstrates that the MOF is no more suitable for referencing than the OSAID. Furthermore, the role of a checklist or framework may replace that of an OSD-compliant license, allowing for a spectrum of openness from a minimum acceptable standard (ideally defined by that single existing and proven document) to radically open options. Referencing the specific proposals of select organisations in this context may deprive us this critical flexibility, which is a point we would have hoped emerged from the OSI's multi-year process (and which suggests they may have plans to offer a single checklist and possibly a centralised certification program instead of the self-service status quo).

We trust this input will help bolster the case for data being a critical component of any such definition, being the "source" or symbolic instructions for AI models. It may be too soon to commit to any checklist for completeness, in which case it may be better to opt for generic tests for reproducibility (which is an implicit requirement for Open Source despite claims to the contrary).

Sincerely,

Sam Johnston (LinkedIn)
Developer, Debian
Lead Developer, Personal Artificial Intelligence Operating System (pAI-OS)
Board Member, Kwaai Open Source AI Lab

@tarkowski
Copy link

The proposed approach, together with the Open Training Data requirement correctly applies, in my opinion, the DPG standard to AI systems.
I would like to offer several detailed comments and suggestions:

  • openness of datasets can be achieved not just through open licensing, but also by using data in the Public Domain - this should be made explicit
  • I suggest making it clear that data includes not just pre-training data (which is the focus of many debates about data sharing for AI training) but also other types of datasets, for example instruction datasets for fine-tuning. Technically, the current language is correct (the term datasets is a general one and covers all datasets), but it might help clarify this
  • There are cases where a distinction is made between the dataset and its data, with only the dataset being shared openly. It might make sense to include a clarification that when speaking about datasets, the DPG standard means both
  • I notice the use of the term "open source AI system" - which by now means a specific type of open AI system that is compliant with the OSAID definition. Therefore it might make sense to use the more general term "open AI system" or "openly shared AI system". In this specific case, I would refer more broadly to "Ai systems".

@tverbeke
Copy link

tverbeke commented Dec 3, 2024

@tarkowski The OSAID is not (and has never been) compliant with the basic tenets of open source given that it does not respect the freedom to study nor the freedom to modify. This severely flawed definition also was released on October 28, 2024 , so I disagree that a good month later ('by now') we should start accepting that an 'open source AI system' is a non-open-source system as described by the OSAID. I agree, of course, that - by its newspeak - this definition does not make things easier and causes a lot of harm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants