Does CADA open source cover data and AI training datasets?

Summary No, the proposed Cloud and AI Development Act (CADA) does not mandate that data or AI training datasets be released under open-source licenses. Article 41 explicitly limits its "open source first" obligation to "open standards and components released under an open source licence" when building the "cloud and AI ecosystem or stack." This scope is strictly software-centric. While CADA encourages the sharing and reuse of training data and AI models, these provisions are located in Title II (Cloud and AI Leadership Initiatives) and are governed by existing frameworks like the GDPR and the Data Act, not by the open-source licensing regime of Chapter V.

Detail

To navigate the regulatory landscape of the proposed CADA, it is essential to distinguish between the legal treatment of software artifacts and the legal treatment of data assets. The Act establishes a robust framework for open source, but its scope is narrowly tailored to technical components, not the underlying data used to train AI models.

The Scope of Article 41: Software and Standards Only

Article 41, titled "Promoting open source solutions and open source first," sets the primary obligation for public sector bodies regarding open source. The text states:

"The Union and Member States shall take the necessary measures to encourage Union entities and public sector bodies to use and facilitate the reuse of open standards and components released under an open source licence when building their cloud and AI ecosystem or stack."

The critical limitation lies in the definition of the objects being regulated. The term "open source licence" is defined in Article 2(25) by reference to the Interoperable Europe Act (Regulation (EU) 2024/903). In EU law and technical architecture, an open-source license grants rights to use, study, modify, and distribute software code. It does not confer rights over data.

Therefore, Article 41 does not compel public bodies to release their training datasets, nor does it mandate that AI models be trained exclusively on open-data repositories. Instead, it drives the adoption of open-source software (OSS) foundations—such as operating systems, containerization tools, middleware, and AI frameworks—to reduce vendor lock-in and enhance technological sovereignty. The "cloud and AI ecosystem or stack" refers to the technical architecture (infrastructure-as-code, management layers, inference engines), not the data flowing through it.

Data Reuse and AI Training: A Separate Regulatory Track

While Article 41 governs the software layer, CADA addresses data and AI training through distinct mechanisms, primarily under Title II (Research, Development and Deployment Activities) and the Cloud and AI Leadership Initiatives.

Recital 22 highlights that the Union should promote the "sharing and reuse of training data and AI models across the Union public sector." However, this is framed as an operational objective of the Leadership Initiatives rather than a direct procurement mandate akin to Article 41. The proposal recognizes that high-quality data is critical for AI development but treats it as a resource to be pooled and shared through specific initiatives (such as the Centres for AI and the EuroCloud Federation) rather than through a blanket "open source" licensing requirement.

Furthermore, CADA explicitly acknowledges that data governance falls under existing EU data protection and data access laws. Recital 63 clarifies that where cloud computing services process personal data, the General Data Protection Regulation (GDPR) applies. Similarly, Recital 41 notes consistency with the Data Act, which facilitates data switching and access but does not mandate open-source licensing for data.

Consequently, data reuse is handled by the interplay of CADA's demand-side measures (encouraging public sector adoption) and the existing data acquis. This ensures that data sharing respects privacy, intellectual property, and security constraints, which often preclude the "open" release of datasets that might contain sensitive or classified information.

The "AI Stack" Distinction and Boundary Clarification

The phrase "cloud and AI ecosystem or stack" in Article 41 is a technical term referring to the software and infrastructure layers. This includes:

Infrastructure-as-code (IaC) templates.
Container orchestration tools (e.g., Kubernetes).
Middleware and API gateways.
AI development frameworks and libraries.

It does not extend to the datasets used to train the models that run on that stack. A Chief Technology Officer (CTO) evaluating compliance must separate their efforts:

Software Compliance: Ensure public sector contracts prioritize open-source software and open standards as per Article 41.
Data Compliance: Ensure data usage complies with the GDPR, the Data Act, and specific data-sharing agreements established under Title II.

Why the Distinction Matters: Sovereignty vs. Openness

A critical nuance in CADA is the distinction between "open source" (software) and "open data" (publicly accessible datasets). While the proposal encourages open source to boost technological sovereignty, it does not impose a similar "open data" mandate on all public sector data.

Title IV (Autonomy) establishes a sovereignty framework that emphasizes protecting data confidentiality, particularly for activities contributing to "public order" (e.g., law enforcement, defense). Recital 62 notes that risk assessments must determine the appropriate assurance level for data, which may require strict containment. Mandating open access to all training data would conflict with these security and privacy objectives. Therefore, CADA supports the reuse of software components to build sovereign AI capabilities, while treating data as a strategic asset that must be managed with appropriate safeguards, potentially remaining closed or restricted to specific authorized users.

What this means for you

For CTOs, architects, and SMEs evaluating the practical impact of the proposed CADA, the distinction between software and data is vital for compliance strategy and product development.

For Public Sector Buyers and Their Suppliers

If you are supplying cloud or AI services to Union entities or public sector bodies, you must align your offerings with Article 41's "open source first" principle. This means your solutions should rely on open standards and open-source software components where feasible. Proprietary software may still be used if justified by functionality, security, or total cost, but the default expectation is openness. However, you are not required to open-source your proprietary AI models or the datasets used to train them, unless specific data-sharing agreements under the Leadership Initiatives apply.

For AI Developers and Model Providers

You can continue to use proprietary or licensed datasets for training AI models. CADA does not force you to release your training data into the public domain. However, to compete in the public sector market, your underlying software stack (e.g., the inference engine, the management platform) should ideally be built on open-source foundations. This reduces barriers to entry and aligns with the EU's goal of reducing dependence on single-vendor proprietary ecosystems.

For Data Strategists

Focus on compliance with the GDPR and the Data Act for data handling. Article 41 does not add a new layer of data licensing requirements. Instead, it encourages the technical infrastructure that enables secure data processing. Ensure that your data pipelines and storage solutions are compatible with open-source tools, as this will facilitate integration with public sector systems that are increasingly adopting open standards.

Strategic Implication

The proposal signals a shift towards sovereign, interoperable technical infrastructure. While your data remains a competitive asset, your software architecture must become more transparent and interoperable. Invest in open-source compatibility for your cloud and AI platforms to position yourself favorably in the evolving EU public procurement landscape.

Common misconceptions

Misconception 1: CADA mandates open-sourcing all AI models. Reality: Article 41 applies to "components released under an open source licence," which refers to software code. It does not require AI models (which are often considered intellectual property or trade secrets) to be open-sourced. While the proposal encourages the reuse of AI models across the public sector (Recital 22), this is facilitated through sharing mechanisms, not a blanket open-source licensing mandate.

Misconception 2: "Open source" in CADA includes open data. Reality: No. Open source refers to software licensing. Data is governed by separate legal frameworks. CADA encourages data sharing and reuse for AI training (Title II), but this is distinct from the open-source software requirements in Article 41. Data reuse is subject to privacy, security, and intellectual property laws.

Misconception 3: Public sector bodies can no longer use proprietary software. Reality: Article 41 encourages the use of open standards and open-source components but allows for exceptions based on functionality, security, total cost, and other justified criteria. Proprietary software is not banned, but it must be justified against the preference for open alternatives.

Misconception 4: CADA replaces the GDPR for data handling. Reality: CADA complements existing laws. Recital 63 explicitly states that CADA is consistent with GDPR. Data protection obligations remain unchanged. CADA focuses on the sovereignty and procurement of cloud services, not the fundamental rights aspects of data processing.

Official sources

This is general information about a draft EU regulation, not legal advice.