
Open Problems in AI Data Economics
Abstract
In our new paper, we introduce data economics as a coherent field and define open problems that have not yet been formalized. Most AI economics research focuses on downstream effects like productivity and labor displacement, not production. We argue that understanding AI's economic impact requires studying how data, compute, and labor interact to create AI systems.
To date, the field lacks shared nomenclature and conceptual foundations. This work establishes those foundations.
Introduction
Foundation models have consumed 26 much of the public internet, yet they draw on only a small fraction of the world’s available data. The majority remains locked in proprietary databases, behind login walls, and in specialized corpora (inaccessible due to intellectual property restrictions, privacy regulations, access controls, among other reasons). As AI labs exhaust public sources and firms build specialized models for their operations, understanding data economics has become essential. Deals for proprietary training data now reach hundreds of millions of dollars, yet markets remain fragmented and ad-hoc, lacking standardized pricing or clear valuation frameworks.
26. Open commons: Public or collaborative datasets like LAION or Common Crawl that provide free baselines and help keep private prices in check.
Historical precedents show that heterogeneous resources can become measurable and tradable when economic pressure demands it. Corporate equity, grain, and oil all began as differentiated goods before developing standardization 5: listing requirements for equities, USDA grading for grain, reference prices for oil. Data markets, as an emerging asset class, need analogous infrastructure.
5. This is the explanation of the superscript
Our contributions
We make four contributions.
- We establish why data resists standard economic treatment through its distinctive properties:nonrivalry (multiple users can access the same data simultaneously without depletion), partial excludability (access can be restricted but copies are easily made), context-dependence (value varies by buyer holdings and application), and emergent rivalry through contamination (benchmark leakage, adversarial poisoning, and dataset aging reduce future utility). We also document the verification paradox: buyers cannot assess data quality without examining it, but examination enables copying, creating severe adverse selection.
- We establish why data resists standard economic treatment through its distinctive properties:nonrivalry (multiple users can access the same data simultaneously without depletion), partial excludability (access can be restricted but copies are easily made), context-dependence (value varies by buyer holdings and application), and emergent rivalry through contamination (benchmark leakage, adversarial poisoning, and dataset aging reduce future utility). We also document the verification paradox: buyers cannot assess data quality without examining it, but examination enables copying, creating severe adverse selection.
- We establish why data resists standard economic treatment through its distinctive properties:nonrivalry (multiple users can access the same data simultaneously without depletion), partial excludability (access can be restricted but copies are easily made), context-dependence (value varies by buyer holdings and application), and emergent rivalry through contamination (benchmark leakage, adversarial poisoning, and dataset aging reduce future utility). We also document the verification paradox: buyers cannot assess data quality without examining it, but examination enables copying, creating severe adverse selection.
Why this matters
As labs exhaust public data and firms turn to proprietary datasets, data economics will determine who captures value, whether markets concentrate, and who owns the means of intelligence itself. This paper does not resolve how data should be valued or allocated. Instead, it establishes foundations by documenting current practices, developing preliminary frameworks for production modeling, and defining open problems for the field of data economics.
Read the full paper: PDF