The new markets for AI data

Unlock the Editor’s Digest for free

The writer is the global co-head of investment banking at Goldman Sachs.

Data is the foundation of the artificial intelligence revolution, but AI is also revolutionising the market for data. Developers are racing to invest billions of dollars to build the infrastructure to power vast AI systems. That rapid expansion has led to a surge in demand for data, creating the potential for companies to generate significant economic value.

AI systems are typically described as having three main components — power, compute and data. These refer to the electricity required to power data centres, the chips needed to conduct computations at mind-boggling speeds, and the data necessary to train AI models. Of these critical components, it is data that is least discussed, perhaps because data centres and semiconductors are physical things you can see and touch. (It’s admittedly difficult to hold up a data packet during an onstage keynote.)

But sourcing data is an essential aspect of the rapidly expanding AI ecosystem. According to some estimates, the world is running out of “organic” data, with model developers reaching the limits of publicly available data — essentially copies of the entire internet — to pre-train ever-bigger models.

After AI models are constructed and pre-trained on huge data sets, they still require additional “test time compute” where a model is asked to answer specific questions or solve problems. This requires the right kind of data, which is sometimes lacking.

There is a lack of sufficient training data that shows humans “showing their work” in the steps to address complex problems. This is where companies with focused, well-organised, or highly logical data sets can become newly relevant. Imagine how a textbook company might use its archives of technical manuals and coursework to train an AI system to do complex scientific processes.

Recent data licensing deals show how different companies are selling access to their data to AI companies. Expect this trend to accelerate as companies get even more creative in doing so. So far, these deals have been negotiated individually with special terms, but you can imagine a marketplace — or multiple markets — for training data emerging.

Synthetic data, or data created at least in part by AI systems, is a critical part of the development of large language models and has emerged as one path for expanding the set of options for developers looking for new data sets.

For example, as robotic technology becomes more sophisticated, AI systems can increasingly create maps of our physical environment. Synthetic data for self-driving might involve setting up a “digital twin” of Los Angeles and having millions of “mock” vehicles navigate the city in a virtual space as training data.

And it is possible that types of data that have previously been difficult to analyse or use become newly accessible and valuable with the incredible computational power of AI systems. Think about what data we’ve collected about complex systems such as weather, quantum mechanics or viral mutations. As robots can perceive entire categories of data that are imperceptible to humans, collections of video and spatial data may also suddenly have a newfound value.

Tesla uses the data collected by its fleet of autonomous driving vehicles to train the AI models that power its underlying self-driving technology. And Nvidia recently announced an expansion of its robot simulation environment, where it trains its robots in a virtual, digital representation of the physical world.

One of the most valuable repositories of data is human-generated data that remains locked away — proprietary research behind corporate and government firewalls. Today, the holders of this data are reluctant to make it accessible without knowing the implications. But the right structures and incentives can invite more deals.

In practical terms, different companies will devise different strategies. Some will treat data as a core business asset, not a byproduct, and work to monetise it through licensing or subscriptions. Others will need to upgrade their data infrastructure to make the best use of future AI capabilities.

How different jurisdictions decide to regulate AI and further regulate data usage will have profound implications for how those markets evolve — and where. Data privacy and security, questions about data provenance, ownership, authentication, are all potential new legislation areas.

This period of incredible innovation and upheaval offers opportunities for the companies that get their data strategy right.

Source link