AI Training Data, In-Depth. Part 1: Dataset Types, Market Overview, and Leading Dataset Providers

The dataset market is both expansive and intricate, encompassing a variety of data types and applications that drive the growth and evolution of AI across multiple industries. This article aims to answer key questions concerning the state of the AI dataset market: What is the current market potential for GenAI datasets? Where can one buy datasets that meet legal standards? Who are the leading providers of high-quality datasets for AI training?

Dataset Types and Their Applications

The dataset market is vast and varied, and understanding the types of datasets and where they are used can provide insight into the market dynamics and the driving forces behind data collection and utilization.

Datasets can be broadly categorized based on their content and application:

Image and Video Datasets: These are crucial for computer vision applications, including facial recognition, autonomous driving, and augmented reality. They consist of annotated images and videos that help train models to recognize and interpret visual information.
Text Datasets: Used in natural language processing (NLP) applications, these datasets include collections of documents, social media posts, books, and other text forms. They train AI to generate text, translate languages, or provide customer service through chatbots.
Audio Datasets: Essential for voice recognition systems, these datasets contain recordings of speech in various languages and accents, useful for developing voice-activated assistants, transcription services, and more.
Numerical and Time-Series Datasets: These datasets are important in financial services for algorithmic trading, risk assessment, and economic forecasting. They consist of sequential data points indexed in time order.
Sensor Datasets: These datasets gather data from various sensors, they are useful in monitoring and predictive maintenance of industrial equipment, smart home devices, and environmental monitoring.

Datasets find their applications across different industries:

Healthcare: Companies such as IBM Watson Health use datasets comprising medical images, patient records, and clinical trial data to train AI for diagnostic assistance, personalized medicine, and treatment outcome prediction.
Retail: Retail giants like Amazon utilize consumer behavior data, product images, and transaction records to optimize inventory, personalize marketing strategies, and enhance customer experiences.
Automotive: Tesla leverages data from sensors, cameras, and GPS in developing autonomous vehicle technologies, enhancing navigation systems, and improving safety features.
Finance: Financial institutions such as JPMorgan Chase use transaction data, market trends, and customer profiles to aid in fraud detection, credit scoring, and personalized banking services.
Entertainment: Streaming services like Netflix and Spotify analyze preferences and behavioral data to drive content recommendation engines, improving user engagement and content curation.

Market Overview: Is There a Market for GenAI at All?

The AI dataset market is indeed experiencing rapid growth, driven largely by the increasing application of AI across a variety of industries and the subsequent demand for diverse and specialized datasets.

In 2024, the global AI market was valued at approximately USD 214.6 billion and is projected to grow at a compound annual growth rate of 35.7% through 2030, potentially reaching around USD 1,339.1 billion.

According to a report by Straits Research, the AI Training Dataset Market is also projected to grow and reach USD 7.23 billion by 2030, increasing at a compound annual growth rate of 20.8% from 2022 to 2030. This growth reflects the critical need for high-quality data to train AI models.

For instance, Shutterstock, a marketplace of images and videos long used by content creators, has found a new growth engine: licensing the visual media in its library to train AI models. Shutterstock’s AI-licensing business generated $104 million last year, showcasing the significant demand for high-quality datasets. This indicates potential for other content providers and authors to enter and benefit from this growing market.

However, it’s worth noting that Shutterstock is a giant in the industry, and for companies, it is more convenient to buy a large volume of content at once from a major provider than to deal with multiple small studios or individual authors. This presents a challenge in terms of defending the interests of professional communities, as smaller content creators might find it harder to compete in this space.

In one of our previous articles, we explored key trends in the Generative AI market and identified the dataset market to train GenAI models as an emerging sector of the industry. This fact is underscored by numerous recent deals between AI companies and content providers, deals that reflect anticipation of future court decisions on copyright claims. Many stakeholders believe that these legal decisions will favor content creators, bolstering the market for GenAI datasets. However, the optimism is not universal.

Dmitry Shironosov, CEO of Everypixel, also comments on this situation:

It is great that under public pressure, the model training market is moving towards licensing datasets.

Unfortunately, this is still not enough to consider the interests of the creators who produce this content. In fact, photo banks are trying to sell the content they acquired in the pre-AI era and under a different economic model, with authors receiving a portion of the sales revenue under copyright law. Some may see this as a signal for the emergence of a new market for data sets.

However, let’s take a look at what would happen if creators were only paid the current one-time payments that stock photo agencies make for selling their content to tech companies. My guess is that the market would collapse in less than a year—no production company in the world would be able to pay its bills. Therefore, the current situation is still far from sustainable.

As long as the authors themselves are not involved in the process of creating datasets, a sustainable market development model won’t emerge. It is understandable that tech companies may find this fact irritating and troublesome, and they may just want to brush off the problem and keep growing. The question is whether they will succeed.

Dmitry Shironosov, CEO of Everypixel

Mark Milstein, Co-Founder and Director of Business Development at vAIsual, is also cautious about the emergence of the GenAI market. According to Mark, the potential market share for GenAI datasets, especially for text-to-image synthesis, is relatively limited when compared to the more established LLMs. He states that while LLMs are advancing rapidly, the development of text-to-image synthesis platforms is “a poor sister” that lacks comparable interest:

I’m hopeful for a market where companies license their data, but I need to see it to believe it. The dataset market predictions heavily favor text data. The market for visual data, though, is quite small and may even be shrinking, thanks to freely available resources like Google images. The stock image industry was slow to react to the impact of GenAI, and now it might be too late to counter the consequences.

Mark Milstein, Co-Founder, Director of Business Development at vAIsual

In our talk with Mark, he stressed that despite the mindset challenges and the mixed approach to dealing with this new technology, the stock imagery industry is technically equipped to take full advantage of GenAI. He mentioned:

The licensed stock media industry was always prepared for this moment. The content was ready, easily sourced, and available through convenient licensing schemes. What has been, and still is, missing is a different approach to metadata. Stock image keywords are not the metadata required for datasets.

Mark Milstein, Co-Founder, Director of Business Development at vAIsual

Leading Dataset Providers in the AI Industry

Here we provide a list of companies that have become crucial in providing high-quality datasets tailored for AI training that adhere to legal standards. While there are not many of them, they are important in shaping the AI industry with their innovative and responsible data solutions:

vAIsual

vAIsual has established itself as a leader in the dataset marketplace by offering legally clean datasets specifically designed for AI training. From the beginning, vAIsual has focused on delivering ready-made datasets that not only save time but also mitigate legal risks for AI innovators. Their services are particularly valuable for global companies seeking high-quality data to train their AI models.

Appen

Appen provides high-quality training data to improve machine learning at scale. Known for its advanced data collection and annotation services, Appen caters to a variety of industries, including technology, automotive, and government sectors. Their expertise in creating precise training datasets makes them a preferred choice for companies aiming to enhance their AI applications.

Scale AI

Scale AI excels in providing access to a diverse set of data types, including sensor fusion, natural language processing, and visual data annotation. Their platform enables AI teams to work more efficiently by providing the tools necessary to accelerate the production of machine learning datasets. Scale AI is known for its rigorous quality control and scalable annotation services, which are crucial for training dependable AI systems.

Examples of Ethically Sourced Datasets in AI Development

As the demand for transparency and ethical sourcing grows, several initiatives have demonstrated how responsibly sourced data can drive AI innovation without compromising on quality. Here, we found some notable examples of ethically sourced datasets:

Cotton Canvas-XL-C by Common Canvas on Hugging Face

A prime example of ethically sourced data is the Cotton Canvas-XL-C model hosted on Hugging Face. This model is trained exclusively on images licensed under Creative Commons Attribution (CC-BY) and Creative Commons Attribution-ShareAlike (CC-BY-SA) licenses. These licenses allow for the images to be used freely, provided that the original creators are credited and any derivative works are shared under the same terms.

BRIA AI’s Model

Contrasting with the open-source approach, BRIA AI offers a model that, while also ethically sourced, is accessible through a subscription or purchase. BRIA AI emphasizes the quality and exclusivity of its data, which is meticulously curated to ensure both legal compliance and high performance.

The shift towards ethically sourced datasets reflects a growing awareness of the need for transparency, fairness, and accountability in AI development. As the market continues to expand, the emphasis on ethical data practices will play a crucial role in shaping the future of AI, ensuring that it is not only innovative but also equitable and responsible.

Spread the word