AI Training Data, In-Depth. Part 2: From Diverse Inputs to Ethical Sourcing and Oversight. Industry Standards of Image Datasets

Building on the insights from the first part, which covered the market dynamics of AI datasets and leading dataset providers, this second part delves into the critical aspects of GenAI datasets. We address key questions about what makes a high-quality dataset: What are the essential aspects to consider in GenAI datasets? Why is it important to regularly update AI datasets? Which datasets can be considered ethical, and what standards do they need to meet?

GenAI Datasets: Five Important Aspects

The emerging market underscores the critical importance of curating robust datasets. Without diverse and comprehensive data at the input stage, AI systems are fundamentally limited, unable to produce or understand beyond their training. Here, we delve into the critical aspects of GenAI datasets — diversity, quality, and updates — and discuss these insights with experts in the field.

The Importance of Diverse Inputs

A simple Google search for “person with a disability” primarily shows people in wheelchairs, although disabilities can vary widely and relate to different abilities of a person. This indicates a significant gap in how different aspects of diversity are depicted and represented in digital content.

The root of biased AI often lies in the lack of diverse and inclusive data. The thing is that AI algorithms do not possess the ability to think abstractly or conjure something wholly new; they generate outputs based solely on patterns and data they have been exposed to during training.

Therefore today’s AI market faces a significant challenge: datasets often do not reflect the rich diversity of the real world, particularly with respect to minority populations. This oversight leads to biased algorithms that fail to serve the entirety of society fairly.

Denis Sorokin, Creative Director of Everypixel, also highlighted the ongoing challenges in sourcing high-quality DEI&B content:

To train a neural network to generate diverse and inclusive content, you need a high-quality dataset, which is currently scarce. Thousands of creators, photographers, and artists around the world are working on it, but the lack of quality content is still a real problem.

Denis Sorokin, Creative Director of Everypixel

The realization of the need for diverse and inclusive content is relatively recent. One pivotal moment in the industry’s acknowledgment of this need was the Gucci sweater scandal of 2019, when Gucci faced backlash for releasing a black turtleneck sweater resembling blackface, sparking accusations of cultural insensitivity and highlighting the fashion industry’s need for greater diversity and inclusivity. The scandal led to public apologies, commitments from Gucci to improve cultural awareness and diversity within the company and highlighted the lack of sensitivity and representation in fashion and media.

So @gucci puts out a sweater that looks like blackface……
On Black History Month….
And then issues an apology because they didn't know that blackface images are racist.

🤦🏿‍♂️ pic.twitter.com/G3HjPTIuuQ
— Tariq Nasheed 🇺🇸 (@tariqnasheed) February 7, 2019

This incident, among others, brought diversity, equity, inclusion, and belonging (DEI&B) issues to the forefront, underscoring the necessity for change in content creation and representation. Before these revelations, the content industry had decades of producing material that lacked diversity and inclusivity. Only in the past 5-6 years has there been a concerted effort to address these shortcomings. However, the challenge remains immense, as AI and media industries work to overcome the legacy of insufficient representation.

The current landscape of DEI&B content reveals significant gaps that must be addressed to meet the evolving needs of marketing, advertising, and datasets. The industry expert Steve Jones, Founder & CEO of pocstock, an agency dedicated to addressing the lack of diverse and inclusive content in the stock photography and advertising market, explains their role in the market:

Generally speaking, some of the gaps in content and data are related to a long history of the media placing a high priority on content featuring White people, while not focusing on people of color who make up most of the world’s population. As a result, there are massive content gaps limiting the accuracy, relevance, and completeness of the data now needed for training and accurate predictions.

Pocstock entered the market when brands and agencies sought more authentic content and less of the traditional ‘stocky’ images. This shift reflects a broader move towards personalization in marketing, especially for Gen Z. We developed The Diversity Lens standards to ensure inclusivity in all our content creation and curation.

Steve Jones, Founder & CEO of pocstock

To give more context, The Diversity Lens serves as a guide for inclusive imagery selection and creation. The Diversity Lens recognizes that simply incorporating people from different races, genders, and ethnicities into images does not adequately represent diversity. Instead, it advocates for a deeper understanding of the various ways individuals identify themselves within society. It highlights a spectrum of inclusion that encompasses racial diversity, representation of LGBTQ+ individuals, persons with disabilities, neurodivergent individuals, senior citizens, families, and diverse groups, among others. The guide emphasizes the importance of inclusivity, diversity, equity, justice, belonging, action, equality, and accessibility in visual content creation.

The Diversity Lens standard. Source: Pocstock

The main challenge is producing enough content to address the scarcity proportionally. The data needs to reflect the real world proportionally without biases or over-indexing one group over another. Pocstock is trying to fix a longstanding problem at a scale that makes a meaningful impact, matching the pace of AI development. New AI platforms emerge daily, but data integrity isn’t always reliable.

Steve Jones, Founder & CEO of pocstock

These expert insights underscore the critical need for diverse datasets in AI development and the ongoing efforts to bridge the representation gaps in the content industry. Other stock platforms and production companies are also moving towards removing DEI&B content gaps, recognizing the necessity for more inclusive representation.

However, training neural networks requires even more diverse content, and the problem extends beyond race and ethnicity. Specific objects like hands, devices, or interactions between multiple people in images, may also be poorly represented. Therefore, addressing these challenges requires a multifaceted approach, including increased production of diverse content and ongoing efforts to improve data integrity and representation.

Diversity in Data: Balancing Technical Parameters

In addition to diversity in content, there should be technical diversity in the data. AI systems thrive on rich datasets that include various sizes, resolutions, and formats, which enable them to operate effectively across different scenarios.

Denis Koreshkov, Computer Vision Engineer at Everypixel, indicated several key parameters by which datasets should be diverse. The combination of these parameters should theoretically cover most of the existing photo styles:

Temperature: cold, warm, neutral
Shot Size: extreme close-up, close-up, medium close-up, medium, medium full, full
Angles: depersonalised, over the shoulder
Light: high-key, low-key, natural
Orientation: horizontal, vertical

When analyzing one of our datasets containing photos from 2020 to 2024, Denis Koreshkov, for instance, found out that there were significantly more neutral-tone photos than low or high-tone photos. Additionally, horizontal photos vastly outnumbered vertical ones. Such imbalances in a dataset can affect the AI’s learning and output.

For example, AI algorithms still struggle with proportions when trained on horizontal images and then tasked with generating vertical images. This isn’t merely an odd mistake but a technical limitation that can lead to unusual distortions, as in the example below with an elderly woman and her unusually elongated body.

Created in Stable Diffusion. Source: Easter Eggs of AI

Moreover, datasets should be described qualitatively. Descriptions should include not just what is depicted but also details about the frame, lighting, tone, etc. This level of detail ensures that the AI is exposed to a wide range of scenarios and conditions, helping it to generate more accurate and representative outputs.

The Need for Ongoing Learning

The world is not static, and neither can the datasets used to train AI models. Regular updates are essential to keep algorithms relevant and effective. As societal norms, cultural contexts, and technological landscapes evolve, so too must the data that train AI, ensuring it remains fit for purpose and reflects current realities.

For instance, consider the evolution of concerts. In the 80s and 90s, audiences typically just watched performances. However, in today’s digital age, concert-goers predominantly capture every moment on their smartphones. This shift represents a significant change that AI models must adapt to in order to accurately interpret and depict scenes of modern life.

Live concert in 198. Source: Udiscovermusic

Community and Collaboration

Ensuring diversity, maintaining high-quality data, and continuously updating datasets are pivotal in the AI training process. These aspects are crucial for developing unbiased and effective AI systems. However, the strained relations between creators, photographers, and content authors versus technology companies do not contribute to improving the situation. Creators, who play a significant role in the creation of these datasets, often don’t feel safe or appreciated in this dynamic, but their cooperation and trust are essential. Our expert also comments on the community’s role in achieving this:

This can be achieved when everyone is aligned. All services depend on a solid base of creatives producing the required content. If the community isn’t on board, the agency has strategies and plans but can’t deliver because photographers and sources either aren’t cooperating, don’t care, or see the technology as a threat to traditional studio business.

We need a mindset that includes readiness and creativity, with personnel and creatives ready to engage with this new industry and meet its needs.

Mark Milstein, Co-Founder, Director of Business Development at vAIsual

Oversight and Governance

Another critical aspect is the need for oversight and governance in the development and management of AI datasets. Jones underscores this need:

We also need oversight, governance, and universal standards. In advertising, we have the IAB; in web development, we have the Worldwide Web Consortium. We need a diverse governing body to set standards for AI product development and data modeling. At pocstock, we used a diverse team for our AI product, contributing to a more well-rounded product. Our inclusive approach is what’s needed to solve major issues in the content and data markets.

Steve Jones, Founder & CEO of pocstock

Which datasets can be considered ethical

We believe that ethical datasets are those that adhere to high standards of fairness, transparency, and respect for the rights of those whose data is being used. We also collected some key points from various sources and expert responses, that can be compiled to understand what makes a dataset ethical:

Consent and Ownership: Ethical datasets should respect creators’ rights and be composed of data for which explicit consent has been obtained from the data providers. This includes ensuring that data has not been scraped without permission and that there is clear ownership or rights for usage established. Organizations such as Fairly Trained certify AI models that do not use any copyrighted work without a license, ensuring that any training data used has been rightfully obtained and that the rights holders have consented to their data being used for training purposes.
Transparency and Due Diligence: Ethical dataset practices include thorough due diligence on the data’s origins and ensuring transparency about how data is collected, used, and shared. Companies should maintain detailed records of data usage and conduct regular audits to ensure compliance with ethical standards.
Diversity and Inclusivity: An ethical dataset must also be diverse and inclusive, accurately reflecting the variety of the global population to prevent biases in AI behavior. This includes having robust measures in place to ensure that minority groups are adequately represented and not stereotyped.
Addressing Historical Biases: It is important to recognize and correct for historical biases that may be present in datasets. This involves not only correcting inaccuracies but also ensuring that attempts to address biases do not distort historical facts. A balanced approach must be maintained to reflect true diversity without perpetuating stereotypes.

By adhering to these principles, AI companies can ensure that their training datasets do not just fuel powerful generative models but also promote a fairer and more equitable digital future.

Spread the word