The Comprehensive Guide to Text-to-Image Models

For the past few years, text-to-image neural networks have been disrupting the world of AI by generating images based on text prompts. This technology has allowed designers and developers to create unique visual content with minimal effort. In this comprehensive guide, we will explore how they work, provide a brief historical overview, and take a look at the most common text-to-image models.

A Brief History of Text-to-Image

Early research in the field of automated image generation dates back to 2007. The University of Wisconsin published a paper on a text-to-picture synthesis system, that aimed to generate images from text. The system used machine learning algorithms to analyze a given text, find a corresponding image for keyphrases, and then have the computer generate an image that represented the meaning of the text. You can think of this as creating a collage of existing images rather than creating a new one. Here is an example of what this system was able to create. The prompt given was “First the farmer gives hay to the goat. Then the farmer gets milk from the cow.”


In 2015, neural networks could produce blurry images — it was more like an arrangement of pixels rather than a meaningful image that made sense, but they were still pioneers in creating something that had never existed before. To date, networks had also been trained to transfer style from one image to another. For example, this photo of buildings in a small town in Germany could be wrapped in the style of Vincent van Gogh’s “The Starry Night” or Edvard Munch’s “The Scream”. At the time, the Prisma app took the Internet by storm as users were constantly posting their photos that had been rendered into artworks.


As the generative neural networks became more sophisticated, the quality of the content they produced improved dramatically. Based on the datasets specifically used for training, some of them were focused on artworks, while others focused on portraits, cats, or landscapes. In 2018, an AI-generated portrait of Edmond de Belamy was sold at Christie’s for more than $400.000. This was followed by websites that used AI to create stunningly realistic synthetic images of human faces or cats

In 2021, an AI company OpenAI was catapulted into the spotlight after the release of DALL-E, a model that created images from text captions. While the images it created were still a bit glitchy, it took the technology far beyond where it had been before.

Source: OpenAI

In 2022 (and you are here), DALL-E was followed by DALL-E 2 along with other models that have been released, and today you can ask them to create whatever you want and they bring you realistic and consistent images. While they still have some limitations, you won’t be able to tell the difference between them and a photo taken by a human.

Source: OpenAI

It’s been a six-year journey, and the progress has been impressive. Meanwhile, there are a lot of roads ahead to improve, and how generative AI might play out in different domains is still subject to change. 

Neural Networks Involved In AI Image Generation

Creating an image from a text prompt with AI looks like magic. In fact, it’s math and magic. Text-to-image algorithms rely on a combination of different networks, each responsible for a specific part of the computations. Basically, they operate on the principles of Convolutional Neural Networks (CNNs). The core of text-to-image generation is grounded on diffusion modeling frameworks using of Variational Autoencoders (VAEs) and Autoencoders (AEs).

The key aspect of image generation algorithms is their ability to recognize objects in images and understand the relationship between text and visuals. CNN is a type of neural network used primarily for these tasks. One of the earliest iterations of these networks was developed by Yann LeCun, a scientist based in France, and has been used by banks to recognize handwriting on checks.

CNN breaks down the image into smaller pieces and analyzes them individually. It can also understand the relationships between objects, such as which parts are closer together or which are connected. This allows the network to identify the structure of the image, as well as patterns and objects within it. It then puts all these pieces back together to make a prediction about what the image is.

An illustration of Convolutional Neural Networks. Source:

VAEs and AEs are responsible for encoding and decoding data. The encoding step places the input data into a lower-dimensional space, and the decoding step places the encoded data back into the original data space, but with some random variations.

An illustration of Autoencoders and Variational Autoencoders. Source:

And at the heart of AI image creation is the diffusion model. The basic idea behind it is to start with a random image and gradually refine it through a series of steps. At each step, the model adds some random noise to the image and then applies a transformation that gradually makes the noise less and less visible. This process is repeated many times, with each step making the image more and more detailed and complex.

An illustration of diffusion models. Source:

The process can be likened to how heat travels through the air using conduction. In conduction, heat energy is transferred from hot areas to cool areas, resulting in a temperature balance throughout the medium. In both cases, the process is characterized by the gradual spreading and smoothing of information throughout the system until a state of equilibrium is reached.

Now we’ve seen what’s under the hood of image generators. Let’s take a closer look at what goes on inside.

How Text-to-Image Models Work

Text-to-image models have a complicated architecture, with a combination of different networks involved in performing their tasks. In addition, each model has its own architecture, which may differ from others. However, we will try to provide a general explanation of the basic mechanisms of the process.

For a model to master the task, a machine must be trained to recognize objects in images, using a large dataset that includes images and text descriptions of them. As a result, the machine not only recognizes objects in images, but also understands the text descriptions that correspond to them, and how the objects and concepts relate to each other. This understanding of the relationship between language and images is a key aspect of text-to-image algorithms.

The machine converts the data into mathematical equivalent and places them in a multi-dimensional space based on meaning, color, shape, texture, style, and so on. This space, the “black box” where the magic happens, is called the “latent space”. When a user types a prompt, the algorithm finds the mathematical representation of what they wrote and retrieves it from the latent space. Based on images that the machine has already “seen” in the dataset, it translates the math back into pixels and delivers the output — the generated image that never existed before. In fact, the machine does not actually “see” or “understand” the image in the same way that a human would. Instead, it uses statistical patterns in the data to generate images that match the text prompt.

Let’s look at how infants explore the world to get a better idea of how this works. They spend countless hours listening to their parents speak, constantly observing the physical objects around them, and making associations. Later, with a wealth of experience from seeing numerous people, things, and objects, hearing thousands of words, and learning colors, shapes, sizes, and textures, the infant’s brain comes to know what a person is, what a cat is, and what a dog is. They can accurately distinguish between these objects and tap the correct option while looking at the picture.

As a child gets older, they learn to draw, and if you ask them to draw a picture of a purple cat riding a bicycle, they will probably be able to complete the task. This is a bit of an oversimplification, but it basically explains how neural networks create images.

The Most Common Text-to-Image Models

DALL-E and DALL-E 2 

DALL-E 2 is an AI system built by OpenAI, one of the most significant companies in content generation to date. It is a successor to DALL-E, which OpenAI released two years ago, in January 2021. Today, its algorithm creates even more realistic images and art from the text description, enlarges images by assuming what’s beyond the original canvas, or removes objects taking into account shadows, lights, and textures on the original image. As a result of the partnership between OpenAI and Shutterstock, DALL-E 2 has been integrated into one of the largest stock image websites.  

DALL-E 2 algorithms integrated into the Shutterstock interface. Source:

OpenAI states that both paid and free users own the images they create with the algorithm, including the right to reprint, sell, and merchandise. Creators are asked not to mislead the public about AI involvement and to proactively disclose this fact. For example, they may include words such as “This image was created with the assistance of DALL-E 2” or “This image was generated with the assistance of AI.” However, they still allow the DALL-E signature to be removed.


Free: limited credits.

Pay-as-you-go model starts at $0.016 per image.

Stable Diffusion 

Stable Diffusion is an open-source AI model that can draw images given text descriptions. Stability AI, the company behind Stable Diffusion, released the model in August 2022 and licensed it under a CreativeML OpenRAIL++ license. This means that users own the rights to the generated content and are free to use it as long as they don’t violate any laws. 

Because it is more accessible than other models, it has skyrocketed to more than 10 million daily users. The open-source nature of the algorithm and the relatively high visual quality of the output have led many developers to start building their own services and applications on top of Stable Diffusion. In fact, at least three of the tools mentioned in our previous post about AI-powered image creation and editing tools are based on Stable Diffusion. However, this popularity has also led to lawsuits against Stability AI for scraping and using copyrighted content without authors’ consent to train the SD dataset.

If you want to test the model, go to a public demonstration on Hugging Face, or join DreamStudio, Stability AI’s creative platform. 


Free: limited credits.

Pay-as-you-go model starts at $10 per 1000 credits.


Midjourney is an AI model run by an independent research lab. The web version of it is rumored to be coming in a few months, but currently, it is only accessible through the official Discord server.  Users type in a prompt, then send it to the bot, which returns them an image. Some have speculated that Midjourney’s algorithms are rooted in Stable Diffusion. Meanwhile, both AI models face legal headwinds from the artistic community.


Free: limited options.

Monthly subscription starts at $8 per month.


Imagen is a text-to-image model that Google gave us shortly after OpenAI released DALL-E 2. Google is quite cautious about pushing its models to the wider public due to safety concerns. That’s why Imagen is only available as a stripped-down demo on a research paper website and in Google’s AI Test Kitchen app. This is a place where a group of users can test beta versions of AI services being developed by the search giant. According to The Verge, there are two options to play with in the AI Test Kitchen: creating a personalized city or a monster.    

An image created with Imagen. The prompt was “A photo of a raccoon wearing a cowboy hat and black leather jacket playing a guitar in a garden.” Source:


Free, invite-only.


Parti is another text-to-image model from Google. While most popular algorithms including Google’s Imagen, DALL-E, and others use diffusion models, Parti is based on an autoregressive architecture. It treats text-to-image generation as a sequence-to-sequence modeling problem which is close to the way machine translation operates. To translate a phrase, the algorithms predict new words based on the previous words and take context into account. For images, the algorithms break an image down into image tokens and reconstruct them. 

An image generated with Parti. The prompt was “A raccoon wearing formal clothes, wearing a top hap and holding a cane. The raccoon is holding a garbage bag. Oil painting in the style of Rembrandt.” Source:

To date, the model is not available for users.


Make-a-Scene is an AI model created by Meta. While the company hasn’t made Make-a-Scene more widely available, it’s intended to be a tool for creating artwork from text prompts along with users’ free-form sketches. This method allows us to put control in people’s hands and let them shape the content that a system generates.

CogView and CogView2

CogView2 is an open-source AI model brought to us by Beijing-based scientists. A 6 billion parameter model, CogView2 also supports interactive text-guided editing of images.


Unfortunately, we haven’t had a chance to test the waters with the CogView2, as its web demo was not available at the time of writing. Meanwhile, its predecessor, CogView, which was pre-trained with 4 billion parameters and released in 2021, is available to play around with. It only supports Chinese text prompts, but you can translate English text directly into the input field. According to the Terms of Use, images created with CogView may not be used for commercial purposes, whether it is direct, indirect, or potential commercial use.

An image created with CogView. The prompt was “A racoon playing a guitar.” Source: CogView

Crayion (formerly DALL-E mini, renamed at OpenAI’s request as a product unrelated to the DALL-E model built by OpenAI) is an image generator that has shocked the public with its strange and glitchy images. It has taken over social media and become a meme machine, as Weirddalle has gained 1.2 million followers on Twitter and 171k members on Reddit. Its limited artistic scope comes down to a smaller dataset (about 30 million images) that Boris Dayma, a machine learning engineer who built Crayion, used to train the model.

An image created with Crayion. The prompt was “A photo of a raccoon wearing a cowboy hat and black leather jacket playing a guitar in a garden.” Source:

However, you may use images created with Crayion for personal, academic, or commercial projects, as long as follow the Terms of Use. In addition, creators ask free subscribers to give credit to for their content. 


Free: typical wait time is of 1-2 minutes.

Monthly subscription starts at $5 per month.

The featured image for this post was created using the Stabble Diffusion algorithm. The prompt was “The magic box of latent space”.

Spread the word