In the realm of AI, image recognition stands as a gateway for machines to see and comprehend visual data, much like the way humans do. In this guide, we explore how it works, provide a brief historical overview, and take a look at use cases:
Key Terms
The landscape of image recognition is rich with terms that are often confused and used interchangeably. To bring clarity to the topic, here’s a quick guide explaining the key terms and highlighting their distinct differences:
Computer Vision vs. Image Recognition vs. Visual Recognition
There are many approaches to the terms computer vision, image recognition, visual recognition, so confusion is most often found in these terms. We stick to the idea that Computer Vision (CV) is a field of AI, that focuses on enabling computers to identify, comprehend information of visual inputs, and take actions based on that information. CV has many tasks for how computer vision programs process images and return information. Here are some of them:
- Image Classification: assigning an image to one of the predefined classes.
- Image Keywording (or Image Tagging): assigning keywords or tags to images.
- Object Detection: identifying a particular object in a photo, video, or image and framing it with a bounding box — a rectangle that precisely outlines an object.
- Optical Character Recognition (OCR): identifying letters and numbers in images and converting them into machine-encoded text.
- Image Segmentation: splitting an image into smaller parts (segments) for a more detailed understanding of the image. The result of this task is an image mask, representing the specific boundary and shape of each identified class.
- Object Tracking: processing the location of a moving object in a video over time.
Thus, Image Recognition (IR) is an application of CV, that can use different tasks for different purposes. Visual Recognition is then a more extended term of Image Recognition and encompasses a broader range of visual data types, extending to videos and live streams.
In scientific articles, you are more likely to come across the term CV, which is also more favorable among developers. Meanwhile, IR tends to find its niche in business and practical applications. A quick look at the comparison of search results volumes for CV and IR reinforces this trend. However, it’s worth noting that, in general, these terms are sometimes used interchangeably.
A Brief History of Image Recognition
In the late 1950s, neurophysiologists Hubel and Wiesel’s cat experiments laid the foundation for understanding visual processing and image recognition. The experiment conducted on cats accidentally discovered ‘simple cells’— neurons responding to edges with different orientations. Later identified in orientation columns, these simple cells had a profound impact on computer vision algorithms, underscoring that image recognition starts with processing simple shapes like straight edges.
Around the same period, the first computer image scanning technology emerged, allowing computers to digitize and capture images.
In the 1960s, AI emerged as an academic field, initiating a quest to address the human vision problem.
In 1974, OCR and more advanced intelligent character recognition (ICR) emerged, giving later birth to the OCR application FineReader developed by ABBYY.
In the 1980s, scientists introduced algorithms for machines to detect edges, corners, curves, and similar basic shapes.
In the early 2000s, the research focus shifted towards object recognition, with the emergence of the first real-time face recognition applications by 2001. The 2000s also witnessed the establishment of standardization of how visual data sets are tagged and annotated.
In 2010, the ImageNet data set became available. The ImageNet project, led by researchers including Olga Russakovsky, and Jia Deng, contained millions of hand-tagged images across a thousand object classes. Serving as a foundation for today’s models, ImageNet not only enabled the comparison of detection progress across a wider variety of objects but also facilitated the measurement of the progress of large-scale image indexing for retrieval and annotation in the realm of computer vision.
And in 2012 a team from the University of Toronto developed the AlexNet model that significantly reduced the error rate for image recognition.
As the field progressed, one of the most interesting developments was done by Google. With the acquisition of reCAPTCHA, a technology designed to distinguish between human and automated access to websites, Google significantly enhanced its OCR algorithms and AI training methods. Using image and text based CAPTCHAs (Completely Automated Public Turing tests to tell Computers and Humans Apart) Google began accumulating a large dataset of labeled examples. Each time when users solve a CAPTCHA to access a website, they essentially label an image or text for Google’s dataset. This dataset allows Google to train its AI using supervised machine learning, a method where AI learns from a large set of categorized examples. In essence, every Google user now plays a role in developing a more intelligent and capable AI, particularly in the field of computer vision.
How Image Recognition Works
It is worth clarifying that there are different approaches in image recognition technology, and one of the most common is the convolutional neural network (CNN) — one of the variants of the architecture of neural networks using convolutions. CNNs are employed, for example, in the already mentioned AlexNet, and in our projects as image recognition algorithms.
Firstly, an image is made of “pixels,” with each pixel being represented by a number or a set of numbers, which also include a set of values referring to its color.
In short, before CNNs can recognize objects and their relationships, patterns, and overall image structures, they undergo a training process. During training, CNNs are exposed to a vast dataset of labeled images, learning to associate specific patterns and features with corresponding object labels, such as “dog” or “mouse”.
Once trained, CNNs internally compress input images into hidden representations, discarding insignificant information. These hidden representations encapsulate patterns, relationships between objects, significant pixels, etc. In essence, the original image undergoes a transformation, evolving from a picture into a map of distinctive features — a set of numerical values. Leveraging the knowledge gained from the training process, CNNs utilize these representations to make predictions about the content of new images.
To facilitate a better understanding of CNN’s operation, here we provide a gentle and simple explanation of each stage in CNNs. The CNN network layers are typically broken down into four distinct types:
Convolutional Layer
In the convolutional layer, CNNs transform images into small pieces called features. These features capture information about specific shapes, edges, textures, color information, or other visual patterns essential for further scanning. The aim is to identify matches among these features and already learned patterns from training. This process, called filtering, assigns high scores for perfect matches and low or zero scores for less accurate ones. It involves attempting every possible match through scanning — an operation known as convolution. The resulting output of this convolutional layer is a map of features.
Activation Function Layer
After the convolutional layer transforms the image into features and filters it to identify patterns, the resulting information moves to the activation function layer. It evaluates whether a specific pattern is present or not. This layer is crucial for introducing non-linearity, allowing the network to capture more complex dependencies and patterns in the data. This activation step helps the network focus on essential visual elements while filtering out less relevant information. The most commonly used activation function in CNNs is the rectified linear unit (ReLU).
Pooling Layer
There are various types of pooling, including MaxPool, AvgPool, and MinPool, with MaxPool being the most commonly used. The pooling layer takes the features identified so far and reduces the size of the feature map. The aim behind pooling, particularly Max Pooling, lies in discarding some information to facilitate the neural network in searching for patterns more effectively.
The primary function of Max Pooling is to reduce the dimension of the feature map. For example, a 2×2 window (or 3×3, or other dimensions) scans through each filtered feature, picking the biggest value it comes across, and placing it in a new, smaller 1×1 box. Opting for the biggest value matters because it often encapsulates key details, like distinct colors or object boundaries, essential for effective pattern recognition. As a result, this Max Pooling with a 2×2 window decreases the dimension of the feature map by 2 times.
In simpler terms, the pooling layer helps keep the essential information while making the data more manageable.
Fully Connected Layer
In the fully connected layer, the smaller filtered images go through a process of being split and stacked into a single list. Each value in this list plays a role in predicting the probability of different outcomes.
To put it simply, imagine taking the key features highlighted by the pooling layer and organizing them in a way that helps the network make its final decision. For instance, if the network is deciding whether an image has a mouse, the fully connected layer will look at features like tails, whiskers, or a rounded body. It assigns specific weights to these features, sort of like saying, “How important is the tail in determining if it’s a mouse?”
By calculating the products of these weights and the information from the previous layer, the fully connected layer helps the model generate accurate probabilities. In the end, it’s like the network’s way of saying, “Based on these features, I’m pretty confident this is or isn’t a mouse.” This step ensures the network makes precise predictions about what’s in the image.
While CNNs are widely adopted, there are also visual transformers (ViT), which are more recent and effective in some cases.
Instead of directly handling the entire image (as CNNs do), ViT breaks it down into smaller “image patches”, and treats them as a structured sequence. A pure transformer applied directly to sequences of image patches can perform very well, particularly in image classification tasks. When pre-trained on large datasets and transferred to various image recognition benchmarks, such as ImageNet, ViT achieves excellent results compared to state-of-the-art convolutional networks while demanding significantly fewer computational resources for training. However, a deeper exploration of these subtleties and nuances may require a broader discussion.
Image Recognition Software
To simplify the creation and implementation of image recognition software, developers can leverage Image Recognition APIs. These APIs, available through cloud-based services, provide an efficient method for conducting image recognition. By transferring data to cloud servers, developers can quickly construct and launch image recognition solutions, obtaining valuable information about the image or its contained objects.
Image recognition is a powerful capability that effortlessly integrates into diverse applications and devices, enabling a range of practical uses. As the minds behind Everypixel Journal, Everypixel Labs, develop APIs equipped with image recognition technology, we decided to delve into its functionalities and shed light on how they can contribute to the development and enhancement of projects and businesses.
Image Keywording (or Image Tagging)
The image keywording API provides methods to recognize objects, people, places, and actions in images and turn them into keywords.
Functionality: Generating a set of at least 20 keywords (up to 50) for each image.
Best for:
- Image categorization for websites and applications.
- Digital asset management for efficient content organization.
- Image moderation.
The age recognition API provides methods to extract facial features from photos to accurately estimate a person’s age.
Functionality: Estimating a person’s age based on an analysis of patterns in an image of their face.
Best for:
- Sales and marketing: it tailors advertising and product offerings based on age demographics.
- Entertainment and gaming: it enhances user experiences with personalized content and adaptive settings.
- Access control and verification: it automatically controls access to services and products based on age.
- Personalized UI: it facilitates the creation of adaptive user interfaces based on the user’s age.
User Generated Content (UGC) Photo Scoring
The UGC photo scoring API provides methods to give an aesthetic score to user-generated photographs based on technical parameters and attractiveness.
Functionality: Scoring photos based on such parameters as clarity, composition, exposure, framing, and so on.
Best for:
- Recommendation services: it helps users choose the best photo for improved user satisfaction.
- Image moderation: it filters out poor-quality photos in user-generated content.
- Image quality control system: it ensures platform success by preventing the upload of poor-quality photos.
- Search results ranking: it enhances visual search results on platforms with the UGC photo scoring algorithm.
Here we showcased just a glimpse into the vast capabilities of the technology. While there is considerable hype surrounding image generation, it’s crucial to recognize the profound importance of technology, especially in such domains as medicine and accessibility for individuals with disabilities. These applications not only demonstrate the versatility of image recognition but also underscore its potential to significantly impact and improve various aspects of our lives.