In this monthly roundup, we spotlight the top AI news stories from May:
OpenAI’s GPT-4o Rollout
In line with the mission to advance AI accessibility, OpenAI introduced its latest model, GPT-4o. GPT-4o offers GPT-4-level intelligence with enhanced speed and capabilities across text, voice, and vision. Users can now interact with GPT-4o in real-time conversations about images, enabling tasks like menu translations and receiving recommendations. Language capabilities have been expanded to over 50 languages, making AI more accessible globally. Additionally, ChatGPT Free users got access to features such as data analysis, photo discussions, file uploads for assistance, and more.
Between the lines: During a presentation, OpenAI also introduced a virtual assistant named Sky, sparking controversy over its voice similarity to Scarlett Johansson. The incident prompted mixed reactions, with some supporting Johansson’s stance on the importance of consent, while others defended OpenAI’s legal rights.
Financial Times Signs Licensing Deal with OpenAI
The Financial Times has entered into a licensing agreement with OpenAI, allowing ChatGPT users to access summaries, quotes, and links to its articles, all attributed to The Financial Times. This partnership includes collaboration on developing new AI tools, building on The Financial Times’s existing use of OpenAI’s ChatGPT Enterprise.
The context behind: This deal is also part of OpenAI’s broader strategy of licensing content from various news organizations, despite some legal challenges from others like The New York Times over copyright issues.
Mysterious GPT2-chatbot
An incredibly powerful AI system, named gpt2-chatbot, briefly appeared on the LMSYS Org website, drawing significant attention before being swiftly taken offline. Users noted its performance rivaled, and even exceeded, that of OpenAI’s GPT-4, making it one of the most advanced AI systems globally. However, the source of the model remains unknown, fueling speculation that it could be an early release from OpenAI. LMSYS Org cited “unexpectedly high traffic & capacity limit” as the reason for the brief outage and hinted at a broader release in the future. This fleeting appearance has intensified curiosity and rumors about the next breakthroughs in AI technology, with some believing OpenAI may have developed a highly powerful new model.
Between the lines: The rumors about OpenAI’s involvement intensified after the company’s CEO, Sam Altman, mentioned he has a soft spot for “gpt2” in a post on X, which quickly gained over 2 million views.
Apple to Unveil AI-Enabled Safari Browser Alongside New Operating Systems
Apple is set to revolutionize its Safari web browser with AI-powered features in the upcoming release of iOS 18 and macOS 15. The new Safari 18 will introduce “Intelligent Search,” an advanced tool leveraging AI to provide text summarization and enhance browsing by identifying key topics and phrases within web pages. Additionally, a “Web Eraser” feature will allow users to remove unwanted content from web pages, enhancing user control and privacy.
Between the lines: Apple has also reached an agreement with OpenAI to incorporate ChatGPT features into its forthcoming iOS 18 operating system for the iPhone. The AI enhancements, part of a broader update expected at Apple’s Worldwide Developers Conference in June, signify a major step in the company’s commitment to advancing AI technology.
OpenAI’s Media Manager
In a bid to address concerns surrounding content ownership, OpenAI unveiled ongoing developing of Media Manager, a tool that will enable creators and content owners to tell us what they own and specify how they want their works to be included or excluded from machine learning research and training. Media Manager aims to establish a new standard of transparency and accountability in the AI industry.
Why it mattes: With Media Manager expected to be released by 2025, OpenAI seeks to set a precedent for ethical content usage in AI systems, fostering a collaborative environment that benefits all stakeholders involved.
Microsoft Bans U.S. Police from Using Azure OpenAI Service for Facial Recognition
Microsoft has reinforced its prohibition on U.S. police departments using its Azure OpenAI Service for facial recognition. The updated terms of service now explicitly prevent integrations from being used by or for police departments in the U.S. for real-time facial recognition on mobile devices such as body cameras and dashcams. This policy adjustment follows the recent release of a product by Axon, which utilizes OpenAI’s GPT-4 model to summarize body camera audio, raising concerns about potential AI hallucinations and racial biases.
Between the lines: While the ban applies to U.S. police, it allows for some flexibility internationally and does not cover facial recognition in controlled environments.
Our colleague and the RnD Team Lead of Everypixel Alexander Shironosov, has deepened our exploration of recent releases in AI models:
VLM:
OpenAI and Google have announced major advancements in their AI models, with OpenAI’s multimodal GPT-4o and Google’s Gemini 1.5 Flash and Pro achieving significant milestones. GPT-4o has secured the top position in the text-based lmsys arena, while Gemini Pro and Gemini Flash hold second place and a spot in the top ten, respectively.
Non-proprietary VLM:
Siglap’s visual encoder continues to dominate the field of non-proprietary VLMs, being frequently paired with LLMs. This approach is highlighted in two significant guides on VLM creation from Meta and Huggingface. Google’s open-source Paligemma family also utilizes this combination, employing Siglip and LM Gamma 2 with varying parameters. Furthermore, the LAMA 3 V model, which combines Siglap with Lame 3 8B, demonstrates impressive performance, rivaling the metrics of Gemini 1.5 Pro on various vision benchmarks.
CV:
- A new version of the popular object detection model Yolo10 has been released, featuring significant enhancements. The authors have abandoned non-maximum suppression and implemented several optimizations, resulting in faster result generation without compromising accuracy.
- A joint study by FAIR, Google, and INRIA introduces a novel method for automatic clustering of data to address data imbalance in training, diverging from the traditional k-means approach. This new technique effectively accounts for data from the long tails of distributions, enhancing the performance of algorithms in Self-Supervised Learning. The study demonstrates significant improvements in managing data diversity and boosting algorithmic accuracy.
LLM:
- Intel researchers have unveiled a leaderboard of quantized language models on Hugging Face, designed to assist users in selecting the most suitable models and guide researchers in choosing optimal quantization methods. This leaderboard aims to achieve a balance between efficiency and performance, providing a valuable resource for the AI community to enhance model deployment and development.
- Recent developments in language models also include Mistral’s new code generation model, Codestral, which boasts 22 billion parameters and outperforms both the 33-billion parameter DeepSeek Coder and the 70-billion parameter CodeLlama. Additionally, a new version of DeepSeek, DeepSeek V2, has been released, sparking anticipation for a potential new iteration of DeepSeek Code.
Image Generation:
- While the AI community eagerly awaits the public release of Stable Diffusion 3, new text-to-image models using the DiT (Diffusion Transformer) architecture have emerged. Lumina-T2I and Hunyuan, a DiT model from Tencent, are noteworthy additions. The authors of Lumina-T2I provide detailed insights into training such models in their paper, and Tencent’s Hunyuan model is also available for experimentation. Checkpoints for both models are accessible, allowing users to explore their capabilities now.
- An intriguing development in the AI community is the project by an independent developer, Cloneofsimo, who is working on a model akin to Stable Diffusion 3 from scratch. Documenting progress through regular Twitter updates and codebase revisions on GitHub, this initiative showcases a grassroots effort to replicate and innovate upon cutting-edge text-to-image model architectures.
- Recent advancements in distilling text-to-image models have led to the development of several promising approaches aimed at generating images in fewer steps. Notable among these are Hyper-SD, which integrates Consistency Distillation, Consistency Trajectory Model, and human feedback, and the Phased Consistency Model. Additionally, SDXL-Diffusion2GAN introduces a one-step generator. These models, detailed in respective papers, demonstrate superior performance compared to previous methods like LCM and SDXC-Turbo, showcasing significant improvements in efficiency and accuracy.
- A recent study also explores the use of text-to-image models in a specialized domain: the generation of 2D and 3D medical data. By training a diffusion model to produce high-quality medical images, this approach aims to enhance the accuracy of anomaly detection models, ultimately aiding physicians in their diagnostic processes and improving overall medical outcomes.
Miscellaneous
Sony Music Confronts Tech Giants Over Unauthorized Use of Artists’ Songs
Sony Music has taken a bold stance against tech giants, including Google, Microsoft, and OpenAI, accusing them of potentially exploiting its songs in the development of AI systems without proper authorization. In a series of letters to over 700 firms, Sony Music demands clarification on whether its music was used in AI training, warning of legal action if copyright infringement is confirmed.
Why it matters: This move underscores a broader debate surrounding AI data usage and copyright laws, with implications for the future of AI development and regulation.
Stability AI Explores Sale
Stability AI is reportedly exploring a sale amid financial difficulties, with discussions held with potential buyers in recent weeks. Facing a cash crunch, the company generated less than $5 million in revenue in Q1 2024 while sustaining losses exceeding $30 million. With debts nearing $100 million to cloud computing providers and others, Stability AI’s financial strain is evident.
The context behind: This development follows a recent restructuring that included staff layoffs and the resignation of founder Emad Mostaque as CEO. Despite having nearly 200 employees worldwide and releasing AI models for audio and video generation, the company’s future remains uncertain amidst its financial woes. In response to the ongoing financial problems, Emad Mostaque, the former CEO of Stability AI, also remarked on the situation with a blend of irony and resignation.
Microsoft Secures Largest Renewable Energy Deal Ever to Power AI Ambitions
Microsoft has signed the largest renewable energy agreement in history, committing to develop 10.5 gigawatts of new renewable energy capacity globally to fuel its AI ambitions. This record-breaking deal with Brookfield Asset Management, worth an estimated $11.5 to $17 billion, is critical for supporting Microsoft’s AI-driven initiatives and data centers, which are known for their high energy consumption. The new renewable energy projects, coming online between 2026 and 2030, will bolster Microsoft’s efforts to match 100% of its electricity use with carbon-free energy and reduce its reliance on fossil fuels.
Why it matters: With AI expected to consume ten times more electricity by 2026 than in 2023, this agreement will help Microsoft mitigate its carbon footprint while advancing its goal of achieving carbon negativity by 2030.