AI Emergence
Posts
🚨OpenAI’s Viral Image Gen Model Just Dropped for Devs

🚨OpenAI’s Viral Image Gen Model Just Dropped for Devs

Along with - Microsoft 365 Copilot Got a Major Upgrade

Analytics Vidhya (Curated by Kunal Jain)
April 24, 2025

Hi there 👋

Sam Altman recently mentioned that saying “please” to ChatGPT has cost OpenAI tens of millions. Turns out, even being polite adds to your token bill.

This week’s updates are all about giving you more control over how much your AI thinks, and how much it costs.

Google’s Gemini 2.5 Flash now lets you set a “thinking budget” to balance speed, intelligence, and spend. OpenAI has opened up gpt-image-1, the model behind those Ghibli-style images, for developers to build with. And Gemma 3 QAT is making it possible to run powerful models locally, right from your laptop.

Meanwhile in Beijing, robots tried running a half-marathon. Four finished. One crashed into a fence. We’re getting there.

Let’s take a closer look at each of these and more.

What would be the format? Every week, we will break the newsletter into the following sections:

The Input - All about recent developments in AI
The Tools - Interesting finds and launches
The Algorithm - Resources for learning
The Output - Our reflection

The Input
The Tools
The Algorithm
The Output

The Input

OpenAI’s Viral Image Generator Hits the API

The model behind those trending Studio Ghibli-style AI images is now available to developers. OpenAI has launched gpt-image-1, its newest image generation model, through the API. It’s the same tool used in ChatGPT to turn text into pictures.

What’s New:

Multimodal Image Generation: gpt-image-1 is built to understand detailed text prompts and generate images in a wide range of styles. It can follow complex instructions, render accurate text inside visuals, and even reflect real-world knowledge.
Works with Design Apps: Tools like Adobe and Figma are starting to use gpt-image-1, so you can generate and edit images without leaving your design workflow.
Built-In Safety: The model includes filters to block harmful or unsafe images, and you can control how strict those filters are.
How Pricing Works: It’s token-based. $5 per 1M for text input, $10 per 1M for image input, and $40 per 1M for image output. Most images cost somewhere between $0.02 and $0.19, depending on quality and complexity.

This makes OpenAI’s image generation tech available to anyone, not just ChatGPT users. Developers can now add high-quality, prompt-based visuals to their own apps, a big step toward making visual creation as accessible as text generation. (source)

Image gen is now available in the API!
We’re launching gpt-image-1, making ChatGPT’s powerful image generation capabilities available to developers worldwide starting today.
✅ More accurate, high fidelity images
🎨 Diverse visual styles
✏️ Precise image editing
🌎 Rich world
— OpenAI Developers (@OpenAIDevs)
5:34 PM • Apr 23, 2025

Gemini 2.5 Flash Is Live - Reasoning Gets a Toggle

Google has introduced Gemini 2.5 Flash, a hybrid reasoning AI model now available in preview. This model offers developers the ability to control the depth of the AI's reasoning through a configurable "thinking budget," optimizing for cost, speed, and output quality.

What’s New:

Hybrid Reasoning Control: 2.5 Flash shows significant reasoning boosts over its predecessor (2.0 Flash), with a controllable thinking process that can be toggled on or off depending on task complexity.

Thinking Budget: Thinking budget lets developers allocate up to 24,576 tokens for deeper reasoning. For example, a 0-token budget works well for simple tasks like translating phrases or answering trivia, while a higher budget enables the model to handle complex tasks like generating engineering calculations or multi-step scheduling logic.

Performance Benchmarks: The model demonstrates strong performance across reasoning, STEM, and visual reasoning benchmarks, outperforming Claude 3.7 Sonnet and DeepSeek R1, while coming close to OpenAI's o4-mini.

Cost Efficiency: Gemini 2.5 Flash is designed to be cost-effective, with pricing as low as $0.60 per million tokens when reasoning is disabled, and $3.50 per million tokens when reasoning is enabled.

Availability: The model is accessible via API through Google AI Studio and Vertex AI, and is also available as an experimental option within the Gemini app. (source)

Microsoft 365 Copilot Just Got a Major Upgrade

Microsoft has rolled out new features to 365 Copilot aimed at improving productivity, research, and content workflows. While it’s clear Microsoft is continuing to invest heavily in its AI workspace, here’s what’s changed:

Smart Agents: New built-in agents, Researcher and Analyst, can help synthesize information and analyze data. Users can also explore the new Agent Store or create their own agents via Copilot Studio.
Notebooks: A way to organize Copilot’s responses using your own materials, docs, meetings, pages and even generate quick audio summaries. Designed for deeper collaboration and ideation.
Universal Search: Copilot can now surface answers across tools like ServiceNow and Slack, returning both the answer and the source.

Create & Customize: Some handy creative updates: turn slides into videos, generate images from text, and use a refreshed interface built to make human–AI interaction smoother. (source)

1/ Big day for Microsoft 365 Copilot: I’m really excited about our latest update.
Copilot has truly become the UI for AI – and for me, it’s the scaffolding for my workday.
Here are four new features I’ve especially been enjoying.
— Satya Nadella (@satyanadella)
5:40 PM • Apr 23, 2025

Gemma 3 QAT Models Are Here - AI That Runs on Your Desktop

Big news from the Gemmaverse! This week Google has introduced Gemma 3 QAT (Quantization-Aware Training) models, enabling powerful AI to run on everyday hardware. These models drastically cut memory needs without sacrificing accuracy, making advanced AI much more accessible.

What’s New:

Quantization-Aware Training (QAT): Normally, AI models use high-precision math that eats up a lot of memory. With QAT, the model is trained to work well even with low-precision math (like using 4-bit numbers instead of 16 or 32). It’s like training someone to do math on a calculator with fewer buttons, they get just as good, but use way less power and memory.

Run Big AI on Small devices: These models now need far less VRAM (your GPU’s memory), which means you can run large models on everyday devices:

The 27B model used to need 54GB, now just 14.1GB, good for a desktop RTX 3090.
The 12B model drops from 24GB to 6.6GB, works on laptops like the RTX 4060.
4B and 1B models are even lighter, they can run on compact devices, including some phones and edge computers.

Broad Compatibility: These models work with popular tools like Ollama, LM Studio, MLX , and libraries like llama.cpp and gemma.cpp, so you can plug them into existing workflows easily.

Available Now: You can download and use Gemma 3 QAT models today from Hugging Face and Kaggle, with full support and documentation. (source)

Just announced new versions of Gemma 3 – the most capable model to run just one H100 GPU – can now run on just one *desktop* GPU!
Our Quantization-Aware Training (QAT) method drastically brings down memory use while maintaining high quality. Excited to make Gemma 3 even more
— Sundar Pichai (@sundarpichai)
3:56 PM • Apr 18, 2025

ByteDance Releases UI-TARS-1.5 - An AI That Can See and Use Apps

ByteDance has launched UI-TARS-1.5, a new open-source AI agent that combines vision and language to understand and interact with digital screens - kind of like an AI that can see your app and click around like a human.

What’s New:

Understands Screens Visually: UI-TARS-1.5 can read what's on a screen, understand it, and respond whether it’s clicking a button, typing in a form, or following instructions.
Trained on Real Interfaces: It’s built on examples from real-world apps and websites, so it knows how to navigate common layouts and tasks.
Open-Source and Free to Use: ByteDance is making the model available to everyone, so developers can build smarter UI tools, automation systems, or accessibility features.

This brings us closer to AI that doesn’t just generate content, but actually uses software. With UI-TARS-1.5, developers can build agents that interact with apps visually, like a human user, unlocking a whole new layer of automation.

Wan2.1-FLF2V-14B

It may sound like your neighbor’s WiFi password, but Wan2.1-FLF2V-14B is actually a new open-source AI model from Alibaba. It can generate smooth, HD video using just two images: one to start and one to finish.

What’s New:

Two Images In, Full Video Out: Give it a starting frame and an ending frame, the model fills in the rest, creating a smooth 5-second video. Perfect for concept art, animation, or just impressing your friends.

Besides first-to-last frame video, it can also handle text-to-video, image-to-video, and even video-to-audio tasks.

HD Output That Looks Real: It produces 720p video that’s visually consistent and realistic, thanks to training on a massive dataset of 150M videos and 1B images.

Runs on Consumer GPUs: No data center required. You can run it on a single RTX 4090 with about 8GB of VRAM. A 5-second clip takes roughly 4 minutes to render.

Open-Source and Ready to Go: It’s available on Hugging Face and GitHub, with demos and code that work with ComfyUI and Diffusers. (source)

Robots vs. Humans in Half-Marathon – Spoiler: Humans Still Win

In a one-of-a-kind event in Beijing, 21 humanoid robots joined 12,000 human runners in a half-marathon. It was a bold test of what robots can do outside the lab and also a bit of a comedy show.

Why It Matters:

A Real-World Challenge: The robots had to run 13.1 miles and finish within 3.5 hours. That means keeping a steady pace of at least 3.7 mph, not easy for a robot!
Some Made It, Some… Didn’t: Only four robots crossed the finish line. The fastest, Tien Kung Ultra, took 2 hours and 40 minutes but needed three battery changes along the way.
Plenty of Wipeouts: One robot crashed into a fence. Another stopped moving halfway through. A few overheated. It was rough and also pretty entertaining to watch.
Why China Did This: This race is part of China’s big push to become a world leader in humanoid robotics by 2027. The government is backing robot companies with money, tax breaks, and lots of support.

This wasn’t just for fun. It’s a way to see how robots handle real environments. Before they’re helping in factories or hospitals, they need to learn to handle chaos like running next to thousands of humans.

Meta FAIR Drops New Research to Push Perception, Localization & Reasoning

Meta’s Fundamental AI Research (FAIR) team has released a wave of new tools aimed at helping AI see, understand, and reason more like humans. These updates target some of the biggest challenges in AI today like understanding complex visuals, locating objects in 3D space, and collaborating on reasoning tasks.

What’s New:

Meta Perception Encoder: A new vision model that helps AI better understand images and videos. It handles tough tasks like spotting hidden objects (think stingrays in sand or wildlife in the distance) and works well with language models for visual reasoning. Read More.
Perception Language Model: An open-source model trained on millions of video Q&As and captions. It’s designed to understand both what’s happening and when ideal for recognizing fine details in video clips. Read More.
Meta Locate 3D: A tool that lets AI identify and locate objects in 3D using natural language (e.g., “the vase near the couch”), not just coordinates. It’s powered by point cloud data from depth sensors. Read More.
Dynamic Byte Latent Transformer: A new language model that skips traditional tokens and uses raw bytes instead. This makes it more robust, especially in weird or noisy text scenarios. Read More.
Collaborative Reasoner: A framework where AI agents can talk, disagree, and solve problems together is a big step toward multi-agent systems that reason more like teams of people. Read More.

All of these tools are open or reproducible and mark Meta’s ongoing push toward more intelligent, flexible, and human-like AI.

The Tools

Gamma: AI-Powered Design for Effortless Content Creation

I used Gamma recently while working on a deck for the upcoming DataHack Summit and thought it might be worth sharing how it works, especially if you're experimenting with AI tools for content structuring or design.

What It Does
Gamma helps turn your content into presentations, webpages, documents, or social posts. It's less about heavy design control and more about speeding up the initial structure- useful if you have content but not a lot of time to format it.

How I Used It

Input: I dropped in all the rough notes and ideas around the event- no formatting, just raw text.
Format: Gamma lets you pick what you're making. I chose the Presentation output.
Theme Selection: There are a bunch of preset themes. I picked one that felt minimal and aligned with the GenAI vibe of the event.
Generation: Once I hit generate, it gave me a first draft of the deck.
Editing: From there, I could rearrange slides, rewrite sections, or export it in the format I needed.

Not every output was perfect, but for getting from messy notes to something visual quickly, it helped streamline the process. Might be worth exploring depending on the use case.

You can check out my presentation here.

The Algorithm

According to a deep dive by Ahrefs, it was found that AI Overviews drop top-ranking CTR by 34.5%. A must-read for anyone tracking SEO and zero-click search trends.
In this episode of Google DeepMind: The Podcast, RL pioneer David Silver explores how AI can surpass human-level performance without human data- using AlphaGo and AlphaZero as case studies. A powerful listen for anyone interested in the future of self-learning AI and artificial superintelligence.
If you're curious how reinforcement learning (RL) is powering the next leap in large language model reasoning, this article by Sebastian Raschka is a must-read. He dives into how RL (PPO, GRPO, RLVR) powers smarter LLMs like DeepSeek-R1 and OpenAI o3. A goldmine for devs and AI researchers.
Anthropic’s new command-line tool, Claude Code, is a powerful way to integrate agentic AI into your development workflow. This guide shows how to optimize workflows, automate tasks, and run multi-agent sessions.
Last week, we launched more free AI & ML courses to help you dive deeper into real-world applications and creative AI use cases-
- Building Intelligent Chatbots using AI – Learn to build multimodal chatbots (text, voice, image) using OpenAI and LangChain. Design smart conversation flows, embed PDFs, and deploy bots- while following responsible AI practices from day one.
- How to Build an Image Generator Web App with Zero Coding – Create and deploy your own AI image-generation app with zero coding. This beginner-friendly course covers the basics and walks you through building with no-code tools, step by step.

The Output

This week wasn’t just about faster or smarter AI - it was about making that power more accessible. From models that can run on your laptop to APIs that turn text into visuals, AI is becoming easier to build with, and easier to build on.

As the barriers keep dropping, the real question is shifting- from who gets to shape the future, to what kind of future we build when AI is in everyone’s hands.

Reply

or to participate.

🚨OpenAI’s Viral Image Gen Model Just Dropped for Devs

Along with - Microsoft 365 Copilot Got a Major Upgrade

Table of Contents

Reply