GPT 4o - Leap in Image Gen!

Along with - Gemini 2.5

Hi there đź‘‹

The new GPT-4o image generation feels like a proper upgrade- smarter, very precise, and a lot more usable.. It’s way more contextual, makes fewer mistakes, and plays around with styles in a pretty slick way.

The real shift though is that now these images are actually usable now- for marketing, teaching, and tons of other real-world use cases. Expect to start seeing them pop up everywhere.

On the side, Google DeepMind quietly dropped Gemini 2.5, and it’s already sitting at the top of the LMArena leaderboard.

Lots of big stuff this week. Let’s dive in!

What would be the format? Every week, we will break the newsletter into the following sections:

  • The Input - All about recent developments in AI

  • The Tools - Interesting finds and launches

  • The Algorithm - Resources for learning

  • The Output - Our reflection

Table of Contents

OpenAI has stepped up big time. They've introduced native image generation inside ChatGPT, powered by GPT-4o, and it looks like a major leap from anything we’ve seen before.

Sam Altman called it “one of the most fun, cool things we have ever launched.” While OpenAI’s been in the image game since the days of DALL·E, this new rollout feels far more seamless and powerful- built right into the chat interface.

The feature is already live for ChatGPT Plus and Pro users, and OpenAI says access will expand to free users soon. API access is on the way too.

Here’s what makes it stand out:

  • Perfect text rendering - It handles clean, legible text inside images (a big step up from older models).

  • Iterative conversation - You can tweak your image with follow-ups like “make it warmer” or “change the layout,” and it remembers the context.

  • Smarter inputs - It works with existing images, reference styles, and design palettes to shape new outputs.

  • Cross-modal understanding - Being an omnimodel, it gets how text, image, and layout relate- making the outputs feel more intentional.

The responses from this model are genuinely useful across a bunch of real-world tasks- be it marketing, social media, or documentation, you name it.

Here are a few examples to show you what I mean:  (source)

Prompt: Add a money plant to the table.

Result:

You’ve probably seen MCP pop up everywhere lately- it’s essentially a “USB-C for AI,” a standard protocol to plug models like ChatGPT or Claude into tools, data sources, and services.

Now, OpenAI is officially adding support for the Model Context Protocol (MCP) across its products, including the ChatGPT desktop app.

What does this mean? Models can now pull in real-time data from business tools, apps, and content libraries- and even interact directly with development environments. Think of it as the bridge that makes AI way more connected and useful across your workflows. (Source)

Google DeepMind quietly launched Gemini 2.5 Pro (experimental) and within hours, it climbed to the #1 spot on the LMArena leaderboard. It builds on the Gemini 1.5 series but with noticeable improvements across the board- especially in reasoning, coding, and multimodal capabilities.

So, what’s different in this new version?

How It’s Evolved from Gemini 1.5 Pro

  • Better reasoning and accuracy in both text and multimodal tasks.

  • Faster and more efficient- meaning quicker responses and lower compute costs.

  • Stronger coding abilities, which makes it more useful for developers building agentic or code-heavy applications.

Key Upgrades in Gemini 2.5 Pro

  • Multimodal Input Handling: It works across text, images, video, audio, and code repositories- making it genuinely useful for different real-world use cases.

  • Improved Reasoning Engine: Responses now feel more grounded, as the model takes time to analyze context before answering.

  • 1 Million Token Context Window: That’s a massive upgrade, letting the model handle large documents or projects without losing track.

  • Smarter Coding Help: It generates better code, fixes bugs more reliably, and even assists in more complex development workflows.

  • Fresh Training Data: It’s trained on newer data with a knowledge cut-off in January 2025, giving it a slight edge on current events and recent developments.

If you're experimenting with agentic workflows, multimodal tasks, or just want a model that feels more capable in practical scenarios, Gemini 2.5 Pro is definitely worth a test run.  (source)

After so much of wait, Anthropic has rolled out web search for Claude, giving it access to up-to-date info from the internet to improve response relevance and accuracy. When Claude uses web content, it includes direct citations, so you can fact-check sources with ease.

Availability:

Web search is now in feature preview for paid users in the U.S. (Free users and more regions coming soon). Just toggle it on in your profile and chat with Claude 3.7 Sonnet to get started. (source)

China’s DeepSeek is back with an upgrade- DeepSeek-V3-0324, a powerhouse model built on a massive 685B parameter architecture. It’s sharper in coding, math, and Chinese writing, and now comes with an MIT license- making it easier for devs to tinker, fine-tune, and build.

What’s New:

  • Big Jump in Reasoning & Web Dev: Scores jumped from 39.6 to 59.4 on AIME and hit 49.2 on LiveCodeBench (+10 points over the last version).

  • MoE Architecture: Still uses Mixture-of-Experts, keeping things efficient despite the size.

  • Real-World Muscle: Spun up a full responsive website in tests and runs on an M3 Ultra at 20+ tokens/sec.

  • Developer Love: Trending on Hugging Face, with shoutouts from researchers, coders, and Olympiad champs.

With R1 already turning heads earlier this year, there’s buzz that R2 might be dropping soon- DeepSeek could be closing in on OpenAI and Anthropic… at just 2% of their spend. (source)

OpenAI just dropped a new set of audio models into its API- and they’re built for developers who want smarter, more natural voice agents. These updates make it easier to build apps that not only understand you better but also talk back with personality.

What’s New:

  • Speech-to-Text (STT): Two models- gpt-4o-transcribe and gpt-4o-mini-transcribe- are now live, with impressive accuracy even in noisy environments or across different accents.

  • Text-to-Speech (TTS): The new gpt-4o-mini-tts adds tone control. You can now prompt it with styles like “empathetic support rep” or “medieval knight” and get speech that actually fits the vibe.

  • Ready-Made Voices: Comes with built-in styles like Calm, Professional, Surfer, and more.

Behind the Scenes:

  • Trained on a wide mix of audio data to capture subtle nuances.

  • Uses advanced distillation and reinforcement learning to balance quality and efficiency.

Integration & What’s Next:

  • You can plug these into the OpenAI API today, or use them with the Agents SDK for fast voice workflows.

  • For real-time speech-to-speech? Use their Realtime API.

  • OpenAI’s roadmap includes support for custom voices and moving toward multimodal agents that can handle video too.

One step closer to voice agents that feel less robotic and more… real.  (source)

Elon Musk’s Grok AI 3 now includes text-based image editing and an upgraded Deeper Search feature -  both aimed at enhancing usability and competing with ChatGPT and Gemini.

What’s New

  • AI Image Editing: Users can now edit AI-generated images using prompts. The interface is simple- generate an image, click “edit,” and tweak via text.

  • Deeper Search: Goes beyond X posts, pulling from broader and more credible web sources for richer, more reliable results.

Availability

  • Available for free to all X users and Grok app testers.

  • Image editing is rolling out gradually, while Deeper Search is live. (source)

Microsoft is rolling out 11 new AI agents for its Security Copilot, aimed at reducing repetitive workloads and analyst burnout in cybersecurity teams.

Why It Matters

  • Addresses a major cyber talent gap -  only 83% of U.S. cyber roles are currently filled.

  • Security teams face thousands of alerts daily, often spending hours just responding.

What’s New

  • 6 Microsoft agents + 5 partner agents (from OneTrust, Tanium, BlueVoyant, and others) launch in preview next month.

  • Tasks include phishing detection, regulatory letter drafting, and post-breach actions.

  • Each agent offers customizable autonomy and a “thinking map” for transparency and oversight.

  • Agents can be corrected mid-task (e.g., false phishing flag).

Unlike traditional AI copilots, these agents take autonomous actions, reflecting growing demand for AI that goes beyond just answering questions. (source)

Google is now rolling out real-time screen and camera understanding features to Gemini Live for select Google One AI Premium subscribers.

What’s New

  • Screen Reading: Gemini can “see” your phone screen and answer questions about it.

  • Live Camera Input: Gemini interprets your smartphone’s camera feed in real-time- e.g., helping choose paint colors from live video.

These features stem from Google’s Project Astra and are gradually appearing on devices like Xiaomi phones.

Why It Matters

This gives Google a clear lead in AI assistant capabilities, as Apple’s Siri upgrade is delayed and Amazon’s Alexa Plus is still in early access.

Gemini continues to evolve as the default AI assistant on Android, including Samsung devices. (source)

Qura: AI-Powered Replies for X and LinkedIn
Qura helps you generate personalized, AI-driven replies on X (Twitter) and LinkedIn- boosting engagement and saving valuable time.

How to Use:

Install the Extension - Add Qura from the Chrome Web Store and pin it to your browser.

Log In to Your Account - Open X (Twitter) or LinkedIn and make sure you're signed in.

Pick a Post - Scroll your feed and choose a post you'd like to reply to.

Launch Qura - Click the Qura icon next to the post to activate the reply tool.

Choose a Tone - Select from preset tones like Neutral, Support, Agree, or Disagree.

Generate a Reply - Let Qura create a smart, context-aware response instantly.

Edit or Post - Tweak the reply if needed, or post it directly with one click.

  • Last week, we launched more free AI & ML courses to help you get hands-on with cutting-edge tools and frameworks-

    • Build Product 10x Faster with GenAI - Learn prompt engineering (zero-shot, few-shot, chain-of-thought) and build real-world AI apps using LangChain, Cursor, and Bolt.dev. You’ll also customize and deploy GPTs to boost productivity through hands-on projects.

    • Foundations of Model Context Protocol (MCP) - Get up to speed with MCP, the emerging standard for AI task automation. This course covers the basics, shows why it’s gaining traction, and walks you through building Python-based workflows using Claude.

  • Google Creative Lab’s new co-drawing tool lets you sketch alongside Gemini 2.0, turning rough doodles into dynamic, AI-enhanced art. From evolving shapes to animated battles and colorful transformations, it’s collaborative creativity in real-time.

  • Claude's new "think" tool adds an internal reasoning step during response generation, improving accuracy in complex, multi-step tasks. It boosts performance in policy-heavy, tool-intensive scenarios- delivering up to 54% gains on Ď„-Bench and contributing to SOTA results on SWE-Bench.

I’ve been playing around with GPT-4o’s image generation lately - testing all sorts of ideas and styles.

Would love to hear what you think. If you’ve tried it too, feel free to share your creations- we might feature a few!

Reply

or to participate.