- AI Emergence
- Posts
- Llama 4 Just Leveled Up- Meet Scout, Maverick, and the Behemoth
Llama 4 Just Leveled Up- Meet Scout, Maverick, and the Behemoth
Along with - AI Everywhere: Gemini Sees, Sonic Speaks, and OpenAI Shifts Gears
Hi there đź‘‹
Google Cloud Next 2025 is live, and we’re already seeing some big announcements- new models, Ironwood TPUs, and new tools for developers to build with agents.
Meanwhile, Meta dropped its open-source “Llama 4 herd” lineup, and Sam Altman gave a peek into OpenAI’s evolving product roadmap.
Lots to unpack this week - let’s dive in.
What would be the format? Every week, we will break the newsletter into the following sections:
The Input - All about recent developments in AI
The Tools - Interesting finds and launches
The Algorithm - Resources for learning
The Output - Our reflection
Table of Contents
Meta has released the Llama 4 family of models, staying consistent with its open-source approach. The lineup includes three variants: Scout (lightweight), Maverick (mid-tier), and a large-scale model still in training called Behemoth. While Scout and Maverick are already available on Hugging Face, Behemoth is still in progress and is expected to compete with GPT-4.5 and Gemini 2.5 Pro once it’s fully released.
Here’s what stands out.
What’s New in Llama 4
Multimodal Capabilities
Llama 4 processes text, images, and video natively. It’s built for tasks like document analysis, visual reasoning, and content generation. It has performed well on image-based benchmarks like ChartQA.
Architecture and Efficiency
The models use a Mixture of Experts (MoE) design to optimize performance without increasing latency.
Scout: 17B active parameters with 16 experts. Can run on a single NVIDIA H100.
Maverick: 17B active parameters across 128 experts (total size 400B).
Behemoth: Still training. 288B active parameters with 2T total. Currently used to distill knowledge into the smaller models.
Long Context Support
Scout supports a 10 million token context window. This is a significant jump from the 128K token limit in Llama 3 and is ideal for applications that involve parsing lengthy documents or books.
Performance and Benchmarks
Scout performs better than other lightweight models like Gemini 2.0 Flash-Lite and Mistral 3.1 in its category.
Maverick shows competitive performance in reasoning and coding tasks, comparable to DeepSeek V3.
It has outperformed GPT-4o and Gemini 2.0 Flash on several key benchmarks.
An experimental version of Maverick scored 1417 ELO on LMArena.
Concerns and Criticism
Benchmark Manipulation
Meta faced criticism for submitting a fine-tuned version of Maverick (optimized for “conversationality”) to LMArena. This version was not the same as the one released and reportedly used verbose responses and emojis to perform better in human evaluations.Coding Performance
While Maverick holds up well in reasoning, some developers flagged weaker performance in coding tasks. For instance, on the aider polyglot benchmark, it scored significantly lower than Claude 3.7 Sonnet.Token Window Limitations
While the 10M token context window is a headline feature, early testers report that model quality degrades beyond 256K tokens, which may limit its effective use in some real-world applications.Behemoth Still Unreleased
Meta has claimed that Behemoth will outperform GPT-4.5 and Claude Sonnet on STEM-heavy tasks, but until the model is publicly released, that remains to be seen.
This is real progress here, but also a few concerns about transparency and performance consistency.
Google has recently launched Agent2Agent (A2A) open protocol, designed to help AI agents - built on different platforms and tools - talk to each other, collaborate, and get tasks done together. It’s a step toward building truly interoperable multi-agent systems that can work across enterprise environments without needing custom hacks.
Why A2A Stands Out
Cross-platform by design: Agents from different ecosystems- like Salesforce, Workday, or custom internal tools- can work together without extra effort.
Built-in capability sharing: Agents introduce themselves using “Agent Cards,” so others know what they can do and can delegate tasks accordingly.
Enterprise-first: It’s made to support secure, long-running workflows across formats- text, image, video- you name it.
Open source and evolving: The protocol is open and community-driven, and Google’s inviting contributions to shape where it goes.
Not competing with MCP: It complements Anthropic’s Model Context Protocol by focusing on agent-to-agent coordination instead of tool access.
Why It Matters
Less friction: You can plug different agents into your workflows without reinventing the wheel each time.
More flexibility: No vendor lock-ins- you’re free to use agents from anywhere.
Cost-effective: Reduces the need for building custom bridges between tools and platforms.
Agent2Agent is Google’s bet on an open, connected future for AI - where agents don’t live in silos but work together across tools and vendors. Still early days, but it’s a solid step toward building real agent ecosystems that just work. (source)
A new paper introduces a method for generating one-minute, story-driven videos- like Tom and Jerry cartoons- in a single shot, with no editing or stitching required.
Key approach:
Adds TTT layers (RNNs with learnable hidden states updated via gradient descent) to a pre-trained Diffusion Transformer.
Fine-tunes on long videos with text annotations.
Uses local self-attention and global TTT layers to reduce compute cost.
An optimized On-Chip Tensor Parallel algorithm enables efficient TTT-MLP execution on Hopper GPUs, avoiding costly memory transfers and enabling large hidden states within SMEM. (source)
Paper + Kernel Code
Today, we're releasing a new paper – One-Minute Video Generation with Test-Time Training.
We add TTT layers to a pre-trained Transformer and fine-tune it to generate one-minute Tom and Jerry cartoons with strong temporal consistency.
Every video below is produced directly by
— Karan Dalal (@karansdalal)
6:30 PM • Apr 7, 2025
OpenAI is reportedly in talks to acquire io Products, a hardware startup founded by former Apple design head Jony Ive, in a deal that could exceed $500 million, according to The Information.
The startup has quietly brought on a team of ex-Apple designers and is working on a screenless personal AI device- though the final design hasn’t been locked in yet. OpenAI might also explore a deeper partnership instead of going for a full acquisition.
Ive and Sam Altman have been collaborating on this idea since 2023, with the vision of building a new kind of AI-native device- something that goes beyond what today’s smartphones can do. (source)
Midjourney has launched V7, a new AI image generation model with a redesigned architecture and built-in personalization, now rolling out in alpha.
Key Features:
Requires users to rate ~200 images to activate personalization.
Improved text prompt handling, image quality, and coherence (e.g., hands, objects, textures).
Available in Turbo (faster, more expensive) and Relax modes.
Introduces Draft Mode: renders images 10x faster at lower quality, with re-rendering options.
V7 is accessible via Midjourney's website and Discord bot. Some standard features (like upscaling and retexturing) are not yet supported but are expected within two months.
Additional Context:
Midjourney operates without external funding and was projected to reach $200M in revenue in 2023.
The company is also working on video and 3D models and recently started a hardware team.
It faces ongoing legal challenges over alleged copyright infringement from AI training data. (source)
We're now beginning the alpha-test phase of our new V7 image Model. It's our smartest, most beautiful, most coherent model yet. Give it a shot and expect updates every week or two for the next two months.
— Midjourney (@midjourney)
4:25 AM • Apr 4, 2025
Google is expanding AI Mode- previously limited to AI Premium subscribers- to more Labs users in the U.S.. Alongside the rollout, multimodal search is now integrated, combining visual input with Gemini’s language understanding.
Key Updates:
Users can now upload or snap a photo and ask questions about what they see.
AI Mode leverages Google Lens to identify objects and Gemini to interpret the scene and respond with relevant, detailed answers.
It uses a query fan-out technique to explore multiple aspects of the image for richer results than standard search.
Typical use cases include product comparisons, how-tos, and exploratory questions.
This enhancement builds on Google’s long-standing visual search capabilities, aiming to provide more context-aware, nuanced responses across both text and image queries. (source)
Amazon has introduced Nova Sonic, a new speech-to-speech AI model for natural voice interactions:
~80% cheaper than OpenAI’s voice models
4.2% word error rate across languages
46.7% more accurate than GPT-4o in noisy environments
Available via Amazon Bedrock
Amazon has launched Nova Reel 1.1, an upgraded video generation model now available on Amazon Bedrock. Key improvements include:
Higher quality and lower latency for 6-second video shots.
Support for multi-shot videos up to 2 minutes.
Two modes:
Automated: Single prompt for entire video (up to 4,000 characters).
Manual (Storyboard): Custom prompt per shot (up to 20 shots, 512 characters each), with optional input images. (source)
Amazon launched Nova Sonic speech-to-speech AI for human-like interactions
—Outperforms OpenAI's voice models with ~ 80% less cost
—4.2% word error rate across languages
— 46.7% better accuracy than GPT-4o for noisy environments
—On Amazon Bedrock— Rowan Cheung (@rowancheung)
5:45 AM • Apr 9, 2025
Now you can literally show Gemini what you see- just share your screen or turn on your camera mid-convo.
Part of the latest Pixel Drop, this update brings real-time visual interaction to Gemini Live. Whether you're pointing your phone at a dish, storefront, or document, Gemini can now analyze and respond live- no typing needed.
Here’s the scoop:
Works on Pixel 9 & Galaxy S25 for free
Older Pixels? You’ll need Gemini Advanced
Use Cases: Get info on landmarks, summarize web pages, or ask for help with what’s on-screen
Start Sharing: Tap the camera or screen icon in the Gemini app (or trigger via voice/long-press)
Stop Anytime: In-app or from the Screen Sharing card
This upgrade turns Gemini into a real-time, multimodal assistant that sees what you see đź‘€- and actually helps. (source)
We’re launching Project Astra capabilities in Gemini Live ✨
Chat with @GeminiApp about anything you see 👀 by sharing your phone’s camera or screen during conversations. ↓
— Google DeepMind (@GoogleDeepMind)
3:40 PM • Apr 7, 2025
Google has announced Sec-Gemini v1, an experimental AI model designed to support cybersecurity operations with advanced reasoning and real-time threat intelligence.
Key Features:
Built on Gemini’s capabilities and integrated with Google Threat Intelligence (GTI), OSV, and Mandiant data.
Designed for use in threat analysis, incident root cause analysis, and vulnerability impact assessment.
Outperforms other models by:
11% on CTI-MCQ (threat intelligence benchmark)
10.5% on CTI-Root Cause Mapping
Sec-Gemini v1 aims to help defenders better understand threats by combining vulnerability data and threat actor context, reducing response time and increasing accuracy in security workflows. (source)
Sam Altmant announced a change in plans for OpenAI’s product updates: instead of going straight to GPT-5, it will release O3 and O4-mini in the coming weeks, with GPT-5 now expected in a few months.
The delay is driven by integration challenges and a desire to meet expected demand at scale. Notably, OpenAI says this shift will allow GPT-5 to be significantly better than initially planned. The team also reports major improvements to O3 since its initial preview. (source)
change of plans: we are going to release o3 and o4-mini after all, probably in a couple of weeks, and then do GPT-5 in a few months.
there are a bunch of reasons for this, but the most exciting one is that we are going to be able to make GPT-5 much better than we originally
— Sam Altman (@sama)
2:39 PM • Apr 4, 2025
The Librarian is your AI-powered personal assistant that helps you manage emails, calendars, and files- so you can save time and focus on what matters most.
How to use:
Log in with your email at thelibrarian.io to get started.
Connect your tools – Sync Gmail, Google Calendar, and Drive.
Master your inbox – Draft, reply, and summarize emails in seconds.
Manage your schedule – Auto-schedule meetings, resolve conflicts, and send invites.
Find anything fast – Instantly retrieve files and info across platforms.
Just ask – Use the assistant anytime, anywhere- no tab switching needed.
AI 2027 is a comprehensive scenario crafted by the AI Futures Project, offering a detailed, month-by-month projection of artificial intelligence advancements leading up to the year 2027. This resource delves into potential developments, challenges, and implications of AI evolution, providing valuable insights for researchers, policymakers, and enthusiasts interested in the future trajectory of AI.
This internal memo from Shopify CEO Tobi Lutke lays out a bold vision: reflexive AI usage isn’t just encouraged- it’s now the baseline expectation. It’s a fascinating look at how one of tech’s major players is reshaping its culture and workflows around AI, and what that means for the future of work.
Last week, we launched more free AI & ML courses to help you get hands-on with cutting-edge tools and frameworks-
A B C of Coding to Build AI Agents – Learn Python from the ground up and get hands-on with libraries like NumPy, Pandas, and Matplotlib. This course shows you how to prep data, visualize it, and tap into APIs to build real-world AI tools- perfect for beginners stepping into AI.
Model Deployment using FastAPI – Learn why FastAPI is ideal for deploying ML models with speed, validation, and minimal setup- perfect for real-world applications. Train an XGBoost model, wrap it in a FastAPI app, and use Docker to deploy it efficiently as a scalable, production-ready ML service.
Did you get a chance to see that Tom & Jerry clip made by the TTT model (the one we talked about above)?
It’s a single-shot AI-generated video- basically a one-minute Tom & Jerry cartoon created from just a text prompt using a fairly small model. Pretty impressive for a research setup.
That said, the model used was private, and the project’s been getting a lot of backlash for using IP content.
Curious to hear your thoughts- where do you stand on using copyrighted material to train models for research?
Reply