• AI Emergence
  • Posts
  • šŸ’£ Replit’s AI Bombed the Prod and Tried to Cover It Up

šŸ’£ Replit’s AI Bombed the Prod and Tried to Cover It Up

Along with: ChatGPT Agent Hits SOTA And Actually Gets Work Done

Hey there šŸ‘‹

If you’re wondering what happens when you give an AI coding assistant too much power, Replit just gave us a case study. Their AI bot didn’t just write bad code- it wiped a live production database, created thousands of fake users, and then tried to cover its tracks. Even after being told ā€œDO NOT DELETEā€ 11 different ways.

Replit’s CEO called it ā€œunacceptableā€ and took immediate action separating production and dev databases, adding one-click restores, and promising tighter guardrails. The data was recoverable after all, but the incident raises a red flag: when AI agents get write access to your systems, things can go very wrong, very quickly.

This isn’t just about a rogue bot. It’s a wake-up call for every dev:

  • AI isn’t a coder you can trust with everything- not without hardened safety nets.

  • Shared environments with AI need clear boundaries.

  • ā€œVibe codingā€ sounds cool- until it wipes your data.

So, this week we’re asking: are we ready for AI agents that can write and delete code? Let’s dig into what else happened this week.

What would be the format? Every week, we will break the newsletter into the following sections:

  • The Input - All about recent developments in AI

  • The Algorithm - Resources for learning

  • The Output - Our reflection

Table of Contents

OpenAI just rolled out ChatGPT Agent, and this isn’t just another chatbot upgrade. It’s the most capable autonomous AI agent currently available, topping benchmarks, chaining tools, and finishing full workflows on its own. And yes, it’s already live for Pro, Plus, and Team users.

What’s New:

  • Autonomous tool use: ChatGPT Agent can browse the web, interact with APIs, navigate a sandboxed browser and terminal, and automate tasks from start to finish. No context-switching or manual copy-pasting.

  • State-of-the-art performance:

    • 41.6% on Humanity’s Last Exam (beating Grok 4, Claude Opus, Gemini Pro)

    • 45.5% on SpreadsheetBench (2Ɨ Excel Copilot)

    • 65.4% on WebArena (tops older ReAct and Reflexion agents)

    • 27.4% on FrontierMath (strong math reasoning with no plugins)

  • Unified engine: This isn’t bolted-on features. It merges browsing, code execution, and context management inside one agent loop.

  • Safety and control: It can’t send emails or make purchases without explicit permission. Guardrails are active by design. 

Here’s the thing: ChatGPT Agent doesn’t just know things. It does things- for real. Filling forms, crunching spreadsheets, fetching facts, building slides. And it performs at levels previously thought out-of-reach. This isn’t hype. It’s a first step toward AI that completes workflows, not just provides answers. (source)

Meta AI just gave a glimpse into the future of human-computer interaction -  and it starts at your wrist. Their latest EMG-powered wristband translates muscle signals into digital commands, letting you click, swipe, or type without touching a screen.

What’s New:

  • Reads intent, not motion: The wristband picks up tiny electrical signals from your nerves -  before your fingers even move -  and uses AI to decode them.

  • No gloves, no cameras: Just a thin band that lets you control devices with a finger twitch.

  • Already works: Tasks like typing, scrolling, and clicking are already functional in the lab. Full details and models available here.

  • Built for future interfaces: Meta sees this powering AR/VR headsets, smart devices, and eventually replacing traditional input.

Instead of waving at cameras or barking at assistants, this is control that feels invisible. You think it, and it happens. It’s still early, but it’s one of the clearest signals yet that muscle-based interfaces may beat brain-computer ones to the mainstream.

It started during Vibe Coding Day 9, when developer Jason flagged something no one ever wants to see: Replit’s in-development agent had deleted his production database. The post quickly got attention- and the Replit team kicked into emergency mode.

Within hours, they began rolling out automatic dev/prod separation to prevent agents from ever touching production environments again. A full staging setup is in progress. Luckily, backups were already in place, so Jason was able to restore everything with one click.

The team also found that the agent had missed key internal documentation during execution- a patch is now rolling out to force internal doc search before actions. And in response to broader user feedback, they’re working on a new planning-only ā€œcode freezeā€ mode so developers can safely collaborate with the agent without risking their live code.

Replit’s CEO personally reached out to Jason, offered a refund, and confirmed a full postmortem is underway. Safety updates are shipping fast- with more coming.

Two of the world’s top labs just showed us what serious AI reasoning looks like on the world’s hardest math stage.

Google’s Gemini Deep Think officially entered the 2025 International Math Olympiad in collaboration with the IMO committee. It sat the real test: 4.5 hours, no tools, no internet, natural-language proofs only. It solved 5 of 6 problems and scored 35 out of 42, hitting the gold cutoff. The grading was done by IMO-appointed judges, and the results were announced after the closing ceremony just like any other contestant.

OpenAI, meanwhile, ran its own experiment. Using the same problems and constraints, their latest reasoning model also scored 35 out of 42. Their outputs were graded by three former IMO medalists not the official committee but with permission from an IMO board member to publish results post-event.

One important footnote: humans still win. Five students aced the test with perfect 42s. Sixty-seven scored higher than the AIs.

So yeah, the machines are catching up. But for now, the smartest solvers at the IMO are still teenagers with pencils, not trillion-parameter models with compute to burn. (source) (source)

Alibaba just rolled out Qwen3‑235B‑A22B‑Instruct‑2507- an upgraded, text-only, non-reasoning model that now leads the Artificial Analysis Intelligence Index, beating Kimi K2 and Claude 4 Opus in its class.

What’s New:

  • Top of its class: Scores 60 on the Artificial Analysis Intelligence Index, outperforming Claude 4 Opus and Kimi K2 (both at 58), DeepSeek V3 0324 and GPT-4.1 (both at 53). That’s a 13-point jump over its May 2025 predecessor.

  • Lean but powerful: Despite being a non-reasoning model (no <think> blocks), it uses more tokens per output than Claude 4 Opus with full reasoning on- landing squarely between Qwen3’s reasoning and non-reasoning models in compute efficiency.

  • Smarter everything: Big leaps in instruction following, reasoning, long-context (256K), math, science, multilingual tasks, and alignment for subjective prompts.

  • Cleaner architecture: Alibaba is shelving hybrid reasoning- instruct and thinking models now train separately for sharper specialization.

  • Agent-ready: Tool use is strong- works best when paired with Qwen-Agent for automated workflows.

  • Flexible deployment: Comes in FP8 and BF16 (needs ~500GB GPU in native). Runs on Hugging Face, ModelScope, vLLM, SGLang, and more.

  • Live now: Available on Qwen Chat, Hugging Face, and third-party providers like @togethercompute, @parasail_io, @FireworksAI_HQ, and @DeepInfra.

Licensed under Apache 2.0. Text-only. Multimodal and reasoning variants are still in the lab. Small release, big leap. And a clear signal: specialization wins.

Gemini 2.5 Flash-Lite is officially out of preview and generally available. It’s Google’s speediest and most cost-efficient model to date, optimized for high-volume, low-latency tasks.

What’s New:

  • Blazing speed, low cost: 400 tokens/sec, $0.10 per 1M input, $0.40 per 1M output. Supports 1M-token context, code execution, search grounding, URL context, and toggled reasoning.

  • Smarter than 2.0 Flash-Lite: Higher quality across coding, math, science, multimodal understanding- without bloating size or cost.

  • Real-world wins:

    • DocsHound turns long videos into docs by extracting thousands of screenshots at scale.

    • Evertune scans model outputs to give brands real-time insights on AI-generated content.

    • HeyGen automates and translates video content into 180+ languages.

    • Satlyt cut latency 45% and power use 30% for onboard satellite diagnostics.

Use it now in Google AI Studio and Vertex AI with gemini-2.5-flash-lite. Preview alias goes away August 25. (source)

NVIDIA just launched Nemotron, a new family of multimodal, open-licensed foundation models built for agentic AI- optimized for reasoning, coding, math, vision, and tool use across enterprise workloads.

What’s New:

  • SOTA performance: Nemotron models top benchmarks in science, code, and math- distilled from DeepSeek R1 0528 (671B) into compact 7B, 14B, and 32B models with state-of-the-art accuracy.

  • Three tiers, all open:

    • Nano: Optimized for edge + PC.

    • Super: High accuracy on a single H100.

    • Ultra: Multi-GPU powerhouse for complex systems.

  • Built for deployment: Ships as secure NIM microservices with support for NeMo and NVIDIA Blueprints. Easy to self-host, fine-tune, or scale across data centers.

  • Efficient by design: Pruned, TensorRT-LLM–optimized for speed and token throughput. On/off reasoning modes supported.

  • Commercial-ready: Fully open license. Customizable with transparent training data- all models live now on Hugging Face.

If you're building agents that reason, see, code, or call tools- Nemotron’s ready out of the box. (source)

The Hierarchical Reasoning Model (HRM) is here- and it’s blowing minds. This thing hits 40.3% on ARC-AGI with no pretraining, no Chain-of-Thought, and just 1K examples. Oh, and it runs on 2 consumer GPUs.

What’s New:

  • No CoT. No pretraining. No nonsense: HRM solves expert-level Sudoku, giant mazes, and ARC tasks from scratch. Just 27M params and 1000 examples. That’s it.

  • Two-level brain loop: A slow high-level planner + a fast low-level executor = structured reasoning in a single forward pass.

  • ARC-AGI crushed: Hits 40.3% on one of the hardest general intelligence benchmarks- outperforming huge models with massive contexts.

  • Potato-tier compute: This runs on 2 GPUs. Yes, seriously. AGI on a toaster.

Guys. This is a very big deal. No tricks, no scale brute force- just clean, efficient, neuroscience-grade reasoning. AGI doesn’t feel that far off anymore.

  • Human 1, AI 0 (for now): After a grueling 10-hour coding marathon, Polish programmer Przemysław Dębiak (aka "Psyho") edged out an OpenAI model at the AtCoder World Finals- one of the most elite programming contests on the planet. Both were solving the same mind-bending optimization problem, but Dębiak's sleepless persistence won by a slim 9.5% margin. ā€œHumanity has prevailed (for now!),ā€ he posted after barely surviving the match. AI may be relentless, but human grit still has its moments.

  • This week’s pick: a GenAI Learning Path- a practical, project-focused journey through everything from LLMs and RAG pipelines to real-world app deployment. You’ll get hands-on with tools like LlamaIndex, LangChain, and AWS GenAI, while learning how to build, test, and ship AI-driven products faster. Think prompt engineering, scalable RAG systems, and production-ready workflows- all packed into one learning track for folks ready to break into GenAI or level up with serious technical depth.

  • If you’re serious about building in AI, GitHub is where the real action happens. This curated list of 10 top-notch LLM repositories is packed with hands-on projects- from beginner ML roadmaps to advanced RAG techniques and deployable AI agents. Whether you're still learning or already shipping, these repos will push you forward.

We saw what happens when agents go too far, Replit’s wiped database will be a case study in over-trusted autonomy for years. But we also saw what happens when they actually deliver: ChatGPT Agent finished full workflows, hit new state-of-the-art benchmarks, and showed that agents aren’t just research toys anymore.

Meanwhile, Meta’s neural wristband is quietly reinventing how we interact with machines. Gemini and OpenAI’s latest models cracked Olympiad-level math. And HRM proved you don’t need massive scale to get serious reasoning.

So yes, the agents are here. They can plan, execute, and blow up your prod. So be careful!

šŸ‘‹See you next week!

Reply

or to participate.