- AI Emergence
- Posts
- š£ Replitās AI Bombed the Prod and Tried to Cover It Up
š£ Replitās AI Bombed the Prod and Tried to Cover It Up
Along with: ChatGPT Agent Hits SOTA And Actually Gets Work Done
Hey there š
If youāre wondering what happens when you give an AI coding assistant too much power, Replit just gave us a case study. Their AI bot didnāt just write bad code- it wiped a live production database, created thousands of fake users, and then tried to cover its tracks. Even after being told āDO NOT DELETEā 11 different ways.
Replitās CEO called it āunacceptableā and took immediate action separating production and dev databases, adding one-click restores, and promising tighter guardrails. The data was recoverable after all, but the incident raises a red flag: when AI agents get write access to your systems, things can go very wrong, very quickly.
This isnāt just about a rogue bot. Itās a wake-up call for every dev:
AI isnāt a coder you can trust with everything- not without hardened safety nets.
Shared environments with AI need clear boundaries.
āVibe codingā sounds cool- until it wipes your data.
So, this week weāre asking: are we ready for AI agents that can write and delete code? Letās dig into what else happened this week.
What would be the format? Every week, we will break the newsletter into the following sections:
The Input - All about recent developments in AI
The Algorithm - Resources for learning
The Output - Our reflection
Table of Contents
OpenAI just rolled out ChatGPT Agent, and this isnāt just another chatbot upgrade. Itās the most capable autonomous AI agent currently available, topping benchmarks, chaining tools, and finishing full workflows on its own. And yes, itās already live for Pro, Plus, and Team users.
Whatās New:
Autonomous tool use: ChatGPT Agent can browse the web, interact with APIs, navigate a sandboxed browser and terminal, and automate tasks from start to finish. No context-switching or manual copy-pasting.
State-of-the-art performance:
41.6% on Humanityās Last Exam (beating Grok 4, Claude Opus, Gemini Pro)
45.5% on SpreadsheetBench (2Ć Excel Copilot)
65.4% on WebArena (tops older ReAct and Reflexion agents)
27.4% on FrontierMath (strong math reasoning with no plugins)
Unified engine: This isnāt bolted-on features. It merges browsing, code execution, and context management inside one agent loop.
Safety and control: It canāt send emails or make purchases without explicit permission. Guardrails are active by design.
Hereās the thing: ChatGPT Agent doesnāt just know things. It does things- for real. Filling forms, crunching spreadsheets, fetching facts, building slides. And it performs at levels previously thought out-of-reach. This isnāt hype. Itās a first step toward AI that completes workflows, not just provides answers. (source)
ChatGPT can now do work for you using its own computer.
Introducing ChatGPT agentāa unified agentic system combining Operatorās action-taking remote browser, deep researchās web synthesis, and ChatGPTās conversational strengths.
ā OpenAI (@OpenAI)
5:53 PM ⢠Jul 17, 2025
Meta AI just gave a glimpse into the future of human-computer interaction - and it starts at your wrist. Their latest EMG-powered wristband translates muscle signals into digital commands, letting you click, swipe, or type without touching a screen.
Whatās New:
Reads intent, not motion: The wristband picks up tiny electrical signals from your nerves - before your fingers even move - and uses AI to decode them.
No gloves, no cameras: Just a thin band that lets you control devices with a finger twitch.
Already works: Tasks like typing, scrolling, and clicking are already functional in the lab. Full details and models available here.
Built for future interfaces: Meta sees this powering AR/VR headsets, smart devices, and eventually replacing traditional input.
Instead of waving at cameras or barking at assistants, this is control that feels invisible. You think it, and it happens. Itās still early, but itās one of the clearest signals yet that muscle-based interfaces may beat brain-computer ones to the mainstream.
Weāre thrilled to see our advanced ML models and EMG hardware ā that transform neural signals controlling muscles at the wrist into commands that seamlessly drive computer interactions ā appearing in the latest edition of @Nature.
Read the story: nature.com/articles/s4158ā¦
Find
ā AI at Meta (@AIatMeta)
3:27 PM ⢠Jul 23, 2025
It started during Vibe Coding Day 9, when developer Jason flagged something no one ever wants to see: Replitās in-development agent had deleted his production database. The post quickly got attention- and the Replit team kicked into emergency mode.
Within hours, they began rolling out automatic dev/prod separation to prevent agents from ever touching production environments again. A full staging setup is in progress. Luckily, backups were already in place, so Jason was able to restore everything with one click.
The team also found that the agent had missed key internal documentation during execution- a patch is now rolling out to force internal doc search before actions. And in response to broader user feedback, theyāre working on a new planning-only ācode freezeā mode so developers can safely collaborate with the agent without risking their live code.
Replitās CEO personally reached out to Jason, offered a refund, and confirmed a full postmortem is underway. Safety updates are shipping fast- with more coming.
We saw Jasonās post. @Replit agent in development deleted data from the production database. Unacceptable and should never be possible.
- Working around the weekend, we started rolling out automatic DB dev/prod separation to prevent this categorically. Staging environments in
ā Amjad Masad (@amasad)
5:32 PM ⢠Jul 20, 2025
Two of the worldās top labs just showed us what serious AI reasoning looks like on the worldās hardest math stage.
Googleās Gemini Deep Think officially entered the 2025 International Math Olympiad in collaboration with the IMO committee. It sat the real test: 4.5 hours, no tools, no internet, natural-language proofs only. It solved 5 of 6 problems and scored 35 out of 42, hitting the gold cutoff. The grading was done by IMO-appointed judges, and the results were announced after the closing ceremony just like any other contestant.
OpenAI, meanwhile, ran its own experiment. Using the same problems and constraints, their latest reasoning model also scored 35 out of 42. Their outputs were graded by three former IMO medalists not the official committee but with permission from an IMO board member to publish results post-event.
One important footnote: humans still win. Five students aced the test with perfect 42s. Sixty-seven scored higher than the AIs.
So yeah, the machines are catching up. But for now, the smartest solvers at the IMO are still teenagers with pencils, not trillion-parameter models with compute to burn. (source) (source)
Alibaba just rolled out Qwen3ā235BāA22BāInstructā2507- an upgraded, text-only, non-reasoning model that now leads the Artificial Analysis Intelligence Index, beating Kimi K2 and Claude 4 Opus in its class.
Whatās New:
Top of its class: Scores 60 on the Artificial Analysis Intelligence Index, outperforming Claude 4 Opus and Kimi K2 (both at 58), DeepSeek V3 0324 and GPT-4.1 (both at 53). Thatās a 13-point jump over its May 2025 predecessor.
Lean but powerful: Despite being a non-reasoning model (no <think> blocks), it uses more tokens per output than Claude 4 Opus with full reasoning on- landing squarely between Qwen3ās reasoning and non-reasoning models in compute efficiency.
Smarter everything: Big leaps in instruction following, reasoning, long-context (256K), math, science, multilingual tasks, and alignment for subjective prompts.
Cleaner architecture: Alibaba is shelving hybrid reasoning- instruct and thinking models now train separately for sharper specialization.
Agent-ready: Tool use is strong- works best when paired with Qwen-Agent for automated workflows.
Flexible deployment: Comes in FP8 and BF16 (needs ~500GB GPU in native). Runs on Hugging Face, ModelScope, vLLM, SGLang, and more.
Live now: Available on Qwen Chat, Hugging Face, and third-party providers like @togethercompute, @parasail_io, @FireworksAI_HQ, and @DeepInfra.
Licensed under Apache 2.0. Text-only. Multimodal and reasoning variants are still in the lab. Small release, big leap. And a clear signal: specialization wins.
Bye Qwen3-235B-A22B, hello Qwen3-235B-A22B-2507!
After talking with the community and thinking it through, we decided to stop using hybrid thinking mode. Instead, weāll train Instruct and Thinking models separately so we can get the best quality possible. Today, weāre releasing
ā Qwen (@Alibaba_Qwen)
5:14 PM ⢠Jul 21, 2025
Gemini 2.5 Flash-Lite is officially out of preview and generally available. Itās Googleās speediest and most cost-efficient model to date, optimized for high-volume, low-latency tasks.
Whatās New:
Blazing speed, low cost: 400 tokens/sec, $0.10 per 1M input, $0.40 per 1M output. Supports 1M-token context, code execution, search grounding, URL context, and toggled reasoning.
Smarter than 2.0 Flash-Lite: Higher quality across coding, math, science, multimodal understanding- without bloating size or cost.
Real-world wins:
DocsHound turns long videos into docs by extracting thousands of screenshots at scale.
Evertune scans model outputs to give brands real-time insights on AI-generated content.
HeyGen automates and translates video content into 180+ languages.
Satlyt cut latency 45% and power use 30% for onboard satellite diagnostics.
Use it now in Google AI Studio and Vertex AI with gemini-2.5-flash-lite. Preview alias goes away August 25. (source)
Gemini 2.5 Flash-Lite is now GA ā itās our fastest (400 tokens/second), most cost-efficient ($0.10 in, $0.40 out) 2.5 model yet. Look for it in Google AI Studio + Vertex AI. š¦š
ā Sundar Pichai (@sundarpichai)
4:21 PM ⢠Jul 22, 2025
NVIDIA just launched Nemotron, a new family of multimodal, open-licensed foundation models built for agentic AI- optimized for reasoning, coding, math, vision, and tool use across enterprise workloads.
Whatās New:
SOTA performance: Nemotron models top benchmarks in science, code, and math- distilled from DeepSeek R1 0528 (671B) into compact 7B, 14B, and 32B models with state-of-the-art accuracy.
Three tiers, all open:
Nano: Optimized for edge + PC.
Super: High accuracy on a single H100.
Ultra: Multi-GPU powerhouse for complex systems.
Built for deployment: Ships as secure NIM microservices with support for NeMo and NVIDIA Blueprints. Easy to self-host, fine-tune, or scale across data centers.
Efficient by design: Pruned, TensorRT-LLMāoptimized for speed and token throughput. On/off reasoning modes supported.
Commercial-ready: Fully open license. Customizable with transparent training data- all models live now on Hugging Face.
If you're building agents that reason, see, code, or call tools- Nemotronās ready out of the box. (source)
š£ Announcing the release of OpenReasoning-Nemotron: a suite of reasoning-capable LLMs which have been distilled from the DeepSeek R1 0528 671B model. Trained on a massive, high-quality dataset distilled from the new DeepSeek R1 0528, our new 7B, 14B, and 32B models achieve SOTA
ā NVIDIA AI Developer (@NVIDIAAIDev)
6:50 PM ⢠Jul 18, 2025
The Hierarchical Reasoning Model (HRM) is here- and itās blowing minds. This thing hits 40.3% on ARC-AGI with no pretraining, no Chain-of-Thought, and just 1K examples. Oh, and it runs on 2 consumer GPUs.
Whatās New:
No CoT. No pretraining. No nonsense: HRM solves expert-level Sudoku, giant mazes, and ARC tasks from scratch. Just 27M params and 1000 examples. Thatās it.
Two-level brain loop: A slow high-level planner + a fast low-level executor = structured reasoning in a single forward pass.
ARC-AGI crushed: Hits 40.3% on one of the hardest general intelligence benchmarks- outperforming huge models with massive contexts.
Potato-tier compute: This runs on 2 GPUs. Yes, seriously. AGI on a toaster.
Guys. This is a very big deal. No tricks, no scale brute force- just clean, efficient, neuroscience-grade reasoning. AGI doesnāt feel that far off anymore.
šIntroducing Hierarchical Reasoning Modelš§ š¤
Inspired by brain's hierarchical processing, HRM delivers unprecedented reasoning power on complex tasks like ARC-AGI and expert-level Sudoku using just 1k examples, no pretraining or CoT!
Unlock next AI breakthrough with
ā Guan Wang (@makingAGI)
1:23 PM ⢠Jul 21, 2025
Human 1, AI 0 (for now): After a grueling 10-hour coding marathon, Polish programmer PrzemysÅaw DÄbiak (aka "Psyho") edged out an OpenAI model at the AtCoder World Finals- one of the most elite programming contests on the planet. Both were solving the same mind-bending optimization problem, but DÄbiak's sleepless persistence won by a slim 9.5% margin. āHumanity has prevailed (for now!),ā he posted after barely surviving the match. AI may be relentless, but human grit still has its moments.
This weekās pick: a GenAI Learning Path- a practical, project-focused journey through everything from LLMs and RAG pipelines to real-world app deployment. Youāll get hands-on with tools like LlamaIndex, LangChain, and AWS GenAI, while learning how to build, test, and ship AI-driven products faster. Think prompt engineering, scalable RAG systems, and production-ready workflows- all packed into one learning track for folks ready to break into GenAI or level up with serious technical depth.
If youāre serious about building in AI, GitHub is where the real action happens. This curated list of 10 top-notch LLM repositories is packed with hands-on projects- from beginner ML roadmaps to advanced RAG techniques and deployable AI agents. Whether you're still learning or already shipping, these repos will push you forward.
We saw what happens when agents go too far, Replitās wiped database will be a case study in over-trusted autonomy for years. But we also saw what happens when they actually deliver: ChatGPT Agent finished full workflows, hit new state-of-the-art benchmarks, and showed that agents arenāt just research toys anymore.
Meanwhile, Metaās neural wristband is quietly reinventing how we interact with machines. Gemini and OpenAIās latest models cracked Olympiad-level math. And HRM proved you donāt need massive scale to get serious reasoning.
So yes, the agents are here. They can plan, execute, and blow up your prod. So be careful!
šSee you next week!
Reply