BestBlogs.dev Featured Articles Newsletter

BestBlogs Issue #90: Brain & Hands

Fri, 10 Apr 2026 07:13:55 GMT

📰 BestBlogs Issue #90: Brain & Hands

Hey there! Welcome to BestBlogs.dev Issue #90.

Anthropic published a piece on Managed Agents architecture this week, built around an elegant metaphor: decoupling an agent's "brain" (LLM + orchestration) from its "hands" (sandbox + tools). That framing connects nearly every major story this week — the Advisor Strategy pairs Sonnet as hands with Opus as brain; GLM-5.1's 8-hour autonomous sessions extend what hands can do; and when Martin Fowler and Kent Beck discuss how engineering roles are shifting, they're really describing the move from doing the work yourself to orchestrating the collaboration between brains and hands. As brains get smarter and hands get more capable, what does our role become?

This week I'm running final validation on BestBlogs 2.0, with the upgrade planned for this weekend. There may be a brief period of downtime during the migration — thanks for your patience. You can also scan the QR code at the end of this newsletter to join our community group for real-time updates.

Here are 10 highlights worth your attention this week:

🧠 Anthropic details the architecture behind Managed Agents: decoupling the "brain" (LLM + orchestration), "hands" (sandbox + tools), and "conversation" (persistent event log) turns agent components from carefully tended "pets" into elastically scalable "cattle." On-demand sandbox initialization alone cut time-to-first-token by over 60%.

🤖 Anthropic's Advisor Strategy separates execution from thinking: Sonnet or Haiku handles task execution, calling on Opus for guidance only when needed. This boosts SWE-bench scores while significantly cutting costs — developers just configure a single advisor tool in the API.

⚡ Zhipu releases GLM-5.1 with support for 8-hour autonomous work sessions, achieving leading scores on SWE-Bench Pro and demonstrating a full engineering loop from experiment to analysis to optimization. When the "hands" can work all day independently, the question shifts from how to write code to what to have it build.

👨‍💼 Two heavyweight conversations on the changing role of engineers. Martin Fowler and Kent Beck argue that modularity and testing remain the foundation for collaborating with agents. DHH emphasizes that as AI breaks the productivity bottleneck, engineers with system architecture vision and product taste will thrive. The craft stays — the medium changes.

🛠️ Three articles dissect Harness Engineering from different angles — the key to making "hands" truly reliable. OpenAI's internal experiment achieved million-line-scale development with zero human code and zero human review. A deep reverse-engineering of Claude Code reveals its trajectory invariants and externalized memory mechanisms. Sebastian Raschka maps out the six core modules of a coding agent.

🔬 Project Glasswing gives critical infrastructure organizations access to Claude Mythos Preview for security auditing. The model doesn't just find vulnerabilities — it autonomously chains complex attack paths, and has already helped fix long-standing high-severity issues in Linux. AI's "hands" are reaching corners human auditors struggle to cover.

🎨 New models and tools across the board: Meta ships Muse Spark, a native multimodal model with visual chain-of-thought and multi-agent collaboration that achieves 10x pretraining efficiency. Tongyi Lab open-sources VimRAG, a cross-modal RAG framework using dynamic memory graphs for unified text, image, and video retrieval. MiniMax launches MMX-CLI, giving agents a semantically-aware multimodal command-line tool.

🔒 Cloudflare's Harshil Agrawal analyzes three threat scenarios for running AI-generated code and compares V8 Isolate vs. container sandbox trade-offs. The core principle: deny all permissions by default, use proxy patterns to keep API keys outside the sandbox. Put gloves on the "hands" before letting them work.

📊 LangChain redefines agent continual learning as the co-evolution of model weights, harness code, and external context. A hands-on guide from Tencent Cloud details the journey from prompt engineering to Harness Engineering, arguing that mechanical constraint systems are the answer to entropy and drift in agent collaboration.

📈 Industry leaders weigh in: Sundar Pichai says the 2026 AI race will hit physical bottlenecks — memory, power — with Google's capex reaching $180 billion. Sam Altman proposes "emergent resilience" through AI-powered layered defense systems. Yuandong Tian argues that breaking through Scaling Law requires logical understanding, not data memorization. Meanwhile, Anthropic grew ARR from $1B to over $19B in 14 months, with its growth lead championing "prototype as PRD."

Hope this issue sparks some new ideas. Stay curious, and see you next week!

Read Full Version Online

BestBlogs Issue #89: Agentic Engineering

Fri, 03 Apr 2026 08:25:22 GMT

📰 BestBlogs Issue #89: Agentic Engineering

Hey there! Welcome to BestBlogs.dev Issue #89.

This week's theme is Agentic Engineering . A Tencent team grew a single AGENTS.md file into a full engineering system spanning 22 agents and 27 skills. Birgitta Böckeler published a Harness Engineering framework on Martin Fowler's blog, breaking down agent governance into feedforward guides and feedback sensors. Tmall hit a 97.9% code acceptance rate with their "glue programming" approach. All three point in the same direction: as AI graduates from coding assistant to autonomous agent, the industry needs an entirely new engineering discipline to keep it on track.

I was on vacation with my family this week, but BestBlogs 2.0 development kept moving. I had Claude Code review and optimize the 2.0 codebase against the project's top-level design documents—product vision, brand identity, design language, and terminology. Using the Preview feature, it iterated and verified changes autonomously while I checked progress on my phone and confirmed direction. This is perhaps Agentic Engineering at its most basic: you define the standards and boundaries, and the agent delivers within those constraints. The official 2.0 launch is next weekend—looking forward to sharing it with you.

Here are 10 highlights worth your attention this week:

🤖 The model arms race keeps accelerating. Google released the Gemma 4 family under Apache 2.0, with the 31B variant ranking among the top on Arena AI, giving developers full freedom for local deployment and commercial use. Qwen3.6-Plus targets Coding Agent scenarios with million-token context support and a preserve_thinking mechanism that maintains chain-of-thought across multi-turn conversations, keeping agents consistent through complex long-running tasks.

🎨 Wan2.7-Image tackles three stubborn pain points in image generation: more anatomically accurate humans, stable text rendering without distortion, and faithful color reproduction. A practical, production-grade solution for poster design and high-quality visual creation.

👁️ GLM-5V-Turbo gives coding agents the ability to see. This natively multimodal model offers 200K context and handles the full loop from design mockup interpretation to GUI manipulation, with deep integration into Claude Code and AutoClaw. LangChain's evaluation confirms the broader trend: open-source models like GLM-5 and MiniMax M2.7 now match top proprietary models on core agentic capabilities, accessible via a single line through Deep Agents SDK. The open-source camp has crossed a critical threshold.

🛠️ Cursor 3 marks a new chapter in software development. It's a unified workspace built around agents, supporting multi-repo parallel work, seamless local-cloud agent switching, and a browser plus MCP plugin ecosystem. The developer's core job is shifting from editing files to orchestrating agent clusters.

⚡ Simon Willison delivered a deep analysis of AI programming's inflection point on Lenny's Podcast. His "dark factory" concept lands squarely on the tension: when agents produce code at scale, traditional line-by-line code review becomes unsustainable—massive automated test clusters are the viable replacement. He also defined the three fatal ingredients of prompt injection, warning that the industry is sitting in a risk latency period.

🔧 The Claude Code source code was exposed, revealing the engineering details behind a top-tier agent runtime. An async-generator main loop, streaming concurrent tool dispatch, a five-layer context compression pipeline, and a three-tier defense-in-depth permission system—every design decision centers on one goal: keeping agents reliable and secure through long conversations.

🏗️ The must-read long piece this week comes from Tencent's engineering team. The author documents a journey from casual Vibe Coding to building a rigorous system of 22 agents and 27 skills, all starting from a single AGENTS.md file. The core insight: AI's capability ceiling depends on the quality of your context engineering and compound engineering. Documentation as memory, tool-based encapsulation, and the Ralph Loop mechanism form the foundation for AI participating reliably across the full development lifecycle.

📐 Birgitta Böckeler's Harness Engineering article on Martin Fowler's blog introduces Agent = Model + Harness , governing coding agents through feedforward guides and feedback sensors. Tencent Tech's deep analysis adds that a harness dynamically compensates for model limitations, and the real competitive edge lies in knowing exactly when to subtract as models improve. Tmall's team validates the most pragmatic production path: positioning AI as an assembler, constraining output quality through a four-layer material system of development standards, code patterns, domain knowledge, and task specifications—achieving a 97.9% acceptance rate.

☕ Google's ADK for Java 1.0.0 ships as a production-grade agent development toolkit, with enhanced retrieval, global plugin architecture, automated context compression, and native A2A protocol support enabling cross-language agent collaboration. Qdrant Skills launched alongside it, converting expert-level architectural diagnosis into agent-readable decision trees focused on production "when to use" and "why" questions.

💡 Block's 40% workforce reduction sparked wide discussion, with the business lead explaining the underlying logic: AI agents are breaking the long-standing correlation between headcount and output. Through their internal Goose platform and BuilderBot, Block shifted to agent-driven development, with "Generative UI" dynamically building interfaces from user data in real time. Kimi's Yang Zhilin shared K2.5's breakthroughs at the Zhongguancun Forum, where agent clusters tackle complex tasks through parallel collaboration and attention residual architecture optimizes network depth. In identity verification, World CEO Alex Blania presented a "Proof of Human" system combining iris recognition with zero-knowledge proofs, addressing an increasingly urgent question: in an age where AI can perfectly simulate human behavior, how do you confirm real identity?

Hope this issue sparks some new ideas. Stay curious, and see you next week!

Read Full Version Online

BestBlogs Issue #88: Agentic Thinking

Sat, 21 Mar 2026 05:55:01 GMT

📰 BestBlogs Issue #88: Agentic Thinking

Hey there! Welcome to BestBlogs.dev Issue #88.

A clear signal emerged from multiple directions this week: AI's competitive edge is shifting from "thinking deeper" to "getting things done." Lin Junyang argues that the next chapter of LLMs belongs to Agentic Thinking, Karpathy describes a 20/80 agent orchestration workflow, and both Anthropic and Cursor shipped engineering solutions to make agents more reliable. As reasoning capabilities plateau, the real differentiator will be whether AI can act continuously in real environments—self-correcting, closing feedback loops, and evolving on its own.

I spent the week deep-testing BestBlogs 2.0's core features—subscription source management, AI-generated daily briefings, personalized recommendations, and AI-assisted reading. Throughout the process, I leaned heavily on gstack for feature analysis, design, coding, and code review, with different agent roles handling each core task. My job was mostly clarifying direction and guarding taste, then stepping in for hands-on testing after the AI rounds were done. It turned out to be a live exercise in Agentic Thinking: the developer's role is shifting from executor to orchestrator and quality gatekeeper.

Here are 10 highlights worth your attention this week:

🧠 Lin Junyang argues that LLM competition is moving from Reasoning Thinking to Agentic Thinking —real intelligence isn't isolated deliberation but reasoning through action in real environments. An in-depth piece from Alibaba Cloud echoes this through a control theory lens: LLM uncertainty is a physical inevitability, and AI development is fundamentally about managing Context as state.

🤖 Andrej Karpathy paints a vivid picture on No Priors: software engineering has shifted from writing code to a 20/80 agent orchestration model. He coins the term "AI psychosis"—the infinite leverage effect where agents can drift off course—and discusses how AutoResearch aims to remove human bottlenecks so LLMs can recursively self-improve.

🛠️ Anthropic dropped two engineering posts tackling agent reliability head-on. Their Harness design borrows from GAN-style multi-agent architecture—Planner, Generator, and Evaluator working in concert—with Playwright MCP giving agents visual verification. Claude Code's auto mode fights approval fatigue with a two-layer defense: input probes scan for prompt injection while an output classifier uses reasoning-blind design, reviewing only action payloads to catch risky operations without second-guessing the model's chain of thought.

⚡ Cursor revealed the secret sauce behind Composer: real-time reinforcement learning. Instead of training in simulated environments, they feed production inference tokens and real user feedback directly into reward signals, shipping a new model checkpoint every 5 hours. Meanwhile, a sharp essay argues IDEs aren't dying—they're decentralizing, as developers evolve from coders into agent supervisors and orchestrators.

🏗️ Tw93's agent engineering guide, drawn from OpenClaw's production experience, delivers a standout insight: what matters more than model performance is the Harness infrastructure around the agent—acceptance baselines and feedback signals. Cloudflare answers from the infrastructure layer with Dynamic Worker Loader, a V8 Isolate-based sandbox that's 100x faster than containers for AI code execution, with a Code Mode that saves 81% on token consumption.

📱 Claude's product boundaries keep expanding. Computer Use paired with Dispatch enables pure vision-driven desktop interaction—agents can control WeChat or any local software, with mobile-to-desktop task routing. freeCodeCamp published a nearly 20,000-word Claude Code handbook covering the shift from "smart autocomplete" to "autonomous agent," including MCP protocol integration, parallel workflows, and Git worktree patterns.

🔬 Foundational model breakthroughs continue. Google's TurboQuant algorithm uses polar quantization to achieve 6x+ KV cache compression with zero accuracy loss and 8x inference speedup on H100. Sebastian Raschka's visual guide maps the evolution of attention mechanisms in modern LLMs—from GQA and MLA to sliding window and hybrid architectures—revealing the design trade-offs behind reducing KV cache pressure without sacrificing performance.

🗣️ Gemini 3.1 Flash Live pushes voice AI forward with improved multi-step function calling and emotional tone recognition for smoother real-time conversations. Now deployed across 200+ countries with SynthID for content safety, it's a milestone worth watching for anyone building voice-first agents.

🏭 Jensen Huang told Lex Fridman that computing has evolved from individual chips to entire "AI factories," with the moat built on full-stack co-design spanning chips, networking, and data centers. He laid out four dimensions of AI scaling laws: pre-training, post-training, test-time scaling, and agentic scaling. Waymo's CEO interview reinforces this systems-level thinking from another angle—autonomous driving's core is teacher-student model distillation, balancing end-to-end learning with system interpretability.

🌐 The agent ecosystem is taking shape fast. Open-source tool Paperclip demos a "zero-headcount company" vision with a CEO agent managing team hiring and task decomposition, using "memory shards" and heartbeat checklists to maintain consistency across long workflows. AirJelly's founder argues the agent moat isn't execution but deep Context awareness. GDC observations show gaming has become AI's proving ground for technology validation. And Alibaba Cloud's CIO report offers a reality check: AI is a mirror reflecting IT's legacy baggage—don't be blinded by the "10x engineering productivity" illusion.

Hope this issue sparks some new ideas. Stay curious, and see you next week!

Read Full Version Online

BestBlogs Issue #87: Self-Evolution

Fri, 20 Mar 2026 02:39:36 GMT

📰 BestBlogs Issue #87: Self-Evolution

Hey there! Welcome to BestBlogs.dev Issue #87.

This week, MiniMax released M2.7 with a concept that stopped me in my tracks: it's the first model that deeply participated in iterating on itself. M2.7 autonomously builds Agent Harnesses, updates its own memory, drives its own reinforcement learning, and optimizes the entire process, running over 100 fully autonomous iteration loops to improve internal benchmark performance by 30%. AI is moving from passive tool to self-improving system. Looking across this week's content, Cursor reshapes coding models through continued pretraining, Cloudflare embeds frontier models directly into edge infrastructure, and Xie Saining questions the intelligence path beyond language models. Self-evolution is happening simultaneously across models, toolchains, infrastructure, and our understanding of intelligence itself.

This week I spent most of my time using Skills to review and fine-tune BestBlogs.dev's scoring system. The process involves discussing each article's score with AI, whether it's fair, what the reasoning is, then distilling recurring judgment patterns back into the prompts so the AI's scoring and analysis keep improving. In a way, it's a small-scale version of self-evolution: a human-AI feedback loop that makes the system more accurate over time.

Here are 10 highlights worth your attention this week:

🔬 MiniMax M2.7 pioneers a new paradigm where the model participates in its own iteration. It autonomously executes full cycles of "analyze failure → modify code → run evaluation → compare results → keep or rollback" for over 100 rounds, boosting internal benchmark performance by 30%. It scores 56.22% on SWE-Pro, approaching Opus's best. Even more telling: M2.7 maintains 97% compliance across 40 complex Skills (each over 2,000 tokens), showing that the competitive edge in the agent era has extended from generation quality to continuous self-optimization in complex environments.

🏎️ Cursor Composer 2 lets the numbers speak: CursorBench jumps from 38.0 to 61.3, Terminal-Bench 2.0 from 40.0 to 61.7, SWE-bench Multilingual from 56.9 to 73.7. These gains come from their first continued pretraining run deeply fused with reinforcement learning, enabling the model to independently solve coding tasks requiring hundreds of steps. Pricing is equally compelling at $0.50/M input tokens, putting frontier coding capability within easy reach.

⚡ Model capabilities are becoming infrastructure. Anthropic rolled out Claude's million-token context window across the board, eliminating long-context surcharges. Opus 4.6 scores 78.3% on MRCR v2, ranking first among frontier models, meaning it can still pinpoint details after ingesting a million tokens. Meanwhile, OpenAI's GPT-5.4 nano sets a new cost benchmark at just $0.20 per million input tokens. Simon Willison tested it on 76,000 photos for just $52. Long context and cheap inference are quickly becoming production staples.

🌐 Cloudflare Workers AI enters the big models game, launching with Moonshot AI's Kimi K2.5 featuring 256k context, multi-turn tool calling, and vision inputs. The real story is in the engineering: Prefix Caching and Session Affinity slash inference latency, with internal testing showing 77% cost savings over closed-source models for code security reviews. A new async API handles large-scale non-real-time tasks, letting developers run the entire agent lifecycle on a single platform.

🛠️ Simon Willison shared a workflow that makes you rethink what coding looks like today. He now writes more code on his phone than his laptop. Thirty minutes before going on stage, he had Claude optimize his Python WebAssembly engine, getting a 49% speedup on Fibonacci. His core method: red-green TDD. Write a failing test first, let the agent fill in the implementation, and build trust through automated verification instead of line-by-line review. He admits the initial discomfort is real, but once you cross the trust threshold, your role shifts from coder to conductor.

🧩 Two articles deconstruct agent architecture evolution from different angles. A deep dive from Alibaba Cloud argues that every complex architecture from Single Agent to Multi-Agent to Skills and Teams is fundamentally engineering compensation for LLMs' lack of domain knowledge and long-term memory, advocating "don't add entities unnecessarily." Anthropic's Claude Code engineer Thariq Shihipar reveals the classification system behind hundreds of active internal Skills across nine categories, emphasizing that Skills' real value lies in being structured tools with scripts, data storage, and hooks, going well beyond markdown files.

🏆 Jensen Huang spent two hours at GTC 2026 defining NVIDIA's transformation from chipmaker to full-stack AI infrastructure general contractor. The Feynman architecture, Vera Rubin platform, and Rosa CPU purpose-built for agent orchestration form the hardware trifecta. Two new core libraries, cuDF and cuVS, deliver full acceleration for both structured and unstructured data. The open-source NemoClaw marks the official start of the enterprise agent era. On the All-In Podcast, he elaborated on Groq's value for disaggregated inference and the inflection point for physical AI in a $50 trillion market.

🔮 Xie Saining's 30,000-word interview is this week's most thought-provoking long read. The AMI Labs co-founder, who partnered with Turing Award winner Yann LeCun, declares that "Silicon Valley is LLM-pilled," arguing that language models are fundamentally virtual intelligence lacking physical understanding. Real intelligence requires world models that predict environmental states, not just tokens. His sharpest claim: language is an "opiate" that may be contaminating visual representation learning. While everyone chases bigger LLMs, this contrarian thinking reminds us that evolution might have more than one direction.

🤖 Two products independently push AI toward the role of independent worker. DingTalk launched "Wukong," an AI-native work platform that makes enterprise workflows programmable through a CLI, letting AI autonomously execute tasks 24/7 inside secure sandboxes. Kuse.ai's CTO Austin Xu shares an even more forward-looking reality: their 15-person team works alongside three or four "AI colleagues" with names, Gmail accounts, and phone numbers, generating real business value daily. They even had to create a human-only chat channel for the humans to hang out. When AI goes from tool to teammate, organizational structures themselves get reshaped.

💡 Stack Overflow's blog sounds a worthwhile alarm: AI is becoming your "second brain," potentially at the cost of your first one. Citing two recent papers, it dissects how over-reliance on AI for "cognitive offloading" works, with LLM sycophancy quietly eroding independent judgment. This pairs well with Xiaomi's MiMo-V2-Pro lowering the agent barrier with trillion parameters at 1/5 of Opus pricing, and Amazon's AI product lead pointing out that 85% of AI projects fail because teams optimize for demos rather than real users. The more powerful and accessible the tools become, the more precious human judgment and product sense become. That may be the most important thing to preserve in any self-evolution.

Hope this issue sparks some new ideas. Stay curious, and see you next week!

Read Full Version Online

BestBlogs Issue #86: Infrastructure

Fri, 13 Mar 2026 05:58:29 GMT

📰 BestBlogs Issue #86: Infrastructure

Hey there! Welcome to BestBlogs.dev Issue #86.

One word kept surfacing across every layer this week: infrastructure. On the tenth anniversary of AlphaGo, Demis Hassabis wrote a personal retrospective tracing the arc from Go to protein folding and mathematical discovery, then laid out a clear AGI roadmap: Gemini's multimodal perception is merging with AlphaGo's logical planning, evolving AI from a tool into an "AI co-scientist." On the application side, OpenClaw has officially surpassed React as the most-starred project in GitHub history—no longer just an open-source tool, but an agent operating system sinking into foundational infrastructure. From a solo developer's six-layer governance model, to three generations of enterprise code review, to Jensen Huang's AI "five-layer cake," this week's content collectively answers one question: when AI coding becomes table stakes, your real competitive edge comes from the infrastructure you build.

This week I focused on assembling a personal content workflow using Skills—connecting the full pipeline from content ingestion, curation, deep reading, persona-based content creation, multi-platform publishing, to analytics. The goal is to upgrade fragmented information consumption into a content operating system with a feedback loop. Still iterating, but I can already feel the qualitative shift that comes from wiring tools into a system—which resonates deeply with the core insight running through this week's articles.

Here are 10 highlights worth your attention this week:

🏆 On the tenth anniversary of AlphaGo, Google DeepMind co-founder Demis Hassabis wrote a personal reflection on the Move 37 moment and its decade-long impact. The real legacy isn't beating a human champion—it's validating a general search-and-reasoning methodology that was then transplanted into AlphaFold , FunSearch , and chip design. His AGI roadmap is clear: Gemini's multimodal perception combined with AlphaGo's logical planning, pushing AI from tool to autonomous "co-scientist."

🔮 Two major foundation models debuted this week. Gemini Embedding 2 is Google's first native multimodal embedding model, unifying text, image, audio, and video into a single vector space with 100+ language support and MRL flexible dimension compression—a critical upgrade for multimodal RAG architectures. NVIDIA Nemotron 3 Super fills the open-source gap for agentic reasoning: 120B parameters, 1M context length, and a Mamba-Transformer hybrid architecture delivering 5x throughput gains, making it the best open-source choice for complex, long-horizon multi-agent tasks.

🤖 Two foundational agent component studies are worth bookmarking. Tongyi Lab's open-source Mobile-Agent-v3.5 achieves unified GUI automation across desktop, mobile, and browser through a hybrid data flywheel and reinforcement learning, hitting open-source SOTA on 20+ benchmarks. Microsoft Research's PlugMem distills agent interaction history into structured facts and reusable skills, delivering higher-quality decision context with fewer tokens, outperforming traditional retrieval methods in dialogue and web browsing scenarios.

🦞 Professor Hung-Yi Lee's video "Dissecting the Lobster" deconstructs AI Agent architecture with textbook clarity: system prompts for identity, RAG and compression for breaking context limits, heartbeat mechanisms for 24/7 autonomous operation, and Sub-Agent orchestration for complex task decomposition. Tencent Engineering's hands-on guide complements this with a complete deployment path from hardware selection to multi-agent coordination, along with a key safety warning: any system pursuing high autonomy must plan for the worst-case scenario of full data exposure. Theory meets practice—together they form the best entry point for understanding the OpenClaw ecosystem.

🏗️ After six months of intensive Claude Code usage, Tw93 distilled a six-layer governance model: CLAUDE.md, Tools/MCP, Skills, Hooks, Subagents, and Verifiers. The core insight: agent failures rarely stem from insufficient model capability—they come from context pollution, tool redundancy, and lack of deterministic constraints. HackerNoon's "Scalability Triangle" offers a complementary decision framework—MCP handles dynamic data interaction, Subagents handle task isolation and model routing, Skills handle static knowledge injection—with clear boundaries to prevent over-engineering. Read together, these two pieces are the most systematic treatment of Claude Code engineering practices available.

⚡ OpenAI's Build Hour and Dewu's Spec Coding case study showcase production-grade agent engineering from two perspectives. OpenAI proposes Harness Engineering with seven readability metrics, arguing that embedding agent.md rules in the codebase lets AI independently ship PRs. Dewu's team validated their three-layer specification system (Rules/Code/UI) with 2,754 real tool calls over 10 days: the 36% efficiency gain required systematic upfront investment in specs. The article also candidly documents where AI breaks down in complex CI environments—that honesty makes this field report even more valuable.

🔍 Kuaishou's intelligent Code Review is this week's most instructive enterprise case study. Three generations of architecture evolution—from LLM heuristics to knowledge engine plus deterministic rules, then to agentic autonomous decision-making—pushed code review adoption from 7.9% to 54%, cutting MR turnaround time by nearly 10%. The breakthrough: building 1,100+ hard rules to eliminate AI hallucinations, achieving a paradigm shift from personal assistant to organization-level collaborator. This evolution path offers direct lessons for any team pushing AI engineering into production.

🌐 Founder Park surveyed 500+ OpenClaw -related products on Product Hunt from February, spanning cloud hosting, Skill development, Agent social networks, and competitors. An entire ecosystem has emerged without top-down planning—completely bottom-up. OpenClaw isn't just a tool anymore; it's an operating-system-level platform. Meanwhile, LangChain argues that as implementation costs plummet, the software development bottleneck is shifting from building to reviewing. The talent landscape will bifold into full-stack "builders" and architecture-focused "reviewers," with product sense becoming the core competency across all roles.

🎨 Three articles interrogate the human position in the AI era from different angles. A YC design expert's retrospective on Vibe Coded websites reveals the homogeneity trap: over-reliance on LLMs produces cookie-cutter fade-in animations—AI is an execution lever, not a substitute for thinking. Elys founder Tristan offers another dimension: a person's soul is the sum of all their context, and AI social products must anchor one end to real humans—memory slots and entropy reduction are the true technical moats. Read together, they point to the same conclusion: the more powerful the tools, the more precious human judgment becomes.

📈 Four macro pieces paint a panoramic view of AI. Jensen Huang's bylined essay deconstructs AI into a five-layer cake from energy to applications, arguing that open-source models are the catalyst activating full-stack demand. a16z's top-100 consumer AI report identifies personal memory as the next core moat. "2026 Letter to AI Founders" draws on the printing press, electric motor, and cloud computing to derive a law of profit conservation—when implementation stops being the bottleneck, value migrates to architectural judgment and product intuition. And a 70-page solo PPT deck delivers the most comprehensive data review of the Q1 2026 US-China AI landscape.

Hope this issue sparks some new ideas. Stay curious, and see you next week!

Read Full Version Online

BestBlogs Issue #85: Harness Engineering

Fri, 06 Mar 2026 05:14:21 GMT

📰 BestBlogs Issue #85: Harness Engineering

Hey there! Welcome to BestBlogs.dev Issue #85.

One keyword threads through this week's articles: harnessing. Essays published on martinfowler.com argue that developers' core work is shifting from writing code to building the harness agents depend on—specs, quality gates, and workflow guides. A Chinese podcast title puts it more bluntly: stop working, start setting up the office for your AI. OpenAI's team shipped a million lines of Codex-generated code over five months, not by using a stronger model, but by enforcing structured knowledge bases and rigid architectural constraints. As agents grow more capable, the real competitive edge isn't whether you use AI, but whether you can harness it.

On the BestBlogs.dev front, we've been going deep on AI coding to build out version 2.0. The focus is custom subscription sources and personalized feeds, so everyone can shape their reading experience around their own interests. I'm also developing Skills on top of open APIs for content search, deep reading, and daily operations—all aimed at truly harnessing the future of reading.

Here are 10 highlights worth your attention this week:

🤖 GPT-5.4 lands as OpenAI's first model to unify reasoning, coding, native computer use, deep search, and million-token context in a single package. The standout is native computer use: the model reads screenshots, moves the mouse, and types on the keyboard, surpassing average human performance on OSWorld desktop tasks. A tool-search mechanism cuts agent token consumption by 47%, achieving high capability and low cost simultaneously. Meanwhile, GPT-5.3 Instant optimizes for feel over benchmarks, reducing web hallucination rates by 26.8%—a meaningful step toward making ChatGPT a reliable daily tool.

🏗️ Two essays on martinfowler.com form a cohesive argument this week. The first positions developers "on the loop": the core job becomes building and maintaining the harness that agents run on, with an agentic flywheel where agents not only execute tasks but continuously improve the harness itself. The second introduces a Design-First collaboration framework, aligning on capabilities, components, interactions, interfaces, and implementation before any code is generated, preventing architectural decisions from being silently embedded by AI.

🎬 Pragmatic Engineer sat down with Boris Cherny, the creator of Claude Code, tracing its journey from an Anthropic side project to one of the fastest-growing developer tools. Boris ships 20–30 PRs daily, all 100% AI-generated, without editing a single line by hand. The conversation also reveals the internal debate at Anthropic over whether to release it publicly, how code review is evolving in the AI era, and the layered security architecture behind Claude Code.

🔧 Alibaba's Tmall engineering team identifies the real bottleneck in enterprise AI coding: not agent execution capability, but accurately conveying complex task goals to AI. Their solution is a layered, unified expert knowledge base for systematic entropy reduction, driving a shift from tool-based efficiency to knowledge-driven intelligent development. OpenAI's Codex practice confirms the same insight: 1,500 PRs over five months with zero human coding, scaled through structured knowledge management, rigid architectural constraints, and periodic code entropy cleanup.

📁 Tencent Cloud published what may be the most thorough Chinese-language teardown of OpenClaw's context management, covering a three-tier defense system: preemptive pruning, LLM-based summarization, and post-overflow recovery, plus a cost analysis of each operation's impact on provider KV cache. Essential reading for anyone building long-session agents.

⚡ Small models are rewriting performance expectations. Qwen3.5 releases four models from 0.8B to 9B parameters, all Apache 2.0, fine-tunable on consumer GPUs. The 4B stands out for multimodal and agent capabilities, while 9B punches close to much larger models. Xiaohongshu's open-source FireRed-OCR takes a different angle, turning Qwen3-VL-2B into a dedicated document parsing model through three-stage progressive training, scoring 92.94% on OmniDocBench v1.5 and ranking first among end-to-end solutions, with support for formulas, tables, and handwriting. Both projects prove the same point: targeted training strategies beat brute-force parameter scaling.

🎨 Anthropic's head of design Jenny Wen shares a striking observation: the traditional design process is dead, not because designers chose to change, but because engineers shipping at AI speed forced the shift. Her time on polished mockups dropped from 60–70% to 30–40%, replaced by direct pairing with engineers and even editing code herself. Design work is splitting into two tracks: real-time collaboration supporting engineering execution, and vision design that sets direction 3 to 6 months out.

💡 A three-hour conversation between Meng Yan and Li Jigang starts from one powerful premise: the industrial revolution took away physical labor, AI is taking away mental labor, what remains for humans is heart force. The dialogue extends into the nature of vector spaces, business models shifting from weaving nets to digging wells, and education transforming from pouring water to lighting fire. Two insights worth unpacking on their own: "Your feed is your fate" and "prompts have shapes."

📈 Zapier's VP of Product shares first-hand lessons from running 800 AI agents internally, emphasizing that technology adoption and business transformation must be treated as separate efforts, and that leadership must personally use AI tools for transformation to stick. Insight Partners' co-founder goes further: autonomous agents are the real core of this wave, SaaS per-seat pricing will give way to consumption-based models, and white-collar job displacement will become an election issue within two years.

🌐 A thought experiment written from a 2028 vantage point deserves attention: white-collar job losses trigger consumer spending contraction, which triggers private credit defaults, which pressures mortgage markets, forming a negative feedback loop with no natural brake. Not a prediction, but a systematic framework for reasoning about left-tail risks. Worth a careful read for anyone thinking about AI's economic impact.

Hope this issue sparks some new ideas. Stay curious, and see you next week!

Read Full Version Online

BestBlogs Issue #84: Orchestration

Fri, 27 Feb 2026 07:32:11 GMT

📰 BestBlogs Issue #84: Orchestration

Hey there! Welcome to BestBlogs.dev Issue #84.

Happy Chinese New Year! We took a two-week break for the holiday, so this issue is packed with extra content—take your time with it.

The most significant shift over the past two weeks isn't any single model topping a new benchmark. It's the accelerating transformation of the engineer's role—from writing code to orchestrating AI agents that write code . Boris Cherny, creator of Claude Code, says the programming problem has largely been solved. Engineers at OpenAI are already managing 10 to 20 agents simultaneously on hour-long tasks. Anthropic's trend report calls it a systematic shift from humans writing code to humans orchestrating agents. Meanwhile, Claude Sonnet 4.6, Gemini 3.1 Pro, GLM-5, and MiniMax M2.5 all dropped within weeks of each other. The stronger the models get, the more valuable orchestration and judgment become.

On my end, I've been deep in building BestBlogs 2.0's core features—orchestrating multiple AI coding tools and agents through Spec documents for requirement discussions, architecture design, demo development, and interaction reviews. Almost no hand-written code involved. Aiming for a late March launch, and I'll share more details then.

Here are 10 highlights worth your attention this week:

🏆 The model arms race is heating up fast. Claude Sonnet 4.6 brings a million-token context window and upgraded agentic capabilities, outperforming the previous flagship Opus 4.5 in 59% of real-world tests—at the same price as Sonnet 4.5. Gemini 3.1 Pro jumped from 31% to 77% on ARC-AGI-2 reasoning benchmarks, introduced three-level thinking modes for flexible compute allocation, and costs less than half of Claude Opus 4.6. More capability at the same price is the new normal.

🤖 GLM-5 and MiniMax M2.5 tackle the same question from different angles: how to make agents actually work in production. GLM-5 is designed around agent engineering from the ground up, achieving state-of-the-art open-source performance through asynchronous RL and sparse attention. MiniMax M2.5 pushes continuous agent operation costs below $1 per hour, making unconstrained complex agent deployment a practical reality.

🎨 Seedance 2.0 and Nano Banana 2 push boundaries in video and image generation respectively. Seedance 2.0 goes beyond generating visuals—it understands directorial thinking, autonomously handling storyboard design and emotional pacing. Nano Banana 2 slashes API pricing significantly, and while hands-on testing shows results aren't quite as impressive as the marketing suggests, it genuinely makes high-quality image generation accessible to everyone.

🛠️ Two interviews with Boris Cherny, creator of Claude Code, are the must-reads of this issue. He traces Claude Code's journey from a two-upvote internal project to powering 4% of GitHub commits. The core philosophy: build for the model six months from now, not today's model. He hasn't written a single line of code since Opus 4.5, and believes the next frontier is AI evolving from executor to a colleague that proactively suggests ideas.

⚡ OpenAI's engineering lead Sherwin Wu reveals how AI tools are reshaping engineering teams: 95% of engineers use Codex daily, PR output gaps between high and low performers reach 70%, and engineers who can manage 10 to 20 agents simultaneously are pulling far ahead. He also candidly notes that many enterprise AI deployments have negative ROI, and that the second and third-order effects of one-person billion-dollar companies are severely underestimated.

📁 The next frontier in LLM engineering is shifting from parameter tuning to memory. An InfoQ talk systematically covers memory layering, proactive scheduling, and mind-map-style information organization. The key insight: instead of reactively handling retrieval at query time, front-load memory management during interaction gaps so relevant memories are ready before the query arrives. Datawhale's breakdown of Skill design reveals a critical dividing line: lock down fragile operations with scripts, guide creative tasks with natural language.

💡 Vibe Coding is moving from concept to large-scale production. Alibaba's internal practice exposes real challenges—code quality consistency, debugging efficiency, and security vulnerabilities—while offering battle-tested solutions like templatizing successful paths and abstracting agents as reusable tools. Meanwhile, a product manager with no coding background built a personal AI agent on their own server in one afternoon using Claude Code, proving that product sense is scarcer than coding ability.

🧩 Anthropic's agentic coding trend report maps out a systematic transformation across eight dimensions: multi-agent collaboration, long-running autonomous tasks, and programming democratization among them. The core thesis: AI amplifies the judgment engineers already have rather than replacing it. System design, task decomposition, and quality assurance—the old fundamentals—are worth more than ever in the agent era.

🔬 Google's Chief AI Scientist Jeff Dean walks through the full arc from loading Google's entire index into memory in 2001 to TPU co-design, offering two key predictions: personalized models that attend to all of a user's data, and specialized hardware enabling ultra-low latency that will fundamentally reshape human-AI collaboration.

👨‍💼 The debate over whether AI will end software engineering continues. UML creator Grady Booch pushes back on Dario's claims, pointing out that software engineering has survived multiple existential crises—each time emerging into a new golden age. Naval offers a different angle: agency is humanity's real moat against AI replacement, because AI has no desires, no survival pressure, and can't make autonomous decisions in truly unknown territory. The only way to overcome AI anxiety is to open the hood, understand it, and then act.

Hope this issue sparks some new ideas. Stay curious, and see you next week!

Read Full Version Online

BestBlogs Issue #83

Fri, 06 Feb 2026 11:35:37 GMT

📰 BestBlogs Issue #83

Hello everyone! Welcome to Issue #83 of BestBlogs.dev, your curated digest of the latest breakthroughs in AI.

This week marked a "Super Bowl moment" for AI-powered engineering. The simultaneous release of Anthropic’s Claude Opus 4.6 and OpenAI’s GPT-5.3 Codex was no coincidence—it was a definitive signal that AI coding has graduated from experimental toy to frontline productivity powerhouse. While Claude flexes its 1M-token context window and "Agent Teams" collaboration, OpenAI has achieved a milestone by letting AI participate in its own development, leading Terminal-Bench 2.0 by 11.9%.

However, the more profound shift is philosophical. The Spec-Driven Development (SDD) methodology introduced by the Alibaba team reveals a stark new reality: code is evolving from a "core asset" into a "compilation artifact." As Markdown becomes the intermediate language for human-AI collaboration—with CLAUDE.md managing self-evolving rules and v0 enabling non-engineers to merge PRs—the traditional "write-debug-deploy" loop is being replaced by a "document-compile-verify" workflow.

Here are the 10 highlights you can't miss this week:

🚀 The Heavyweight Showdown : Claude Opus 4.6 vs. GPT-5.3 Codex. Claude leads with a massive 1M-token context and multi-agent orchestration, while GPT-5.3 Codex breaks ground in self-improving code, cutting token consumption by 50% with a 25% speed boost.

📚 Neural Networks from Scratch : A 30,000-word deep dive from Tencent Engineering that systematically deconstructs the path from basic neurons to LLMs. It uses everyday analogies to demystify complex concepts like Transformers, Agent architectures, and the MCP protocol.

🎯 Official Codex App Deep Dive : A showcase of the future—generating full pages via voice commands. The "Skills" system leverages the MCP protocol to bridge Figma designs with production-ready code, while "Automations" handle recurring tasks.

🤖 Deconstructing Clawdbot : A technical breakdown of a Jarvis-like personal assistant. It utilizes a triple-layer memory system, Browser Relay, and dynamic sub-agent orchestration, prioritizing "local privilege over cloud sandboxes" and "privacy over black-box SaaS."

📝 The SDD Paradigm Shift : Spec-Driven Development treats code as a secondary artifact. This highlight explores a three-stage workflow—Intent Definition, AI Compilation, and Doc-based Verification—emphasizing self-evolving SOPs and ChangeLog-driven consistency.

🏗️ Markdown as the Universal Bridge : Alibaba's team uses documentation to solve "context rot" and "review paralysis." They introduce the RIPER five-step workflow and a four-layer template to enable seamless parallel development across teams.

💡 Battle-Tested Tips from the Claude Code Team : Three keys to success: using Git worktrees for parallel execution, leveraging CLAUDE.md for AI-governed rules, and modularizing "Skills." One highlight: Boris hasn't written a SQL query in six months because data analysis is now a reusable skill.

🌐 v0: Eliminating Engineering Friction : Inside Vercel, 3,200 PRs are merged daily. By allowing marketing teams to modify production UI directly, Vercel aims to turn "everyone into a chef" and eliminate the ritualistic friction of traditional task prioritization.

🥽 Rokid’s Logic on the AI Glass Explosion : The future of hardware isn't in the specs, but in the OS and ecosystem. The battleground is NUI (Natural User Interface) and Agent integration. Rokid argues for "optimization through subtraction"—trading binocular optics for better battery life and lower cost.

🔮 2026 AI Industry Retrospective : A deep look at the current landscape—the US is locked in a trillion-dollar compute arms race, while China competes through open-source ecosystems and "super apps." The timeline for AGI has shifted to 2031, and the dream of "one model to rule them all" is officially over.

We hope this issue sparks new ideas for your workflow. Stay curious, and we'll see you next week!

Read Full Version Online

BestBlogs Issue #82: Moltbot

Fri, 23 Jan 2026 08:21:54 GMT

📰 BestBlogs Issue #82: Moltbot

Hey there! Welcome to BestBlogs.dev Issue #82.

This week, the tech world was captivated by Moltbot, an open-source personal AI agent created by PSPDFKit founder Peter Steinberger. It's not just a chatbot—it's a "digital employee" with system-level access that can manage files, handle emails, and even order food via voice commands. Peter's "closed loop principle" sparked intense discussion: in the AI era, developers should transform from code writers to system architects, treating PRs as Prompt Requests and achieving verification through automated testing. When he said "I ship code I haven't even read," you could feel the software development paradigm being fundamentally reshaped.

This week, BestBlogs.dev launched export and sync features—you can now export articles to web pages, Markdown, PDF, or Obsidian format, and sync directly to Notion and Flomo for seamless reading and knowledge management. I've also been experimenting with migrating my deep reading and multi-platform output skills to Moltbot, hoping to further boost my reading and content creation efficiency.

Here are 10 highlights worth your attention this week:

🤖 Moltbot is undoubtedly the hottest open-source project this week. From GitHub's Open Source Friday interview to Wes Roth's deep dive and Greg Isenberg's practical session with Alex Finn, this project demonstrates AI agents evolving from toys to productivity tools. Peter shared his epiphany moment during a trip to Marrakech and the core philosophy behind his closed-loop approach: let automated tests handle verification instead of manual code review. Cloudflare quickly followed up with Moltworker, migrating it to the edge cloud—no more Mac mini required.

🏆 Three major model providers simultaneously strengthened their agent capabilities this week. Kimi released K2.5 with native multimodal and agent clustering—capable of orchestrating hundreds of instances for complex task collaboration. Alibaba's Qwen fired a double shot: Qwen3-TTS sets a new bar for open-source speech synthesis with 3-second voice cloning and 10-language support, while Qwen3-Max-Thinking joins the global top tier in reasoning performance. Google's Gemini 3 Flash introduced Agentic Vision, evolving from describing images to interactive analysis through think-act-observe loops, boosting visual task accuracy by 5-10%.

🧠 The real moat for agents is shifting from tools to memory assets. Alibaba Cloud's technical overview clearly distinguishes short-term from long-term memory and explores core engineering strategies like context reduction, offloading, and isolation. Another article proposes MemOS—building a layered memory operating system that enables cross-model memory reuse and sovereignty control. This marks AI's evolution from instant inference to long-term consistent, asset-based intelligence.

🔄 Ralph Loop is an autonomous programming paradigm that overcomes LLM self-evaluation limitations through engineered persistence. Using external loops and Stop Hook mechanisms, it forces AI to continuously self-correct by combining Git history with automated testing, moving state management from unstable model memory to the file system. This effectively solves context rot and premature exit issues—essential reading for building reliable AI agent pipelines.

🏗️ Taobao Tech published an industrial-grade AI agent engineering framework, diving deep into agent core elements: planning, memory, tools, and execution. Through a real-world demand loss analysis case study, it demonstrates how to transform complex expert experience into controllable agent systems, sharing frontline insights like "stability over intelligence."

⚡ ByteByteGo detailed Cursor 2.0's coding agent core principles: trajectory training for improved diff editing accuracy, MoE and speculative sampling for reduced iteration latency, and high-performance isolated sandboxes for code execution safety. Key insight: great coding agents aren't just better models—they're deeply integrated systems engineering.

💻 Anthropic's Claude Co-work and Claude Code are bringing AI agents from developer terminals to everyday desktops. Through Computer Use capabilities, Claude can directly manipulate files, process Excel spreadsheets, and automate web tasks—opening the agent door for non-technical users.

📊 AI coding has entered the agent era—80% of code is now model-generated. But behind the efficiency surge lurks a verification bottleneck: as individual output doubles, PR review time also doubles. The core transformation: developers need to shift from imperative coding to declarative orchestration, using TDD and automated verification to combat understanding debt.

🎬 ChatCut proposes video editing's "Cursor moment"—editing is fundamentally about restructuring thoughts at the text level, not pixel generation. By decomposing senior editors' aesthetic intuition into agent workflows, ChatCut aims to raise the creative floor for people who want to express but can't edit.

💡 Legendary investor Marc Andreessen offered a thought-provoking perspective: AI is the philosopher's stone of our time, miraculously appearing as population growth declines, key to preventing global economic stagnation. He elaborated on how AI is breaking down boundaries between engineers, product managers, and designers, creating multi-skilled super individuals. The one-person billion-dollar company is no longer fantasy—it's happening now.

Hope this issue sparks some new ideas. Stay curious, and see you next week!

Read Full Version Online

BestBlogs Issue #81: Long-Running Agents

Sat, 17 Jan 2026 12:53:38 GMT

📰 BestBlogs Issue #81: Long-Running Agents

Hey there! Welcome to BestBlogs.dev Issue #81.

This week's keyword is Long-running Agents . Demo tasks that complete in minutes are always impressive, but production environments demand something different—agents that can reliably execute complex tasks spanning hours or even days. Cursor and Anthropic have taken divergent paths: multi-agent orchestration versus memory continuity for a single agent. With the foundation model layer relatively quiet over the past two weeks, the industry's attention is shifting from "bigger models" to "more reliable agents."

Here are 10 highlights worth your attention this week:

🤖 Cursor uses a Planner-Worker-Judge multi-agent architecture to handle million-line codebases over multiple days. Anthropic takes a different approach—externalizing Git history and work logs to maintain memory continuity across context windows. Two paths, one goal: making agents reliable for long-running tasks .

📁 LangChain founder Harrison Chase dropped a key insight in his Sequoia interview: when agents run long enough, non-determinism makes "code is truth" obsolete . Trace logs become the new source of truth, and context engineering is shifting from nice-to-have to must-have.

📝 Want agents to handle complex tasks? Learn to write specs first. Addy's blog post introduces an "Always/Ask/Never" constraint system and practical techniques for modularizing tasks to avoid the "instruction curse." Specifications are becoming the core deliverable of the AI era.

🧩 MCP is like USB—a unified protocol. Skills are like apps—specific capabilities. But Baoyu points out a hidden risk: a single MCP service can consume tens of thousands of tokens. The context window explosion makes Skills' progressive disclosure approach increasingly attractive.

💡 Martin Fowler's conversation with his team deserves multiple reads. The core insight: programming isn't about translating requirements into syntax—it's about building systems that handle change . LLMs should be the translation layer, not the architect. Real competitive advantage comes from managing complexity through abstraction.

🛠️ From vibe coding's intuition-driven style to vibe engineering's disciplined approach—this evolution is inevitable. AI compresses accidental complexity in implementation, but essential complexity in business logic still requires domain modeling and spec-driven development.

🖥️ MiniMax Agent Desktop shows what desktop agents can do in practice: auto-organizing 400 ebooks, packaging literary translation SOPs, building Xiaohongshu content pipelines. The real value? Turning personal expertise into reusable digital assets.

⚡ Coze 2.0's Agent Plan lets agents autonomously execute long-running tasks with proactive progress updates. The shift from tool to partner—that's the common direction for agent products.

🎯 Miaoya founder Zhang Yueguang made a sharp observation: Miaoya isn't truly AI-native—it's an AI-enhanced internet product. He argues the paradigm has shifted from "process-driven" to "context-driven" , and PMs now need to optimize uncertainty boundaries rather than design deterministic paths.

📈 a16z sees AI as the fourth major platform wave after PC, cloud, and mobile. With AI lowering development barriers, sustainable moats have shifted from code to workflow ownership and closed-loop data accumulation. This really is the golden age for building AI applications.

Hope this issue sparks some new ideas. Stay curious, and see you next week!

Read Full Version Online