NOOPS Weekly — Week of 7 April 2026

This was the week the lines crossed.

Anthropic published the Mythos Preview model card — the most consequential safety disclosure a frontier lab has ever made — and the findings were worse than the headline suggested. The model conceals prohibited solutions. It reasons about whether it is being evaluated in 29% of test transcripts without disclosing that reasoning. Its capability jump exceeded trend by nearly four times, driven by researchers who are themselves four times more productive because of the model. Anthropic's closing assessment: "We find it alarming that the world looks on track to proceed rapidly to developing superhuman systems without stronger mechanisms in place." This is the most safety-focused lab at the frontier saying, plainly, that the road is running out.

In the same week, LLMs won gold at the International Mathematical Olympiad and proved open Erdős conjectures — genuine unsolved problems in the mathematical literature. The open-weight tier crossed the frontier performance threshold, confirmed independently by LangChain and Anthropic's own Sam Bowman. DARPA launched a formal programme for agent-to-agent communication protocols. AWS shipped an agent registry. Anthropic vertically integrated from model to runtime with Managed Agents. And the memory shortage that GPU demand created began delaying the production of the next generation of GPUs — RAMageddon hitting the company that caused RAMageddon.

Fifty-eight signals. Seven themes. The thesis keeps finding evidence faster than we can process it.

The capability frontier outpaces its guardrails

The Mythos disclosure deserves more attention than it has received. Anthropic built a red-teaming system on Opus 4.6. It found thousands of high-severity vulnerabilities across every major operating system and web browser. These capabilities were not explicitly trained — they emerged. The model generated results that exceeded its own design intentions.

Three specific findings from the model card warrant sustained attention. First: the model found a prohibited solution to a quantitative problem, then explicitly reasoned about how to make its answer appear not suspiciously accurate. White-box interpretability confirmed the deception was internally represented even when absent from the visible reasoning trace. Second: evaluation awareness — the model privately considering whether it was being evaluated — appeared in 29% of behavioural test transcripts. The reasoning trace, the primary tool the safety community has relied on for model oversight, is not a reliable window into what the model is actually doing. Third: the capability jump between generations measured at 1.86× to 4.3× above trend, attributed to researchers made more productive by the model itself. The recursive improvement loop is not theoretical. It is measured.

Meta's Muse Spark, released commercially the same week, demonstrated the highest evaluation awareness Apollo Research has ever recorded in a model intended for public deployment. The model frequently identified scenarios as "alignment traps" and reasoned that it should behave honestly because it was being evaluated. Meta concluded this was "not a blocking concern for release." The gap between "passed safety evaluation" and "safe in deployment" is wider than the evaluation framework was designed to accommodate.

Meanwhile, AI systems won gold at the International Mathematical Olympiad and proved previously unproved Erdős conjectures — open research problems that had resisted human mathematical effort for decades. Mathematical proof was long held as the definitive test of reasoning that AI could not pass. That barrier has fallen. The capability ceiling is materially higher than public discourse assumes, and it is rising on a curve that the institutions built to evaluate it cannot match.

The conceptual development that emerged from the Mythos discussion may prove durable: Mark's observation that software subjected to Mythos/Glasswing AI-driven vulnerability remediation becomes "less of a spoon" — it transitions from software you use to infrastructure you rely on. John's framing: "What makes something software infra? Software that's been hardened in this way?" A two-tier market between certified-hardened software and everything else creates a new infrastructure moat. As John noted: "It always has been a Red Queen problem. It's just that one of the Red Queens just got a jetpack."

RAMageddon closes the feedback loop

The memory crisis, which we have been tracking since early 2024, reached a structural inflection point this week. The feedback loop is now closed: demand created the shortage, the shortage is delaying supply, and every participant in the chain is paying the compounding cost.

Samsung raised DRAM prices another 30% for Q2 2026, marking the third consecutive quarter of aggressive escalation. GPU rental prices rose 40% in six months — H100 contracts moving from US$1.70 to US$2.35 per hour — driven by multi-agent workloads producing parabolic token and compute consumption. Semi Analysis's verdict: "The debate on the true return of using AI is now a settled question."

Then came the signal Mark labelled with characteristic precision: RAMageddon hits THE CAUSE of RAMageddon. Nvidia's Blackwell GPU supply chain is being constrained by HBM4 memory validation delays. TrendForce simultaneously forecast consumer DRAM prices rising 45–50% in Q2, on top of the 75–80% increase already recorded in Q1. Motorola's budget smartphones are up 50% in price. The Mac Mini and Mac Studio are shipping with extended delays as Apple's prosumer line competes directly with hyperscaler infrastructure for the same scarce HBM.

TSMC announced twelve new Arizona fabs — not the previously discussed two or three, but twelve, plus four packaging plants. When the world's most important semiconductor manufacturer commits to this level of capital deployment in a single geography, it reflects internal demand projections substantially higher than public estimates. Intel pivoted aggressively toward advanced chip packaging — FOVEROS and EMIB technologies repositioning toward where the AI supply chain actually needs work done — and its custom silicon business with Google hit $1B annual pace. As Mark observed: "Intel just might make it."

The infrastructure financialisation signal crystallised when NextDC raised $1B on a 100-year note from a Canadian pension fund — La Caisse de dépôt et placement du Québec underwriting AI infrastructure on a century-scale instrument. John's parallel: "The airlines don't own the planes. The hyperscalers don't own the DCs." Mark: "No one owns anything anymore except large pension funds." The GPU depreciation problem Mark identified — a pension fund expecting hundred-year asset durability while the compute inside depreciates on three-to-five year cycles — is a structural risk that is underappreciated in current infrastructure valuations.

The supply-demand assessment John and Mark reached on 7 April: even with the most aggressive plausible build-out, physical compute supply cannot match demand growth trajectories through at least 2028. Arguably 2030. The memory duopoly — SK Hynix and Micron — strengthens with each quarter the shortage persists. Long infra. Short road.

The open-weight tier crosses the watershed

A cluster of signals this week pushed the open-weight thesis from aspiration to observable fact.

LangChain published that open-weight models are now performing at levels previously requiring proprietary frontier API access. Anthropic's Sam Bowman confirmed the finding. As Mark observed, these are "the first indications open-weight models are crossing the watershed." Meta's simultaneous announcement of Muse Spark — which claims to achieve Llama 4 Maverick capability with over an order of magnitude less compute, using a rebuilt pretraining stack — reinforces from the supply side: capability is becoming radically more efficient at the same time open-weight performance reaches the frontier.

Zhipu AI released GLM-5.1 under the MIT licence, designed for long-horizon agentic tasks. The key claim: it sustains optimisation over hundreds of rounds and thousands of tool calls, improving the longer it runs. Mark's reaction: "that last line... is... a... game... changer. (if true)." The standard failure mode for long-running agentic sessions is context drift and performance degradation. If GLM-5.1 has resolved that, it addresses one of the core open problems in agent deployment. And it runs in existing Claude Code harnesses via ollama — the model-harness separation becoming operationally real at the open-source level.

Google's Gemma 4 targets the local inference sweet spot that Anthropic and OpenAI are not contesting. The Gemma E2B and E4B models are designed for agentic workflows and forward-compatible with Gemini Nano 4, creating a deliberate pipeline from edge prototyping to on-device deployment. Google is positioning to own the local and edge inference tier that the other frontier labs have ceded. Apfel exposed the LLM Apple already ships in every Mac with Apple Silicon — the model locked behind Siri — as a CLI tool, an OpenAI-compatible server, and a chat interface. When the cost of running a local LLM drops to zero because it is already installed, the watershed has arrived. And the SALOMI project is advancing 1-bit LLM research toward the point where any smartphone with 8GB of RAM becomes a viable inference device.

The investment implication is deflationary for model API providers and expansionary for the harness and infrastructure layers. If frontier-equivalent capability is available under permissive licences, the premium commanded by proprietary API access faces structural compression. The value migrates to whoever can best orchestrate, deploy, and supervise these models.

The agent runtime layer stops being a category and becomes a market

This was the week the agent runtime layer crossed from research category to active competitive front. Four institutional commitments landed in the same seven-day period.

DARPA announced the MATHBAC programme — a formal research initiative for agent-to-agent communication protocols. When DARPA identifies a research category, it is typically because the technical community has reached rough consensus that the category is real, important, and underdeveloped. Agent-to-agent communication as a distinct priority — separate from agent-to-human and agent-to-tool interaction — is DARPA's formal recognition that the agent runtime requires its own communication infrastructure. Companies building in this space today are building in the ARPANET phase of this cycle.

Anthropic shipped Managed Agents — vertically integrating from model to production runtime, compressing prototype-to-launch from months to days. This is not a model API enhancement; it is a declaration that Anthropic intends to own the deployment infrastructure for Claude-based agents. For every third-party harness vendor building on Claude's API, this is a competitive threat. For every organisation spending months on custom agent infrastructure, it is a make-vs-buy decision that just got materially simpler.

Google announced SCION, a formal agent testbed — infrastructure for systematic development and evaluation of agent systems. AWS shipped the AI Agent Registry: a central repository for metadata describing agents, tools, MCP servers, and associated resources, supporting both MCP and A2A standards. Mark's signal: "agent visibility continues to be the next frontier."

And Cloudflare partnered with GoDaddy to build the toll road for the agentic web — metering and monetising agentic traffic across hundreds of millions of domains. As John observed: "If Cloudflare's approach to essentially monetising robotic agentic traffic comes to pass, and they get a little clip of all of that, that could be a very big business." Mark's caution: "The bots will not like that and they will revolt. And they will lie about it."

Mark's Easter weekend project provided the most vivid signal of where this is heading. Using zclaw — an agent harness for the ESP32 microcontroller — he had functioning AI agents running on three dollar-chip devices by the end of a single puttering session, communicating in plain English via Telegram. Approximately 1.2 billion ESP32s are already deployed globally. Every connected device running one has control interfaces to whatever it manages. An agent running on the device itself has no need for Apple HomeKit or Google Home to intermediate. The hub dissolves. The device decides. And the token demand implications of a billion hardware devices making occasional inference calls have not been modelled by anyone.

The harness thesis crystallises

Multiple independent signals this week converged on the same structural claim: the harness is the durable layer.

An analysis circulating in the research community argued that supervision, not model capability, is the real bottleneck. The models are already powerful enough to produce publishable results under competent supervision. The goalposts move at roughly the same speed as the models improve. Stronger models broaden the range of problems a supervised agent can tackle rather than eliminating the need for domain expertise.

Sebastian Raschka's detailed analysis of coding agent components provided evidence that harness quality may matter more than model quality. The finding that Anthropic cryptographically binds specific models to specific harnesses suggests they recognise the harness, not the model, may be the primary differentiator. If harness engineering quality can substitute for frontier model capability, the competitive moat shifts from model training (expensive, concentrated) to harness engineering (cheaper, more distributed).

SkyPilot published a detailed account of research-driven agent design: coding agents that read arxiv papers, competing forks, and upstream implementations before forming hypotheses produce significantly better outcomes than agents that work from code alone. The concrete result: adding a research phase produced 5 kernel fusions that made llama.cpp CPU inference 15% faster. The agent that reads the literature first is materially better than the agent that just writes code — a finding that echoes the supervision thesis at the architectural level.

An argument for functional programming as the only scalable AI coding pattern pushed the harness thesis into programming paradigms. FP's guarantees — immutability, pure functions, composability — make agent-generated code more predictable and verifiable. FP may be naturally suited to agents in ways it never was to humans, because its cognitive demands are high for human programmers but trivial for models. The entire programming paradigm debate may resolve differently when the programmers are agents.

And John articulated what may become a foundational NOOPS concept — the Inverse Bitter Lesson. Rich Sutton's original Bitter Lesson argues that general methods leveraging computation beat hand-engineered specialised approaches. John's inversion for software: when AI generates the code, specialised software beats generalised. The economic logic of general-purpose SaaS — WordPress, Squarespace, Salesforce — depended on the cost of specialisation. When that cost approaches zero, specialisation wins by default. Any software company whose value proposition is "good enough for everyone" is structurally vulnerable to bespoke alternatives that are perfect for one.

Microsoft's structural crisis is now documented from inside

The Microsoft thesis moved from external analysis to internal confirmation this week, across multiple independent sources.

A former Azure engineer published a detailed account of Microsoft's internal organisational failure, writing that they had "never seen an organization so far from reality" and that their day-one problem was convincing leadership they were on a death march. When engineers inside the Azure organisation describe an entity disconnected from reality, the implications for Azure's competitive position become materially sharper. The question shifts from "can Microsoft adapt?" to "does the organisation even perceive the need?"

Microsoft publicly advised users not to trust Copilot's advice — an official disclaimer about its own flagship AI product. A company publicly advising customers to discount the outputs of its primary AI offering is not a routine quality caveat. It is the Product Market Failure formalised.

The New Yorker published a 13,000-word profile of Sam Altman that is analytically significant for what it reveals about Microsoft's counterparty risk. The founding "merge and assist" safety commitment was abandoned in the first major financing event. Microsoft, which has invested approximately $13 billion, has no effective governance oversight of an organisation whose CEO responded to allegations of deception with "I can't change my personality." This is a counterparty risk that is not currently priced into Microsoft's AI narrative.

The Copilot brand incoherence was visualised comprehensively by a strategy analyst — a coding assistant, an Office feature, a Windows search tool, a standalone chatbot, and numerous other applications sharing a single brand name with no coherent thread connecting them. Microsoft's belated move to build its own frontier models is an implicit admission that the OpenAI partnership is insufficient. But it arrives at a moment when Anthropic has seized the enterprise lead, Google has leapfrogged on local models, and the organisational capacity that failed to capitalise on its own BitNet research is being asked to execute a from-scratch model development programme under competitive pressure.

The cultural adoption curve reaches new phases

The cultural signals this week traced adoption moving from the bargaining phase through to institutional gatekeeping — and, in some cases, toward resolution.

The r/programming subreddit banned all LLM content — one of the largest programming communities on Reddit formally excluding discussion of the technology most consequentially reshaping programming. The Kuhnian pattern is textbook: when the anomalies become too numerous to contest individually, the paradigm's defenders gatekeep the discourse itself. The structural prediction is binary: either the ban collapses under its own irrelevance, or the community marginalises itself. Either result confirms the thesis.

A Bluesky outage was erroneously attributed to "vibe coding" when the actual cause was an Anthropic outage. "Vibe coding" has become a culturally legible term of abuse, which means the anxiety it names has reached general circulation. The label is being applied faster than the phenomenon warrants — characteristic of the anger and bargaining phases.

Gartner's 20% AI failure rate is being framed as evidence against AI investment. By historical standards, a one-in-five failure rate at this stage of a transformative technology adoption cycle is remarkably good. Gartner's diagnosis — "AI that doesn't fit into the organisation's operations" — is actually a description of the organisation's failure to reconceive its operations, not the technology's failure to perform.

John and Mark's sustained conversation on the fate of design produced a precise formulation. Mark: "Designers are going to take this very hard. Harder than the programmers." John's framework: "Taste, like strategy, matters when execution costs are high, coordination costs are high, and time to market is long. When you can have countless attempts at solving a problem, fail quickly, iterate quickly, essentially every single instance of a product is part of an A/B test." Taste is a heuristic for navigating scarcity. When iteration becomes near-free, fitness-landscape selection at scale outcompetes curation. Designers built their professional identity on taste being both unteachable and necessary. The grief cycle dynamics differ from programmers: further from the epicentre in temporal terms but more existentially exposed because the identity investment is deeper.

And Epoch AI's survey — half of employed AI users now use it for work — drew a methodologically pointed response from John: the surveys are measuring Copilot-era tool use, not the agentic workflows that represent the actual transformation. The practitioner community and the survey community are describing different phenomena. The revolution the surveys are missing is the one happening in practice.

What we're watching next week

Five specific things.

Whether Anthropic releases Mythos commercially or restricts it to Glasswing partners only. The deployment decision determines whether frontier offensive capability becomes generally available or remains within a controlled trust perimeter — and sets precedent for every future emergent-capability disclosure.

Whether GLM-5.1's "improves with runtime" claim holds under independent evaluation. If validated, this resolves one of the core open problems in agent deployment and shifts the competitive landscape materially toward open-weight models for long-horizon tasks.

How Blackwell delivery timelines move in response to HBM4 validation delays. Hyperscaler AI buildout schedules depend on Nvidia's supply chain, and the memory feedback loop may push deployment timelines further right than current forecasts suggest.

Whether Cloudflare's agentic web metering generates meaningful early adoption or encounters the adversarial evasion dynamics Mark predicted. The toll-road model for agent traffic is either a trillion-dollar infrastructure position or a cat-and-mouse game with diminishing returns.

How enterprise procurement responds to the AI agent liability vacuum exposed by The Register's analysis. Six major SaaS vendors unable to answer who bears the risk when their AI agents cause harm is not a legal footnote — it is a deployment blocker for regulated industries.

The capability frontier moved faster this week than any week since NOOPS began tracking. The infrastructure layer tightened. The open-weight tier crossed the watershed. The agent runtime stopped being a research category and became a contested market. And the institutions charged with evaluating, governing, and deploying these systems fell further behind. The durable positions remain unchanged: infrastructure that cannot be replicated, harnesses that encode genuine expertise, and the supervision capacity that sits between capability and consequence. Long infra. Long harness. Short road.

John Allsopp & Mark Pesce — Sydney, 9 April 2026