The Math Says Jack Clark Might Be Right
![]() |
| Copyright: Sanjay Basu |
A Technocrat’s Discernment
Jack Clark said something in September 2025 that most people either celebrated or dismissed, but very few actually sat down and did the math on. He said he continues to believe that the sort of powerful AI system described in Dario Amodei’s “Machines of Loving Grace” essay will be buildable by the end of 2026, with many copies running in 2027. He doubled down on this at a Congressional hearing, too. End of 2026. Truly transformative technology.
I have spent years building GPU and generative AI infrastructure at Oracle Cloud Infrastructure. I have watched training clusters scale from curiosity projects to multi-billion dollar deployments. I have seen what happens when you give smart people enough FLOPS and enough will. And honestly, when I first read Clark’s prediction, my instinct was skepticism. Not because the goal seemed impossible in some abstract sense, but because the gap between “buildable” and “deployed at scale” is exactly the kind of gap that enterprise infrastructure people like me know intimately. It is the gap where ambition goes to negotiate with thermodynamics.
But then I did the math. And the math is interesting.
What “Powerful AI” Actually Means Here
Let us be precise about what Amodei described in that October 2024 essay, because precision matters. He was not talking about chatbots that write better emails. He was describing AI systems with intellectual capabilities matching or exceeding those of Nobel Prize winners across most disciplines. Biology, computer science, mathematics, engineering. Anthropic’s internal shorthand for this is a “country of geniuses in a datacenter.” That is a vivid phrase but also a measurably specific one, and that specificity is what makes it amenable to analysis rather than speculation.
In their March 2025 submission to the OSTP, Anthropic formalized it further. These powerful AI systems would have the ability to work autonomously on tasks spanning hours, days, or weeks. They would not merely answer questions. They would execute research programs. The distinction is critical. We are talking about systems that can hold coherent intent across extended time horizons, decompose complex problems, and iterate toward solutions without constant human intervention.
Now here is where it gets interesting from a measurement standpoint.
The Time Horizon Exponential
METR, the Model Evaluation and Threat Research nonprofit in Berkeley, has been tracking something they call the “task-completion time horizon” of frontier AI agents. The definition is elegantly simple. You take a set of software engineering, ML engineering, and cybersecurity tasks. You measure how long each task takes a skilled human professional to complete. Then you test whether the AI agent can complete the task with at least 50% reliability. The time horizon is the human-equivalent duration at which the agent hits that 50% success threshold.
The original METR paper from March 2025 showed a doubling time of roughly 7 months over the period from 2019 to 2025. By the time they updated to Time Horizon 1.1 in January 2026 with a larger task suite (228 tasks, up from 170), the picture had sharpened. The doubling time from 2024 onward had compressed to approximately 89 days.
Let me put numbers on this. Claude 3.7 Sonnet had a time horizon of about 1 hour. Claude Opus 4.5, released late November 2025, jumped to roughly 5 hours. Claude Opus 4.6, released in February 2026, registered a 50% time horizon of approximately 12 to 14.5 hours depending on your modeling assumptions. METR themselves note that the confidence intervals remain wide, and that the estimate is sensitive to task composition, but the trend line is doing something that cannot be ignored.
If we model this as a straightforward exponential with the 89-day doubling time, the math becomes almost unsettlingly clean.
Starting from a 12-hour time horizon in February 2026, the projection for end of 2026 (roughly 10 months later, call it 300 days) gives us:
T(end of 2026) = 12 × 2^(300/89)
That exponent is 300/89 ≈ 3.37, so:
T ≈ 12 × 2³.37 ≈ 12 × 10.3 ≈ 124 hours
One hundred and twenty four hours. That is a little over five full days of continuous human-equivalent work. A workweek.
Ajeya Cotra, who now works at METR and is one of the most respected AI forecasters alive, published a correction in March 2026 admitting she had already underestimated the pace of progress. Her January prediction for end-of-year 2026 was a 50% time horizon of about 24 hours. Two months later, Opus 4.6 was already at 12 hours. She noted something important though, and this is where the analysis gets philosophically interesting rather than just arithmetically satisfying. As time horizons push past several dozen hours, the nature of tasks changes qualitatively. Long tasks are inherently more decomposable than short ones. A one-hour debugging task is monolithic. You cannot meaningfully parallelize it. A one-week development project is naturally modular. You can assign subtasks to parallel workers.
This means the effective capability of a system operating at 100+ hour time horizons may actually grow super-exponentially in practice. A management-layer AI that can decompose work and assign it to execution-layer AI instances can, in theory, tackle projects of arbitrary scale. The bottleneck shifts from raw capability to coordination overhead.
The Compute Story
Let me tell you what I see from the infrastructure side, because the demand signals are as telling as the benchmark results.
Anthropic announced a cloud partnership with Google in October 2025 giving them access to up to one million of Google’s custom TPUs, projected to bring more than one gigawatt of compute capacity online by 2026. One month later, Nvidia and Microsoft announced they were investing up to $15 billion in Anthropic, with Anthropic committing to purchase $30 billion in Azure compute running on Nvidia hardware. These are not speculative bets. These are purchase orders. When someone places $30 billion in compute orders, they have a roadmap they believe in.
The effective compute available for frontier training runs has been growing at roughly 4x per year over the past several years, driven by a combination of hardware improvements (going from A100 to H100 to B200), larger cluster sizes, and improved training infrastructure. But compute alone does not tell the story. Algorithmic efficiency has been improving at a rate that is harder to quantify precisely but is conservatively estimated at 2x per year in terms of achieving equivalent performance with less compute. The two factors multiply.
So the effective capability per dollar of training compute is compounding at something like 8x per year. This is faster than Moore’s Law ever was, and it is being sustained by an industry pouring hundreds of billions of dollars into the problem simultaneously.
The Scaling Law Argument
Here is where I want to bring in some actual mathematics, because I think the scaling law framework provides the most rigorous basis for Clark’s confidence.
The Chinchilla scaling results from Hoffmann et al. (2022) established that optimal training requires data to scale linearly with model parameters. The loss L of a language model scales approximately as:
L(N, D) ≈ A/N^α + B/D^β + L₀
Where N is the number of parameters, D is the number of training tokens, and the exponents α and β are empirically around 0.34 and 0.28 respectively, with A, B, and L₀ as fitted constants. The key insight is that this relationship has held remarkably consistently across several orders of magnitude of scale.
Now, loss is a somewhat abstract quantity. What matters for Clark’s prediction is not loss per se but downstream capability. And here the evidence suggests that capabilities emerge in relatively sharp phase transitions as loss decreases past critical thresholds. A model at loss level X might be utterly incapable of a task. A model at loss level X minus some small delta might succeed reliably. This is the phenomenon of emergent abilities, which has been both celebrated and criticized, but which remains empirically robust when you measure against the right benchmarks.
The implication is that you don’t need to predict the exact shape of the loss curve with infinite precision. You need to predict whether the loss will cross certain thresholds. And when those thresholds are close, even modest improvements in compute, data, and algorithmic efficiency can trip multiple capability gains in rapid succession.
If current frontier models are at a training compute level of roughly 10²⁶ FLOPS (a reasonable estimate for something like Opus 4.6 or GPT-5 class models), and if the next generation of training runs pushes to 10²⁷ or 10²⁸ FLOPS, the scaling laws predict a loss reduction of:
ΔL ≈ A × [(1/N₁^α) — (1/N₂^α)]
For a 10x increase in effective compute (split optimally between parameters and data per Chinchilla), this translates to roughly a 15–20% reduction in loss. That sounds small until you remember the phase transition point. We are very likely in a regime where multiple important capability thresholds are clustered within a narrow band of loss values. This is not wishful thinking. It is what the empirical record of the past three years has shown, where each successive 2–3x increase in effective training compute has unlocked what feels like a qualitative jump in agent behavior.
The Engineering Multiplier
There is another factor that I think is underappreciated, and it is the one that most directly connects to my daily work. AI systems are already being used to accelerate AI development itself. Anthropic’s own system card for Opus 4.6 acknowledged that their models are beginning to accelerate the pace of internal research and engineering.
Let us model this conservatively. Suppose current AI tools provide a 1.5x engineering productivity multiplier for the researchers and engineers building the next generation of models. This is probably conservative given that several Anthropic employees have reported 20–40% speedups, and Claude is writing a significant fraction of committed code at major AI labs.
If each generation of AI provides a 1.5x multiplier to the development of the next generation, and generation cycles are roughly 6 months, then over two generations (one year) you get:
Effective engineering output = E₀ × 1.5 × 1.5 = E₀ × 2.25
That is more than a doubling of effective engineering capacity directed at AI development, on top of all the compute and algorithmic improvements. This creates a feedback loop that is mild by the standards of what some people worry about (true recursive self-improvement), but it is nevertheless real and measurable.
The Counterarguments, Honestly Considered
I would be dishonest if I presented this as a certainty. There are legitimate reasons to doubt the timeline.
First, the METR time horizon metric is measured primarily on software engineering tasks. Software engineering is a domain where AI has shown the most dramatic improvements, partly because the reward signal is exceptionally clear (does the code pass the test suite or doesn’t it?) and partly because the training data is extraordinarily rich. Other domains relevant to “Nobel Prize winner” level performance, like experimental biology or materials science, involve physical world interaction, long feedback loops, and tacit knowledge that resists easy formalization. A 100-hour time horizon on software tasks does not automatically imply a 100-hour time horizon on wet lab biology.
Second, the METR plot is frequently misunderstood. The time horizon numbers represent how long a task takes a human, not how long the AI spends on it. Just because Opus 4.6 can handle a 12-hour human task does not mean it replaces 12 hours of human labor in practice. Real world work involves ambiguity, coordination, and judgment calls that benchmarks cannot fully capture. As METR’s own Thomas Kwa has emphasized, one should be cautious about inferring real world impact from benchmark performance alone.
Third, and this is the one that I find most compelling philosophically, there is a difference between “buildable” and “useful at the civilizational level Amodei describes.” Clark’s specific claim is about buildability, not deployment. A system that could in principle match Nobel laureate level reasoning across disciplines might still require enormous inference costs, might hallucinate in domain-specific ways that are dangerous in practice, might fail at tasks requiring genuine novelty as opposed to recombination of existing knowledge, and might not integrate into the physical world infrastructure needed to actually accelerate biology or materials science.
Why I Still Think Clark is Probably Right
Despite all of that, I come down closer to Clark’s position than to the skeptics. Here is my reasoning, stated as plainly as I can.
The time horizon trend is not a single data point. It is a trend spanning seven years and dozens of model releases from multiple independent organizations. The doubling time has, if anything, accelerated rather than slowed. The compute investments are locked in and massive. The algorithmic improvements show no sign of plateauing. And the engineering feedback loop, where AI accelerates its own development, is already operational even if mild.
Clark’s claim is specifically about “buildable.” He is not saying that by December 2026, AI will have cured cancer and solved climate change. He is saying that the technical capability, the raw cognitive horsepower, will exist in a form that can be trained and deployed. The gap between “buildable” and “transformative at civilizational scale” might be years or decades. But the gap between today and “buildable” might genuinely be months.
I find it instructive that Ajeya Cotra, who is professionally incentivized toward careful calibration and who works at an organization (METR) specifically designed to provide sober assessments of AI capability, has now twice publicly admitted to underestimating the pace of progress. Her January 2026 predictions were already obsolete by March. Her assessment that 10% probability of full AI R&D automation by end of 2026 seemed high to many forecasters at the time, but after seeing Opus 4.6’s performance, she says it feels about right again.
When the most careful forecasters keep having to revise upward, that is itself a data point.
What This Means For Those of Us Building the Infrastructure
I will end with a note from my own domain. At OCI, I have watched the demand for GPU inference infrastructure grow from an interesting side business to a multi-billion dollar revenue stream in under three years. The customers I work with are not speculating about powerful AI arriving someday. They are building for it now. They are placing orders for inference capacity that only makes economic sense if the models running on that hardware are dramatically more capable than what exists today.
The infrastructure is being built. The compute is being purchased. The algorithms are improving. The engineering talent is focused. The feedback loops are engaged.
Whether Clark is right about the exact month or quarter is almost beside the point. The trajectory is clear enough that anyone in the infrastructure business who is not planning for it is making a serious strategic error. The math does not lie. It just tells you roughly where the curve goes. And right now, the curve says Jack Clark has the better end of this argument.
I have learned, over two decades in this industry, that exponentials are easy to dismiss right up until they are impossible to ignore. We may be closer to that transition point than most people realize.
Sanjay Basu is a computational scientist, technologist, and philosopher. He serves as Senior Director of GPU & Gen AI Solutions at Oracle Cloud Infrastructure and is the founder of Cloud Floaters Inc. He writes “A Technocrat’s Discernment” and can be found at https://sanjaysays.com

Comments
Post a Comment