The Agents Are Waking Up
The Intelligence Revolution that swept through the software industry this past winter is coming to knowledge-work next.
Over the past few months, I’ve found it’s getting harder and harder to write about AI in an intelligible yet useful way. There’s always been a knowledge overhang between the median experience of most historians and the AI frontier, but it’s become a chasm and I don’t think most people know what to believe or think anymore. Nevertheless, a consensus has formed in the AI community that we’ve crossed an important threshold beyond which everything will change. My sense is that if people have heard this, they’ve probably dismissed it as hype.
What worries me, though, is that the question of whether the hype is real or not is becoming all the more inscrutable at exactly the moment it’s becoming most consequential. So in this piece, I want to talk about why that knowledge gap exists, why its growing, and why a lack of access and experience with genuine frontier AI capabilities is going to make it all the harder for people to form intuitions about the future—not a distant, hypothetical future any more, but one that transpires in knowledge-work over the coming months. My intention is then to build on this over the coming weeks and months to talk more regularly about how this is likely to play out for historians and the humanities more broadly.
A Very Different Kind of Tech
To start, I think it’s important to acknowledge that there’s good reason for smart people to be skeptical. AI has historically been associated with grandiose claims, most of which have rarely panned out. At the same time, recent AI developments run against the grain of the intuitions most people form about how new technologies evolve and change. As a society, we just aren’t used to things moving this rapidly. For example, it took me the better part of two decades to go from my first camera phone to not having a separate digital SLR camera. But in only two years, LLMs have gone from being incapable of solving grade school math problems to providing original proofs for genuinely unsolved frontier math problems that only professional mathematicians can understand. In all fairness, that’s not something an adult should intuit from their experience with other technologies.
But as the pace of change has accelerated, the frontier is also becoming less accessible, both in terms of the ability of non-experts to experience the effects of those changes and ease of access. The former is simply an issue of use-case misfit: unless you are a mathematician, you probably can’t tell if a novel math proof is correct or not. The latter is a UI and training problem. This may surprise some people in the tech community who feel that they’ve done a lot of hard work to counteract accessibility issues, so let me explain.
In early 2023, when I started this blog, the main issue was that many people hadn’t tried ChatGPT so the solution was relatively simple: open a free account and try it. It got more difficult as the best models were paywalled and more difficult still with the introduction of reasoning models, then model switchers, and then tiered use plans. All of this made it harder to get people using the same models and setup. The effect has been that most people I know are still forming their intuitions about LLMs based on far less capable versions of the technology than what is accessible at the frontier. But what they read on X, Substack, or in the media describes something that sounds like the same product when its actually based on something fundamentally different. What do you mean ChatGPT solved an Erdős problem that’s stumped the best mathematicians for sixty years? Just this morning I asked it to give me ten sources on Canadian confederation and one of them didn’t exist! It just doesn’t make sense. Both might involve a product called ChatGPT, but the first requires a $200 USD subscription and access to Pro mode (not to be confused with a Pro subscription) while the second came from the free version. They simply have nothing to do with one another even if the packaging looks the same.
In the face of these sorts of criticisms, people in the tech industry tend to point to things like Google’s AI Studio, which makes building apps from scratch about as easy as it can possibly get. But here the point of comparison for the median user really matters. AI Studio might have revolutionized accessibility in that you don’t need to know how to write code or even install Python to use it to build an app, but my own sense is that most people outside of tech don’t know where to start. When you’re an expert, it’s easy to lose perspective about the median level of experience and knowledge about your field.
Agents and Harnesses
Even here it gets tricky to explain, because there is a subtle but key shift implied above: to experience LLMs at the frontier, you now need to not only be using a really expensive model but must also have it strapped into a specialized agentic harness.
An AI agent is an LLM capable of taking action in the world, working autonomously to accomplish a user-set goal using a set of bespoke tools. In essence, while a chatbot answers questions one at a time in a back and forth conversation about a topic, users delegate actual tasks to an agent. The agent plans a sequence of steps, takes actions, looks at the results, revises its approach based on new information, and then takes some more actions. The loop continues—sometimes for hours on end at a cost of hundreds of dollars in tokens—until the agent decides it’s done and reports back to the user. If you’ve ever tried the Deep Research functions in ChatGPT or Gemini or NotebookLM, you’ve encountered an agentic workflow.
Agents don’t work in the abstract, though. They need what software developers have started to call a harness, that is, the scaffolding that actually lets the model see files, run code, browse the web, or retrieve information from a database. This is something that has to be purpose built for an LLM and that is a new and far from solved problem. It is the harness that turns a model into an agent.
Cursor and Claude Code are the best-known harnesses for software development: the first is a code editor, the second a terminal application (one of those scary looking black DOS screens). Both of them give highly capable frontier models carefully controlled access to a codebase—that is, all the various files one creates to build a program as simple as tic-tac-toe or as complicated as a web browser—so that it can make changes, run tests, and debug its own work.
Tool calling is a strange and seemingly magical thing to watch. Think of tools as small software programs that let the model do things you would normally do with your mouse and keyboard. The key point is that it’s the LLM itself that decides—on its own—what tools to use, in what order, and how to use them. Tools can be prewritten or, as is increasingly the case, they can be created by the LLM itself on the fly to solve problems. If that sounds like science fiction, it once was, but it’s here now and it actually works.
If you want to try this for yourself, go to Google’s AI Studio (it’s free to try) and click the build link on the left. Then give the agent an app to build. If you don’t know what to ask it to do, try asking it to build a playable chess game with a computer opponent. Basically, you can ask it to build anything and you aren’t going to hurt or break anything. And if something doesn’t work, explain the problem from a user’s point of view (“when I click a pawn and try to move it to another square, nothing happens”) and it will fix it. Sometimes this takes a couple of tries. Either way, you’ll watch an LLM write code to accomplish your task. That’s an easy way to watch an agent at work.
A Sudden Shift: Winter 2025-26
Cursor, Claude Code, Windsurf and OpenAI’s Codex are all special-purpose harnesses, built to largely automate one kind of knowledge-work: software development. Using them is very similar to giving written instructions to an unimaginative but technically proficient research assistant. These can be as simple as “build me an app that allows me to edit a PDF in my web browser” or “make me a program that will allow me to search all my PDFs by keyword at once.” The agent just figures it out.
These coding agents have been around for awhile, but until December 2025 they were much better at augmenting existing expertise than they were at replacing it. This meant that amateur programmers like myself initially found them really useful, but often hit a wall where the model couldn’t do more complex things, at least not reliably or safely. Professional programmers, on the other hand, found them good for boilerplate things, but typically noted that they were unable to do the novel, secure programming they were paid to do.
Between December and February, the change came suddenly and was the result of several converging improvements. First, the newest models released before Christmas (Claude Opus 4.5 and GPT 5.3) were trained to be really good at knowing what tools to call and when and this made them much more efficient than their predecessors. Second, they were also much better at holding state, meaning they could work on the same task for longer without any degradation in performance. Third, these models were also much “smarter,” meaning that they were also better able to process large amounts of code and arrive at a working solution to a problem on the first try—even novel problems for really large and complex codebases that don’t appear to be well represented in the training data. When these things were put together with the right harness—meaning they were strapped into specially built scaffolding with all the right tools—the agents just woke up.
What followed has been called the SaaSpocalypse: since the New Year, most of the major Software as a Service (SaaS) companies have lost much of their value as investors fled. Why this happened—and whether it was rational—is debatable, but the most plausible answer (at least to me) is that investors began to question whether these companies had long term value in a world where anyone could build anything in a web browser with little expertise. In essence: if I can build a program in AI Studio that edits PDFs, do I need to buy an annual subscription to Adobe Acrobat?
How this all plays out still remains to be seen, but a few things are clear. First, as layoffs mount and hiring slows, software development is now a much more competitive field than it’s been in a long time. Whether AI is the cause or the excuse for the downturn is hotly debated, but in either case applications to computer science are now dropping quickly for the first time in decades at the best schools.
My own sense is that the downturn is real and is being caused both by panicked investors as well as a genuine shift in assumptions about the probable future value of programming expertise. I have formed my own intuitions about this firsthand, as I’ve started to build my own highly complex SaaS apps (a soon to be released web version of Archive Studio for example). Quite simply, the walls I’d hit in the past crumbled this winter. My reasoning is simple: if a historian can learn to do this in his mid 40s…
Perhaps more significant are the capabilities I saw students wield in a third year Digital Humanities course I taught this past semester. In that class, students came from a range of backgrounds (history, business, computer science, communications, psychology) and then worked together in groups to build SaaS apps from scratch to solve a range of problems. The results were, quite simply, stunning. For context, they were only tasked with working on the project during class time, for roughly 1 to 1.5 hours per week for 12 weeks. By the first week of April, we had a working social networking site, two apps that could calculate calories and recipes from photographs of receipts and ingredients, a program that gamified everyday tasks with full Google Calendar and email integrations, and a student organizer that extracted to-do lists from unstructured data like course syllabi. None of these students had built anything like this before. That is great for productivity and software development in general, but I don’t see a way that the industry won’t change.
Towards General Agents for Knowledge-Work
General knowledge-work harnesses are, perhaps not surprisingly, farther behind those deployed to the software industry. But my sense is that this is going to play out in a similar way for knowledge work more broadly as purpose-built harnesses start to appear for various industries. One of the first general-purpose harnesses is Claude Cowork, which Anthropic began testing earlier this year. Cowork is a desktop application available for Mac and now PC that gives perhaps the best frontier model (Opus 4.6 and now 4.7) supervised access to a folder on your computer, a sandboxed shell, a web browser, and a library of pre-written and highly customizable “skills” for particular kinds of tasks. As with coding, you give it a goal in plain English and it goes off and does the work.
Claude Cowork can attempt pretty much anything you can do on a computer yourself. Whether you can or should (ethically or legally) do all these things with Cowork is a different question: the capability is there. This means that Cowork can open PowerPoint and create a presentation based on a research paper you provide. You can give it one of your existing presentations as a model and it can match your style exactly. It can search the web for open-source images and include citations. Unlike previous models, it does it well and consistently.
You can also use it to download hundreds of images from a website (picture the repetitive click-next-download-save tasks you used to do by hand, now automated) but more than that, you can ask it to “go through this digitized microfilm and save only the documents that mention X, using the archival citation as the filename.”
At the moment, it is far from perfect. It’s expensive and sometimes gets sidetracked into weird loops. On long tasks, it also tends to stop before the task is fully complete. But having experienced the agentic coding revolution firsthand, I can tell you that this is really familiar ground. In fact, the similarities are uncanny: we’re seeing the exact same issues of trust, reliability, and depth of capabilities play out again.
Deterministic vs Non-Deterministic Work
There are lots of questions about whether it will be as easy to optimize models for general knowledge-work as it was in coding. The basic difference is that coding is deterministic: for the most part, the code either runs or it doesn’t and you can use that as a signal in training to make new models iteratively better. Most knowledge-work is non-deterministic, meaning there may not be a single correct answer or the correct answer is unknowable without expert-human analysis.
My intuition is that it will be much easier than it might seem. First, given the trend lines it just seems inevitable at this point. The models have all been improving steadily and predictably on all tasks over time; even if their abilities remain jagged, the curves are the same. More importantly, though, math probably provides a good analogy. Math is, of course, deterministic in that a proof either works or it doesn’t. But in practice, when we use models to create new proofs for unsolved math problems, as is regularly the case now, the system behaves like a non-deterministic system because only a human expert can actually vet the result to determine whether it is correct or not. It can’t self verify.
My assumption here is that there are a lot of common, non-deterministic tasks in knowledge-work where it will actully prove to be a lot easier and cheaper to generate an effective signal than it is with frontier math.
Conclusion
To my mind, it is almost certain that over the next year, maybe slightly longer, we are going to see specialized harnesses developed for most areas of knowledge-work, some of them probably using the same agentic coding programs discussed above. These programs will let subject matter experts create apps that do exactly the things they need.
At the same time, models are also starting to effectively train themselves, what is called recursive self-improvement in the AI industry. The latest and most powerful Anthropic model, Claude Mythos, for example, was mostly coded by other Anthropic models and this, in turn, is not only speeding up the process of developing new models but making the training process more efficient. This will only accelerate from here.
As this starts to unfold, we need to have real conversations about whether we want to put hard and specific limits on some LLM use-cases and whether we want to reserve some forms of decision-making for humans alone. In other words, we need to decide what we are willing to automate, when we are willing to use AI to augment human intelligence, and when don’t want to use AI at all. If the reports and rumours about Mythos are to be believed, these timelines may even need to be sped up. In either case, at some point soon the agents are going to wake up in knowledge-work too.



Mark, the chasm is real, and from an institutional vantage I would only sharpen one part of it. The chasm is growing fastest inside the organizations that most need to interpret what is happening. University leaders, and I say this as one of them, are making governance decisions calibrated to a version of the technology they encountered two model generations ago (if that!). The lag is now long enough to constitute a structural risk in its own right.
Your deterministic versus non-deterministic question is where I may push back a little. The task categories where agents have already crossed into knowledge work, LMS submissions, legal document review, sales outreach, customer support triage, share a condition that is easy to miss in the story about capability. Each had been templated, quietly, over years of interface design and workflow compression, to the point where an agent could/can now complete it because the deterministic-enough signal was already there. The template preceded the agent. What the automation did was just make visible a prior condition.
This has an implication for your prediction about the next year. If the pattern truly holds, the tasks most immediately vulnerable to agentic automation are the ones we had already rendered measurable and repeatable for reasons unrelated to AI, often to make them easier to manage, scale, or grade. What remains, and what now becomes precious, is the layer of knowledge work that resisted templating: situated judgment, institutional memory, the reading of texture that Polanyi called tacit and that most organizations discover they need only once the templated layer is gone. I have written about one version of this.
Oh, and on Mythos, a small note. It is plausible and indeed widely reported that Anthropic uses its own models heavily in internal engineering. Whether that constitutes recursive self-improvement (RSI) in the technical sense is a separate question, and Anthropic's own public position on Mythos Preview, as of earlier this month, is that they are less confident than they used to be that junior-researcher work is safe from automation but that the answer to the direct RSI question is still probably not. The difference between 'yes' and 'less confident than before' might change how one should read the tempo of what comes next.
Thank you for writing this. The chasm framing is one I had hoped to write next, as I see what my students can do with lower-tier models versus the $200/month ones. Interesting times ahead all in all.
Thanks, Mark. It will be incredibly useful to have this post in my pocket when explaining these changes to normie colleagues.