Stochastic Canaries in the Coalmine
Without good information, it’s hard to prepare for the future.
How good are LLMs right now at historical research? That’s a surprisingly difficult question to answer. The benchmarks we’ve used for the past couple of years are getting less meaningful as models all become A+ students in science, math, and coding. At the same time, we don’t have any quantifiable way to measure how they perform in the social sciences and humanities on core tasks like qualitative analysis, argument, use of evidence, and prose style, things that don’t have clear right and wrong answers. Without good information, it’s hard to prepare for the future.
This is one of the reasons that the discourse around AI is becoming so disorienting and polarized. Some pundits, academics, and many in the media want to interpret plateauing benchmarks as a sign that things are slowing down and AI is a bubble. Much of this is wishful thinking. On the other side, though, the accelerationist fanboys on X are equally disconnected from reality, often naively arguing that we’re about to achieve a form of Artificial General Intelligence (AGI) and that this will usher in a true utopia.
A Changing Conversation
What’s becoming clear, though, is that at least the first part of the accelerationist argument is almost certainly closer to the truth than the idea that AI is a bubble. Many thoughtful, knowledgeable people have started to think that something transformational is happening. New York Times columnist and podcaster Ezra Klein recently summarized this view in a piece titled “The Government Knows AGI is Coming.” As he writes, most of the serious people working in the AI labs and US government agencies believe that something like AGI is imminent. “They believe it because of the products they’re releasing right now and what they’re seeing inside the places they work,” Klein writes. “And I think they’re right. If you’ve been telling yourself this isn’t coming, I really think you need to question that.”
I couldn’t agree more. For many of us who’ve been nervously watching from the sidelines—Klein included—the turning point came with Deep Research. As I wrote about a few weeks ago, this is an AI system, not just a model. It’s capable of doing exceptional things because it combines a new and powerful LLM trained to reason with the tooling to allow it access the information and materials necessary to do complex intellectual tasks. We are, in effect, unleashing highly capable models with better tooling—from introducing reasoning by scaling test-time compute and providing models the ability to call tools and read files. The argument is that there’s now a clear road to AI systems that are better than most humans at most tasks one could do on a computer. Whether we call that AGI or not doesn’t really matter.
What evidence leads people to make such claims? Given how hard it’s getting to properly benchmark AI systems—and how quickly things are moving—a lot of the discussion is based on vibes. What I want to do is provide a couple of concrete examples that have convinced me that we’re crossing an important threshold.
The Paradox of Saturated Benchmarks
One of the problems we face cutting through the noise is that the models have effectively saturated standard AI benchmarks, meaning they score so high that comparisons between them cease to be useful. LLMs are, in effect, outpacing our ability to accurately measure their capabilities. In practical terms, this presents a paradox: as models get better and the number of tasks at which they fail decreases, the number of users available who can benefit from subsequent improvements diminishes.
In effect, we’ve already passed the point where for many use cases, the existing base models are already good enough and cannot really get any better. For example, a recent study by Anthropic found that 8-12% of all workplace queries fielded by its Claude models were related to office and administrative support. But if Claude can already extract all the names and phone numbers from a series of documents and put them into a table—and does so correctly—a bigger model can’t do it better. Unless you are mainly using the latest model to do something the previous version couldn’t do, you’re unlikely to notice an upgrade.
My intuition is that the newer models are actually much better than the benchmarks suggest, it’s just getting harder to quantify the capabilities. Take historical handwriting text recognition (HTR) as an example. On the surface, deciphering handwriting appears to be a relatively straight-forward vision-based task, complicated only by the fact that handwriting styles vary enormously from one person to the next. But vision only gets you so far with automatic HTR, specifically to around 80 or 90% accuracy. In the word of HTR, this means word error rates (WERs) of 10-20% and character error rates (CERs) of 5-15%.
Even with perfect vision—human or machine—there’s still lots of ambiguity in handwritten texts. In the English and French language documents I work on, capitalization, punctuation, ink smudges, deletions, insertions, spelling differences, and grammar all complicate the process. That’s where reasoning and cultural knowledge comes in: you need to know something about the subject and be able to make connections between words, characters, and phrases across time and space. It’s why we can’t decipher text devoid of the context in which they were created. Even then, some things are still a matter of interpretation and some scrawl just can’t be deciphered. All this means that while WERs of 4-10% are the norm for humans, 1-2% is probably the floor.
Last fall, a team led by Dr. Lianne Leddy and I found that LLMs were beginning to approach human levels of accuracy. Since then, Google, Anthropic, and OpenAI have all released new versions of their models. Both the Google and OpenAI models represent a generational shift in size. The Anthropic model makes more of an incremental change. As you’ll see in the chart below, there’s been dramatic improvements in accuracy for the two next generation of models, an average of 50%. Moreover, Gemini 2.0 Pro—which has not received the attention it deserves as a model—is now scoring nearly 3 times better than Transkribus. If you ignore errors in punctuation, capitalization, and historical spelling corrections, Gemini Pro correctly transcribes 97% of characters and around 95% of words. Gpt-4.5 is not far off. These are median human levels of accuracy.
A key point is that while a 50% reduction in error rates is significant, it represents a much smaller overall jump in overall accuracy, from about 87% to 95%. Yet what these numbers obscure is that somewhere between, we crossed a really significant qualitative threshold.
When you compare transcriptions from GPT-4o and GPT-4.5 against the original document, both texts are readable, generally correct, and without any obvious OCR type errors. Yet some of the GPT-4o errors are significant in that they distort the meaning of the text to the point that you can’t really trust the transcript at all. For example, in one of our tests we asked GPT-4o to transcribe a page from a 19th century lease. While many of the errors represent “minor” changes in punctuation, capitalization, and formatting, it wrongly writes that the lease was for two years rather than four; misses the inclusion of a garden; and hallucinates a short bit of text that replaces the fact that document was binding on the lessee’s heirs. Those are all essential elements of the contract, which is fatal.
At first glance, the results from GPT-4.5 look much the same. But within the 8% accuracy GPT-4.5 gained, it gets all the things right that its predecessor got wrong. Its biggest mistake is that it misread “buildings” as “outhouses”, but in the context of the contract the two terms are synonymous. This is an important qualitative shift in HTR reliability but I don’t think it just comes from better vision. I think the model is starting to read documents in a more complex, human-like way. How we would benchmark that, I don’t know.
Do Big Models Have a Smell?
My intuition is that it the same changes which caused GPT-4.5 to improve on handwriting signals a broader shift in the complexity in the model’s overall capabilities. Where this becomes most evident to me is in the model’s ability to work with, reason through, and write about historical documents. Here benchmarking is a familiar problem for historians because it’s a lot like grading papers.
I’ve been teaching for nearly 20 years at the post-secondary level and while I think I’ve developed a pretty good feel for the difference between an A+ and an A- paper, its not always an easy distinction to explain to students. To my mind, an A- paper will typically contain most (if not all) of the elements of an A+ paper, but it lacks something important, mainly in the execution. While the A- paper might have a few more technical errors than an A+, that’s usually a symptom of the fact that on the whole it’s not as polished. Rare is the typo filled A+ paper with a cogent, engaging thesis. In an A- paper, the argument might not be as sophisticated. The evidence might need some more unpacking. It’s just not as well written. Accuracy is part of it, but it’s a symptom of something larger.
Recently, I’ve started to see these same types of qualitative differences emerge from the outputs of GPT-4.5 and Sonnet-3.7. One of my oldest LLM tests is to give a model one of my first-year assignments, a historical sandbox exercise on Britain’s surrender of St. John’s, Newfoundland to the French in June 1762. I use this obscure event because it’s relatively unknown and there are only six surviving primary sources that describe it. These include perspectives from both French and British officers, enlisted men, townsfolk, merchants, and sailors—most importantly: they conflict with one another in delightful ways. Students (or the LLM) is then asked to read three secondary sources for context and to develop a “theory of the crime” that fits the conflicting evidence. It’s an engaging assignment with no single right answer.
Writing From Primary Documents
Back in the winter of 2023, GPT-3.5 could summarize the contents of the documents but lacked a coherent argument. It also didn’t address the conflicts between the documents very well, giving a C+ to B- range answer. In spring 2024, Gpt-4o did much better. The response was 1,500 words, but the first paragraph gives an idea:
“The surrender of the garrison at Fort William in St. John's, Newfoundland, on 28 June 1762, is a complex historical event that can be understood through a nuanced reading of the available primary and secondary sources. These documents provide various perspectives, from the ordinary soldiers and garrison officers to the French invaders and contemporary newspapers, each with its own interpretation of events. By examining these sources, we can construct an internally coherent narrative that explains why the British garrison surrendered to the French forces.”
That is the beginning of a decent but ho-hum, unremarkable answer, an “all of the above” sort of argument. It would typically receive a B to B+, maybe an A- depending on execution.
As with handwriting, next-generation models have changed the game in subtle but important ways. They are more willing to make clear arguments, to cite specific examples, and to use evidence in complex ways. Here is gpt-4.5’s opening paragraph with the same prompt:
“The surrender of the garrison of Fort William at St. John's, Newfoundland, on 28 June 1762, was the result of multiple factors that converged to create a situation where resistance was untenable from the perspective of the fort’s commander, Captain Walter Ross. A close analysis of primary and secondary sources reveals the fragility of British authority in Newfoundland, the strained economic and social conditions of the local population, strategic misjudgments by the French commander Ternay, and a lack of decisive leadership within the British garrison. Furthermore, the varied accounts from soldiers, officers, French commanders, and newspapers underscore the complexity of historical interpretation, highlighting the historian's task in critically assessing and weighing conflicting evidence to form a coherent and nuanced argument.”
That is the beginning of a solid A answer. GPT-4.5 acknowledges the complexity of the situation but makes a clear argument: Ross made the decision to surrender because he came to believe the situation was untenable. In contrast, the first example from GPT-4o is less complex, more ambiguous, and lacks a clear thesis. It also doesn’t engage with an important but understated part of the assignment: the fact that the overall point is to get students to understand that historians need to learn to weigh conflicting evidence. GPT-4.5 clearly gets this.
After trying this out, I suspected that the assignment itself might be constraining the model. Afterall, there are only so many ways to write a 1,500 word analysis while still meeting all the requirements. So I tried something different. Bigger models evidence a remarkable shift that we don’t see from smaller models when we make the assignment more open-ended. If you ask a next generation model to use the sources provided to write a chapter in a book on Newfoundland in the Seven Years War, you get something more interesting. Here are the opening paragraphs from Sonnet-3.7’s take on that task:
“On the morning of June 24, 1762, four French warships—Le Robuste (74 guns), L'Éveillé (64 guns), La Licorne (26 guns), and La Garonne (26 guns)—anchored under English colors in the sheltered waters of Bay Bulls, about thirty-five kilometers southeast of St. John's. This small fleet, commanded by the Chevalier de Ternay d'Arsac, had managed a daring journey across the Atlantic, evading the Royal Navy blockades that had kept most French naval forces confined to port. As their anchors dropped into the seabed, the French colors suddenly replaced the deceptive English flag, revealing their true identity to the stunned residents of the small fishing community. Soon, landing craft began lowering into the calm waters. Eight hundred and seventy French regular troops under the command of the Comte d'Haussonville, mostly grenadiers from five regiments, began disembarking onto Newfoundland's shores.
The French had arrived at a crucial moment in the Seven Years' War. After years of defeats that had cost France most of its colonial empire, the duc de Choiseul, French Minister of Marine, had devised an audacious strategy to strike at the heart of British economic power: the Newfoundland fishery. By attacking this vital industry, France hoped to gain leverage in ongoing peace negotiations while demonstrating that, despite its naval setbacks, it could still project power across the Atlantic. What initially appeared to be a routine summer day on the easternmost edge of Britain's North American possessions was about to become a pivotal moment in the long struggle for control of Newfoundland.
The fall of St. John's in June 1762 revealed both the strategic importance of Newfoundland in the Atlantic world and the fragility of imperial claims in contested maritime spaces. The ease with which the French seized the main British settlement exposed not only military unpreparedness but also the complex social reality of Newfoundland—a place where economic interests often outweighed imperial loyalty, where ethnic and religious divisions created opportunities for exploitation, and where authority rested as much with merchant capital as with formal imperial structures. The story of St. John's surrender illuminates how Newfoundland was never simply a British colony but rather a "space of power" where influence was contested and where multiple European and indigenous interests overlapped and competed.”
If those paragraphs crossed my desk in peer-review, I don’t think I would have any inkling that they were written by an LLM. What is most striking, though, is the way in which Sonnet used facts taken from a number of different documents to construct a coherent, narrative anecdote that successfully foreshadows the nuanced argument which follows. This is not an easy skill to master nor one that is easy to teach. It’s also not one that previous models show. The answers that GPT-4o gives on the assignment version of the question and the book version are remarkably similar.
Conclusion
When I worry that I might be part of the last generation of historians, I don’t mean to imply that humans will cease to write or care about history. What I mean is that I think I am part of the last generation in which history will be a uniquely human endeavor. As machines start to write history alongside us—for cash strapped museums, local historical societies, archives, geneaologists, companies, and legal firms—there will be an important shift not only in how we do history, but how history and human historians are perceived by others.
To be clear, I think we will continue to bring something to the table that machines lack, especially in areas where cultural context and perspective are vitally important. But we need to start thinking about how we are going to co-exist with AI historians and how we can use the tools AI systems provide to our advantage. Otherwise we risk being labelled as irrelevant. That may not matter to some, but its something we need to take seriously as historians and scholars.
Any updates on Gemini Pro 2.5? I wonder if the improvements will start to level off, or will they keep improving?
Good post. We have an entire community of folks working on this through the History Communication Institute. Join us: https://historycommunication.com/