Is this the Last Generation of Historians?
OpenAI’s Deep Research is a strange and shockingly powerful agentic AI research assistant that offers a clear glimpse of the future. But where do we fit in?
Last weekend, OpenAI released its latest reasoning model called Deep Research. Reasoners like o1, DeepSeek, and o3-mini are different from conventional chatbots in that they spend time planning, exploring possible answers, and then refining their “thinking” before responding. The idea is to make their outputs more comprehensive and reliable.
Deep Research is strange and new. It’s agentic, meaning it can autonomously conduct its own research to answer high level questions and the results are often awe-inspiring. Most would pass muster in any historical research firm or PhD level course. But right now, under specific circumstances, it can also hallucinate. My sense, though, is that its shortcomings are related to the way the model was launched, rather than any inherent limitations with the model itself. More on that in a bit.
Historians—and academics broadly—need to pay attention: Deep Research is the beginning of something new. For the last couple of years, people have been talking about a future in which AI will begin to do the things we value at an expert level. My intuition is that this is the actual beginning of that future. Until now, many of us have tried to defer the really difficult questions by ignoring AI or pretending the obvious isn’t happening. I don’t think that’s going to be possible any longer.
Deep Research
The most important thing to know about Deep Research is that it is one of the first agents, meaning it can autonomously go out and find reliable sources to solve a given research problem. OpenAI describes Deep Research as “powered by a version of the upcoming OpenAI o3 model that’s optimized for web browsing and data analysis, it leverages reasoning to search, interpret, and analyze massive amounts of text, images, and PDFs on the internet, pivoting as needed in reaction to information it encounters.” As Ethan Mollick writes, it’s “the end of search and the beginning of research.”
Another key difference from conventional models is that Deep Research is not designed for back-and-forth chat: it is a fire and forget model. Users begin by posing a research question or problem, the model then asks some clarifying questions of its own, before going off to do some actual research for anywhere between 5 and 30 minutes. When it’s done, it notifies the user that the answer’s ready.
OpenAI says that Deep Research is trained to select solid, reputable sources, to evaluate them thoroughly, and to cite specific text (not just provide general links as other models do). In the near future, they also say they plan to announce deals that will give the model access to paywalled sources, presumably including academic journal repositories and other data. Beyond this, its specific capabilities and limitations remain murky. It's currently available only to users with a Pro account, although OpenAI plans to bring it to Plus tier users at some point in the future.

First Impressions
Using Deep Research is unsettling, pure and simple. There is something distinctly non-LLM—human in fact—about its prose. Its use of language is more natural, nuanced, and sophisticated than conventional models and it can be disarmingly informal but still professional. It’s a mature style: gone are the hamburger essays and purple prose that GPT-4o typically produces. Deep Research genuinely writes like a good PhD student—or minted scholar. It’s a noticeable change and you should click the links below to read some examples.
One of my first tests was to ask Deep Research for a historiographical analysis of the evolution of fur trade historiography, focusing on comparing and contrasting Canadian and American approaches. It’s response is very good. But consider that right now it can only access full-text sources that aren’t paywalled, which is why the bibliography is so limited. Even so, it included lots of scholars it could not actually read…and did a good job with their works. Think forward to what this will look like once OpenAI actually allows users to upload files and Deep Research can access journal repositories and e-library resources on its own.
An AI Research Assistant
Deep Research is designed to do internet-based research—it is literally a research assistant that goes off and does a task while you do something else. So next I asked it to compile a list of archival primary sources, specifically of unpublished letters written by Alexander Henry. This is not an easy task for a computer as it involves identifying likely archives, navigating often archaic websites, and then successfully using a catalog search. I watched in amazement as it worked its way through LAC’s catalogue all on its own—not an easy feat for a human these days—and identified several relevant collections. It did the same across 21 archives, including ArchiveGrid, and came back with a reasonably good list that mirrored my own initial survey from a few years ago. Not perfect, but exactly what I would expect from a human RA. Three years ago, most people thought this type of AI assistance was decades off.
It is telling that this is the first model I’ve accidentally started to anthropomorphize. And it’s not just the writing that’s different. Its analysis has a depth of insight that’s not there with other models. I’ve found myself reading its outputs with actual interest, not just because I’m amazed at the technical marvel in front of me, but because I start to find the analysis compelling and revealing.
Limitations and More Weirdness
Deep Research also has some odd quirks as well as some significant issues. Some of this is related to the fact that it plans and carries out its own research. If you mess-up the prompt in any way, or inadvertently send it off in the wrong directions with your follow-up answers, you can’t do anything about it. This can be frustrating as you watch its reasoning process develop. In one test, I watched its reasoning devolve into a string of random characters. It seemed to recover but still, what happened?
The weirdest result, though, was also the most interesting. A few years ago, I showed that Alexander Henry’s 1809 Travels and Adventures in Canada and the Indian Territories was actually written by English children’s author and grifter Edward Augustus Kendall, partly from papers he stole from Henry in Montreal and partly from material he lifted from earlier travelogues. So I wondered whether Deep Research could help me compare Henry’s text to other travelogues to find any overlaps.
The results were truly surprising. Deep Research provided a 6,165 word answer which it claimed was based on “a combination of computational textual analysis (keyword and n-gram comparisons across digitized texts) and close reading [to] identify instances where Henry’s language or narrative motifs parallel those of his predecessors.” What? Really?
Is Deep Research Hallucinating Abilities it Doesn’t Have?
Running an n-gram analysis would require Deep Research to have access either to tools or the ability to write and run python code. OpenAI’s announcement indeed says: “Deep research independently discovers, reasons about, and consolidates insights from across the web. To accomplish this, it was trained on real-world tasks requiring browser and Python tool use, using the same reinforcement learning methods behind OpenAI o1, our first reasoning model.”
To probe the case, I tried again and asked the model to show its work, outputting any relevant code at the end. This time it proposed not only to conduct an n-gram analysis, but also to measure cosine similarity using TF-IDF vectors and to calculate Jaccard similarity coefficients. In its response, it provided some python code but when I ran it, although it worked, it didn’t reproduce the model’s results. In fact the numbers were wildly off. I tried several other examples with similar results (same with the other functions). Another serious problem is that many of the quotes in the response were either entirely made up or poorly paraphrased.
So how do we square this with the other results above? First, it is a good reminder to treat LLMs critically. But don’t breathe a sigh of relief either because I don’t think these failures stem from problems with the model itself.
I would love to hear from OpenAI on this, but my intuition is that Deep Research went off the rails because it tried to do something that it couldn’t actually do. I suspect that although OpenAI trained Deep Research on python, it did not give it the ability to download and store the actual text files required to make its code work. When things like that happen to LLMs, they tend to go off the rails pretty quickly because they don’t have enough information to understand what is happening and they start to spiral. I am speculating here, but when the code failed, Deep Research may have found that it did not have the actual text in its context and tried to recall the quotations as best it could. And when models do that, like people, they don’t do a great job and start to confabulate.
My theory is supported at least in part by a final test in which I gave Deep Research the same problem but told it not to use any code and to conduct a purely qualitative analysis instead. This time, the citations all worked and the quotations were accurate. The analysis was also, perhaps not surprisingly, much better. Although I will admit that I haven’t yet clicked through all the links it provided, at the very least its qualitative attempt was significantly better.
Assuming this is the case, these types of errors suggest Deep Research has a number of latent abilities that OpenAI is likely to release at a later date. As these start to converge with access to paywalled sources, LLMs are going to do some really interesting and strange things.
Conclusions
My sense is that we are going to look back and see Deep Research as the start of a new era when AI agents started to work. This has enormous implications in general. As Logan Kirkpatrick of Google’s DeepMind recently observed, people simply aren’t preparing for a world where intelligence is effectively free. But let’s focus on what it means for us as historians and academics specifically.
In an excellent blog post, Joshua Gans recently argued that these new reasoning models fundamentally undermine our existing research model because they allow us to answer many research questions on demand rather than through months or even years of laborious work. What is the point, he asks, of conventional academic publishing in such a world? This begs an obvious question: are we the last generation of human historians?
While I’ve heard lots of counter-arguments over the last couple years—ranging from the ethical to the aesthetic—I have yet to hear one that convincingly parries the core issue: our work has value because it is time consuming and our expertise is specific and comparatively rare. As LLMs challenge that basic equation, things will inevitably change for us just as they have for any economic group faced with automation throughout history. To be clear, I don’t like this either and I have real moral, ethical, and methodological concerns about machine generated histories.
This is why we need to stop talking about LLMs in the abstract and start having serious conversations about the future of our discipline. My intuition is still that humans are going to need to be in the loop and that people will prefer human generated histories to machine made ones, but am I right? Even if I am, what is history going to look like in this brave new world? Are we willing to harness these tools and work alongside them? What exactly do we bring to the table that LLMs do not? If we keep avoiding these hard conversations, the rest of the world may move on without us.
Totally agree! It’s the detective work I enjoy too. I don’t think that will go away. But that is a different issue that the question of how far and fast the economic value of human research and analysis will decline in the face of automation. I often think of furniture making as a good example. I am a woodworker and I love building arts and crafts furniture in my spare time and I am pretty good at it. I’ve built a few pieces for other people too. 70 years ago I could have made a career of it, but the reality today is that it would be very difficult to thrive. The problem is that for most people a $200 IKEA coffee table is good enough compared to a custom one for $1,800. Doesn’t change the enjoyment I get from my hobby but it doesn’t change the economics of the business. Certainly there are professional custom furniture makers, just a lot less than there were 70 years ago.
Interesting perspective, but not sure I agree. Deep Research is already an agentic system: it conducts systematic, unsupervised research on its own to solve a problem. In terms of reasoning, from a practical standpoint these models "reason" in that they are able to parse a question, identify relevant from irrelevant sources, read the material to find information that answers the question, and then write a coherent answer. You might look at the historiographical example as it is a more coherent, cogent response to a more general question. I am not sure how one would argue that an LLM could infer the answer from the question in this case.