Building an AI Research Assistant

What would you want your AI research assistant to do?

Jun 02, 2023

Most mainstream media attention has focused on ChatGPT’s writing abilities. Unfortunately, this has obscured its much more interesting ability to perform customized and adaptable task-based reasoning. What I mean by this is that the underlying engines behind things like ChatGPT can do a range of monotonous jobs. I think it would be amazing to have a research assistant transcribe documents, translate them, store them in a database, help you organize, and then find them. It would be even better if this AI assistant could compile data and visualize it. These are all tasks that requires some level of “thinking”.

Thinking Machines

I’m fascinated by the debate about whether LLMs can truly “reason” and have been reading and listening to everything I can on inscrutable matrices, “black boxes”, and emergent capabilities—it’s fascinating and scary stuff. Few people realize that this is more alchemy than science at present. Think about this: the people that built these things don’t know exactly how they work. In essence, researchers fortuitously discovered when you train a neural network containing a very large number of parameters to predict the next word using an extremely large corpus of text, all by themselves they start to do interesting things that their creators did not initially expect nor train them to do. This is the reason that some very smart people in the AI field are getting worried: humans built a powerful technology and we can’t actually explain how it works.

To be clear: I am using the term “reasoning” to describe the fact that when you ask GPT-4 to do something quite complex—that involves doing something we would normally describe as “thinking” like parsing a series of difficult to follow instructions—it usually “gets” what you mean and performs the task quite well. While I think it is important to understand what this actually involves, especially for high-risk applications like medicine, it is less crucial when all you want it to do is create a table from data in a text file. In “low-stakes” cases this this, it either works or it doesn’t. You can tell right away. And for these practical purposes, it doesn’t much matter whether it actually thinks or is really, really good at predicting the next word.

The ability to follow instructions and solve problems is what make LLMs so powerful. And we can harness this power to do useful things for us.

Building an AI Research Assistant

For the past few months I have been working on creating an AI powered research assistant. Like most social historians, I am intrigued by the stories of ordinary people and am probably something like a microhistorian at heart. This is someone who closely reads mundane, inglorious, and overlooked sources to generate detailed pictures of ordinary lives lived in historical obscurity. It is an arduous and inefficient process. For example, I spent years generating detailed biographies of hundreds of ordinary shell-shocked soldiers from medical records, war diaries, personnel files, newspapers, and veterans records to help me understand what happened to them and why. I more recently did a similar thing for around 300 voyageurs from St-Benoit parish using voyageur contracts, parish records, legal documents, and fur trade records. In both cases, I had to sift through hundreds of different sources, looking at one individual at a time, one source at a time in order to generate mini-biographies or life histories. To me, the value of this approach is that it often reveals hidden trends and patterns that are rarely written down in the memoirs, diaries, and letters of elites. The trade-off is that it is time consuming and repetitive, to say the least.

I’ve been working on developing an AI research assistant that can speed up this process. What I envision is an AI enhanced search and retrieval agent that can be used to query a large body of documents and make the process of retrieving information across disparate datasets faster and more efficient. In effect, I want to automate the searches, clicking of links, and scrolling through documents.

In later posts, I will explain the process in more detail, but in a nutshell the program I am building starts with filling a special kind of spreadsheet called a vectored database with transcribed historical documents. Vectored databases attach something called “embeddings” to chunks of text or images. These long, floating-point numbers look and act like GPS coordinates, representing the meaning or content of blocks of text and images in an imaginary semantic three-dimensional space. Just as GPS coordinates tell us how near we are to a given point on the Earth, embeddings can be used to retrieve the data from the database that is nearest or most relevant to a given query. This data might be several pages of text, a few sentences, or even single points like a birth year and name. When the user generates a query, the GPT-4 API uses this database, rather than its training data, to answer questions which makes it more accurate and useful.

Historical Agents

This sort of program is nothing new and is very similar to the types of agents and “Chat-with-my-PDF” type plugins that people have been developing since the GPT-3.5 API was released last winter. Yet historians have some unique needs those off-the-shelf solutions don’t address. In sifting through documents, we need to pay attention to chronology and dates and ensure that citations are accurate. This is all possible with properly structured vectored databases and search functions, but they need to be customized for the ways historians actually ask questions and the types of results we expect to see. In one sense, this is just another type of search function, but it’s also more than that. Here is a clear use case where search often fails: how many times have you spent hours looking for a specific document that you know you’ve seen in your files but can’t quite remember the wording or what folder it was in? It would save a lot of time and effort to be able to describe its contents in plain language to an AI assistant that could then go find it based on a somewhat vague description. Semantic searching using vectored databases allow you to find things that are similar in concept, but not necessarily in wording.

Here’s a more expansive example. Imagine you are looking for information on how soldiers responded to fear in battle by going through a database of their letters. You’d quickly find they didn’t always use words like “scared” or “afraid” and often employed euphemisms—not always, but enough that it makes keyword searching databases pretty difficult. While the obvious solution is to just “read all the letters” manually, this isn’t practical if you are faced with tens of thousands of individual documents. To solve this sort of problem, we might normally construct a sample of some sort but this would mean going through an awful lot of irrelevant data in order to find a few precious examples that may or may not be representative (depending on how you chose your sample). No matter how you slice it, it’s a very time consuming and labour intensive task. The unpleasant truth is that all too often, all that work results in only a few lines in whatever article or book chapter you’re working on.

With an embedded search, paired with an AI agent, you can quickly find all the relevant examples in a data set—without having to categorize or code them first. It’s a technical process, but surprisingly easy to accomplish. In a nutshell, your query is given an embedding via an API like OpenAI’s Ada-002 to represent its conceptual and semantic content. Provided you have the information you need loaded into a vectored database, the AI will then try to match your query to the embeddings from letters that are close to it in an imagined semantic space. Done properly, the results will be topically and conceptually relevant and you can iterate through all the examples in a very large database quickly.

But let’s expand this out a bit more. With enough tweaking and database building—and access to an LLM with state of the art visualization abilities—you could also have it do some additional lateral processing, pulling up biographical information on the author of a letter from their personnel file and census records. An AI agent could also go through unit war diaries and find out exactly what a soldier’s unit had been doing in the days leading up to the point when the letter was written. It could even go through newspapers to find out what the recipient may have been reading when they got the letter in the mail. The possibilities for this type of lateral, recursive research are limited only by the availability of digitized documents—or the ability of the historian to digitize the documents themselves.

The Future of Historical Search

Imagine harnessing this AI reasoning power to do something that would be impractical for a single human or team of humans to do in a lifetime. Here I think of the remarkable Programme de recherche en démographie historique (PRDH) database which documents the lives of everyone that lived in New France/Lower Canada from the 1620s to the 1860s. This project has been ongoing for decades and is a truly invaluable resource that enabled countless studies. As I keep experimenting with this technology, its clear that it will soon become feasible to construct a similar database for Canada as a whole, covering any combination of records from 1842 through the 1930s. When you think about it, this has been theoretically possible for years, the only limiting factors were the enormous number of human research assistants and the amount of time required. Yet AI reduces these costs exponentially by promising to automate processes that are time consuming and repetitive, compressing days of work into seconds or minutes. Rest assured, genealogy websites have been experimenting with this sort of technology for awhile and will soon make this a core aspect of their business. They are already using it to transcribe and index the 1931 census in partnership with Library and Archives Canada.

Over the past few months, I have been experimenting with a small data set and am now in the process of scaling it up, at this point just to see what I can get it to do. It is still a giant experiment. Right now, the possibilities seem pretty endless, so long as you can feed the LLM with the relevant data and tell it what you want it to do via processes called “few shot learning” or “fine tuning”. Right now, the main limiting factor is the context window and the need to OCR and “clean” your data: you can send GPT-4 up to 25k words of text to work with while Anthropic’s Claude takes 75k words. And these context windows will only keep growing (remember when a single megabyte of ram was state of the art?); multi-modal models will soon be able to use images without OCR, including of handwritten text. You get the picture.

I suspect that in the long run, the types of programs I am working on will become as commonplace as library search engines and archival databases. I am sure the next generation of historians won’t think much about them either, but right now they seem pretty revolutionary to me.

Generative History

Discussion about this post