Why OpenAI’s New Model Might Change Everything for Researchers

LLMs don’t need to be perfect, they need to be reliable. O1 is a huge step in that direction.

Sep 13, 2024

OpenAI announced a new model yesterday called o1. It reasons in a way that reduces hallucinations and allows it to tackle more complex problems. It’s not perfect, but in this post I want to explore why I think it’s such a significant development for applying LLMs to academic research and how the approach may open-up new use-cases for LLMs.

Introducing o1

O1 is not a new foundation model (IE it is not GPT-5/Orion) but part of the GPT-4 class of LLMs. The base model is not yet available, but OpenAI made an early preview version available to Tier-5 OpenAI developers via the API (and fortunately this gave me early access). For context, the base O1 model is said to score 25-50% higher on a range of metrics than the preview version. OpenAI says that they are continuing to train and scale the base version and will provide frequent updates.

This model seems to be the infamous “Strawberry” model long rumoured to be in development. Trained from the ground up, apparently on heavily curated open-source and licensed datasets, OpenAI used a new reinforcement learning approach to teach the model to “think” through its answer before writing it out. This process (Q*?), involved two main things: training the model to approach problems through Chain of Thought (CoT) reasoning while also giving it the ability to develop various possible answers and then choose the best one.

How it Works

CoT has been around for awhile now as a prompting technique and involves some variation of telling the model to “think step by step” before answering which improves the quality of its answers. Here we need to remember that LLMs produce a response one word at a time by calculating the most probable next word. Each time it adds a word, it considers the entire prompt as well as each previous word in the response.

By telling the model to think step by step, you force it tackle problems one stage at a time which changes the probabilities used to calculate the next word. This is why it can improve the quality of the final answer, especially for complex problems, by baking essential information into the model’s calculations. But to this point it’s been a one shot, sequential process as the LLM can’t actually “stop” and consider the problem, correct itself, or check itself for hallucinations. With O1, OpenAI trained a model using reinforcement learning to do this automatically which, unlike prompting, will fundamentally alter how the model actually calculates the probabilities for the next token.

As exciting as that is, the really revolutionary thing is that OpenAI also introduced a new inference architecture that allows the model to actually work step by step, pause, reassess, backtrack, and start again. Its responses are not one shot, stream of conscious affairs, but generated through a deliberative process of step by step “reasoning”. We can imagine this as looking something like a tree growing out of the question with each new blossoming branch representing a different exploration of a potential answer to the question. This is revolutionary, because it represents a major shift in the inference process, giving the model the ability to “consider” a problem from different perspectives, test possible answers, back-track to correct itself, and select better courses of action. While OpenAI has not shared details of how the process actually works, their suggestion that this process can be scaled suggests that it is not exploring one tree at a time but many different trees in parallel, choosing the best answers from a large number of potential answers. How many, we don’t know. Dozens? Hundreds? Thousands? That is probably one of the things that is scalable.

Scaling

Now: with an eye to the future, let’s talk about why the mention of scaling is so significant here. Up until now, when we’ve talked about scaling LLMs we’ve meant increasing the size of training data and the parameters (neurons) in the model architecture. Thus far, as a model’s training data and architecture increase, its capabilities have grown proportionally such that some argue the relationship is a “predictive law”. Whether that will (or can) continue to hold true, though, is fiercely debated. When the next generation of scaled-up foundation models appear later this year or early next, we’ll have a better idea of whether scaling will start to bring diminishing returns or not.

But with the release of O1, OpenAI has introduced two new properties of LLMs that it says are also subject to similar scaling effects. The O1 team writes:

“We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.”

Train-time compute refers to the computational resources and time invested in teaching the model during training—essentially, it’s the model’s “learning phase.” Unlike pretraining that requires vast amounts of data, reinforcement learning and fine-tuning are more data-efficient because the model learns from interactions and feedback, often needing vastly less raw data to see significant improvements. So as models get bigger and bigger, finding such efficiencies will become more important.

Source: https://openai.com/index/learning-to-reason-with-llms/

Test-time compute, on the other hand, is the computational effort the model uses when it’s deployed to solve problems—the “thinking phase” when it actually comes up with an answer. What OpenAI is essentially claiming here is that by allowing the model to spend more time processing and reasoning, it will arrive at better and more accurate solutions.

In theory, test-time compute could increase exponentially and OpenAI envisions putting a model to work on some hard unscientific problems for days or even weeks with this model. That sounds compute intensive, but it is nothing compared to the requirements necessary to pretrain a GPT-4 level model. But the flip side is probably also true: it will be possible to dynamically scale compute up or down at test-time based on the complexity of the user’s question. This has important cost implications. With a conventional LLM, the compute expended on an answer is proportional to its length: the longer the answer the more expensive it is to generate. The model works just as hard whether it is regurgitating a recipe for banana bread or trying to explain quantum mechanics. The o1 architecture implies that it will be possible to scale compute to the complexity of the question.

The key takeaway is that enhancing capabilities through train-time and test-time compute doesn’t rely on scaling up data to enormous sizes, but rather on optimizing how the model learns and thinks with the data it has. This approach sidesteps the bottleneck of finite data and opens up new avenues for improving AI performance without the prohibitive costs associated with scaling traditional pretraining methods. It also makes it possible to envision creating a whole series of models finely tailored to specific tasks or areas of scientific research.

Accuracy, Reliability, and Trustworthiness as Emerging Capabilities

After using the model extensively for the last twenty-four hours, my intuition is that we can increasingly think of accuracy, reliability, and trustworthiness as emergent properties that may well scale with the o1 approach. By accuracy, I mean the model’s ability to produce correct and precise answers with fewer or more predictable hallucinations. Reliability, on the other hand, refers not to a consistency in its output, but a consistency in the “thinking” process itself. The implications of the approach described by OpenAI is that it should eventually become highly customizable, allowing organizations to train models to follow a specific series of protocols or thought processes that it can test and vet.

Both accuracy and reliability contribute to improved trustworthiness, but so too does the interpretability of the process itself. At the moment, OpenAI has opted to keep the “thought process” private and hidden from users, although in ChatGPT you can see a summary of what I assume are the successful steps in the CoT flow (but this is not yet available in the API). But as OpenAI notes in the blog post, this will eventually give users the ability to monitor the development of LLM responses for bias or unsafe actions as well as to quickly evaluate them by examining the underlying steps in the process.

Example of “Reasoning Tree” for a question about fur trader Ferdinand Willard Wentzel’s family I posed to ChatGPT based on document excerpts I provided.

This is all extremely important for software developers. I would argue these properties have been the major barriers to moving from the prototype to deployment phase up to now. Developers can “wow” at the prototype stage, but the models have just not been accurate enough or reliable enough to generate the trust necessary to actually put them out there.

The Case of PearlBot

Let me use my own work to provide some concrete examples about how o1 is likely to change the game. Over the past year, I’ve led the development of PearlBot (named after my smartest cat), a prototype AI research assistant designed to answer queries using a database of open-source 18th-century fur trade records. Building it was a learning experience that involved creating both relational and vector databases for effective keyword and semantic searches. The system employs fine-tuned language models to parse queries, retrieve relevant documents, and generate answers with in-text citations for verification. The prototype appears really impressive when I show it to an audience, but the reality is that we haven’t made it available yet because our team has a series of verification tests that the system always fails.

If you ask PearlBot the deliberately vague, open-ended question “Tell me about Wentzel’s family” (a sentence I’ve now written thousands of times), it searches a database of 7,500 records and retrieves around 15 results. These are then passed to an LLM which is given the question and a system message instructing it to use all the relevant documents, cite them in square brackets, and be as thorough and detailed as possible. Our team has designated nine of the documents as essential, which means that we’ve decided that in order for a model to pass the test, it needs to use and cite the information from all of those documents. This is because it is what we would expect of a graduate student or a fellow historian given the same task and body of sources. Across hundreds of tests, GPT-4, GPT-4o, Claude Opus, and Sonnet-3.5 all used an average of about 3.5 of the nine required documents in their answers. Sometimes this was as low as 1 but it’s never exceeded 6.

Using nine of nine documents is a minimum requirement and it also matters how the model actually interprets, organizes, and uses the texts. Again, even the best models don’t do well with this. For example, in his diary, Wentzel consistently referred to his wife as “my girl” which sometimes confuses models into thinking he was referring to his servant or daughter. They also frequently misunderstand references to the families of other people as references to Wentzel’s own family: when Wentzel tells a friend to give his best to “Mrs. McKenzie”, models often assume she is one of Wentzel’s relatives.

The models sometimes use correct information, but then cite it to the wrong sources and also have a hard time re-arranging events into a chronological, logical sequence. They especially have trouble connecting related information and events separated by time and space. For example, one document describes the birth of Wentzel’s child in the winter of 1805 while another mentions him travelling with his wife and three children in the summer of 1805. The models frequently assume that this means he had four children, rather than drawing the logical inference that the child born in the winter of 1805 is one of the three children seen in the canoe that summer. One of Wentzel’s “young boys” also died in 1808 and although the models always note this fact, they sometimes assume this must have been a fifth child. They have also frequently overlook the significance of another subsequent but oblique reference to the tragedy which betrays that Wentzel remained preoccupied for some time by the death of his son such that he neglected his job, an important insight into his personality.

Although we have varied the system message and prompt in many different ways, nothing consistently improves performance beyond the limits outlined above. Some prompts get better answers, but no prompt eliminates all of the problems. So as impressive as it is to watch PearlBot write out an answer with verifiable citations, the program has never been close to being ready for primetime. The answers are simply incomplete and too often inaccurate, unreliable, and untrustworthy. Up until now, it has been a good proof of concept but little more.

PearlBot meets o1-Preview

O1 is the first model that can pass our tests. We’ve only had a day to try it out, but after integrating it into PearlBot, we tried the question ten times. In those tries, it used the necessary documents an average of 8 times (low of 6 times high of 9), passing the minimum requirement 5 times. That is a massive improvement on its own—swapping our former max score for the minimum score here—but its responses have also been much more consistent, not necessarily in form but certainly in content. O1 repeatedly draws the same inferences, connects the same information and events, and makes correct judgements. It has fewer problems, rearranging events into a coherent chronological narrative where one event connects to the next. I have yet to see a hallucination although it sometimes uses the wrong number to refer to a source (less frequently, but it happens).

PearlBot uses o1-Preview to respond to the question “Tell me about Wentzel’s Family” using all nine required documents thus passing our test.

Here is why it feels like a different beast. In one typical run, it wrote:

Wentzel's family appears to have been intimately connected with the Indigenous communities among whom he lived and worked. On January 31, 1805, Wentzel joyfully recorded the birth of a son, noting with a mix of pride and astonishment the child's size and resemblance to himself. He wrote that his "girl" had "happily made me a great boy," and marveled at "how a Lilliputian could make such a big male child, nearly as big as myself, and more than 20 inches long."[2327] This passage not only underscores his paternal pride but also hints at his own modest stature, referring to himself humorously as a "Lilliputian."

This is an important example because the model had to not only correctly parse that Wentzel’s “girl” was his wife, but also use his joke about the baby’s size to infer something about Wentzel’s appearance. It now correctly infers that the child born in January 1805 was likely one of the three children seen in the canoe that summer and correctly notes that the loss of his son in 1808 “deeply affected him as evidenced by a subsequent entry on March 7, 1808.” This is not a one off, but it does both things consistently.

O1 also uses the documents to make original and insightful observations. For example, it writes:

The participation of Wentzel's wife in the trading community is further evidenced on October 17th, 1807. Wentzel recorded that "while my wife was across the river, an old man named Croup de Chien sent her four beaver skins and one meat item, though he did not owe anything" ([6099]). This gesture suggests that Wentzel's wife was respected and perhaps held a significant social standing within the local community, receiving gifts even in the absence of debt or obligation. It also illustrates her mobility and active engagement outside the immediate confines of the trading post.’

The model’s observation that the gift by Croup de Chien to Wentzel’s wife speaks to her status in the broader community is reasonable but not one we’ve seen from other models. So too with its observation about her mobility and engagement in the fur trade outside the fort itself.

Conclusions

The model just feels more mature. It’s a boring writer but it does the thing you ask it, more consistently and ably. While a lot more testing is needed, it feels like a big step towards a more accurate, reliable, and trustworthy model. To my mind, PearlBot and similar applications will be deployable when they meet or exceed human performance on similar tasks. We are getting very close to that here.

However, it is far from perfect (and that is OK). It does not follow instructions all that well and if you give it nine relevant documents in a much larger pool of 120 irrelevant documents, its performance degrades significantly. Maybe scaling can improve this in future versions.

To look forward just a bit, we need to remember that this is an early preview model which, according to OpenAI, performs less ably than the base o1 model. It is also a GPT-4 level model, not a next-gen model like GPT-5. When the next frontier models appear, it is entirely possible that the doubters will be proved right and the limitations of scaling will mean that the exponential progress in LLM capabilities will begin to level off. But if this isn’t the case, when the o1 regime is applied to the new models we might find ourselves in wholly uncharted waters.

Vivienne Cuff

Jan 27

The issue with AI ‘making interesting observations’ is problematic. I want the technology to assist with transcription work, but as an archivist and historian, the interpretation of the data is more nuanced… it summaries of content too easily become ‘historical facts’. There is an ethical issue here.

Expand full comment

Nicole Dyer

Oct 5

Thank you for sharing the details. This is an exciting time for using AI in historical research!

1 more comment...

Generative History

Discussion about this post