Understanding the Fuzzy Accuracy of Generative AI

If you focus solely on ChatGPT's errors, you need to think about it a different way

Mar 07, 2023

Have you heard of the book The Birth of Canada: A Cultural History of the Confederation Era by GFG Stanley? ChatGPT says it is the single most important book on Canadian Confederation, but of course a quick Google reveals that it doesn’t exist!

Erroneous results like this seem to be leading many scholars and pundits to dismiss ChatGPT as useless or even dangerous. At first glance, that might make sense, but only if we see it as just another type of search engine. Generative AI actually offers something quite novel and that is the point. If you focus solely on its errors, you need to think about it a different way.

Google vs ChatGPT

For this post, I am going to focus on ChatGPT as it is the best known tool at the moment. If we want to understand why it is still useful despite its tendency to hallucinate sometimes, we need to understand what it is doing when this happens and why. So let’s being with looking at what chatbots were actually designed to do because although on the surface, they seem similar to search engines, they are a very different thing. Google was created to find websites based on queries and its success is measured by the relevancy of its results. Chatbots, on the other hand, are made to have conversations with users about a wide range of topics. They have historically been deemed successful when a user feels like they are talking to a real human.

The Large Language Models (LLMs) underpinning AI chatbots are trained to respond to prompts by reading tens of billions of pages of text. Every time ChatGPT generates part of a word (called a token) it goes through 175 billion computations to determine a probable way to continue the sentence. But that does not necessarily mean it always chooses the most probable (or indeed factually correct) thing to say next. This is a crucial point, because if it did that, its answers would always be more or less identical. Instead, because it is meant to be dynamic and conversational, it is programmed to introduce an element of randomness into the equation, controlled through a parameter called “temperature.” This makes the LLM sometimes choose less probable outputs. The higher the temperature setting, the more “original” the response will feel to a human—but it will also increase the chance of factual or contextual errors. LLMs use their training and the context provided by a user’s prompts (and in the case of ChatGPT, earlier parts of the conversation, up to about 3,000 words) to determine the range of what should be probable. Elaborate, detailed prompts with clear instructions and parameters almost always elicit the best responses. This is why some AI scholars have called LLMs stochastic parrots or a version of the mirror test.

Consider how a “raw” untrained LLM works in comparison to a fully developed app like ChatGPT. In basic models (like Meta’s new LlaMa model, just released to researchers), the prompts must be especially leading to elicit a lucid response. For example, the documentation on the LlaMa GitHub page tells users to avoid asking LlaMa to “Explain the theory of relativity”, suggesting they prompt it with something like: “Simply put, the theory of relativity states that…” This invites LlaMa to infer a completion—which is why an LLMs responses are technically called “inference”. In effect, it's a typical drama class exercise, albeit one based on billions of pages of text. Although you don’t see what is happening under the hood, when you use ChatGPT, it too is trying to infer a response from your lead.

Because it is so focused on conversation flow and language, ChatGPT is especially bad on sources. It does not really have the capacity to understand who said what when. Think for a second about how it would relate to sources during training. Most writing does not contain citations, and so bibliographies, footnotes, and especially short form endnotes must be interpreted as unusual, confusing, disembodied non-sequiturs, devoid of most of the context that LLMs require to understand how words relate to one another. In effect, what it learns, if anything, is that certain authors’ names are associated with certain topics. The titles it invents, like Stanley’s fictitious work mentioned above, are cobbled together from these bits and pieces to sound plausible.

Understanding this process helps to explain why results from ChatGPT can also be so uneven. Despite making up fictitious sources on Canadian Confederation, it can paradoxically generate pages and pages of relatively good analysis of the old “act or pact” debate. This is not the type of thing we are used to seeing in academia: the students who tend to make up random books, rarely have much of substance to say about their subjects. But not with generative AI. It is most accurate where the source material is voluminous and consistent. This is why tools like ChatGPT were not intended to be used as search engines.

Finding Use Cases

Colleagues have rightly asked me: given these limitations, how can generative AI like ChatGPT possibly be useful? Simply put, if you expect it to be infallible and plan to use it solely on the input side of the research equation, that is, to retrieve sourced information, you will not find it useful at all. What makes generative AI so revolutionary, though, is that these tools are incredibly efficient at making use of information on the output side, including in ways that are beyond human capacity. They excel at the skilled tasks normally reserved for humans like writing, editing, synthesis, summary, and analysis. They shine here precisely because of their myopic focus on language, text, and context which, of course, also make them so bad at citing their sources.

We are at the beginning of a new era in how we relate to information. Just as digitization and electronic search once revolutionized how we find information, generative AI is similarly overhauling what we are able to do with the electronic information we’ve spent the past few decades compiling. But we are only at the start of this process which is why tech companies are so interested in exploring new “use cases” for the technology: even OpenAI did not expect ChatGPT to go viral nor did it anticipate many of the uses that people found for its flagship app.

To understand this moment, we need to also acknowledge that despite everything above, this new generation of LLMs is actually quite accurate – surprisingly so. This is why they are able to consistently pass legal and medical exams, do well on SATs, and write A level undergraduate essays or book reviews. They get things wrong, but so do humans. And for many applications, that’s OK.

I think this unsettles many people because most of us are used to relating to computer programs in a binary way: either they are right or wrong, they work or they don’t. But LLMs are a different beast, and this will take some getting used to. They have a fuzzy relationship with accuracy and their results are more human-like in that they are inconsistent and variable for all the reasons I outlined above. Even so, in dwelling on its errors, we risk missing the bigger picture.

What sets generative AI apart from us “fallible meat bags” (as Sydney once called us) is its incredible speed and ease of use. As but one example, ChatGPT can write passable copy in 20 seconds that might take an average, skilled human a day or so to produce. But it can also do things that people cannot: it can, for example, translate texts very accurately between several languages almost instantaneously. It can code in a variety of programming languages and train users in difficult and unfamiliar tasks, step by step. It also remains patient and courteous no matter how angry a user gets—so long as it is trained to do so. As you can imagine, these use cases make it valuable now in any number of professional settings which is why companies are racing to adopt it.

But its most important virtue is that it is also incredibly easy to use. Because programs like ChatGPT use natural language to execute commands, you only need to be able to clearly express yourself in order to get it to do what you want it to do. Ask it to write a program in C+ that allows you to open a JPEG and highlight text with a cursor, and it will do so. Nearly instantly. Not sure what to do with the code? You can ask it that too and it will tell you how to compile it. Need a copy of visual studio to do that? It can point you to the website and walk you through the installation.

What’s All the Disruption?

Automation is hardly a new phenomenon, but its not one as familiar to the middle classes. And in this sense, Generative AI is especially disruptive not just because it will do some of the things humans once did much faster, but because it is also reversing the trend towards intellectual specialization. AI is generalizing the skills necessary to produce highly refined outputs – not in a single area, but across the board. Going forward, as the technology evolves, this gap will only widen. Which is why it would be a big mistake to discount something like ChatGPT just because it does not retrieve the results you want. That is not what it was designed to do.

Generative History

Discussion about this post