A First Look at GPT-4 and the Humanities
GPT-4 is out and it solves many of the “problems” with ChatGPT
How quickly things change. Today OpenAI released GPT-4, the much awaited successor to the GPT-3.5 engine that powered ChatGPT since November. Even at this early stage it’s evident that it can do many things better than its predecessor—even though its most intriguing features have yet to be released. In this post, I want to share my initial impressions with some quick side by side comparisons.
First, though, just so everyone is on the same page: what is GPT-4? GPT-4 is a multi-modal Large Language Model (LLM), which means that it can accept both text and images as input. Its “vision” capabilities have not been fully released yet, but the demos are impressive. It is significantly better on accuracy and reasoning: whereas GPT-3.5 could pass the Uniform Bar Examination (UBE), it scored in the 10th percentile. GPT-4 scores in the 90th percentile and obtains similarly high results on many other standardized tests.
To me, though, one of the most important changes for historians, is GPT-4’s capacity to “remember”, process, and generate 10x the text. It’s previous limit of about 3,000 combined words has restricted its ability to analyze documents and solve historical research problems. More data opens the possibility of using the model to do some really interesting things. More on this later.
For now, here is my initial impression of what GPT-4 can do better.
1. GPT-4 is Much Improved on Sources
With ChatGPT, one of the biggest problems (or saving graces if you are worried about plagiarism) has been its inability to handle sources. As an example, I asked the old model of ChatGPT to provide a list of the five most important books on the social history of Canada in the First World War and, not surprisingly, it responded with complete nonsense—in that the books either do not exit or are wildly off point, sometimes both. When I told it that it made up the titles, ChatGPT apologized and provided five more made-up or off-base suggestions. Hence recommendations to distrust ChatGPT’s sources.
But no more: GPT-4 is a world of difference. With the same prompt, it provided a list of five real and on point titles: Jonathan Vance’s Death so Noble, Sarah Glassford’s and Amy Shaw’s, A Sisterhood of Suffering and Service, J.L. Granatstein’s and J.M. Hitsman’s Broken Promises, Jeff Keshen’s Propaganda and Censorship, and Ian Miller’s Our Glory and Our Grief, all with full and correct citations.
I tried this a few times with different topics and these results are typical. That said, it still occasionally hallucinated, but only on one of the five titles. Interestingly, this time, when I told it that the list was inaccurate, it apologized, identified the erroneous title and removed it. It can now fact check itself.
For those hoping to deploy ChatGPT in their work, this is a major advance. If you are primarily worried about plagiarism, the end may really be nigh.
2. It Is Really Improved on Citations
GPT-3.5 Turbo did not handle citations very well. It also could not reliably convert between Chicago style and MLA nor could it provide a full citation based on a partial note.
GPT-4 is, again, wholly different: it can reformat citations—instantly. I gave it several block notes written in the Chicago style and asked it to convert them to in-text MLA format and it did, perfectly. I then had it convert some other in-text citations to Chicago style footnotes and then generate a Chicago style bibliography from those references.
Finally, you can give it partial references (last name, short title) and it generates the full citation. Perfect each time, at least with the examples I tried. It also advises you to double check the publication information too. This will be especially useful.
3. It Writes Better, Longer Texts
Only Monday I explained that ChatGPT is not a real threat to academic integrity in most contexts, given its tendency to generate vague C and B level papers (albeit without citations) that are far too short to satisfy the length requirements of most university essay assignments. This was largely due to a lack of context: as I outlined at the start, through the API the Turbo model could only “remember” and generate about 3,000 words at a time, including the initial prompt and answer. In ChatGPT, it could only return around 800 words of text.
GPT-4 is capable of “remembering” and generating more than 10x as much context, although the ChatGPT deployment of the model does not yet allow you to use that full amount. Even so its writing powers have improved dramatically. It generated a 1,600 word paper on fur trade marriages in three blocks of text, including footnotes and a bibliography. Yes it generates citations now for its own work when requested. Its book reviews too are much better. Earlier versions had a tendency to hallucinate, especially with lesser known books but I don’t see that same tendency in GPT-4’s responses.
These examples obviously have implications for academic integrity, but let’s set that aside for a minute. In theory, the new model’s API will be able to process and write 25,000 words. That is enough to provide the model with 15,000 words of notes and ask it to generate a chapter in response. For context, the cost of using GPT-4’s API to generate a 100,000 word book will be under $8.00. I am not suggesting that anyone would want to read an AI novel. The point is: writing several years’ worth of human-generated text is now equivalent to just over 1/2 an hour of minimum waged work in Ontario.
What this means is beyond me. Whereas it was relatively easy to see how we might integrate a GPT-3.5 level technology into the classroom, as this continues to evolve, I am less sure of the implications. Neither is OpenAI: they are actively looking for researchers to use the API to figure out they can do with it.
“writing several years’ worth of human-generated text is now equivalent to just over 1/2 an hour of minimum waged work.”
4. It Is More Factual
As impressive as it is, ChatGPT Turbo does best on well-known topics, even though this makes its answers a bit derivative and vague. But when it wanders too far from the well beaten path, it can start to hallucinate, sometimes badly.
For example, when I asked the Turbo engine to tell me about Robert Henry, an 18th century Albany merchant, I got five paragraphs of complete fabrication. That is not surprising as Robert Henry is an obscure figure, a minor businessman of interest to me only because he was the uncle of Alexander Henry the Elder, the subject of a biography I’m writing. When the 3.5 Turbo version of ChatGPT doesn’t know the answer, it makes it up.
Ask GPT-4 the same question and it will rephrase your question back to you, tell you a bit about the importance of Albany in the 18th century fur trade, and, when pressed, admit that Robert Henry is an obscure figure about whom it has no other information. That’s not disappointing, it’s what it should do.
Conclusion
I have to admit that GPT-4 has left me a bit shaken. The technology is advancing at breakneck speed and things that I thought were years away are possible now. I was skeptical that the next generation model would be exponentially more capable in important areas, but it seems that it is. The improvements in context size alone are revolutionary—and that is without even considering its multi-modal capabilities. GPT-4 will be able to read text without OCR. Let that sink in.
Many of the advances with GPT-4 are subtle, but in the same way that the differences between a B and an A paper are subtle. And this is what makes generative AI so disruptive: it is advancing faster than most people can comprehend.