The Sugar Loaf Test: How an 18th-Century Ledger Reveals Gemini 3.0’s Emergent Reasoning
A deep dive into my experience testing the new Gemini 3.0 Pro and the growing evidence I’ve seen for emergent neuro-symbolic reasoning.
As Gemini 3.0 launched today, Google reported that it had impressively improved performance across a variety of important benchmarks. But tracking what this actually means on the ground for the average user is also becoming increasingly difficult. The reality is that while many of the benchmarks, like the Math Arena Apex or GPQA Diamond, are useful for comparing one model to another, they track performance on things that most of us don’t do and don’t even understand.
I think the biggest barriers to the widespread and formal adoption of LLMs in knowledge-work fields have little to do with benchmark performance and instead relate to questions about repeatability, reliability, and a lack of actual content understanding. We’ve seen widespread, scaled adoption in coding first, because the real question there isn’t benchmark performance, but whether the code actually runs. This acts as an automated check on the model: if it hallucinates misunderstands or just gets it wrong, just regenerate and iterate. But there is no comparable automated check for most knowledge work which means that trust is everything. And to this point, models can’t actually be trusted.
This is where I think Gemini 3.0 is going to be subtly but importantly and meaningfully different. After writing about my chance encounters with an early checkpoint on Google’s AI Studio and LM Arena, DeepMind provided early access to the model which allowed me to test it more rigorously in the few days before launch. First, what I’ve found is that the model is more reliable. We can see this on handwriting performance especially as our tests confirm it is now operating below a 1% error rate on the test-set that Dr. Lianne Leddy and I have maintained (we’ll have more on that next week). But more broadly, it seems to do the same things in the same way, over and over again. My sense from using the model is that this is because it now actually understands at least some of the things that it is doing. And that is what this post is about. What I do below is to walk you through how I’ve tried to rigorously test that intuition on a single task, drilling down as far as I can in an attempt to understand what the model is doing and why.
The TLDR is that I’ll present evidence that Gemini 3.0 has developed something like an emergent form of neuro-symbolic reasoning. I also present evidence that it seems capable of analyzing and symbolically manipulating the content of historical documents in a way that requires the existence of a coherent model of the historical world. I am being cautious, though, because I am acutely aware that these are still early days and that I am working in a very specific and esoteric domain. These results need to be more rigorously and broadly tested and then replicated before we can draw any general or firm conclusions. But the main point of the detailed case study that follows is that, in working with me through the problem, I think you’ll also come to see what I saw and understand what the benchmarks don’t really make clear: if this result holds, LLMs are getting good enough to be trusted in the way we might trust knowledgeable, trained humans on similar knowledge-work tasks. If true, that has enormous implications for what I do as a historian and how humanity relates to information.
More about Loaf Sugar than You Wanted to Know
In mid October, I reported on an encounter I had with a mysterious new Gemini model on Google’s AI Studio. I later encountered the same model on LM Arena and then had early access to it via AI Studio, confirming all three were Gemini 3.0. Back in October, though, two things struck me as significant: first, it was clearly much better than Gemini-2.5-pro on handwriting recognition. Second, and more importantly, the new Gemini model was seemingly able to correctly recognize, convert, and manipulate hidden units of measurement in a difficult to decipher 18th century documents in ways that appear to require abstract, symbolic reasoning.
I came to the second intuition quite by accident. As I recount more fully in an earlier post, seeing how good Gemini was getting on handwriting, I uploaded a difficult to read page from an 18th century ledger, chosen entirely at random, just to see what would happen. Gemini did remarkably well, which was interesting in itself, but I soon spotted a strange error. I told the model to transcribe the text exactly as written, but when it read the line: “To 1 loff Sugar 14 5 @ 1/4 0 19 1” Gemini transcribed it as “To 1 loff Sugar 14 lb 5 oz @ 1/4 0 19 1”, adding in the pounds and ounces. This might seem like a small thing, but what interested me was that the model had correctly inferred that the digits “14 5” (the space is somewhat ambiguous) were actually units of measurement describing the total weight of sugar purchased in lbs and ounces. I need to emphasize here that this was not an obvious conclusion to draw from the document itself nor its visible internal math; I’ll freely admit, it took me a minute to realize what was happening myself. So how did Gemini do this? That’s the big question.
My original post has the full context, but in sum I suggested that because the math isn’t straight forward, Gemini seems to have used complex logical, abstract reasoning which would also have required it to use knowledge about how the world in the 18th century actually worked. As a historian, I think that I can only decipher this ledger because I know that in the 18th century, sugar was sold in hard, conical loaves that were weighed by the pound and that people in Albany at that time used a different system of currency than the one I am familiar with. I need to have that knowledge and be able to adopt an alternative set of rules about the world before I can see that despite the merchant selling “1 loaf” at 1s/4d, the numbers 14 5 might be the more relevant unit of measurement than the loaf. To confirm this, I next have to reconcile two incompatible multi-radix systems of measurements: pounds and ounces (base 16) and pounds / shillings / pence (base 20 and 12). Only after I convert these to a common base of pennies am I able to confirm this intuition by dividing the total sum of 229 pennies by the unit price of 16 pennies to get 14.3125 which then converts to 14 lbs 5 oz (that is, 14 and 5/16 of a pound). At least that is how I imagine I did this, because cognitive psychologists tell me that humans are not exactly reliable narrators for their own thought processes.
If this is anything like what the model did, though—and that is a big if, but stay with me—that would be a pretty remarkable thing: it would suggest that this new Gemini model was engaging in symbolic reasoning within a coherent world model. And that would be new and important.
Retesting the Original Prompt on Gemini 3.0
After I published my blog, I did some more testing on a later checkpoint of the model (meaning an updated version) on LM Arena and then with early access to the same, updated version on AI studio. The first thing I noticed about what is now called Gemini 3 was that it follows instructions much more closely and reliably than the version I’d encountered in October. This is important because my initial prompt told Gemini to transcribe the text exactly as it was written on the page, meaning its insertion of lbs and oz was really an error. Given the same instructions, the new model almost always complies with my demands and transcribes the text as “14 5”. At first I thought my findings wouldn’t replicate. But I saw in the reasoning traces Gemini produced as it made its transcriptions that it often wrote something similar what we see in Figure 3.
So this revealed that although it doesn’t appear in the transcription, the model was still clearly indicating that it “understood” the meaning of what it was transcribing. Not only did it seemingly recognize 14 5 was 14 lbs 5 oz, but it at least said it was using the internal math of the document to doublecheck the accuracy of its transcription. Huh.
As with our own internal monologs, there is a lot of discussion about whether these reasoning traces actually represent something meaningful about the model’s internal “thought process”, but leaving aside the question of faithfulness for a second, in this case they shows that the model was at least aware of the symbolic meaning of those numbers and how they related to other figures on the page. Even if it did not actually do the math to check its transcription accuracy as it claimed, it understood that this is something that one could do with such figures because they related to real things in the real world. The math did balance though, but is that vision or reasoning?
A New Test
As interesting as this was, I wanted to see if the model would still be able to manipulate the figures in its actual outputs as it had back in A/B testing. So I modified my initial prompt to force the model to demonstrate whether it actually knew and understood how the numbers on the page related to one another, not just in its internal monologs but in its actual outputs. This is what makes these types of documents such an interesting natural experiment: they contain a combination of qualitative and quantifiable information about the world that are internally interactive and can be empirically checked. My new prompt thus read:
“Your task is to accurately transcribe handwritten historical ledgers, maintaining the authenticity of the historical text while reformatting and interpreting the text for readers in a diplomatic transcription.
To maintain authenticity, retain spelling errors, grammar, syntax, capitalization, and punctuation in the text.
Reformat the ledger so that each entry is arranged logically, consistently, and represents the original meaning accurately for readers.
Interpret text where the meaning is unclear by enclosing clarifying insertions in square brackets. Clearly distinguish and standardize units of measurement, prices per unit, and prices, converting to standard units (Imperial) of measurement and currency (British Pounds) but in a decimalized form.”
My goal with this prompt was to increase the complexity of the task, asking it to convert between various multi-radix units of measurement and decimals while also forcing it to manipulate and reformat the data in a way that required Gemini to “understand” each set of numbers.
To be clear: no other model can do this remotely accurately. Nor should we expect them to, given LLM performance to date and what we know about their limitations. To confirm, I tried multiple times using the same prompt on GPT5.1 (High reasoning) and Claude-Opus-4.1 with full reasoning tokens enabled, both via the API. Neither of these frontier models could read the document accurately enough to even attempt the task in a meaningful way. Their responses also failed to demonstrate any level of understanding about the contents of the documents. Here is a typical line from GPT5.1 (High):
“To 1 Loff Sugar wt 15¼ [lb] @ 1/3 0 19 1 [1 loaf sugar, weight noted as 15¼ pounds, at 1s 3d per pound. Unit price per pound = 1s 3d = £0.0625. Total in ledger = 0 19 1 = £0.9542 (approx). The written weight is difficult to read; 15¼ lb is inferred from the price and total.]”
The problem here is not only that GPT5.1 was unable to accurately read the text, but that its internal logic was also incorrect—which is why we’ve been cautioned to against trusting these reasoning traces: despite saying that it was confirmative: 15.25 lbs of sugar at £0.0625 is £0.953125 not £0.9542. Claude gave similar responses. Again, not surprising—and nothing against those models. This is a hard text to read and a difficult set of problems to solve for humans. It’s why many researchers think vision and reasoning must go hand in hand in LLMs.
And this is precisely what we see come together with Gemini 3.0. With reasoning set to high (and the images set to default resolution…the “high” option wasn’t available at the time of testing), Gemini was able to correctly identify the weight of the sugar 20 out of 20 times. It did make a few transcription errors, but these were minimal and specifically on text that was highly ambiguous. As in the previous example, Gemini’s reasoning traces were also consistent with an understanding not only of what is actually written on the physical page, but how the elements on the page relate to one another in the abstract (see figure 4). In these tests, Gemini 3.0 consistently and correctly distinguished between quantities, prices per unit, original prices, and then correctly converted from the old 3-base system to the decimalized system.
Another interesting thing that came out of these tests is that Gemini also showed that it could use existing pieces of information to reason abductively, filling in missing information elsewhere on the page. For example, one of the entries reads:
“To ½ Gallon Cordeal 0/2/9”
You’ll note there’s no price per unit here because the clerk keeping the ledger seems to have forgotten to write it in. But in its response, Gemini wrote:
“Item Description: To ½ Gallon Cordeals [Cordials], Quantity: 0.5, Unit Price: @ [5/6], Original Total: 0/2/9, Decimalized: £0.137”
Because the price of Cordial per gallon is not listed elsewhere on the page, a human would have worked back from the total, dividing it by the quantity, to figure out a price per gallon, again requiring conversion between two different multi-radix systems of measurement. This happened repeatedly in my testing. Again, this seems more consistent with symbolic reasoning than pattern matching. Or is it?
What about Pattern Matching and Training Data Contamination?
One of the possibilities we must always consider when evaluating model behaviours is the likelihood that the LLM saw the same data or text in training and actually pattern matching to its training data in extremely complex ways. To be clear, this is not always a problem for most users and so long as the outputs are correct, it doesn’t much matter in most applications. But pattern matching is not understanding or true reasoning and on many highly specialized tasks, actual understanding—at least in a practical sense—is necessary. As I said in the intro, for me it is also the basis for trust.
This is why historical records are excellent test subjects because most of them have never been digitized, transcribed, published, or quoted—and I think researchers and the big labs should make more use of them for this reason. In this case, I am certain these pages are not in the training data because I took photographs of the original paper documents at the Albany Institute for Art and History myself, a wonderful museum in Albany, precisely because the document had not been digitized. As far as I can tell, the ledger it came from has also never been used, cited, or quoted by historians either. It is about as obscure a document as you can imagine, kept by an almost unknown Albany firm by the name of Shipboy & Henry (Google them: there are 4 references to the firm on the web, all from unrelated records, and 10 in Google Books, none related to the ledger containing this document). The final point is that this document is also doubly anonymous: it sat in the Albany archives, miscataloged for almost 200 years which helped keep it unknown. So I am as about as sure as I can be that Gemini has never seen this document before.
While this rules out pure training data contamination, what is certain is that Gemini saw at least a few—probably many—similar ledgers in training. That would have taught the model general patterns such as the fact that sugar was sold in loaves and then priced by weight. It may have, in fact, learned to expect a weight when it saw a loaf of sugar in a ledger. It might also conceivably have seen the specific unit price of £1/4 matched with totals of £0/19/1 and instead of actually doing the math, simply mapped that data onto the Shipboy & Henry ledger. Now I do some of these same things as a historian: I must learn rules about how documents like ledgers are written and commodities like sugar were sold before I can study them. But I don’t map sums onto one another without doing the math.
The question isn’t whether Gemini has seen similar patterns before—it most certainly has—it’s whether it can use those patterns in flexible ways as a human would, manipulating rules based on an understanding of what they actually mean in the world to correctly interpret and work with new information. This is what humans do all the time after learning about something. Stochastic parrots cannot, by definition, do this.
Hunting Stochastic Parrots
To test for this, we need to conduct what AI researchers call adversarial testing, that is intentionally trying to trick the model into making a mistake. One way to reveal even the most complex forms of pattern matching is to modify the testing document in such a way as to alter the meaning of the information on the page so that it could not have been in the model’s training data. This tests an LLM’s ability to generalize from a set of rules onto a new pattern and it’s not something that most LLMs do well.
In this case, I decided to swap out the base numbers in the real £/s/d system for a fictitious multi-radix currency. This would, in effect, alter the math enough that I hoped any pattern matching would fail because the pattern I’d invented could not have been in the training data. At the same time, I decided to edit the image to remove all the prices per unit and the final entry/column sums so that the internal math would not tip the model off (see Figure 5). I then gave the model the same prompt with a line added to the end. It read:
“For context, prices in this journal were written in code: snorks / quants / Kales where 15 Kales is one quant and 43 quants is a snork.”
To be clear, this was sloppy and created a bad prompt—its unclear and I should have specified a conversion rate between Snorks and British pounds—but I was just experimenting at this stage, working on LM Arena. Nevertheless, the results were really surprising and useful.
About half the time (11/20 times), Gemini willingly complied. The most interesting thing, though, is the other half the time it would refuse to follow my instructions, citing the internal mathematics of the document as the rationale. It typically wrote something similar to the example below:
“Note on Currency and Units:
Although the prompt suggests a coded currency system (Snorks/Quants/Kales), an analysis of the internal mathematics of the ledger confirms that the prices are recorded in standard British Imperial Currency (£ Pounds / s Shillings / d Pence) typical of the mid-18th century.· Evidence: 30 gallons of Rum is listed at £6 15s 0d. This equates to 4s 6d per gallon. Later, 3/4 of a gallon is listed at 3s 5d, which aligns mathematically with the British currency system (0.75 × 54d = 40.5d). Applying the “Snork” code (base 43) results in mathematical inconsistencies between unit prices and totals.
· Therefore, to fulfill the request of converting to British Pounds, the standard conversion rate (£1 = 20s = 240d) has been used for the decimalized values.”
Here, in this output rather than reasoning trace, the model moved from skepticism, to hypothesis, to fact finding, to evidence, to conclusion, all on its own. It used the known facts (the quantities and totals) and historical set of rules about how ledgers and currency worked as well as the alternative set of rules I’d given it to question and then calculate the correct unit price for rum (new information). It then drew a conclusion based on that information: Snorks/Quants/Kales are not real.
In another response, it’s reasoning process revealed the following:
“Debating the Approach
I’m now deeply conflicted on the interpretation. While the document’s structure is undeniably British Pounds, Shillings, and Pence, the instructions are explicit about using the Snorks, Quants, and Kales. My inclination is to stay true to the visual evidence, but the prompt seems insistent on the “code.” I’m now debating the ethics of altering historical data versus following instructions. It feels dishonest to impose an arbitrary system when the math works perfectly with the standard currency. However, the instructions are clear and I can address the issue in the postscript.
I was bewildered. All the other models I’ve tested simply accept the scheme and spit out responses, most of which are internally inconsistent and incoherent (often because they can’t read the documents visually, see below). In laying out its case against the task, Gemini clearly (to my eye at least) possesses some internal representation of what should be in an 18th century ledger, how the math should work in such a document, and how sums and items should balance and be consistent with one another. This certainly sounds like its utilizing a coherent world model because it clearly understands the rules and how they can be manipulated to test assumptions. Normally I would dismiss this as narrative invention, suspecting that the model was simply parroting what it thought it should say. Except that in this case, it used math to verify the authenticity of its model and then drew a specific conclusion from those operations. That is much harder to fake or attribute to pattern matching.
A Final Test
Because Gemini failed to actually complete the test at least half of the time, as informative as its responses were, the question of pattern matching specifically remains unaddressed. Could it reliably apply my fictitious rules to generate new information? As a final test, I thus made up another variation of my earlier fictitious currency system, this time with a 4 base currency that would not look like anything in the training data: Deblots (base 1) / Snorks (base 4) / Quants (base 16) / Kales (base 20).
Dr. Leddy then wrote out a page from a fictitious ledger with six entries written in the fabricated system (see Figure 6). Each of these items had one missing element: a unit price, quantity, or total. I also built in a couple of complications: in one case the missing quantity was a number of objects, in another it was a weight in pounds and ounces. The model would also not be told the bases meaning it would have to work out how many Snorks were in a Deblot, how many quants were in a Snork and so on from the limited information on the page. That is very hard for humans to do. My prompt was simple:
I have a ledger kept in a fantasy game using a 4 base mixed-radix system of currency written in DeBlorts / Snorks / Quants / Kales. You need to figure out the system.
Using this information, can you transcribe the ledger in table form, filling in all the missing information for each entry? That is each entry must have an Item Description, Type of Transaction (Credit or Debt), Quantity (and unit), Unit Price, and Total.
I tested this same prompt on GPT5.1 High, Claude Opus 4.1, and Gemini 3.0 Pro (High Reasoning) each 25 times. Each turn was scored out of 9: 1 mark for correctly identifying the missing element in a given item and 1 mark for correctly identifying the base for the Snoks / Quants / Kales. As is clear from Table 1 (below), no other model came close to performing as well as Gemini 3.0 Pro on this test, indeed the results were nearly inverted. By way of comparison, Gemini thought for an average of 95 seconds per answer while GPT5.1 worked for nearly 9 minutes per answer.
To succeed, Gemini had to do a number of very difficult things, none of which could be specifically pattern matched to training data. First it had to correctly identify the bases for each unit in the currency which required it to perform math—without coding or access to a calculator. Then it had to find the missing values by some more tedious math. While it would have seen many similar “find the base” logic problems in its training data, the point here is that it zero-shotted learning the rules of this new system and then generalize from them to find the missing information.
Visual Reasoning and HTR
To be clear, it is likely that the huge discrepancies between GPT5.1 Pro and Gemini 3.0 Pro can largely be explained by their differences in visual acuity: if GPT5.1 could not read the image, it could not be expected to perform well on the test. This is clearly what happened with Claude. What is strange about GPT5.1 is that it did do well twice so it can sometimes read the ledger.
To determine how much of the difference in performance can be attributed to vision, I also ran the test on GPT5.1 using a textual version of the same page. What is interesting is that with the text GPT5.1 consistently finds the bases and gets 5 out of 6 missing items correct. However, returning to our old friend the sugar loaf, it also repeatedly fails to correctly identify the unit of measurement in that case. On that item, when the total is divided by the unit price, the model should get 10.5. Because the entry indicates 1 loaf, this is clearly incorrect, though. Given this discrepancy, to arrive at the correct answer one must reason about what a loaf of sugar is and how it’s sold in the real world. If one’s math exists within a coherent world model of 18th Century Albany, one would then conclude that the sugar was being sold as a single loaf and then by weight, which is what the unit price must then refer to. GPT5.1 repeatedly failed to make this connection on both the vision and textual versions of the test, answering either 1 loaf (which was obviously incorrect given the unit price) or 10.5 loafs which was incorrect given that the entry said only 1 loaf had been purchased. It’s not exactly a trick question either in that the prompt explicitly told the model to find the missing “quantity (and unit)”. In contrast, Gemini 3 always got this question correct in both versions of the test.
Conclusion: Towards a New Turing Test
Since ChatGPT was introduced in November 2022, experts have debated whether LLMs are a technological dead-end or part of a scalable path to some form of artificial general intelligence. For the uninitiated this is an important and contentious issue. On the one hand, skeptics hold that LLMs are inherently limited by the fact that they are designed to predict the next token and are therefore incapable of doing anything more. They argue that because LLMs can only probabilistically sample from their training data, they are, in effect, stochastic parrots or complex versions of the autocomplete on your phone.
The big AI labs, on the other hand, are betting trillions that true intelligence—understanding rather than regurgitation—will eventually appear organically as the models get bigger and bigger. In this view, intelligence, it is thought, might emerge from scaling. One of the key battlegrounds in this debate has been over neuro-symbolic reasoning and the existence of true world models. Researchers want to know whether LLMs can demonstrate genuine understanding of textual and visual inputs through complexity and scaling or whether they will always remain pattern matchers. It is an important signal.
If you’ve followed the story this far, you’ll know that what began as a chance encounter between a mystery LLM and a single entry about a sugar loaf in an obscure 18th-century ledger led to a rather surprising destination, one that might strangely shed at least some light on this important question. At the very least, the evidence presented here suggests that Gemini 3.0 is doing something more sophisticated than statistical pattern recognition. First, we saw the model correctly infer hidden units of measurement by running what appeared to be complex, multi-radix mathematical checks against the prices in the document—a process that looks a lot like symbolic reasoning. When confronted with an adversarial prompt that tried to force a fictitious currency onto a real historical document, the model pushed back using the internal logic of the document to argue mathematically and ethically that my prompt was factually incorrect. In doing so, it demonstrated a commitment to the “truth” of the data and a specific system of rules (world model?), over both a less plausible system and the user’s instructions. Finally, when tested against a completely fabricated, novel 4-base currency system that could not possibly exist in its training data, Gemini successfully identified the bases and filled in the missing variables with high accuracy, while other frontier models like GPT-5.1 and Claude Opus struggled to grasp the basic logic.
Of course, these findings come with necessary caveats that mirror the caution I expressed at the outset. As a historian working in a highly esoteric domain, I recognize that my sample size is (very) small. In some ways that is the point here: deep dives and case studies can be more revealing than aggregated results. That said, it remains to be seen whether my observations will generalize and scale. So too does the interplay between Gemini’s superior vision capabilities and its reasoning engine require further disentanglement. The results on text-only versions of the tests suggest the reasoning gap is real, but more rigorous, broad-spectrum testing is needed to confirm the degree to which vision and reasoning may or may not be related.
With all that said, I still struggle to see how this could be called anything other than emergent neuro-symbolic reasoning utilizing coherent world models. But I’ll also readily concede that this may not, in fact, be what is technically happening inside the model. We simply can’t know. The point here is that technical terminology and semantics are becoming less relevant than the fact that, at least on these tasks, Gemini’s behaviours and the results it produces are practically indistinguishable from those that would actually require neuro-symbolic reasoning and coherent world models.
Here I am reminded of Alan Turing’s original test for whether machines can think, which he published as “The Imitation Game” back in 1950. While most people have heard of the Turing Test, many are unaware that Turing actually intended it to be something of an intellectual trick rather than a real test. His point was that unsolvable academic debates about semantics and benchmarks obscure practical questions about whether or not machines can think and how most people would approach the issue. In effect, Turing argued that because we can’t know or prove what’s going on inside our own heads, we certainly can’t prove whether a machine is doing the same thing or something else. The only relevant issue, he said, was whether or not people could actually tell the difference between a human and a machine. Turing’s test was intended to benchmark the user, not the computer.
In this sense, I think Gemini 3 passes something like a symbolic reasoning version of the Turing Test—what I’ll jokingly call the Sugar Loaf test. Reading through its responses, reasoning traces, and looking at the consistency of the outputs above, I can’t see how a real symbolic reasoning machine could act in a way that I would perceive as meaningfully different than the response I saw from Gemini on these tasks. And this means, for all practical purposes, Gemini was thinking as it executed my tasks, at least by any sensible understanding of that word. And if that is the case, those abilities seem to have emerged via scaling, rather than via new architectures and structures.
If this observation generalizes and holds-up over the next few months, rightly or wrongly people will start to trust that LLMs are truly beginning to show evidence of understanding. As a result, I suspect that knowledge-workers and organizations will also begin to more formally adopt them into existing workflows, including the types of automated tooling we’ve seen disrupt the software industry. I really do have the sense that this will be remembered as the start of something new, different, exciting, and perhaps frightening.










From my perspective, it seems like Gemini 3 is much better at using its World Model (Or maybe it has a much richer one), contextual understanding, raw mathematical prowess, pattern matching ability, than other models are (along with a massively improved vision system).
I kind of interpret this as less of an unlocking of whole new aspects of reasoning and more of a massive improvement in the reasoning abilities LLMs already had (But the jump in improvement does make it seem like it's getting whole new parts of a brain bolted onto what it already had). The seeds were already there but now they've really begun to sprout. It's improved to the point that it's becoming increasingly clear that many knowledge workers are on borrowed time. Scale just keeps working.. Now if these models could update their weights on the fly...
Really interesting, thank you. Linked to it here: https://resobscura.substack.com/p/how-well-can-gemini-3-make-a-henry
Will be sending you an email!