40 Comments
User's avatar
Niels's avatar

Thank you for this detailed and interesting article. As a sceptic, I have to chip in a possible, much less exciting alternative: To my untrained eye (maybe I imagine this now after the explanation though) it looks like there is slightly more kerning between 4 and 5 than between 1 and 4. I would expect the next generation of Gemini models to improve on visual accuracy, so it might be among the few models to be able to pick this subtlety up. However, if it picks it up, it would directly read "14 5" and require much less reasoning to decide for the 14lb 5oz interpretation. Either way, this is remarkable!

Expand full comment
Mark Humphries's avatar

I think vision improvements are certainly a big part of what we are seeing happen here. That said, the question is really whether you can predict 14 lbs 5 oz from 0/19/1 any more than you can predict 145 from 0/19/1. The two are different mixed radix systems. To realize that 14 5 or 145 means lbs and ozs seems to require an understanding of the ledger contents, that 0/19/1 is a total for goods purchased, ether this is divisible by a unit price, and so on. They is why it’s such an interesting natural experiment t.

Expand full comment
Mark Humphries's avatar

I totally get the concern but I don't think that is happening here and would say it is not the same type of problem either. This is a bit different than claiming that the model is solving a novel math problem for which a solution exists. Any historian or person with knowledge of 18th century accounting would quickly reach the same conclusion, it's not difficult to figure out...but it does require a significant amount of contextual knowledge and the ability to reason across several layers of abstraction. Easy for humans with some training, hard for LLMs. But not the same as solving a novel math problem.

That said, all the requisite knowledge and similar examples would be in the training data from other sources, but I am certain that this particular ledger is not online because I took the photographs in the archives myself from the paper originals precisely because it had not been digitized. Nor have I ever found a reference to it in print or online. It is a very obscure document. If you are interested it is from the Albany Institute for Art and History in Albany, New York. It is officially catalogued as the Harmen van Heusen Daybook but (to add a layer of complexity to this), it was actually mis-catalogued and is really the Shipboy & Henry Daybook. These are all extremely obscure merchants from Albany in the 18th century and almost nothing is written about any of them. I am interested in them for some esoteric historical reasons and simply uploaded this image at random as it was at hand.

Expand full comment
TBK's avatar
Nov 12Edited

There is a space between the 4 and 5 in 145 so 14 5 is a natural grouping

Expand full comment
Mark Humphries's avatar

Thanks for the comment and yes I think there is a space between them. But to leap from 14 5 to 14 lbs 5 ozs is the interesting part to me. To this point, most LLMs have been unable to read the actual text. When they do, or even come close to it, they don’t infer that those are lbs and ozs without prompting. Obviously a lot more entering is required once the model is released, but at this stage it is interesting question to answer to pose: how did the model infer these were lbs and ozs

Expand full comment
Thiago's avatar

I tried half a dozen documents at the AI Studio, but in no instance I was presented with an A/B choice, alas. I’m very curious about it. I still find Gemini and ChatGPT a lot more unreliable than Transkribus and Leo, to the point of being unusable in most of my sources, which are usually much harder to write than your examples. Keep us posted!

Expand full comment
Mark Humphries's avatar

I haven’t been getting the A/B test either and I wonder if it is done now. In terms of accuracy in general, would you mind sharing your prompts, temperature settings, etc? Send me an example by email and I’d be happy to try.

Expand full comment
Daron's avatar

Isn’t it possible that the LLM simply hallucinated the *correct* answer from those three digits and the surrounding units of measure?

Since your sample size is so small, I’m not sure how you infer all the steps it took with these numbers as being actually demonstrated, if you had ten similar results I might think otherwise, but just one implies coincidental correctness as more likely to me.

Expand full comment
Mark Humphries's avatar

Thanks for the comment! It’s entirely possible this is a hallucination…and it was certainly a hallucination in the context of my instructions as I told it to transcribe the text exactly as it appears on the page. At present what this shows is they this new model seems to be much better at reading handwriting (other models really struggle with pages like this) and that it did something that appears interesting once that warrants further investigation.

Expand full comment
Leo C's avatar

Gemini is well known for its leading multimodal (vision, audio, text) capabilities. I would not be surprised if the model was good at recognizing the space in 14 5, hence it did not say it was 145 lbs, etc

Expand full comment
Mark Humphries's avatar

Thanks for this. But I think the point is that it recognized the space (which is itself not exactly clear) AND identified 14 as lbs and 5 as ozs. Remember that those amounts exist in a mixed radix system that is different than the mixed radix system of currency being used. So how did it come to add lbs and ozs in the right place beside those numbers?

Expand full comment
Connor MacLeod's avatar

Very cool. Can't wait until they until fully release it.

Expand full comment
akash's avatar

Are we certain that there is no information online detailing some or all of the details about the ledger referenced in figure 4?

I ask because I recall a few instances on twitter where an LLM was claimed to have solved a problem, but later, someone found a complete solution online. Pretty sure this happened for one of the math olympiad questions; stackoverflow contained a similar question with the solution, so the model likely retrieved the answer from its training data.

(Also, figure 4 is missing.)

Expand full comment
Mark Humphries's avatar

I am very confident it is not online. But it’s also a different type of problem than a novel math problem. Any human familiar with the context would recognize that this was a unit of measurement, it’s just that LLMs don’t (in my experience). That recognition requires actual reasoning. Now in this case how this model got to that answer is unclear. It could be a fluke (as it’s probabilistic), it could be that it has seen enough 18th century ledgers that this was a probable completion, or it could be an emergent skill. Given that the this model was also significantly better at transcription overall, and did something similar elsewhere on the page, it seems significant but we’ll have to see. Models also change a lot from early checkpoints to release as they are safety tested and fine tuned.

Expand full comment
akash's avatar

Definitely not a fluke and I doubt it has seen that many 18th century ledgers.

I'd guess additional OCR pre- or post-training was a thing, because that is an economically useful task, and this capability is a downstream result of that.

If the Gemini team has a new secret sauce that improved reasoning without domain-specific training, that'd be huge and would change my mind on near-term AI progress!

Either way, very impressive! At the very least this shows that models can now accelerate research in another domain.

Expand full comment
laudanum's avatar

Thanks for this very interesting account. From my experience Gemini already was able to do some basic transformations on recognized text in earlier versions. For example when presenting it with commercial invoices and you asked it to output a structured schema it would sometimes interpolate the fields not explicitly mentioned on the document by means of calculating. E.g. it would divide the totals of a line item by the number of pieces of that item. However I have not seen anything close to the level of interpretation you noticed in your example.

Expand full comment
Mark Humphries's avatar

I agree and thanks for the comment. The truly weird part here is the unprompted transformations through several layers of abstraction. Models can already convert successfully between p/s/d and lbs and one given a unit price, but it has to be promoted. They also can’t do it reliably from images of handwritten text. We’ll have to see if this replicates though. That it is better on handwriting seems pretty clear to me. The reasoning bit is interesting at present but only in the sense that it warrants. Further investigation.

Expand full comment
Vivienne Cuff's avatar

So what you have is pieces of work in different models to organise and collate together to produce an output, say the transcript of a diary or an account book, etc. Although you do need something to manage the work, create and run models, tag format, structure and content. What is needed is a workspace. And it should have tools to allow for analysis and visualisation.

Another issue is that IT managers in organisations will not let anyone take up bandwidth or do anything that would compromise security.

Re the 14 5 example, I suspect that it had been trained in someway to know what pounds and shillings were - you were working with financial information and accounts. I doubt it could infer anything more about what it relates to.

Again it’s horses for courses - some models work and others don’t because cursive handwriting is not always readable for different reasons.

It will, imho, always need a human eye and knowledge - to judge, decide, improve, the processes and outcomes. There will always be new approaches to try. It is worth experimenting, for the sake of doing work you need something stability - a tool that isn’t changing every five minutes.

I might be convinced when it can deal with tabular data. As when you can really model, in the case of financial information, accounting rules, etc. and economic/historical context as well and which would be an input into the training.

Expand full comment
Mark Humphries's avatar

What I find interesting about the 14 lbs 5 oz is not that it the model was correct, but that the process by which it arrived at that answer points to a type of reasoning that these models have not really shown unambiguously in my experience.

Yes Gemini “knows” what p/s/p are because it was trained on pretty much everything ever written and available digitally. That is not surprising. What is surprising is that to do what it did would normally require what we call understanding and symbolic reasoning. That is to see the text not as a series of meaningless tokens without meaning but as representations of things that exist in the world.

So, instead of just transcribing 145 which was the literal text in the page, it seems to have inferred that this was a measurement of sugar. That is a pretty basic meaning to infer. But it then showed understanding in concluding that this number actually represented something unstated: 14 lbs 5 oz and not, say, 145 lbs. To draw that conclusion, because it’s correct but inferred and not written on the page, one needs to take another discreet piece of information (the cost per pound, which is again inferred as this is expressed only as a fraction) and divide the sum on the page by this number.

The only way to see that 14 lbs 5 oz is, in fact, the product of the sum at the end divided by the unit price, is to recognize that these are symbolic representations of real things that exist in the world. It is unremarkable when a person does this as it is the type of symbolic reasoning at which we ac excel as humans. But here, because none of the measurements share a common denominator, the recognition that they are mathematically related is abstracted symbolically.

That seems to be something new.

Expand full comment
Hilary's avatar
6dEdited

I think it's also useful to note that even if the model did make an inference on the basis of a space between 14 and 5, it did not hallucinate a decimal there for "14.5lbs" (which would be 14 lbs 8oz). Going based on the level of context with that obscure document as the LLM had, 14.5lbs is just as plausible as a complete hallucination as 14 lbs 5 oz is, which does suggest that something more is happening under the hood. Definitely going to be interested to see how this holds up with other documents after that model is fully released.

Expand full comment
Donna Dickerson's avatar

Great article! I have read many places that Claude and Gemini are best at transcribing old documents, but found both as well as Transkribus and ChatGPT utter failures. Today, I used your prompt (slightly edited) with Claude (no subscription) to transcribe a will that I had already transcribed “manually” and the result was garbage. Sentences were jumbled and moved to odd places in the document. The original document is in a secretary hand and relatively easy to read. What am I doing wrong? I would love to get an error rate of less than 50%!

Expand full comment
Mark Humphries's avatar

If all of those are failing that badly, it might be something like the image resolution or format. You can email me the image and I would be happy to take a look.

Expand full comment
Ryan Dore's avatar

Great read Mark. The speed of progress is incredible.

Expand full comment
Daniel Popescu / ⧉ Pluralisk's avatar

Thanks for writing this; your analysis here, resonating with previous work, is truely insightful. Do you foresee this specific reasoning capability being accesable broadly, or will it remain a more controlled feature?

Expand full comment
Mark Humphries's avatar

I expect it will be generally available when the nee model is actually released. Also: although this is important it is an incremental change. The general reasoning ability of the models is already very good.

Expand full comment
Nicole Dyer's avatar

How exciting. Thank you for sharing the details of your experiment. I have some images from a store ledger in Tennessee in the early 1800s I'd like to try. It's incredible to see Gemini reasoning about the numbers and coming to accurate interpretations!

Expand full comment
Peter Olsen-Harbich's avatar

Extremely interesting, Mark, as always. I wonder if, when the new model debuts and further testing is possible, it can be convinced to simultaneously reason AND strictly follow the prompt. Impressive and useful as the result is, the model still deviated from its prompted instructions "to maintain the authenticity of the historical text" by transcribing "14 5" as "14 lb 5 oz." Do you think the prompt might be successfully modified to introduce and explain the concept of bracketing in diplomatic transcription? Once the model can do everything you present here, while outputting "14 [lb] 5 [oz]", I think we will have truly crossed the threshold. Thanks!

Expand full comment
Mark Humphries's avatar

Yes, fully agree. That is the strange thing here as it is, actually, an error but clarifies an ambiguity in the document. So it is not a correct response. Now they this is possible, it will be interesting g to see if we can require amendments and additions in square brackets. In the past, the models have been unable to do this consistently. We’ll see…

Expand full comment
edpapenfuse@gmail.com's avatar

I wish this had been around in 1976 .... Your work and efforts to keep us informed about advances in AI has been most helpful to me. Keep up the good work. May the good of AI outweigh the evil consequences of misuse and abuse.

Expand full comment
Mark Humphries's avatar

Thanks! And yes…every time a new use case appears it also has a dark side. The model got smarter but also failed to follow instructions and did its own thing.

Expand full comment