Cross-LLM verification catches 76% of transcription errors, but it only works if we build systems that augment human abilities rather than replace them.
I noticed in this work you still relied on Gemini 3 Pro. Have you evaluated whether latest AI models (3.1 pro or the presumably forthcoming 3.5 api) serve to reduce the CER/WER rates or whether 3 pro preview is some kind of sweet spot that isn't too smart and isn't too dumb. The need to use low thinking levels and low temperature to achieve the best results sort of implies that this might be the case. At the very least it's worth investigation since as of my comment 3.0 Pro Preview has been shut down.
Edit: Nvm. I see on closer reading that you only kept the 3 pro preview as a baseline and used 3.5 flash in this followup. One thing I don't think you mentioned was whether 3.5 flash alone (not overlayed with opus) produced a meaningfully better CER/WER rate. Was that something you investigated?
Thanks for this. Lots of good data to dig into here. So Gemini 3.1 Pro is slightly worse in terms of CER/WER than 3 Pro (which isn’t available anymore as you point out). I used the 3 Pro transcriptions/numbers here to keep the numbers consistent with my earlier posts. These regressions are pretty common in iterative model releases: when the labs tweak one thing, it affects others. Opus 4.7 was significantly better than 4.6 and now 4.8 is worse. They obviously aren’t prioritizing historical handwriting so the aren’t optimizing for it.
On our strictest measurement, Gemini 3.1 scores a CER of 2.30% and WER of 5.59%. Modified it is 0.85% CER and 1.91% WER. That is with temperature at 0.3 and thinking set to 128 (lowest possible). So a bit worse than 3.
Gemini 3.5 Flash is very good, a strict CER of 2.99% and WER of 6.92%. When you exclude ambiguous capitalization and punctuation errors it scores 1.28% CER and 2.58% WER so better than Opus.
If we just use Flash to check Pro, we catch 74 of errors of a total of 139 errors in the baseline, so 53%, with 211 words flagged and thus a specificity of 35%. With opus 4.7 alone we get 89, so 64%of all errors and a specificity of 32%.
You can definitely only one with good effect to reduce costs. Tradeoff in using bot models is also lower specificity.
This method works well for handwritten text in English. Unfortunately, my tests show that no model can match Gemini when it comes to recognising handwriting in Polish (and probably other languages outside the most popular group as well). So, all that remains for me to do in my app is to look for differences in the transcription between Gemini 3 Pro and Gemini 3.5 Flash.
One risk is the reliance on a closed commercial solution over which we have no control. The volatility of this new technology means that what works today for Gemini 3 Pro may not work for Gemini 4. And when Google discontinues a particular model, we are left with nothing. Is the ability to read old manuscripts valuable enough to the company that it will continue to develop and maintain this feature in its models?
That is why I am also looking into open-source models designed for historical document transcription, such as Churro. I am also exploring the option of fine-tuning models such as Qwen-VL.
These are good points. I think it’s interesting, though, that improvements in handwriting recognition seem to be an emergent property of scaling, meaning that while DeepMind has targeted handwriting and is therefore better, Opus and GPT are also getting better as the models get bigger. That said, I’ve experimented a lot with open source models as well as fine tuning open source models and thus far the gap with Gemini is so wide that it doesn’t seem worth it especially when the resources and cost necessary to run a full sized open source model is quite significant. The commercial labs are clearly better right now, but I suspect in a few years we’ll have small open source models that are at current Gemini levels.
Once again a really interesting post from your side. I am wondering how you treat false positives i.e. results that are wrong but where all of your LLM's make the same transcription mistake? This is a problem that I struggle with in my own project atm
Also looking forward to see more about Transcription Pearl in the future - in your screenshots it looks like you have implemented coordinates for the text or baselines and bounding boxes into the pipeline?
Thanks and good questions. So of the 139 errors Gemini-3-Pro makes in the baseline, 106 of them are caught by overlaying the Flash and Opus transcriptions. That leaves 33 errors where all three models make the same mistakes. Basically those are undetectable errors us g the method above and as I explain in the post they are mostly spelling modernizations and abbreviations. To catch them, I’ve tried adding a fourth model (GPT-5.5) but it also nearly doubles the number of false positives, that is correct words that get flagged for review. So to catch 15 more of the 33 real erros, you have to review ~800 words instead of ~350. So not really worth it fine that the errors themselves are pretty minor. I expect that as the models improve, those remaining errors will gradually fall away.
And yes, in Archive Pearl we have implemented visual flagging using bounding boxes to highlight the flagged words in the original document. Users can choose to highlight all the words at once that are flagged in the transcription, or just the selected words, whatever works best for them. You can read more or sign up for the beta at http://archivepearl.com.
Hi, I am a volunteer researcher at Historic Christ Church and Museum in Virginia. My work involves transcribing historic tax returns on a large scale. About a month ago I read your essay about the possible Gemini breakthrough in transcribing historic texts. I found it wonderfully clear - and thrilling in the prospects. Now this! I immediately went to sign up for ArchivePearl and only at that point learned that it was in the beta phase. If my work matches your needs for the testing, I will certainly put it to use straight away. Thanks to all there for both your work and for sharing in such a helpful and engaging way. The images alone are a treat! (I thought I sent this 7 hours ago. Absolute novice on substack.)
Great post as usual, Mark! Very good for skeptics. And ArchivePearl sounds absolutely amazing! However, I'd say you're two optimistic about accuracy: Gemini (both Flash 3.5 and Pro 3.1) still struggles mightly with hard Dutch or Portuguese hands, for instance, per my recent tests. But I'm looking forward for the day when that will no longer be true!
Thanks Thiago! As usual I try to caveat my comments by indicating I am only talking about English 18th and 19th century texts. I wish i had your linguistic skills!
You convinced me that is less a matter of language and more a question of handwriting! Unfortunately, there seems to be a lot more Dutch and Portuguese records with bad handwriting… That may also be because I also work with the seventeenth-century, although nowadays LLMs often do fine with English secretary hand.
Thanks Mark - at the end of the day: it’s ‘horses for courses’ - there will be a number of AI and other programming tools that will needed in any project as well as humans with different expertise in the mix for each stage of a transcription project.
No model will ever be perfect and give you 100% accuracy. Every output needs checking; for small and middle size projects 80 to 90% accuracy is fine. For large projects it will be sampling for errors or you transcribe it manually or have specific bespoke AI programming to cope with a bulk of text and handwriting characteristics- see the Bentham Project.
As well other issues are the archival item itself. For example, Outwards Letterbooks using iron gall inks are difficult to image and transcribe using AI - the iron in the gall ink rusts and the distinctiveness of the handwriting is sometimes lost because you get a kind of splotchy’ handwriting. The paper is tissue thin and this adds to the difficulties.
I would also like to know with Transkribus you have done, if you have trained unique models for the texts you are working with? Pre-trained models are pretty good, but the texts they have been trained with will not be exactly like the texts you were working with?
Just to reiterate I like Transkribus - it gives a workbench of tools, it is probably acceptable in an institutional environment and within its computing and security infrastructure.
Moreover, I don’t have or want to be a programmer and get into the nuts and bolts of AI, token weights, etc. All a user wants is to transcribe a text and then work with it as a dataset.
I’m not sure if you’ve read the post, but the whole point is that no programming is required with the website and that error rates are now so low with the verification process I’ce described that they round to 0. Accuracy of 99.78%.
I noticed in this work you still relied on Gemini 3 Pro. Have you evaluated whether latest AI models (3.1 pro or the presumably forthcoming 3.5 api) serve to reduce the CER/WER rates or whether 3 pro preview is some kind of sweet spot that isn't too smart and isn't too dumb. The need to use low thinking levels and low temperature to achieve the best results sort of implies that this might be the case. At the very least it's worth investigation since as of my comment 3.0 Pro Preview has been shut down.
Edit: Nvm. I see on closer reading that you only kept the 3 pro preview as a baseline and used 3.5 flash in this followup. One thing I don't think you mentioned was whether 3.5 flash alone (not overlayed with opus) produced a meaningfully better CER/WER rate. Was that something you investigated?
Thanks for this. Lots of good data to dig into here. So Gemini 3.1 Pro is slightly worse in terms of CER/WER than 3 Pro (which isn’t available anymore as you point out). I used the 3 Pro transcriptions/numbers here to keep the numbers consistent with my earlier posts. These regressions are pretty common in iterative model releases: when the labs tweak one thing, it affects others. Opus 4.7 was significantly better than 4.6 and now 4.8 is worse. They obviously aren’t prioritizing historical handwriting so the aren’t optimizing for it.
On our strictest measurement, Gemini 3.1 scores a CER of 2.30% and WER of 5.59%. Modified it is 0.85% CER and 1.91% WER. That is with temperature at 0.3 and thinking set to 128 (lowest possible). So a bit worse than 3.
Gemini 3.5 Flash is very good, a strict CER of 2.99% and WER of 6.92%. When you exclude ambiguous capitalization and punctuation errors it scores 1.28% CER and 2.58% WER so better than Opus.
If we just use Flash to check Pro, we catch 74 of errors of a total of 139 errors in the baseline, so 53%, with 211 words flagged and thus a specificity of 35%. With opus 4.7 alone we get 89, so 64%of all errors and a specificity of 32%.
You can definitely only one with good effect to reduce costs. Tradeoff in using bot models is also lower specificity.
Excellent questions!
This method works well for handwritten text in English. Unfortunately, my tests show that no model can match Gemini when it comes to recognising handwriting in Polish (and probably other languages outside the most popular group as well). So, all that remains for me to do in my app is to look for differences in the transcription between Gemini 3 Pro and Gemini 3.5 Flash.
One risk is the reliance on a closed commercial solution over which we have no control. The volatility of this new technology means that what works today for Gemini 3 Pro may not work for Gemini 4. And when Google discontinues a particular model, we are left with nothing. Is the ability to read old manuscripts valuable enough to the company that it will continue to develop and maintain this feature in its models?
That is why I am also looking into open-source models designed for historical document transcription, such as Churro. I am also exploring the option of fine-tuning models such as Qwen-VL.
These are good points. I think it’s interesting, though, that improvements in handwriting recognition seem to be an emergent property of scaling, meaning that while DeepMind has targeted handwriting and is therefore better, Opus and GPT are also getting better as the models get bigger. That said, I’ve experimented a lot with open source models as well as fine tuning open source models and thus far the gap with Gemini is so wide that it doesn’t seem worth it especially when the resources and cost necessary to run a full sized open source model is quite significant. The commercial labs are clearly better right now, but I suspect in a few years we’ll have small open source models that are at current Gemini levels.
Hey Mark
Once again a really interesting post from your side. I am wondering how you treat false positives i.e. results that are wrong but where all of your LLM's make the same transcription mistake? This is a problem that I struggle with in my own project atm
Also looking forward to see more about Transcription Pearl in the future - in your screenshots it looks like you have implemented coordinates for the text or baselines and bounding boxes into the pipeline?
Thanks and good questions. So of the 139 errors Gemini-3-Pro makes in the baseline, 106 of them are caught by overlaying the Flash and Opus transcriptions. That leaves 33 errors where all three models make the same mistakes. Basically those are undetectable errors us g the method above and as I explain in the post they are mostly spelling modernizations and abbreviations. To catch them, I’ve tried adding a fourth model (GPT-5.5) but it also nearly doubles the number of false positives, that is correct words that get flagged for review. So to catch 15 more of the 33 real erros, you have to review ~800 words instead of ~350. So not really worth it fine that the errors themselves are pretty minor. I expect that as the models improve, those remaining errors will gradually fall away.
And yes, in Archive Pearl we have implemented visual flagging using bounding boxes to highlight the flagged words in the original document. Users can choose to highlight all the words at once that are flagged in the transcription, or just the selected words, whatever works best for them. You can read more or sign up for the beta at http://archivepearl.com.
Hi, I am a volunteer researcher at Historic Christ Church and Museum in Virginia. My work involves transcribing historic tax returns on a large scale. About a month ago I read your essay about the possible Gemini breakthrough in transcribing historic texts. I found it wonderfully clear - and thrilling in the prospects. Now this! I immediately went to sign up for ArchivePearl and only at that point learned that it was in the beta phase. If my work matches your needs for the testing, I will certainly put it to use straight away. Thanks to all there for both your work and for sharing in such a helpful and engaging way. The images alone are a treat! (I thought I sent this 7 hours ago. Absolute novice on substack.)
Thanks for this. We’ll definitely get you signed up. Will probably be in about a week and I’ll look forward to getting your feedback.
Great post as usual, Mark! Very good for skeptics. And ArchivePearl sounds absolutely amazing! However, I'd say you're two optimistic about accuracy: Gemini (both Flash 3.5 and Pro 3.1) still struggles mightly with hard Dutch or Portuguese hands, for instance, per my recent tests. But I'm looking forward for the day when that will no longer be true!
Thanks Thiago! As usual I try to caveat my comments by indicating I am only talking about English 18th and 19th century texts. I wish i had your linguistic skills!
You convinced me that is less a matter of language and more a question of handwriting! Unfortunately, there seems to be a lot more Dutch and Portuguese records with bad handwriting… That may also be because I also work with the seventeenth-century, although nowadays LLMs often do fine with English secretary hand.
Totally on the handwriting, but it’s also hard to proof transcriptions if you can’t actually read them! :)
Thanks Mark - at the end of the day: it’s ‘horses for courses’ - there will be a number of AI and other programming tools that will needed in any project as well as humans with different expertise in the mix for each stage of a transcription project.
No model will ever be perfect and give you 100% accuracy. Every output needs checking; for small and middle size projects 80 to 90% accuracy is fine. For large projects it will be sampling for errors or you transcribe it manually or have specific bespoke AI programming to cope with a bulk of text and handwriting characteristics- see the Bentham Project.
As well other issues are the archival item itself. For example, Outwards Letterbooks using iron gall inks are difficult to image and transcribe using AI - the iron in the gall ink rusts and the distinctiveness of the handwriting is sometimes lost because you get a kind of splotchy’ handwriting. The paper is tissue thin and this adds to the difficulties.
I would also like to know with Transkribus you have done, if you have trained unique models for the texts you are working with? Pre-trained models are pretty good, but the texts they have been trained with will not be exactly like the texts you were working with?
Just to reiterate I like Transkribus - it gives a workbench of tools, it is probably acceptable in an institutional environment and within its computing and security infrastructure.
Moreover, I don’t have or want to be a programmer and get into the nuts and bolts of AI, token weights, etc. All a user wants is to transcribe a text and then work with it as a dataset.
I’m not sure if you’ve read the post, but the whole point is that no programming is required with the website and that error rates are now so low with the verification process I’ce described that they round to 0. Accuracy of 99.78%.