12 Comments
User's avatar
Piotr Jaskulski's avatar

When it comes to HTR/OCR tools,

The problem is that ABBYY FineReader does not cope well with 19th-century printed books from Eastern Europe, even though it is Russian software. And Transkribus is a commercial system and by no means cheap if you need to process large volumes of material. It is probably wiser to invest in eScriptorium/Kraken. But not in every case; if we have several thousand pages of manuscripts from dozens of people, this may lead to the need to retrain many models, which is time-consuming. Gemini, however, reads 19th- and 20th-century handwriting well enough to significantly speed up the work. Of course, LLMs make different kinds of errors than traditional HTR models; one must pay particular attention to proper names and areas with less legible handwriting, where the likelihood of hallucinations is greater.

MASAKO | NewForever's avatar

As a traveler of vibecoding, this was truly fascinating to me (thank you!).

Thiago's avatar

Also, do you think Gemini 3.5 has better computer vision than 3.1/3? Its paleography capabilities seem to be largely unchanged, but I'd say it seems less prone to hallucinations in the couple of tests I ran, in an admittedly haphazard fashion.

Mark Humphries's avatar

I ran Gemini 3.5 flash on our test set and it has a strict character error rate of 2.99% and word error rate of 6.92%. When you exclude ambiguous capitalization and punctuation errors it scores 1.28% CER and 2.58% WER, so better than Opus and just barely behind Gemini 3 Pro. This is a small but meaningful improvement over Gemini 3 Flash.

Thiago's avatar

And what about costs? Is it much more expensive than Gemini 3 Flash/much cheaper than Pro, etc?

Mark Humphries's avatar

Flash is $1.50 per million input tokens and $9.00 per million output tokens. Pro is 25% higher. For context, an image and short prompt is about 1750 tokens. If you are transcribing a page, that is about $0.005 per page.

Thiago's avatar

I was looking forward to your post, Mark! Very interesting, as usual, and I agree with most of your points. I've been doing what I call "visual triage" for a couple months now, having Codex (in two different machines) go through hundreds of thousands of manuscript images looking for the needles in these huge haystacks, but you wrote that "Practically speaking, I could have used the Antigravity Agent to do my microfilm example in a fraction of the time, spinning up dozens of agents to do in minutes what took Cowork days—and what would have taken me many weeks."

How would that work? After all, you would have to upload TBs of data to the cloud - or can Antigravity somehow be connected to my Dropbox folder? Also, are Antigravity's limits similar to Codex? Codex had much higher limits than Cowork when I tried it, but that was before their deal with xAI. I'd rather use Antigravity, mostly because it's computer vision is still better than ChatGPT's.

Mark Humphries's avatar

Hi Thiago! Remember that although the Codex app lives on your compute, it still does all its work via OpenAI’s servers. So if you had it read through thousands of images in your computer, each image was sent to OpenAI’s servers to be analyzed. You can hookup an Antigravity Agent to Google Drive or any source (within reason) using code. I am not clear on how this will work, exactly, but it looks like this would significantly cut time. But yes, this would all involve uploading terabytes of data as does using Codex. The major speed up comes from parallelization, that is you ca have 50 agents working together on the same task at the same time so it completes 50x faster.

Thiago's avatar

Thanks! I get your point, but you can more or less mimic that by having multiple agents running in your machines - I usually have 5-6 tasks running at the same time in two different machines (and at some point had 10 in four). I do agree that parallelization would take that to the next level (and burn a LOT of tokens), though. Please post when you figure out how to do that!

Mark Humphries's avatar

So the post talks a bit about doing just that, but maybe I wasn’t explaining it very well…using the API you can simply make multiple calls which instantiates multiple agents. The Antigravity Agent can also snip its own subagents automatically. Yes it burns lots of tokens but on the whole it is highly efficient in terms of ROI.

Thiago's avatar

I understand the principle! Just want to know if it works in practice.

Vivienne Cuff's avatar

You seem to be going down rabbit hole(s)? There are easier products to use for transcription of printed material like ABBYY FineReader PDF (Corporate & Enterprise)

Sorry - please work with Transkribus - https://www.transkribus.org - and contribute to its development.

Transkribus uses AI for transcription of printed and handwritten texts, it concentrates on providing the tools needed for transcription and other tasks. It provides them to relieve the user of the technicalities and it can be used in an organisational context - a solution that is easier to implement and explain to the security and IT staff. It provides workflow, API, etc. As well, it is easier to train and use by non IT staff.

One example, of the difficulty with archival materials are veterans service files - they often contain printed and hand-written texts, tables and other formats. They contain abbreviations and other data where it is more important to have an understanding of the who, why and the legislation concerned relating to their creation - this knowledge contributes directly to creating the model. Moreover, their relationship with related records, like Unit Diaries or printed gazette notices can be included.

As well there are many hands in most archival materials so you have to create specific models for the type of document. Mostly they are not like diary entries with a paragraph of consistent handwriting. An example of this are 19th Century Lunatic Asylum or Hospital Medicsl Casebooks.

Transkribus and other tools can extract metadata and also there are tools to visualise and migrate data.

The key issue to put together a suite of tools to enable the work to be done quickly and efficiently - proven, useable, and easy to use. I don’t have to be a programmer. I can understand enough to understand what is happening but don’t have to get into the nuts and bolts of it.