Discussion about this post

User's avatar
John O’Connor's avatar

I noticed in this work you still relied on Gemini 3 Pro. Have you evaluated whether latest AI models (3.1 pro or the presumably forthcoming 3.5 api) serve to reduce the CER/WER rates or whether 3 pro preview is some kind of sweet spot that isn't too smart and isn't too dumb. The need to use low thinking levels and low temperature to achieve the best results sort of implies that this might be the case. At the very least it's worth investigation since as of my comment 3.0 Pro Preview has been shut down.

Edit: Nvm. I see on closer reading that you only kept the 3 pro preview as a baseline and used 3.5 flash in this followup. One thing I don't think you mentioned was whether 3.5 flash alone (not overlayed with opus) produced a meaningfully better CER/WER rate. Was that something you investigated?

Piotr Jaskulski's avatar

This method works well for handwritten text in English. Unfortunately, my tests show that no model can match Gemini when it comes to recognising handwriting in Polish (and probably other languages outside the most popular group as well). So, all that remains for me to do in my app is to look for differences in the transcription between Gemini 3 Pro and Gemini 3.5 Flash.

One risk is the reliance on a closed commercial solution over which we have no control. The volatility of this new technology means that what works today for Gemini 3 Pro may not work for Gemini 4. And when Google discontinues a particular model, we are left with nothing. Is the ability to read old manuscripts valuable enough to the company that it will continue to develop and maintain this feature in its models?

That is why I am also looking into open-source models designed for historical document transcription, such as Churro. I am also exploring the option of fine-tuning models such as Qwen-VL.

12 more comments...

No posts

Ready for more?