Introducing Transcription Pearl

Mark Humphries

Nov 1, 2024

A Practical AI Tool for the Automated Transcription of Historical Handwritten Documents with State-of-the-Art Accuracy

Read →

11 Comments

Mark Humphries

Nov 15

Wow, that is really interesting as I’ve wondered how it would do on other languages. Great to hear!

Expand full comment

Mark Humphries

Nov 8

Hi Jim, it's always interesting! So you are importing a PDF which has text and images and then correcting the text with Sonnet-3.5...but it looks like the corrections are not going well. Can you send me the PDF?

Expand full comment

Lori Olson White

Feb 9Edited

Way above my head in terms of technology, but I’m handing it off to my “tech assistant” aka husband, and am excited to try it on various genealogy projects I’m working on. Thanks for sharing this info and leading the edge on practical AI.

Expand full comment

Nicole Dyer

Nov 6

This is wonderful! As a genealogist, I'm particularly excited to try Transcription Pearl.

Expand full comment

Jon Bang Ploug

Apr 11

I've successfully modified the Python code to work on macOS.

The main issue was the "Too many open files" error that occurs due to macOS having lower default file handle limits than Windows. With the help of Claude 3.7 Sonnet, I was able to make these specific changes to resolve the issues:

Step 1: Limit the number of concurrent tasks

Find line 2268 in the file:

with ThreadPoolExecutor(max_workers=batch_size) as executor:

Change it to:

with ThreadPoolExecutor(max_workers=min(5, batch_size)) as executor:

This ensures no more than 5 pages are processed at the same time, regardless of how many are selected.

Step 2: Make sure image files are properly closedI needed to add close() calls after the resources were completely used:

At line 256:

original_image = Image.open(image_path)# (code that uses original_image)original_image.close() # Added this line

At line 719:

self.original_image = Image.open(image_path)# (code that uses self.original_image)self.original_image.close() # Added this line

At line 1393:

img = Image.open(file_path)# (code that uses img)img.close() # Added this line

At line 1215 for the PDF document:

pdf_document = fitz.open(pdf_file)# (code that uses pdf_document)pdf_document.close() # Added this line

The critical part was making sure these close() calls were placed AFTER all operations with the resources were completed, not in the middle of using them. This required careful reading of the surrounding code to find the right placement.

I'm not entirely sure if all these changes were strictly necessary, but combined they resolved the file handle limitation issues on macOS.

Best regards, Jon

Expand full comment

Denyse Allen

Feb 10

Absolutely amazing! Do you know what FamilySearch is using to transcribe their record images? I have hundreds of pages of Union Army pension files and this tool could be the trick to transcribe those.

Expand full comment

Vivienne Cuff

Jan 27

Great work - would like to try Pearl

Expand full comment

Florindo Palladino

Nov 14

Congratulations on the outstanding work! I applied the tool to 19th and 20th-century Italian documents and managed to achieve CER and WER values very close to yours.

Expand full comment

Jim Clifford

Nov 8

I think I've broken it. I'm working with a PDF from Transcribus that did a good job, but with lots of common errors:

Having now made the Exporiments

by the test of tanning Leather as desired, necessa-

ry to decide on the best sort of Terra Japonica which I find is the kind N.3. I stated on the 23rd

August last as most worthy of attention, and having

Using Claude 3.5, Transcription Pearl "corrects" the OCR to the following. It seems to be taking some big liberties:

Having now made the Experiments

of the art of tanning Leather as desired, except-

-ing a small trial with bark that I have yet to make, as

I shall, please God, this Week. No. 2, 3, 4, and 5, are

Here are the results getting Claude to do the first pass on the OCR

Having now made the Experiments

of the art of tanning Leather as directed, respecting which I made verbal report of having performed on

3d ulto. (Jany. 3d Inst. No. 2, dated 20th Decr)

request that a short interval of attention, and having

In both cases, claude misses the key term on line 3, Terra Japonica, which is faily easy to read in the document.

I was getting more promissing results with the OCR on more difficult images yesterday. I'm not sure what is causing the problem in this example.

Expand full comment

Reply (1)

Jim Clifford

Nov 8

(I should have started by saying I'm super excited and hope to get this to work on some of my projects.)

Expand full comment

Reply (1)

Mark Humphries

Nov 8Edited

Ok, so thanks so much for sending me the PDF, it revealed a huge error.

First, the prompts file in the util folder was missing some key information, specifically the {text_to_process} placeholder which sends the text of the original transcription to the LLM. That has been updated on Github now. If you don't want to redownload the prompts.csv, you can edit the "Specific_Instructions" entry for the Main Function so that it reads:

"Your task is to use the handwritten page image to correct the following transcription, retaining the spelling, syntax, punctuation, line breaks, catchwords, etc of the original.

{text_to_process}"

Otherwise, you can just re-download the prompts.csv from the GitHub repo and overwrite the old file in the util folder.

Also, I was mainly working with JPGs and not importing from PDFs in testing. Turns out I had the program importing images from PDFs at 72 DPI which is way too low a resolution. I updated the code to import at 300 DPI. If you want to do this manually it is line 1261 which should read:

pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72))

When I reran your text through after the updates I got:

Having now made the Experiments

by the test of tanning Leather as desired, necessa-

ry to decide on the best sort of Terra Japonica &c,

which I find is the kind No.3. I stated on the 23rd

August last as most worthy of attention, and having

This is much closer to the original!

Expand full comment

Generative History

Introducing Transcription Pearl