Hi Jim, it's always interesting! So you are importing a PDF which has text and images and then correcting the text with Sonnet-3.5...but it looks like the corrections are not going well. Can you send me the PDF?
Way above my head in terms of technology, but I’m handing it off to my “tech assistant” aka husband, and am excited to try it on various genealogy projects I’m working on. Thanks for sharing this info and leading the edge on practical AI.
I've successfully modified the Python code to work on macOS.
The main issue was the "Too many open files" error that occurs due to macOS having lower default file handle limits than Windows. With the help of Claude 3.7 Sonnet, I was able to make these specific changes to resolve the issues:
Step 1: Limit the number of concurrent tasks
Find line 2268 in the file:
with ThreadPoolExecutor(max_workers=batch_size) as executor:
Change it to:
with ThreadPoolExecutor(max_workers=min(5, batch_size)) as executor:
This ensures no more than 5 pages are processed at the same time, regardless of how many are selected.
Step 2: Make sure image files are properly closedI needed to add close() calls after the resources were completely used:
At line 256:
original_image = Image.open(image_path)# (code that uses original_image)original_image.close() # Added this line
At line 719:
self.original_image = Image.open(image_path)# (code that uses self.original_image)self.original_image.close() # Added this line
At line 1393:
img = Image.open(file_path)# (code that uses img)img.close() # Added this line
At line 1215 for the PDF document:
pdf_document = fitz.open(pdf_file)# (code that uses pdf_document)pdf_document.close() # Added this line
The critical part was making sure these close() calls were placed AFTER all operations with the resources were completed, not in the middle of using them. This required careful reading of the surrounding code to find the right placement.
I'm not entirely sure if all these changes were strictly necessary, but combined they resolved the file handle limitation issues on macOS.
Absolutely amazing! Do you know what FamilySearch is using to transcribe their record images? I have hundreds of pages of Union Army pension files and this tool could be the trick to transcribe those.
Congratulations on the outstanding work! I applied the tool to 19th and 20th-century Italian documents and managed to achieve CER and WER values very close to yours.
Ok, so thanks so much for sending me the PDF, it revealed a huge error.
First, the prompts file in the util folder was missing some key information, specifically the {text_to_process} placeholder which sends the text of the original transcription to the LLM. That has been updated on Github now. If you don't want to redownload the prompts.csv, you can edit the "Specific_Instructions" entry for the Main Function so that it reads:
"Your task is to use the handwritten page image to correct the following transcription, retaining the spelling, syntax, punctuation, line breaks, catchwords, etc of the original.
{text_to_process}"
Otherwise, you can just re-download the prompts.csv from the GitHub repo and overwrite the old file in the util folder.
Also, I was mainly working with JPGs and not importing from PDFs in testing. Turns out I had the program importing images from PDFs at 72 DPI which is way too low a resolution. I updated the code to import at 300 DPI. If you want to do this manually it is line 1261 which should read:
Wow, that is really interesting as I’ve wondered how it would do on other languages. Great to hear!
Hi Jim, it's always interesting! So you are importing a PDF which has text and images and then correcting the text with Sonnet-3.5...but it looks like the corrections are not going well. Can you send me the PDF?
Way above my head in terms of technology, but I’m handing it off to my “tech assistant” aka husband, and am excited to try it on various genealogy projects I’m working on. Thanks for sharing this info and leading the edge on practical AI.
This is wonderful! As a genealogist, I'm particularly excited to try Transcription Pearl.
I've successfully modified the Python code to work on macOS.
The main issue was the "Too many open files" error that occurs due to macOS having lower default file handle limits than Windows. With the help of Claude 3.7 Sonnet, I was able to make these specific changes to resolve the issues:
Step 1: Limit the number of concurrent tasks
Find line 2268 in the file:
with ThreadPoolExecutor(max_workers=batch_size) as executor:
Change it to:
with ThreadPoolExecutor(max_workers=min(5, batch_size)) as executor:
This ensures no more than 5 pages are processed at the same time, regardless of how many are selected.
Step 2: Make sure image files are properly closedI needed to add close() calls after the resources were completely used:
At line 256:
original_image = Image.open(image_path)# (code that uses original_image)original_image.close() # Added this line
At line 719:
self.original_image = Image.open(image_path)# (code that uses self.original_image)self.original_image.close() # Added this line
At line 1393:
img = Image.open(file_path)# (code that uses img)img.close() # Added this line
At line 1215 for the PDF document:
pdf_document = fitz.open(pdf_file)# (code that uses pdf_document)pdf_document.close() # Added this line
The critical part was making sure these close() calls were placed AFTER all operations with the resources were completed, not in the middle of using them. This required careful reading of the surrounding code to find the right placement.
I'm not entirely sure if all these changes were strictly necessary, but combined they resolved the file handle limitation issues on macOS.
Best regards, Jon
Absolutely amazing! Do you know what FamilySearch is using to transcribe their record images? I have hundreds of pages of Union Army pension files and this tool could be the trick to transcribe those.
Great work - would like to try Pearl
Congratulations on the outstanding work! I applied the tool to 19th and 20th-century Italian documents and managed to achieve CER and WER values very close to yours.
I think I've broken it. I'm working with a PDF from Transcribus that did a good job, but with lots of common errors:
Having now made the Exporiments
by the test of tanning Leather as desired, necessa-
ry to decide on the best sort of Terra Japonica which I find is the kind N.3. I stated on the 23rd
August last as most worthy of attention, and having
Using Claude 3.5, Transcription Pearl "corrects" the OCR to the following. It seems to be taking some big liberties:
Having now made the Experiments
of the art of tanning Leather as desired, except-
-ing a small trial with bark that I have yet to make, as
I shall, please God, this Week. No. 2, 3, 4, and 5, are
Here are the results getting Claude to do the first pass on the OCR
Having now made the Experiments
of the art of tanning Leather as directed, respecting which I made verbal report of having performed on
3d ulto. (Jany. 3d Inst. No. 2, dated 20th Decr)
request that a short interval of attention, and having
In both cases, claude misses the key term on line 3, Terra Japonica, which is faily easy to read in the document.
I was getting more promissing results with the OCR on more difficult images yesterday. I'm not sure what is causing the problem in this example.
(I should have started by saying I'm super excited and hope to get this to work on some of my projects.)
Ok, so thanks so much for sending me the PDF, it revealed a huge error.
First, the prompts file in the util folder was missing some key information, specifically the {text_to_process} placeholder which sends the text of the original transcription to the LLM. That has been updated on Github now. If you don't want to redownload the prompts.csv, you can edit the "Specific_Instructions" entry for the Main Function so that it reads:
"Your task is to use the handwritten page image to correct the following transcription, retaining the spelling, syntax, punctuation, line breaks, catchwords, etc of the original.
{text_to_process}"
Otherwise, you can just re-download the prompts.csv from the GitHub repo and overwrite the old file in the util folder.
Also, I was mainly working with JPGs and not importing from PDFs in testing. Turns out I had the program importing images from PDFs at 72 DPI which is way too low a resolution. I updated the code to import at 300 DPI. If you want to do this manually it is line 1261 which should read:
pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72))
When I reran your text through after the updates I got:
Having now made the Experiments
by the test of tanning Leather as desired, necessa-
ry to decide on the best sort of Terra Japonica &c,
which I find is the kind No.3. I stated on the 23rd
August last as most worthy of attention, and having
This is much closer to the original!