As the data is in a preprinted structured form and in many hands, you probably need to have the ability to recognise the structure of the page - it is basically a complex table and there is an assumption that the same group of hands are found over the period of war service of the individual soldier. There two issues: getting a reproduction of the layout as close as possible to the way the data is laid out - the service record - this is how visually you can see the relationships between pieces of data. The ability to tag the metadata (field labels in the table) - name, rank, location…. Would be good. Secondly, you need an interface - a workbench so the transcriber and project team can manage the work, test and run models , etc., without manually running programs. Most institutions would not allow users to run programs. You need a user interface. IT staff get really nervous… security issues, mainly. IT staff need to be able to see what the application is and does. Overall great work, but given you are working with cursive in mostly historical documents, your CERs mostly will be less than 10% no matter what LLM you use. I don’t think you can match the CERs seen in printed materials - there is too much variability in historical and archival documents. Single hands, with a regular style are easy but archival items with many hands, page structures and cursive styles are more difficult.
Fascinating, Mark! I'm in the process of HTR'ing thousands of pages of transcriptions of notarial records and your post inspired me to try to see if it could create a simple spreadsheet with the data. It worked for the first few documents, but not for the whole file (it just doesn't produce any output - its status remains at "analyzing" forever). Any tips? I have zero knowledge of programming, alas.
It is totally possible but complicated and involves a lot of coding unfortunately. I have some graduate students testing some software that I hope to get out in the next couple of weeks that does this automatically for you. Stay tuned!
As the data is in a preprinted structured form and in many hands, you probably need to have the ability to recognise the structure of the page - it is basically a complex table and there is an assumption that the same group of hands are found over the period of war service of the individual soldier. There two issues: getting a reproduction of the layout as close as possible to the way the data is laid out - the service record - this is how visually you can see the relationships between pieces of data. The ability to tag the metadata (field labels in the table) - name, rank, location…. Would be good. Secondly, you need an interface - a workbench so the transcriber and project team can manage the work, test and run models , etc., without manually running programs. Most institutions would not allow users to run programs. You need a user interface. IT staff get really nervous… security issues, mainly. IT staff need to be able to see what the application is and does. Overall great work, but given you are working with cursive in mostly historical documents, your CERs mostly will be less than 10% no matter what LLM you use. I don’t think you can match the CERs seen in printed materials - there is too much variability in historical and archival documents. Single hands, with a regular style are easy but archival items with many hands, page structures and cursive styles are more difficult.
Fascinating, Mark! I'm in the process of HTR'ing thousands of pages of transcriptions of notarial records and your post inspired me to try to see if it could create a simple spreadsheet with the data. It worked for the first few documents, but not for the whole file (it just doesn't produce any output - its status remains at "analyzing" forever). Any tips? I have zero knowledge of programming, alas.
It is totally possible but complicated and involves a lot of coding unfortunately. I have some graduate students testing some software that I hope to get out in the next couple of weeks that does this automatically for you. Stay tuned!
Thanks, I'm looking forward to it!