In the humanities and cultural studies, OCR (Optical Character Recognition) and HTR (Handwritten Text Recognition) remain difficult tasks. All users have access to a free and simple-to-use tool through OCR4all to carry out their own OCR workflows. The fundamental ideas and concepts of OCR will be covered in this workshop, along with a brief overview of the OCR4all program.
- What kinds of files and data are necessary for OCR?
- How does the OCR or HTR workflow integration in OCR4all adapt according to the source material and the anticipated (human) effort?
- With regard to the content at hand, how much of the workflow can be automated?
- What is an OCR model, and how can one train a specific text recognition model?
- What level of recognition accuracy can be expected?
- How much work should be put into producing texts if they are going to be used later?
By the end of the session, all participants will be able to work independently on challenging OCR tasks thanks to the discussion and explanation of these and other topics.
The participants may use the offered sample texts as well as their own materials. There is no prerequisite for this training, and all skill levels can participate.
Speaker: Florian Langhanki (JMU)
The number of participants is limited to 15, so please register at forschungsdienste@sub.uni-hamburg.de.
This event is in the series "Digital Humanities – How does it work?" of the Department for Digital Scholarship Services.
Examining historical sources in the form of printed and manuscript textual material is a crucial component of the work done by scholars in the humanities, as well as in the cultural and human sciences. These are frequently only accessible as scans, which drastically restricts how useful they may be because automatic indexing techniques like full-text search or quantitative analysis methods cannot be applied. The so-called machine-processable full text must first be extracted from the digitized data for this purpose, and methods for automatic text recognition of prints (OCR) or manuscripts (HTR) are becoming increasingly crucial in this process. Old prints and manuscripts, in particular, can still be exceedingly difficult to work with for a variety of reasons. Fortunately, historical OCR/HTR has made significant strides in recent years, leading to the development of some high-performance solutions.
OCR4all, a freely downloadable open source program created by the University of Würzburg's Center for Philology and Digitization (ZPD), seeks to make it possible for users of all skill levels to independently and accurately index complex printed materials and manuscripts. OCR4all is a single application that includes the whole text recognition workflow as well as all necessary tools. It is simple to install and use because to its user-friendly graphical user interface.
In addition to introducing OCR4all and its features through a live demonstration, the lecture goes through the fundamentals of automatic text recognition. Additionally, the performance and application on various materials will be shown, and an overview of recent work as well as a prognosis for future advances will be provided.
Speaker: Christian Reul
This event is in the series "Digital Humanities – How does it work?" of the Department for Digital Scholarship Services.
Universität Hamburg
Adeline Scharfenberg
Universität Hamburg
Adeline Scharfenberg
Universität Hamburg
Adeline Scharfenberg