Towards Big Data Infrastructure for Historic Handwritten Document Transcription

ECU Author/Contributor (non-ECU co-authors, if there are any, appear on document): David Richard Hoffman (Creator)
Institution: East Carolina University (ECU ); Web Site: http://www.ecu.edu/lib/

Abstract: Historical archival documents are vast and contain many thousands of pages of unstructured data , often not easily searchable. The current state of the art is searchable meta-tags of documents which help a reader to land on the page , but the data within the documents is not fully searchable. On the contrary , if this unstructured handwritten data is transcribed and stored in a database , search would be much more straightforward. Digitization of documents allows for visually impaired persons to stay connected with history as the transcribed text can be sent through assistive technology. Furthermore , digitized text can be translated into worldwide languages allowing for greater accessibility. The problem is apparent and end-goal solution is clear , but the steps to achieve the solution are not. Some museums and libraries have resorted to crowdsourced , volunteer transcription via the web. A more logical means would be to utilize computer-based handwriting recognition to help achieve the goal. In this research we present Big Data approaches towards handwriting recognition. High-resolution document scans consume a large amount of disk space , and computer processing of the images with advanced algorithms require a lot of operations. Sharing the load of the data storage and parallel processing over a scalable cluster of machines is the logical next step , and has not been reported on in depth in this domain. This research does not seek to completely solve the problem of a full-fledged handwriting recognition system. Rather , it describes and demonstrates how using a cluster in a Hadoop environment can assist the research by processing in parallel. Although the algorithms utilized in this research already exist , the framework introduced may be extended to accommodate more advanced methods in order to speed up the processing time.

Additional Information

Publication: Thesis; Language: English; Date: 2017
Keywords: handwriting recognition, manuscript, hadoop streaming api
Subjects

Email this document to

This item references:

Title	Location & Link	Type of Relationship
Towards Big Data Infrastructure for Historic Handwritten Document Transcription	http://hdl.handle.net/10342/6180	The described resource references, cites, or otherwise points to the related resource.

Browse All

Theses & Dissertations

Submissions

Towards Big Data Infrastructure for Historic Handwritten Document Transcription

Additional Information