Towards Big Data Infrastructure for Historic Handwritten Document Transcription
- ECU Author/Contributor (non-ECU co-authors, if there are any, appear on document)
- David Richard Hoffman (Creator)
- Institution
- East Carolina University (ECU )
- Web Site: http://www.ecu.edu/lib/
Abstract: Historical archival documents are vast and contain many thousands of pages of unstructured data , often not easily searchable. The current state of the art is searchable meta-tags of documents which help a reader to land on the page , but the data within the documents is not fully searchable. On the contrary , if this unstructured handwritten data is transcribed and stored in a database , search would be much more straightforward. Digitization of documents allows for visually impaired persons to stay connected with history as the transcribed text can be sent through assistive technology. Furthermore , digitized text can be translated into worldwide languages allowing for greater accessibility. The problem is apparent and end-goal solution is clear , but the steps to achieve the solution are not. Some museums and libraries have resorted to crowdsourced , volunteer transcription via the web. A more logical means would be to utilize computer-based handwriting recognition to help achieve the goal. In this research we present Big Data approaches towards handwriting recognition. High-resolution document scans consume a large amount of disk space , and computer processing of the images with advanced algorithms require a lot of operations. Sharing the load of the data storage and parallel processing over a scalable cluster of machines is the logical next step , and has not been reported on in depth in this domain. This research does not seek to completely solve the problem of a full-fledged handwriting recognition system. Rather , it describes and demonstrates how using a cluster in a Hadoop environment can assist the research by processing in parallel. Although the algorithms utilized in this research already exist , the framework introduced may be extended to accommodate more advanced methods in order to speed up the processing time.
Additional Information
- Publication
- Thesis
- Language: English
- Date: 2017
- Keywords
- handwriting recognition, manuscript, hadoop streaming api
- Subjects
Title | Location & Link | Type of Relationship |
Towards Big Data Infrastructure for Historic Handwritten Document Transcription | http://hdl.handle.net/10342/6180 | The described resource references, cites, or otherwise points to the related resource. |