Text Mining for Historical Documents PDF Print Email

there are numerous current projects running  concerning digitalization of huge amounts of sources in archives, libraries and museums. Digitalizing these sources can only be the first step in a large process - the development of tools to extract relevant informations from these sources is as important as the digitalization itself.

There were to seminars in spring 2009 and 2010 teaching the students about how to analyse sources through linguistic methods. Important text mining tasks were presented, like entity recognition and disambiguation, relation extraction and template filling, segmentation of semi-structured text, automatic link detection between documents, error detection and correction.  The text mining techniques were implemented and tested on real-world examples from the cultural heritage domain, such as historical documents. The cultural heritage domain is a good testbed for NLP methods because a wealth of information in this domain is contained in raw unprocessed and often relatively unstructured texts (in contrast to the biomedical domain where a lot of data is already in a fairly structured form). Text mining can make such documents more accessible to researchers and laypersons alike. Moreover language change over time, unorthodox orthography, and errors introduced during digitisation (e.g. OCR errors) make this domain particularly challenging (and thus interesting!) for natural language processing.






October 2020
28 29 30 1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31 1

Newest events

No events

Our Newsletter

Stay informed by subscribing to our newsletter!