Approximate pattern matching for OCR texts


Speaker: Dr. Manolis Christodoulakis
Affiliation: University of East London, UK
Category: Colloquium
Location: Room 148, Faculty of Pure and Applied Sciences (FST-01), 1 University Avenue, 2109 Nicosia, Cyprus (directions)
Date: Tuesday, March 10th, 2009
Time: 15:00 - 16:00 EET
The process of digitising old books and manuscripts is of immense importance to a variety of people, such as librarians, academics, publishers, etc. This task is achieved by scanning the documents and then performing Optical Character Recognition (OCR) to obtain text that can be stored, searched for, indexed etc. Quite often the original paper-copies of the publications are of poor print quality, leading to digital texts that contain errors. Consequently, any attempt for exact pattern matching will fail, and algorithms for approximate pattern matching must be used, where matches similar (rather than identical) to the pattern can be identified. There exist several different ways for defining text similarity, which however fail to incorporate the specific nature of errors that occur in OCR-texts. In this talk I will present a recently developed similarity measure that is specifically tailored for this purpose. In particular, it incorporates optical similarities of characters as well as matching combinations of characters to yield better approximate matching. Early implementations suggest that it is a promising method, and there is number of variants worth exploring in the future.

Short Bio:
Dr. Manolis Christodoulakis received his BSc from the Department of Computer Engineering and Informatics, University of Patras, and his PhD from the Department of Computer Science in King's College London. In the past, he has worked as a Research Associate and later as an External Lecturer in King's College. Since September 2007, he serves as a Lecturer in the Secure Systems and Software Development (SD) field, in the School of Computing, Information Technology & Engineering. His research interests include: design and analysis of combinatorial algorithms, sequence analysis (pattern matching, repetition finding etc.), computational biology/bioinformatics, and computational music analysis.

