To OCR or Not to OCR . . . ?
Optical Character Recognition, or OCR for short, is a type of software designed to extract text from images (for example, digitized images of your rollfilm) and output it to a file such as a PDF or text file. Creekside Digital often runs OCR on digitized rollfilm and creates searchable PDF files. But will it work with YOUR film, and is it worth the extra cost?
While OCR works very well with typewritten and printed text, the technology is currently very limited in its ability to recognize hardwritten (cursive) script, particularly on older documents of dubious quality. The one exception is with engineering drawings and architectural diagrams; we’ve found that quite often, the consistent “block” handwriting on such documents OCRs very well. Other commonly OCR’d documents include newspapers and parts catalogs on microfiche. Most commonly, Creekside will deliver PDF files with an invisible layer of text underneath the document image, which may be copied and pasted, searched, indexed, etc., just like any other office document. We can also provide other formats such as text files and spreadsheets for custom applications — just ask us. We scan all film to be OCR’d in grayscale at 300dpi if possible (as recommended by the publisher of the OCR software).
How about printed documents in other languages? Sure. Our server-based OCR engine recognizes a total of 184 languages, so newspapers and other printed documents in non-English languages are fair game.
How accurate is the OCR process? That depends on the quality of your source microfilm. If the document images are very clean and in a more modern font or typeface that’s very easy for the OCR engine to recognize, it’s not uncommon to see accuracy close to 100%. The accuracy may drop depending on several factors such as film sharpness, reduction ratio and resolution of the scans, (the smaller the frames, the less image information there is to recognize), how “clean” the documents are (folds, lines, and shadows across letters can confuse the OCR engine), etc. Having said that, Creekside Digital has successfully OCR’d microfilm of newspapers which are more than 150 years old, with excellent results.
So while 11th Century manuscripts and handwritten meeting minutes may not OCR very well, if you need to quickly find numbers, names, and other data within printed docs, optical character recognition is the way to go. It’s as easy as checking a box on our Small Order Form. (And if you need handwritten documents converted to electronic, searchable text, ask us about our data entry and manual indexing services).