To OCR or Not to OCR . . . ?

Optical Character Recognition, or OCR for short, is a type of software designed to extract text from images (for example, digitized images of your rollfilm) and output it to a file such as a PDF or text file. Creekside Digital often runs OCR on digitized rollfilm and creates searchable PDF files. But will it work with YOUR film, and is it worth the extra cost?

While OCR works very well with typewritten and printed text, the technology is currently very limited in its ability to recognize hardwritten (cursive) script, particularly on older documents of dubious quality. The one exception is with engineering drawings and architectural diagrams; we’ve found that quite often, the consistent “block” handwriting on such documents OCRs very well. Other commonly OCR’d documents include newspapers and parts catalogs on microfiche. Most commonly, Creekside will deliver PDF files with an invisible layer of text underneath the document image, which may be copied and pasted, searched, indexed, etc., just like any other office document. We can also provide other formats such as text files and spreadsheets for custom applications — just ask us. We scan all film to be OCR’d in grayscale at 300dpi if possible (as recommended by the publisher of the OCR software).

How about printed documents in other languages? Sure. Our server-based OCR engine recognizes a total of 184 languages, so newspapers and other printed documents in non-English languages are fair game.

How accurate is the OCR process? That depends on the quality of your source microfilm. If the document images are very clean and in a more modern font or typeface that’s very easy for the OCR engine to recognize, it’s not uncommon to see accuracy close to 100%. The accuracy may drop depending on several factors such as film sharpness, reduction ratio and resolution of the scans, (the smaller the frames, the less image information there is to recognize), how “clean” the documents are (folds, lines, and shadows across letters can confuse the OCR engine), etc. Having said that, Creekside Digital has successfully OCR’d microfilm of newspapers which are more than 150 years old, with excellent results.

So while 11th Century manuscripts and handwritten meeting minutes may not OCR very well, if you need to quickly find numbers, names, and other data within printed docs, optical character recognition is the way to go. It’s as easy as checking a box on our Small Order Form. (And if you need handwritten documents converted to electronic, searchable text, ask us about our data entry and manual indexing services).

Latest on Facebook

2 months ago

Creekside Digital

Hey . . . that's the historic railroad depot building that's still in our parking lot, which is also the logo of our frame shop Glen Arm Custom Framing!

A Ride on the Ma & Pa through Baltimore County in PhotosAug 13, 2:00pmHistorical Society of Baltimore CountyRailroad historian Rudy Fischer will take us on a virtual ride - via slides - along the path of the old Maryland and Pennsylvania (Ma & Pa) Railroad through Baltimore County. The Ma & Pa was formed in 1901 through the consolidation of the Baltimore and Lehigh Railway and the York Southern Railroad, and connected Baltimore, Maryland, and York, Pennsylvania, until the 1950s. The Ma & Pa transported passengers, mail, marble and slate, anthracite coal, lumber, manufactured goods, and agricultural products, especially milk, along its picturesque, meandering route.

Part of our Almshouse Speaker Series. Admission $5 per person, free to HSBC members. Light refreshments will be served. For more information, please email us at info@hsobc.org or call us at 410-666-1878.
... See MoreSee Less

A Ride on the Ma & Pa through Baltimore County in Photos

2 months ago

Creekside Digital

#Digitization is cool (and important), but remember that it really just enables what comes next. What are the killer apps that will consume all of this unlocked cultural heritage content?How can you use library collections as data to answer questions? For example, how can computational methods help determine who wrote the Federalist Papers? medium.com/@librarycongress/collections-as-data-40422b043f5a ... See MoreSee Less

#Digitization is cool (and important), but remember that it really just enables what comes next.  What are the killer apps that will consume all of this unlocked cultural heritage content?

 

Comment on Facebook

Get Started