To OCR or Not to OCR . . . ?

Optical Character Recognition, or OCR for short, is a type of software designed to extract text from images (for example, digitized images of your rollfilm) and output it to a file such as a PDF or text file. Creekside Digital often runs OCR on digitized rollfilm and creates searchable PDF files. But will it work with YOUR film, and is it worth the extra cost?

While OCR works very well with typewritten and printed text, the technology is currently very limited in its ability to recognize hardwritten (cursive) script, particularly on older documents of dubious quality. The one exception is with engineering drawings and architectural diagrams; we’ve found that quite often, the consistent “block” handwriting on such documents OCRs very well. Other commonly OCR’d documents include newspapers and parts catalogs on microfiche. Most commonly, Creekside will deliver PDF files with an invisible layer of text underneath the document image, which may be copied and pasted, searched, indexed, etc., just like any other office document. We can also provide other formats such as text files and spreadsheets for custom applications — just ask us. We scan all film to be OCR’d in grayscale at 300dpi if possible (as recommended by the publisher of the OCR software).

How about printed documents in other languages? Sure. Our server-based OCR engine recognizes a total of 184 languages, so newspapers and other printed documents in non-English languages are fair game.

How accurate is the OCR process? That depends on the quality of your source microfilm. If the document images are very clean and in a more modern font or typeface that’s very easy for the OCR engine to recognize, it’s not uncommon to see accuracy close to 100%. The accuracy may drop depending on several factors such as film sharpness, reduction ratio and resolution of the scans, (the smaller the frames, the less image information there is to recognize), how “clean” the documents are (folds, lines, and shadows across letters can confuse the OCR engine), etc. Having said that, Creekside Digital has successfully OCR’d microfilm of newspapers which are more than 150 years old, with excellent results.

So while 11th Century manuscripts and handwritten meeting minutes may not OCR very well, if you need to quickly find numbers, names, and other data within printed docs, optical character recognition is the way to go. It’s as easy as checking a box on our Small Order Form. (And if you need handwritten documents converted to electronic, searchable text, ask us about our data entry and manual indexing services).

Latest on Facebook

2 weeks ago

Creekside Digital

The Library of Congress
Library conservators discuss their work on the Emily Howland Album, containing 48 rare photographs dating to the 1860s -- including a previously unrecorded portrait of Harriet Tubman and images of other abolitionists -- which was conserved and digitized by the Library and will be exhibited for the first time at the Smithsonian's National Museum of African American History and Culture in 2018. More:
... See MoreSee Less

3 weeks ago

Creekside Digital

Let's get digital in this week's Link Love!

Librarians at the The White House Historical Association have digitized 25,000 previously uncatalogued slides! In case you missed it, Missing Scientists' Faces blog shared 28 days of African American female scientists during #BlackHistoryMonth. Check out some of the Digital Public Library of America's primary source sets for #WomensHistoryMonth. Let's check in with the Olympic games archivist now that it's over. Lastly, the Dallas Cowboy cheerleader uniforms are now part of the National Museum of American History.

More -
... See MoreSee Less

Get Started