To OCR or Not to OCR . . . ?

Optical Character Recognition, or OCR for short, is a type of software designed to extract text from images (for example, digitized images of your rollfilm) and output it to a file such as a PDF or text file. Creekside Digital often runs OCR on digitized rollfilm and creates searchable PDF files. But will it work with YOUR film, and is it worth the extra cost?

While OCR works very well with typewritten and printed text, the technology is currently very limited in its ability to recognize hardwritten (cursive) script, particularly on older documents of dubious quality. The one exception is with engineering drawings and architectural diagrams; we’ve found that quite often, the consistent “block” handwriting on such documents OCRs very well. Other commonly OCR’d documents include newspapers and parts catalogs on microfiche. Most commonly, Creekside will deliver PDF files with an invisible layer of text underneath the document image, which may be copied and pasted, searched, indexed, etc., just like any other office document. We can also provide other formats such as text files and spreadsheets for custom applications — just ask us. We scan all film to be OCR’d in grayscale at 300dpi if possible (as recommended by the publisher of the OCR software).

How about printed documents in other languages? Sure. Our server-based OCR engine recognizes a total of 184 languages, so newspapers and other printed documents in non-English languages are fair game.

How accurate is the OCR process? That depends on the quality of your source microfilm. If the document images are very clean and in a more modern font or typeface that’s very easy for the OCR engine to recognize, it’s not uncommon to see accuracy close to 100%. The accuracy may drop depending on several factors such as film sharpness, reduction ratio and resolution of the scans, (the smaller the frames, the less image information there is to recognize), how “clean” the documents are (folds, lines, and shadows across letters can confuse the OCR engine), etc. Having said that, Creekside Digital has successfully OCR’d microfilm of newspapers which are more than 150 years old, with excellent results.

So while 11th Century manuscripts and handwritten meeting minutes may not OCR very well, if you need to quickly find numbers, names, and other data within printed docs, optical character recognition is the way to go. It’s as easy as checking a box on our Small Order Form. (And if you need handwritten documents converted to electronic, searchable text, ask us about our data entry and manual indexing services).

Latest on Facebook

Just a heads up that our corporate website is currently down as we are in the process of moving into our new facilities! We're hopeful that everything will be back up by close of business tomorrow . . . stay tuned. ... See MoreSee Less

2 weeks ago

Creekside Digital

The Digitization Program Office (DPO) is pleased to present its 2018 Annual Report outlining the work and special activities that took place over the course of the year.

DPO works to implement a vision of “Discovery through Digitization” by partnering with others to increase the quantity, quality, and impact of digitized Smithsonian collections.

View and download the report at:
dpo.si.edu/resources
... See MoreSee Less

Image attachmentImage attachment

2 months ago

Creekside Digital

Now for some music with #smithsonianmusic! We're currently working with National Museum of American History Archives Center to digitize thousands of posters related to WWI and WWII, including this troop morale poster from 1917: collections.si.edu/search/detail/edanmdm:siris_arc_176697

According to an article from the New York Times published in the fall of 1918, the Phonograph Records Recruiting Corps was created by Vivien Burnett, son of novelist and playwright Frances Hodgson Burnett, to collect records, machines, and needles to provide music to soldiers overseas during WWI, by recruiting and drafting "slacker records."

Learn more about the Smithsonian Year of Music: music.si.edu/ and stay tuned as we continue to digitize more collections from National Museum of American History!
... See MoreSee Less

Image attachmentImage attachment
Load more
Get Started