The ABCs of OCR Scanning


OCR is going mainstream in Google Docs, but it’s still a tricky process.

A picture is worth a thousand words.  It’s worth a thousand more to an Optical Character Recognition (OCR) program.

With news that search engine giant Google has quietly flipped the switch on OCR file format outputs for Google Docs, figured casual readers and computer users alike could use a crash course on what OCR is, what it isn’t, and how it works.

Once used primarily by litigation services companies, commercial publishers and companies that generate massive volumes of paper documents, OCR has gone mainstream.  With the introduction of Amazons Kindle and other so-called ebook devices that rely on the technology, OCR scanning is faster and more affordable than ever before.


Think of OCR as a PDF file broken down into tiny little pieces, analyzed and processed letter by letter.  Whereas a PDF is sort of a snapshot photo of an entire document saved in a file format we’re all familiar with, an OCR scanning system translates the actual text, individual symbols and numbers inside your document and turns them into characters the software recognizes and then duplicates; Hence the name, Optical Character Recognition.

OCR allows someone to search through stacks of paper for any kind of user-designated search criteria you desire.  Attorneys and medical providers who keep racks of patient and court files find OCR particularly useful since it eliminates the need to physically flip through pages to find, say, a patient that fits a certain medical profile.  Just punch heart disease into the OCR-hosted computer, click, and all the heart disease patients a doctor is treating appear there on your screen.  PDFs aren’t as intelligent. Sure, you can search for particular words or phrases in a PDF document and find them quickly, but entire warehouses of similar documents don’t play well with the simple PDF search box.

Digestible Data

Aside from the format and search differences, OCR works best for text-only documents or those with high-resolution images.  OCR literally looks through your file to identify letters it recognizes.  If you have a page you’ve scanned or housed in Google Docs that contains low res photos taken with a disposable camera or a lot of illegible handwriting scrawled across it, the OCR system could generate a bunch of unintelligible words. But there’s another important benefit OCR has over PDFs.


You can’t go in and edit a PDF file unless you have Adobe Acrobat installed on your computer and the document is unlocked.  With OCR, you’re creating an entirely editable file.  When you’re scanning printed pages into OCR, you end up with a file that contains the same font you created the document in.  So you can edit away, make corrections and improvements whenever you like.

Garbage In, Garbage Out

OCR can present an image recognition challenge, and it is much, much harder to create an accurate file.  The software has to identify the shapes of letters and the separation between words, and then try to strike a balance between words it can’t recognize and misspellings.  So if you’re scanning something official or important, OCR scanning and indexing is best left to the professionals who have quality control measures in place to create accurate files every time.

To find out more about OCR Scanning, contact CopyScan Technologies today!

Request a Quote!