Acrobat Training Guide – Text Recognition

Editor’s note: this is an update on the Acrobat Training Videos – Text Recognition video post. A related post is Three types of PDFs.

Introduction

This is a brief guide on the text recognition feature in Adobe Acrobat1. OCR, which stands for Optical Character Recognition – is a process which adds an invisible text layer to scanned paper documents or screenshots to help make them text searchable. While OCR can be very helpful in terms of search, it is not perfect. The computer is interpreting pictures of letters and characters in documents and attempting to turn them into text. Sometimes, those translations are incorrect (Figures 1 and 2).

Figure 1.
Figure 2.

The quality of the OCR text depends on many factors including the accuracy of the source document, its complexity and structure, font and language variations and the sharpness of the scan. For example, a document with clear, large print font (Figure 3) will generally OCR better than a fax copy with blurry text or handwriting (Figure 4).

Figure 3.
Figure 4.

With newer iterations of Adobe Acrobat, the OCR text accuracy has improved. When working with sets of scanned paper documents that were processed with older OCR engines, some people will spot check the accuracy of the OCR by running simple searches. Time permitting, they may then choose to re-OCR the documents. This can lead to more accurate searchable text.

A good practice for dealing with scanned paper PDF documents we want to work with is to first make a copy of the documents. For example, if we received a flash drive or a download from USAfx of scanned paper PDF files it’s a good idea to first copy the files to a location on a computer or a network drive. This way we can work with the documents and add OCR text when needed, while still maintaining a set of the original files.

With a copy of one documents open, the next step would be to see if document already is already searchable. When we open a PDF file, we are looking at an image of the document. Since the OCR text layer is invisible, we will not know whether it is searchable just by looking at it. There are a few things we can do to see if OCR is present.

If we go to the ‘Edit’ menu and choose ‘Select All’, or the keyboard shortcut ‘Control A’, (Figure 5) and we get a no text characters warning message (Figure 6), this indicates that there is no searchable text. Alternatively, if we use our mouse and single click in a blank area of the page, if the entire page turns blue, it also means there is no searchable text layer.

Figure 5.
Figure 5.
Figure 6.
Figure 6.

We can also try to find a word on the page using one of the search features in Acrobat. For example, when we run a find for the word ‘memo’ we get the same no text characters warning that we got by going to the Edit menu and choosing ‘select all. (Figure 7).

Figure 7.
Figure 7.

Starting the Text Recognition Process

To add an OCR text layer to a document, go to the tools menu and click on the ‘Scan & OCR’ button (Figure 8). When you activate this tool in Acrobat an additional menu bar will appear at the top of the page. Choose the ‘In This File’ option (Figure 9). In most circumstances we will go with the ‘All pages’ default. Click on the blue ‘Recognize Text’ button to begin the process (Figure 10). A progress indicator will appear on the bottom of the bottom right-hand side as it processes each page Adobe will also automatically rotate pages, based on the optimal rotation for the text on that page (Figure 11).

Figure 8.
Figure 8.
Figure 9.
Figure 9.
Figure 10.
Figure 10.
Figure 11.
Figure 11.

While the speed at which Acrobat can OCR documents can vary depending on the complexity of the documents and the type of computer being used, a good general estimate is about 1000 pages per hour. With particularly large OCR jobs, you might want to wait until the end of the day to begin the process. Some offices have also set up a spare computer, dedicated to running various processes such as OCR, so nobody’s computer is tied up.

When the Text Recognition Process is Complete

When the OCR process is complete, we can now go back to the first page to make sure the document is now searchable. If we go back to the ‘Edit’ menu and choose ‘Select All’, the text on the document will now be highlighted in blue while the blank areas surrounding the text will remain white.  (Figure 12). A single click in the blank area no longer turns whole page blue. If we search for the word ‘Memo’ again, using the find option, we will get a set of search results with the first hit on the first page of the document highlighted in blue (Figure 13).

Figure 12.
Figure 12.
Figure 13.
Figure 13.

Since we have now changed the document by adding an OCR layer to it, save the file so we lose none of the work we have just done.

Text Recognition in Multiple Files

We can also run the OCR process across multiple documents, by going to our OCR tool menu (Figure 14) and selecting ‘Or recognize text in multiple files’ (Figure 15). This is a handy option, as we often receive batches of documents that might need to be OCR’d.

Figure 14.
Figure 14.
Figure 15.
Figure 15.

You can choose to OCR an entire set of PDF files in a folder by selecting ‘Add Folder’ (Figure 16) and then navigating to where that folder is on your computer or on the server. By default, Acrobat will include all PDFs and subfolders within the selected folder (Figure 17).

Figure 16.
Figure 16.
Figure 17.
Figure 17.

When running the OCR process on multiple files, we are prompted to choose an option as to where to save the files before you run the OCR. Most users choose to save the files in the same folder selected at the start with the original file names (Figure 18). Acrobat will also launch a progress bar for this process (Figure 19).

Figure 18.
Figure 18.
Figure 19.
Figure 19.

Estimate the page volume and run the process at a break or at the end of the day, if it is a large amount of information. The Acrobat help guide (https://helpx.adobe.com/acrobat/user-guide.html ) is a great resource if you are interested in discovering more about the OCR process.

  1. The free Adobe Acrobat Reader software does not include the ability to OCR documents. ↩︎