Adobe Acrobat: “Renderable Text”

When working with PDF documents you may encounter a “renderable text” error message.  This message will sometimes occur when trying to make a scanned paper PDF file text searchable (also know as adding OCR to a document).

error messageDepending on the version of Acrobat you have, the message may read something like:

“Renderable text” is typically text that has been added to an scanned paper image (like a header, footer or bates number), through a non-Acrobat program.  The way this text is encoded into the page can cause Acrobat to disallow additional searchable text (OCR text).

This message can certainly be annoying and it can also be significant as it can limit your ability to run searches.  In Acrobat, you will be unable to add new searchable OCR text, or improve the quality of the existing OCR, until the error is fixed.

If you’ve seen this message before, and have tried to fix the document without success, you are not alone!  We spoken with a number of people over the years who have come up with some creative solutions.  Though we have yet to find “one solution” that will always fix this particular error, here are a number of possible solutions (results will vary depending on the cause of the error):

Solution 1: Obtain a version of the document with OCR.

  • It may seem simplistic, but if you receive documents without searchable OCR, ask for it.  Often the person or organization that gave it to you will want to search the files themselves and may already have a copy that has been OCR’ed.  Even if the documents they give you generate “renderable text” error messages, you will still be able to search any of the existing OCR text within the files.

Solution 2: If the files are from PACER / ECF, download a new copy.

  • The default download settings in PACER / ECF will add “purple” headers with the case number (which will cause a “renderable text” error message).  If you can find the document again in PACER / ECF, download it with the header option turned off.

Solution 3: Run “Add Tags to Document” (available in Acrobat Pro).
accessibility menu

  • If you have Acrobat Pro installed there is a special “Accessibility” menu where you can run “Add Tags to Document”.  For certain PDF’s, running this option will clear up the issue and allow the document OCR to be run.

Solution 4: Print the document to PDF (available in Acrobat Standard and Acrobat Pro).

  • If you have Acrobat installed (Standard or Pro) you’ll probably also have access to an “Acrobat PDF” virtual printer.  By printing the document to this virtual printer, the new PDF that is created will often avoid having the renderable text issue.

Solution 5: “Sanitize” the document then rerun OCR (available in Acrobat Pro).

  • From the “Protection” menu run “Sanitize Document”.  This will remove all of the document metadata including some of the rendered text that might be causing the error.
  • Re-run the OCR process.

Solution 6: Convert to TIFF files and back, and then re-run OCR (available in Acrobat Standard and Acrobat Pro).

  • Open the PDF document in Acrobat and choose “File > Save As“.
  • In the “Save As” dialog box, choose TIFF (*.tif, *.tiff) from the Save As Type (Windows) or Format (Mac OS) pop-up menu. Specify a location, and then click Save.  Acrobat saves each page of the PDF document as a separate, sequentially numbered TIFF file.
  • Combine the single pages back into a multipage document and re-run the OCR process.

Solution 7: Convert to XPS file format and back, and then re-run OCR.

  • If your computer has the “XPS” virtual printer installed (it comes with many version of MS Office) then print the file using the “Microsoft XPS Document Writer” printer.
    • The XPS printer will ask you to save the file.
    • Convert the saved XPS file to PDF.
    • Re-run the OCR process on the new PDF.

Solution 8: Try running the OCR using a different program.

Scanned Paper (part IV)

Objective Coding:

“Who, What, When” will help you figure out “Where and How”

One definition of the word objective I found online is:

“uninfluenced by emotions or personal prejudices; presented factually”. 

This is accurate when it comes to the objective coding of documents in the not so objective world of litigation.  While you may believe that having OCR for .pdfs or associated text for .tiffs gives you the searching capabilities that you need, having documents objectively coded will really allow you to refine those searches and hone in on the specific documents you are looking for. 

OCR and associated text allows you to search for keywords through the entire text of the document, whereas objective coding allows you to choose specific fields where the name, word or date you are looking for exists.  You also are allowed to create a list of document types (i.e. Email, Financial Record, Police Report, Memo, etc.) specific to your case so that you can identify a specific subset of documents for review.  You can also combine information found in different fields to even further refine your search. 

The standard objective coding fields are:

  1. Author
  2. Recipient
  3. Copyee
  4. Date
  5. Title
  6. Document Type

________________________________________________________________________

For example:

Your search results for a document authored by “John Smith” on “January 1, 2010” would differ tremendously depending on whether you used OCR or objective coding to run your search. 

If you only used OCR for your search, you would find every single document that not only was authored by “John Smith” but in which his name appears.  You would also retrieve every document in which “January 1, 2010″ appears, even if it was simply mentioned in the body of the text.  This could result in an unwieldy subset of documents and not really help you identify the particular subset of documents you are looking for. 

However, if your documents were objectively coded, you could simply search in the Author field for “John Smith” and in the Date field for “January 1, 2010” and find any documents that specifically fit that criteria. 

________________________________________________________________________

If you don’t get objective coding with your discovery, ask for it.  Objective coding contains objective information – facts, not opinions or ideas about a document.  Opposing counsel would not be revealing any information about their case, about their case strategy or about the strengths and weaknesses of the discovery by sharing any objective coding they have done.  No privilege would be breached and no attorney work product would be turned over.  Rather, a win-win situation is created when the cost of capturing factual information that will equally help both parties organize and review the discovery is shared. 

Up next:

Part V: Running a Document Inventory

Scanned Paper (part III)

Unitization:

Where One Document Ends, Another Always Begins

When you receive a set of scanned documents as part of your discovery, you should be able to visualize how those documents were kept in the original custodian’s desk drawer.  You should be able to identify which documents were kept together within a file folder or binder and where one document ends and the next begins.  Being able to recognize the order and organization of your discovery means that the documents were properly unitized.  

Having properly unitized documents is key to being able to effectively review scanned discovery.  You can efficiently move from one document to the next, as well as get a sense of how the documents relate to each other.  It is almost impossible to only work with just loose pages, so you should always ask for discovery to be produced to you with its proper unitization. 

Keep in mind there are two types of unitization: 

1. Physical Breaks:

A document can simply be defined by its physical breaks.  This includes staples, paper clips, binders, folders, etc.  Unitization by physical breaks is usually done at the time of scanning, as the scanning operator is able to see where the breaks exist.  If you choose to unitize documents by their physical breaks, no relationships between documents are captured but it will be clear where one document ends and the next begins. 

2. Logical Document Determination (LDD):

What is logical about a stack of paper that has sat in somebody’s desk for years you might ask?  Whether we want to admit it or not, the way those documents were kept is often a major part of the story a litigation team is trying to tell.  A common way to describe documents that are related is to say they are part of a family of documents. 

for example:

If you know that a spreadsheet was clipped to a memo, even though the memo made no mention of any attached spreadsheet, you have learned a telling piece of information about the relationship between those documents.

 

If the documents are given to you with a load file, the load file will act as your roadmap.  Typically, a production of single page .tiffs that would reflect a huge stack of loose paper if printed are accompanied by a load file that lays out where the document breaks are.  If the documents have been logically unitized, the load file will also identify the parentchild attachments.

If the documents are not unitized when you receive them, you may want to contact the source and ask them for a untized set.  If the source does not have a unitized set, the best option is typically to contact a litigation vendor who is familiar with the process of unitization.  They usually have teams of people trained in using software specifically designed to create document breaks as well as identify document families.

Up next:

Part IV: Objective Coding – “Who, What, When” will help you figure out “Where and How”

Scanned Paper (part II)

Searchable Scanned Documents
Can you Really Find that Needle in the Haystack?

Searchable PDFs

PDFs are typically just images. Hard to believe, when we think about all the things we can do with a PDF, including using it to create forms that can be filled out electronically and having it act as a container to hold a variety of other document formats.  Scanned paper only becomes searchable if it has been OCRed.

What is OCR, you might ask?

OCR stands for Optical Character Recognition.  I know that’s not of much help but two of my goals is to provide you with interesting dinner conversation and obscure words to answer the latest Geek version of Trivial Pursuit.  The third, and most practical goal is to allow you to communicate amongst your team and with vendors about getting your PDFs into a searchable format.

The process of having a document OCRed is actually quite miraculous and the technical aspects can get quite complex.  What is most important to note is that OCRing a PDF allows the text on the PDF to be captured and layered underneath the image in such a way that the text on the image is now searchable.  However, there is always a chance that the OCR within a document can be of poor quality, making the searching of the text inaccurate.

Why might you get poor quality OCR, you ask? 

Most likely, it has to do with the quality of the document being scanned into PDF, but here are some additional reasons to consider:

  • handwriting; the original paper is of poor quality (i.e., photocopy of a photocopy of a fax)
  • scanned in grayscale instead of black and white
  • scanned at a low resolution
  • black and white documents scanned as color
  • foreign language
  • graphics or lines on the page
  • size of font and type of font on the original document
  • …there are always more…

 

TIFFs with Associated Text Files

TIFFs are a lot like pictures or static images.  Unlike PDFs, the text seen on a TIFF can only be captured in a separate file called an Associated Text File, and the image and the text can only be married together to make the TIFF searchable in applications we like to call Evidence Review Platforms (ERPs). 

TIFFs can come as single page documents or multi-page documents.  If they come as single page documents, the only way you would know where one document ends and the next begins is by looking at a load file that uses document ID numbers to identify those document breaks.  Again, load files are designed to be interpreted in an ERP alongside your set of single page TIFFs, and when all of those pieces of information are put together, you are able to move, within the ERP, from one document to the next as well as search through the captured text. 

However, just like the OCR of a PDF, Associated Text Files can vary in quality depending on the quality of the TIFF, the quality of the program used to create the Associated Text File and many other factors.  With all these moving parts, it is hard to really review TIFFs efficiently without some sort of specialized application – like an ERP.  However, once TIFFs are placed into an ERP, the load file and the Associated Text File fall right into place and allow you to perform text searches and to move fairly easily from one document to the next.

Up next:

Part III: Unitization – Where One Document Ends, Another Begins…

Scanned Paper (part I)

Does a Picture Really Speak a Thousand Words?
(part of an ongoing series about scanned paper)

Whether we like it or not, technology is not only becoming part of the legal world, but oftentimes taking it over by storm.  Where we once received paper, we now receive .tiffs, .pdfs and native files.  Where we once could organize the paper in binders or boxes, we now use a combination of tools to view, search, organize and review our much more voluminous and complex set of discovery on our computers because there is just simply too much material to print out. 

In adapting to the influx of electronic discovery, we have to realize that not all electronic discovery is created equal.  Native files such as Word documents and Excel spreadsheets come with a host of information about the file as part of its metadata (a topic we will cover later in our series), while paper that has been scanned and turned into electronic discovery in formats such as .tiff and .pdf are really just pictures of the pieces of paper we once clipped, stapled and three-hole punched.  We can now do more to those pieces of paper once they have been scanned, but keep in mind that they are just pictures of the real thing and unlike the native files we get, these pictures don’t really say as much as we want them to.

 Things to consider: 

  • Is the document searchable?  Is there associated text with the .tiffs and have the .pdfs been OCRed?  If not, should you consider having the documents OCRed?
  • Are the documents unitized?  Do you know where one document ends and the next one begins?  Is there a load file that shows you the document breaks?  If not, should you consider unitization?
  • Was there objective coding done?  Is there a load file that provides you with the objective coding?  If not, should you consider objective coding? 
  • Should you run a document inventory to get a better handle on the various file formats that may be included in the discovery?  Are there color images?  Will you need to take that into consideration when you need to print the documents?  Are there formats that require you to have the associated application in order for you to view the document or database? Are there load files included that may contain objective coding?  
  • Do you already have programs that you are currently using that can handle the viewing, organization and review of scanned paper?  Can it handle one format and not another (i.e. Adobe can handle .pdfs but not .tiffs)?  Do you need to convert your scanned paper into one format that you can handle?  If you don’t already have a program, then what types of programs should you consider?

Up next:

Part II: Searchable Scanned Documents
Can you Really Find that Needle in the Haystack?