Three Types of PDFs

Acrobat

PDFs (portable document format files) are a common file format in federal criminal discovery. But are all PDFs created equal? As you all have experienced, the answer is no, they are not.

Think about PDFs in three distinct categories:

  1. True PDFs;
  2. Image-based PDFs; and
  3. Made-searchable PDFs.

For discovery review, these distinctions are important because it impacts whether the PDF is searchable and the accuracy of your text searches within the PDF file. With voluminous discovery, the ability to search and review PDFs is critical for organizing and reviewing it.

  • True PDFs (also known as text-based or digitally created PDFs). These PDFs are created using software such as Microsoft Word, Excel, or using the “print to PDF” function in those programs. They consist of both text and images. We should think about these PDFs having two layers – one layer is the image and a second layer is the text. The image layer shows what the document will look like if it is printed to paper. The text layer is searchable text that is carried over from the original Word file into the new PDF file (the technical term for this layer is “extracted text”). There is no need to make it searchable and the new PDF will have the same text as the original Word file. An example of True PDFs that federal defenders and CJA panel attorneys will be familiar with are the pleadings filed in CM/ECF. The pleading is originally created in Word, but then the attorney either saves it as PDF or prints to PDF and they file that PDF document with the court. Using either process, there is now a PDF file created with an image layer plus text layer. In terms of usability, this is the best type of PDF to receive in discovery as it will have the closest to text searchability of the original file. Click here to see an example of a True PDF.
  • Image-based PDFs (also known as image-only PDFs). Image-based PDFs are typically created through scanning paper in a copier, taking photographs or taking screenshots. To a computer, they are images. Though we humans can see text in the image, the file only consists of the image layer but not the searchable text layer that True PDFs contain. As a result, we cannot use a computer to search the text we see in the image as that text layer is missing. There are times when discovery is produced, it will be in an image-based PDF format. When you come across image-based PDFs, ask the U.S. Attorney’s Office in what format was that file originally. Second, ask if they have it in a searchable format and specifically if they have it in a digitally created, True, Text-based PDF format. They may not, as they often receive PDFs from other sources before they provide them to you, but you will want to know what is the format in which they have it in, and what is the original format of the file (as far as they know). Click here to see an example of an Image-based PDF.
  • Made-searchable PDFs (also known as “OCRed” PDFs). Image-based PDFs can be made text searchable by applying optical character recognition (OCR). CJA panel attorneys frequently use Adobe Acrobat Pro (or other PDF editor software) to make image-based PDFs searchable. During the OCR process, the software program interprets each character on the image as text and adds a text layer to the image layer. Made-searchable PDFs are like True PDFs, but the searchability of the OCRed document will depend on the quality of the image, or the recognizability of the writing. They are often not 100% accurate when you do keyword searches of the text. Click here to see an example of a Made-searchable PDF.

The ESI Protocol (formally known as the Recommendations for Electronically Stored Information (ESI) Discovery Production in Federal Criminal Cases) noted the limitations of OCR process on scanned paper.

“Generally speaking, OCR does not handle handwritten text or text in graphics well. OCR conversion rates can range from 50 to 98% accuracy depending on the underlying document. A full page of text is estimated to contain 2,000 characters, so OCR software with even 90% accuracy would create a page of text with approximately 200 errors.”

People ask how accurate software programs are in the OCR conversion. That is important, but the biggest factor for how searchable your OCR PDF will become is the underlying quality of the scanned image. A clean copy of a pleading will have high accuracy; a twice photocopied school paper record from the 1950s will be less accurate.

A quick way to see what the quality of the text is compared to the image is to select the text in question in a PDF file (you can use Control + A in Windows or Command + A in Mac to copy all the text on a page), and then copy and paste the text into a Word document. Put the two files side by side and visually compare them.

Side by Side

Adobe Acrobat: “Renderable Text”

When working with PDF documents you may encounter a “renderable text” error message.  This message will sometimes occur when trying to make a scanned paper PDF file text searchable (also know as adding OCR to a document).

error messageDepending on the version of Acrobat you have, the message may read something like:

“Renderable text” is typically text that has been added to an scanned paper image (like a header, footer or bates number), through a non-Acrobat program.  The way this text is encoded into the page can cause Acrobat to disallow additional searchable text (OCR text).

This message can certainly be annoying and it can also be significant as it can limit your ability to run searches.  In Acrobat, you will be unable to add new searchable OCR text, or improve the quality of the existing OCR, until the error is fixed.

If you’ve seen this message before, and have tried to fix the document without success, you are not alone!  We spoken with a number of people over the years who have come up with some creative solutions.  Though we have yet to find “one solution” that will always fix this particular error, here are a number of possible solutions (results will vary depending on the cause of the error):

Solution 1: Obtain a version of the document with OCR.

  • It may seem simplistic, but if you receive documents without searchable OCR, ask for it.  Often the person or organization that gave it to you will want to search the files themselves and may already have a copy that has been OCR’ed.  Even if the documents they give you generate “renderable text” error messages, you will still be able to search any of the existing OCR text within the files.

Solution 2: If the files are from PACER / ECF, download a new copy.

  • The default download settings in PACER / ECF will add “purple” headers with the case number (which will cause a “renderable text” error message).  If you can find the document again in PACER / ECF, download it with the header option turned off.

Solution 3: Run “Add Tags to Document” (available in Acrobat Pro).
accessibility menu

  • If you have Acrobat Pro installed there is a special “Accessibility” menu where you can run “Add Tags to Document”.  For certain PDF’s, running this option will clear up the issue and allow the document OCR to be run.

Solution 4: Print the document to PDF (available in Acrobat Standard and Acrobat Pro).

  • If you have Acrobat installed (Standard or Pro) you’ll probably also have access to an “Acrobat PDF” virtual printer.  By printing the document to this virtual printer, the new PDF that is created will often avoid having the renderable text issue.

Solution 5: “Sanitize” the document then rerun OCR (available in Acrobat Pro).

  • From the “Protection” menu run “Sanitize Document”.  This will remove all of the document metadata including some of the rendered text that might be causing the error.
  • Re-run the OCR process.

Solution 6: Convert to TIFF files and back, and then re-run OCR (available in Acrobat Standard and Acrobat Pro).

  • Open the PDF document in Acrobat and choose “File > Save As“.
  • In the “Save As” dialog box, choose TIFF (*.tif, *.tiff) from the Save As Type (Windows) or Format (Mac OS) pop-up menu. Specify a location, and then click Save.  Acrobat saves each page of the PDF document as a separate, sequentially numbered TIFF file.
  • Combine the single pages back into a multipage document and re-run the OCR process.

Solution 7: Convert to XPS file format and back, and then re-run OCR.

  • If your computer has the “XPS” virtual printer installed (it comes with many version of MS Office) then print the file using the “Microsoft XPS Document Writer” printer.
    • The XPS printer will ask you to save the file.
    • Convert the saved XPS file to PDF.
    • Re-run the OCR process on the new PDF.

Solution 8: Try running the OCR using a different program.

Scanned Paper (part IV)

Objective Coding:

“Who, What, When” will help you figure out “Where and How”

One definition of the word objective I found online is:

“uninfluenced by emotions or personal prejudices; presented factually”. 

This is accurate when it comes to the objective coding of documents in the not so objective world of litigation.  While you may believe that having OCR for .pdfs or associated text for .tiffs gives you the searching capabilities that you need, having documents objectively coded will really allow you to refine those searches and hone in on the specific documents you are looking for. 

OCR and associated text allows you to search for keywords through the entire text of the document, whereas objective coding allows you to choose specific fields where the name, word or date you are looking for exists.  You also are allowed to create a list of document types (i.e. Email, Financial Record, Police Report, Memo, etc.) specific to your case so that you can identify a specific subset of documents for review.  You can also combine information found in different fields to even further refine your search. 

The standard objective coding fields are:

  1. Author
  2. Recipient
  3. Copyee
  4. Date
  5. Title
  6. Document Type

________________________________________________________________________

For example:

Your search results for a document authored by “John Smith” on “January 1, 2010” would differ tremendously depending on whether you used OCR or objective coding to run your search. 

If you only used OCR for your search, you would find every single document that not only was authored by “John Smith” but in which his name appears.  You would also retrieve every document in which “January 1, 2010″ appears, even if it was simply mentioned in the body of the text.  This could result in an unwieldy subset of documents and not really help you identify the particular subset of documents you are looking for. 

However, if your documents were objectively coded, you could simply search in the Author field for “John Smith” and in the Date field for “January 1, 2010” and find any documents that specifically fit that criteria. 

________________________________________________________________________

If you don’t get objective coding with your discovery, ask for it.  Objective coding contains objective information – facts, not opinions or ideas about a document.  Opposing counsel would not be revealing any information about their case, about their case strategy or about the strengths and weaknesses of the discovery by sharing any objective coding they have done.  No privilege would be breached and no attorney work product would be turned over.  Rather, a win-win situation is created when the cost of capturing factual information that will equally help both parties organize and review the discovery is shared. 

Up next:

Part V: Running a Document Inventory

Scanned Paper (part III)

Unitization:

Where One Document Ends, Another Always Begins

When you receive a set of scanned documents as part of your discovery, you should be able to visualize how those documents were kept in the original custodian’s desk drawer.  You should be able to identify which documents were kept together within a file folder or binder and where one document ends and the next begins.  Being able to recognize the order and organization of your discovery means that the documents were properly unitized.  

Having properly unitized documents is key to being able to effectively review scanned discovery.  You can efficiently move from one document to the next, as well as get a sense of how the documents relate to each other.  It is almost impossible to only work with just loose pages, so you should always ask for discovery to be produced to you with its proper unitization. 

Keep in mind there are two types of unitization: 

1. Physical Breaks:

A document can simply be defined by its physical breaks.  This includes staples, paper clips, binders, folders, etc.  Unitization by physical breaks is usually done at the time of scanning, as the scanning operator is able to see where the breaks exist.  If you choose to unitize documents by their physical breaks, no relationships between documents are captured but it will be clear where one document ends and the next begins. 

2. Logical Document Determination (LDD):

What is logical about a stack of paper that has sat in somebody’s desk for years you might ask?  Whether we want to admit it or not, the way those documents were kept is often a major part of the story a litigation team is trying to tell.  A common way to describe documents that are related is to say they are part of a family of documents. 

for example:

If you know that a spreadsheet was clipped to a memo, even though the memo made no mention of any attached spreadsheet, you have learned a telling piece of information about the relationship between those documents.

 

If the documents are given to you with a load file, the load file will act as your roadmap.  Typically, a production of single page .tiffs that would reflect a huge stack of loose paper if printed are accompanied by a load file that lays out where the document breaks are.  If the documents have been logically unitized, the load file will also identify the parentchild attachments.

If the documents are not unitized when you receive them, you may want to contact the source and ask them for a untized set.  If the source does not have a unitized set, the best option is typically to contact a litigation vendor who is familiar with the process of unitization.  They usually have teams of people trained in using software specifically designed to create document breaks as well as identify document families.

Up next:

Part IV: Objective Coding – “Who, What, When” will help you figure out “Where and How”

Scanned Paper (part II)

Searchable Scanned Documents
Can you Really Find that Needle in the Haystack?

Searchable PDFs

PDFs are typically just images. Hard to believe, when we think about all the things we can do with a PDF, including using it to create forms that can be filled out electronically and having it act as a container to hold a variety of other document formats.  Scanned paper only becomes searchable if it has been OCRed.

What is OCR, you might ask?

OCR stands for Optical Character Recognition.  I know that’s not of much help but two of my goals is to provide you with interesting dinner conversation and obscure words to answer the latest Geek version of Trivial Pursuit.  The third, and most practical goal is to allow you to communicate amongst your team and with vendors about getting your PDFs into a searchable format.

The process of having a document OCRed is actually quite miraculous and the technical aspects can get quite complex.  What is most important to note is that OCRing a PDF allows the text on the PDF to be captured and layered underneath the image in such a way that the text on the image is now searchable.  However, there is always a chance that the OCR within a document can be of poor quality, making the searching of the text inaccurate.

Why might you get poor quality OCR, you ask? 

Most likely, it has to do with the quality of the document being scanned into PDF, but here are some additional reasons to consider:

  • handwriting; the original paper is of poor quality (i.e., photocopy of a photocopy of a fax)
  • scanned in grayscale instead of black and white
  • scanned at a low resolution
  • black and white documents scanned as color
  • foreign language
  • graphics or lines on the page
  • size of font and type of font on the original document
  • …there are always more…

 

TIFFs with Associated Text Files

TIFFs are a lot like pictures or static images.  Unlike PDFs, the text seen on a TIFF can only be captured in a separate file called an Associated Text File, and the image and the text can only be married together to make the TIFF searchable in applications we like to call Evidence Review Platforms (ERPs). 

TIFFs can come as single page documents or multi-page documents.  If they come as single page documents, the only way you would know where one document ends and the next begins is by looking at a load file that uses document ID numbers to identify those document breaks.  Again, load files are designed to be interpreted in an ERP alongside your set of single page TIFFs, and when all of those pieces of information are put together, you are able to move, within the ERP, from one document to the next as well as search through the captured text. 

However, just like the OCR of a PDF, Associated Text Files can vary in quality depending on the quality of the TIFF, the quality of the program used to create the Associated Text File and many other factors.  With all these moving parts, it is hard to really review TIFFs efficiently without some sort of specialized application – like an ERP.  However, once TIFFs are placed into an ERP, the load file and the Associated Text File fall right into place and allow you to perform text searches and to move fairly easily from one document to the next.

Up next:

Part III: Unitization – Where One Document Ends, Another Begins…

Scanned Paper (part I)

Does a Picture Really Speak a Thousand Words?
(part of an ongoing series about scanned paper)

Whether we like it or not, technology is not only becoming part of the legal world, but oftentimes taking it over by storm.  Where we once received paper, we now receive .tiffs, .pdfs and native files.  Where we once could organize the paper in binders or boxes, we now use a combination of tools to view, search, organize and review our much more voluminous and complex set of discovery on our computers because there is just simply too much material to print out. 

In adapting to the influx of electronic discovery, we have to realize that not all electronic discovery is created equal.  Native files such as Word documents and Excel spreadsheets come with a host of information about the file as part of its metadata (a topic we will cover later in our series), while paper that has been scanned and turned into electronic discovery in formats such as .tiff and .pdf are really just pictures of the pieces of paper we once clipped, stapled and three-hole punched.  We can now do more to those pieces of paper once they have been scanned, but keep in mind that they are just pictures of the real thing and unlike the native files we get, these pictures don’t really say as much as we want them to.

 Things to consider: 

  • Is the document searchable?  Is there associated text with the .tiffs and have the .pdfs been OCRed?  If not, should you consider having the documents OCRed?
  • Are the documents unitized?  Do you know where one document ends and the next one begins?  Is there a load file that shows you the document breaks?  If not, should you consider unitization?
  • Was there objective coding done?  Is there a load file that provides you with the objective coding?  If not, should you consider objective coding? 
  • Should you run a document inventory to get a better handle on the various file formats that may be included in the discovery?  Are there color images?  Will you need to take that into consideration when you need to print the documents?  Are there formats that require you to have the associated application in order for you to view the document or database? Are there load files included that may contain objective coding?  
  • Do you already have programs that you are currently using that can handle the viewing, organization and review of scanned paper?  Can it handle one format and not another (i.e. Adobe can handle .pdfs but not .tiffs)?  Do you need to convert your scanned paper into one format that you can handle?  If you don’t already have a program, then what types of programs should you consider?

Up next:

Part II: Searchable Scanned Documents
Can you Really Find that Needle in the Haystack?