Three Types of PDFs

Acrobat

PDFs (portable document format files) are a common file format in federal criminal discovery. But are all PDFs created equal? As you all have experienced, the answer is no, they are not.

Think about PDFs in three distinct categories:

  1. True PDFs;
  2. Image-based PDFs; and
  3. Made-searchable PDFs.

For discovery review, these distinctions are important because it impacts whether the PDF is searchable and the accuracy of your text searches within the PDF file. With voluminous discovery, the ability to search and review PDFs is critical for organizing and reviewing it.

  • True PDFs (also known as text-based or digitally created PDFs). These PDFs are created using software such as Microsoft Word, Excel, or using the “print to PDF” function in those programs. They consist of both text and images. We should think about these PDFs having two layers – one layer is the image and a second layer is the text. The image layer shows what the document will look like if it is printed to paper. The text layer is searchable text that is carried over from the original Word file into the new PDF file (the technical term for this layer is “extracted text”). There is no need to make it searchable and the new PDF will have the same text as the original Word file. An example of True PDFs that federal defenders and CJA panel attorneys will be familiar with are the pleadings filed in CM/ECF. The pleading is originally created in Word, but then the attorney either saves it as PDF or prints to PDF and they file that PDF document with the court. Using either process, there is now a PDF file created with an image layer plus text layer. In terms of usability, this is the best type of PDF to receive in discovery as it will have the closest to text searchability of the original file. Click here to see an example of a True PDF.
  • Image-based PDFs (also known as image-only PDFs). Image-based PDFs are typically created through scanning paper in a copier, taking photographs or taking screenshots. To a computer, they are images. Though we humans can see text in the image, the file only consists of the image layer but not the searchable text layer that True PDFs contain. As a result, we cannot use a computer to search the text we see in the image as that text layer is missing. There are times when discovery is produced, it will be in an image-based PDF format. When you come across image-based PDFs, ask the U.S. Attorney’s Office in what format was that file originally. Second, ask if they have it in a searchable format and specifically if they have it in a digitally created, True, Text-based PDF format. They may not, as they often receive PDFs from other sources before they provide them to you, but you will want to know what is the format in which they have it in, and what is the original format of the file (as far as they know). Click here to see an example of an Image-based PDF.
  • Made-searchable PDFs (also known as “OCRed” PDFs). Image-based PDFs can be made text searchable by applying optical character recognition (OCR). CJA panel attorneys frequently use Adobe Acrobat Pro (or other PDF editor software) to make image-based PDFs searchable. During the OCR process, the software program interprets each character on the image as text and adds a text layer to the image layer. Made-searchable PDFs are like True PDFs, but the searchability of the OCRed document will depend on the quality of the image, or the recognizability of the writing. They are often not 100% accurate when you do keyword searches of the text. Click here to see an example of a Made-searchable PDF.

The ESI Protocol (formally known as the Recommendations for Electronically Stored Information (ESI) Discovery Production in Federal Criminal Cases) noted the limitations of OCR process on scanned paper.

“Generally speaking, OCR does not handle handwritten text or text in graphics well. OCR conversion rates can range from 50 to 98% accuracy depending on the underlying document. A full page of text is estimated to contain 2,000 characters, so OCR software with even 90% accuracy would create a page of text with approximately 200 errors.”

People ask how accurate software programs are in the OCR conversion. That is important, but the biggest factor for how searchable your OCR PDF will become is the underlying quality of the scanned image. A clean copy of a pleading will have high accuracy; a twice photocopied school paper record from the 1950s will be less accurate.

A quick way to see what the quality of the text is compared to the image is to select the text in question in a PDF file (you can use Control + A in Windows or Command + A in Mac to copy all the text on a page), and then copy and paste the text into a Word document. Put the two files side by side and visually compare them.

Side by Side

Acrobat DC New Features

All of you use Adobe Acrobat on a daily basis.  Whether it is Adobe Acrobat Reader, Standard or Pro, it is an excellent tool for legal professionals for everything from saving pleadings to file with the court’s case management/electronic case file system to reviewing discovery.  Some of you have been using Acrobat for a while and know that Adobe comes out with new versions every couple of years.  The latest version of Acrobat stopped using the number of release to distinguish a new version (like Adobe Acrobat XI), but now calls itself DC, which stands for Document Cloud, and labels the version by the year of the release (Adobe Acrobat DC 2016 the most recent version).  Like many other software companies, Adobe is moving to a cloud based service giving users the option of working on multiple devices seamlessly if they choose to store their files online.  Though designed for cloud use, users do not have to store their documents remotely, and they can continue using Acrobat DC as a desktop program as they always have.

Acrobat DC has a new look compared to previous versions, has been designed to be tablet and cell phone friendly, and gives users the ability to work on a document from different devices seamlessly. The addition of a user friendly tabbed tool bar makes switching from one document to another that much easier.

The “Home” tab shows the most recent files you have worked with.  You can also search for a file in the search bar, open a file by navigating to it by clicking on “My Computer” or going to the File Menu and selecting → Open.

9-20-2016 1-29-00 PM.jpg

Once you open a document, the “Document” tab appears at the top of the screen, allowing you to easily navigate from the Document to the Tool Center to the Home page.

9-20-2016 1-31-49 PM.jpg

The “Tools” tab, otherwise known as the DC Tool Center centralizes all the features of Acrobat in one place for easy access. Now you can quickly find the tool you need without having to remember which  menu in the tools section to navigate to.

9-20-2016 1-44-24 PM.jpg

The “Search Tools” option in DC is intuitive and easy to use. If you want to OCR a document, type OCR in the “Search Tools” section of the Tool Center and all the toolsets related to recognizing text will appear.

9-20-2016 1-45-13 PM.jpg

The tool pane that users see when looking at a document can be customized. You can add a tool to the tool pane by selecting “Add Shortcut” from the Tool Center or by right-clicking in the Tool Pane when searching for a tool and adding it there.

image5

When Tool Groups are opened, they are automatically pinned to the top of the screen. The Tool Group stays open until you close it or open another tool.

9-20-2016 1-47-39 PM.jpg

DC gives you multiple ways of accessing the tools you are looking for and then quickly going back to working with your documents.

image8.png

The new tabbed tool bar is just one feature of Acrobat DC that makes upgrading worthwhile.  More features will be highlighted in upcoming posts so stay tuned.

Adobe Acrobat: “Renderable Text”

When working with PDF documents you may encounter a “renderable text” error message.  This message will sometimes occur when trying to make a scanned paper PDF file text searchable (also know as adding OCR to a document).

error messageDepending on the version of Acrobat you have, the message may read something like:

“Renderable text” is typically text that has been added to an scanned paper image (like a header, footer or bates number), through a non-Acrobat program.  The way this text is encoded into the page can cause Acrobat to disallow additional searchable text (OCR text).

This message can certainly be annoying and it can also be significant as it can limit your ability to run searches.  In Acrobat, you will be unable to add new searchable OCR text, or improve the quality of the existing OCR, until the error is fixed.

If you’ve seen this message before, and have tried to fix the document without success, you are not alone!  We spoken with a number of people over the years who have come up with some creative solutions.  Though we have yet to find “one solution” that will always fix this particular error, here are a number of possible solutions (results will vary depending on the cause of the error):

Solution 1: Obtain a version of the document with OCR.

  • It may seem simplistic, but if you receive documents without searchable OCR, ask for it.  Often the person or organization that gave it to you will want to search the files themselves and may already have a copy that has been OCR’ed.  Even if the documents they give you generate “renderable text” error messages, you will still be able to search any of the existing OCR text within the files.

Solution 2: If the files are from PACER / ECF, download a new copy.

  • The default download settings in PACER / ECF will add “purple” headers with the case number (which will cause a “renderable text” error message).  If you can find the document again in PACER / ECF, download it with the header option turned off.

Solution 3: Run “Add Tags to Document” (available in Acrobat Pro).
accessibility menu

  • If you have Acrobat Pro installed there is a special “Accessibility” menu where you can run “Add Tags to Document”.  For certain PDF’s, running this option will clear up the issue and allow the document OCR to be run.

Solution 4: Print the document to PDF (available in Acrobat Standard and Acrobat Pro).

  • If you have Acrobat installed (Standard or Pro) you’ll probably also have access to an “Acrobat PDF” virtual printer.  By printing the document to this virtual printer, the new PDF that is created will often avoid having the renderable text issue.

Solution 5: “Sanitize” the document then rerun OCR (available in Acrobat Pro).

  • From the “Protection” menu run “Sanitize Document”.  This will remove all of the document metadata including some of the rendered text that might be causing the error.
  • Re-run the OCR process.

Solution 6: Convert to TIFF files and back, and then re-run OCR (available in Acrobat Standard and Acrobat Pro).

  • Open the PDF document in Acrobat and choose “File > Save As“.
  • In the “Save As” dialog box, choose TIFF (*.tif, *.tiff) from the Save As Type (Windows) or Format (Mac OS) pop-up menu. Specify a location, and then click Save.  Acrobat saves each page of the PDF document as a separate, sequentially numbered TIFF file.
  • Combine the single pages back into a multipage document and re-run the OCR process.

Solution 7: Convert to XPS file format and back, and then re-run OCR.

  • If your computer has the “XPS” virtual printer installed (it comes with many version of MS Office) then print the file using the “Microsoft XPS Document Writer” printer.
    • The XPS printer will ask you to save the file.
    • Convert the saved XPS file to PDF.
    • Re-run the OCR process on the new PDF.

Solution 8: Try running the OCR using a different program.

Adobe Acrobat Training Videos: Searching Fundamentals

Previous video – Text Recognition

Adobe Acrobat Pro is one of the most popular computer software programs on the market for FDO and CJA panel attorneys.  Since so much of the discovery we currently receive in criminal cases is provided in paper or scanned paper format, Acrobat Pro is an excellent tool to help you to better organize and review it.

In our team’s continued efforts to providing resource to CJA panel attorneys and FDO staff, we are creating a series of training videos. Each short video will address a specific feature in a computer software program with our first set focused on Adobe Acrobat Pro XI.

Future videos we are developing will also be posted on this blog.  Make sure to check back in or sign up to subscribe to our blog to get notices of new posts by email.

These videos do not take the place of hands-on training sessions where we can get in depth about a variety of software programs and legal strategies for addressing complex cases, but it hopefully will provide you some basic background information that can help you in your cases.

Adobe Acrobat Training Videos: Text Recognition

Next Video – Searching Fundamentals

Adobe Acrobat Pro is one of the most popular computer software programs on the market for FDO and CJA panel attorneys.  Since so much of the discovery we currently receive in criminal cases is provided in paper or scanned paper format, Acrobat Pro is an excellent tool to help you to better organize and review it.

In our team’s continued efforts to providing resource to CJA panel attorneys and FDO staff, we are creating a series of training videos. Each short video will address a specific feature in a computer software program with our first set focused on Adobe Acrobat Pro XI.

These videos do not take the place of hands-on training sessions where we can get in depth about a variety of software programs and legal strategies for addressing complex cases, but it hopefully will provide you some basic background information that can help you in your cases.

The first video (created by Kelly Scribner and Alex Roberts) gives key information to consider when using OCR text recognition with Adobe Acrobat Pro for scanned paper. Though much has been written about the incredible functionality available with Adobe Acrobat Pro, this short seven minute demonstration focuses on points that we think are most important for you to consider when using OCR in Acrobat Pro.

Future videos we are developing will also be posted on this blog.  Make sure to check back in or sign up to subscribe to our blog to get notices of new posts by email.

.