Three Types of PDFs

Acrobat

PDFs (portable document format files) are a common file format in federal criminal discovery. But are all PDFs created equal? As you all have experienced, the answer is no, they are not.

Think about PDFs in three distinct categories:

  1. True PDFs;
  2. Image-based PDFs; and
  3. Made-searchable PDFs.

For discovery review, these distinctions are important because it impacts whether the PDF is searchable and the accuracy of your text searches within the PDF file. With voluminous discovery, the ability to search and review PDFs is critical for organizing and reviewing it.

  • True PDFs (also known as text-based or digitally created PDFs). These PDFs are created using software such as Microsoft Word, Excel, or using the “print to PDF” function in those programs. They consist of both text and images. We should think about these PDFs having two layers – one layer is the image and a second layer is the text. The image layer shows what the document will look like if it is printed to paper. The text layer is searchable text that is carried over from the original Word file into the new PDF file (the technical term for this layer is “extracted text”). There is no need to make it searchable and the new PDF will have the same text as the original Word file. An example of True PDFs that federal defenders and CJA panel attorneys will be familiar with are the pleadings filed in CM/ECF. The pleading is originally created in Word, but then the attorney either saves it as PDF or prints to PDF and they file that PDF document with the court. Using either process, there is now a PDF file created with an image layer plus text layer. In terms of usability, this is the best type of PDF to receive in discovery as it will have the closest to text searchability of the original file. Click here to see an example of a True PDF.
  • Image-based PDFs (also known as image-only PDFs). Image-based PDFs are typically created through scanning paper in a copier, taking photographs or taking screenshots. To a computer, they are images. Though we humans can see text in the image, the file only consists of the image layer but not the searchable text layer that True PDFs contain. As a result, we cannot use a computer to search the text we see in the image as that text layer is missing. There are times when discovery is produced, it will be in an image-based PDF format. When you come across image-based PDFs, ask the U.S. Attorney’s Office in what format was that file originally. Second, ask if they have it in a searchable format and specifically if they have it in a digitally created, True, Text-based PDF format. They may not, as they often receive PDFs from other sources before they provide them to you, but you will want to know what is the format in which they have it in, and what is the original format of the file (as far as they know). Click here to see an example of an Image-based PDF.
  • Made-searchable PDFs (also known as “OCRed” PDFs). Image-based PDFs can be made text searchable by applying optical character recognition (OCR). CJA panel attorneys frequently use Adobe Acrobat Pro (or other PDF editor software) to make image-based PDFs searchable. During the OCR process, the software program interprets each character on the image as text and adds a text layer to the image layer. Made-searchable PDFs are like True PDFs, but the searchability of the OCRed document will depend on the quality of the image, or the recognizability of the writing. They are often not 100% accurate when you do keyword searches of the text. Click here to see an example of a Made-searchable PDF.

The ESI Protocol (formally known as the Recommendations for Electronically Stored Information (ESI) Discovery Production in Federal Criminal Cases) noted the limitations of OCR process on scanned paper.

“Generally speaking, OCR does not handle handwritten text or text in graphics well. OCR conversion rates can range from 50 to 98% accuracy depending on the underlying document. A full page of text is estimated to contain 2,000 characters, so OCR software with even 90% accuracy would create a page of text with approximately 200 errors.”

People ask how accurate software programs are in the OCR conversion. That is important, but the biggest factor for how searchable your OCR PDF will become is the underlying quality of the scanned image. A clean copy of a pleading will have high accuracy; a twice photocopied school paper record from the 1950s will be less accurate.

A quick way to see what the quality of the text is compared to the image is to select the text in question in a PDF file (you can use Control + A in Windows or Command + A in Mac to copy all the text on a page), and then copy and paste the text into a Word document. Put the two files side by side and visually compare them.

Side by Side

dtSearch User Preferences

When you first open dtSearch the window layout and user preferences will be using the programs default settings.  We’ve found that modifying certain settings will increase the search capabilities and will make navigating and working with the program easier.  The system will remember your preferences so you only have to modify these settings once.

By default, the program is set to search document content, but not file or folder names and there are times when searching file and folder names can be helpful.  Additionally, the search results screen uses a top-bottom layout (the list of results will be on the top with a document preview on the bottom).  Since most documents have a portrait orientation, a side-by-side layout is generally easier to work with.  With Adobe Acrobat documents, there is an additional plug-in needed to be able to navigate through search results within the same document.

To change the user preferences, go to the “Options” menu and choose “Preferences”.

image1

In the Preferences window, under “Indexing Options” place a check next to “Index filenames as text” (leave “Include path information” checked as well).

image2

 

Next, go to “Search results” within the “Search Options” section and place a check next to “Checkbox” and “Type” within the “Items to include in search results” section.  Then under the “Window layout” section, select “Vertical split”.

image3

Finally, select “PDF view options” in the “Document Options” section.  Look in the “Highlighting hits in Adobe Reader” area.  If the screen reads “A plug-in is needed…” then select the “Configure Plug-in” button and follow the screen prompts to install (if the screen reads the plug-in is installed then there is nothing more you need to do).

image4

Once you have made the changes, click “OK”.  You will receive a message notifying you that the new window layout won’t appear until you close and restart dtSearch.

To see the changes, close dtSearch and re-open it.  You will see the window layout is in the side-by-side “Vertical split” view.  When you run a search, your search results will now appear on the left, with checkboxes and the document viewer on the right.  Within PDF documents you will now be able to use the hit navigation buttons.

image9

Going forward, any new indexes you create will include the ability to search file and folder names.  If you wish to add this feature to any of your existing indexes, run “Update Index” from the “Index” menu.

For additional help with dtSearch, please use the “Help” menu or visit dtSearch.com.

.

 

 

dtSearch Desktop Demonstration Video

dtSearch is a popular search and retrieval program. Here is a brief 12 minute video that demonstrates how to setup a new dtSearch index and how to run searches within an index.

As mentioned in the dtSearch Desktop post, we have been able to obtain a limited number of licenses that will be made available to CJA panel attorneys with current, active cases.  To request a license go to the dtSearch Desktop post and fill out the request form on the bottom.

Note: like most litigation software programs, this program was developed for Windows-based operating systems and does not work with Macintosh operating systems.

 

dtSearch Desktop

Limited licenses of dtSearch Desktop Available for CJA Panel Attorneys

We are pleased to announce that we have been able to obtain a limited number of dtSearch Desktop software licenses for CJA panel attorneys with current, active cases.

dtSearch is a popular search and retrieval program, and it is the search engine utilized in well known computer programs such as Forensic Tool Kit (FTK, a computer forensic tool), CaseMap and Adobe Acrobat Pro.  This type of program is a useful tool to assist legal teams in searching discovery, creating brief banks, and viewing different file types (including non-PDF files) even if you don’t have the associated application.  We have a limited number of licenses available for CJA panel attorneys to use for free (a $200 value).

The program provides great functionality in searching both electronic documents and paper documents that are subsequently scanned and converted to a text searchable format, especially since it can search and retrieve information in many different file types.  dtSearch is a user friendly software program which provides immediate results and utility for even the novice computer user.  As electronic discovery in federal criminal matters continues to grow in volume and in the variety of formats, dtSearch is a useful tool for CJA panel attorneys faced with the daunting task of organizing and searching through their case material.

To obtain the software, please fill out the dtSearch Request Form below. When finished filling out this form, press the “submit” button on the bottom of the form. This will attach your completed form to an email message sent to National Litigation Support Paralegal Kalei Achiu. You will then receive an email with download instructions and the activation code necessary to obtain your free copy of the dtSearch Desktop. Please allow up to 5 business days to process your request.  Each user license can be installed for that user on two machines.

You must have an active appointed case to continue to utilize the license.  If you are no longer on the panel and don’t have an active appointed case, we request you return the license to the National Litigation Support Team (NLST) by contacting Kalei Achiu so the license can be used by other CJA panel attorneys.  Like most litigation software programs, this program was developed for Windows-based operating systems and does not work with Macintosh operating systems.

For technical support or if you have any questions regarding the utilization of dtSearch within your office, please contact either Alex Roberts or Kalei Achiu.  If you want to learn more about dtSearch, go to http://dtsearch.com/.

dtSearch Desktop Request Form: