Scanned Paper (part II)

Searchable Scanned Documents
Can you Really Find that Needle in the Haystack?

Searchable PDFs

PDFs are typically just images. Hard to believe, when we think about all the things we can do with a PDF, including using it to create forms that can be filled out electronically and having it act as a container to hold a variety of other document formats.  Scanned paper only becomes searchable if it has been OCRed.

What is OCR, you might ask?

OCR stands for Optical Character Recognition.  I know that’s not of much help but two of my goals is to provide you with interesting dinner conversation and obscure words to answer the latest Geek version of Trivial Pursuit.  The third, and most practical goal is to allow you to communicate amongst your team and with vendors about getting your PDFs into a searchable format.

The process of having a document OCRed is actually quite miraculous and the technical aspects can get quite complex.  What is most important to note is that OCRing a PDF allows the text on the PDF to be captured and layered underneath the image in such a way that the text on the image is now searchable.  However, there is always a chance that the OCR within a document can be of poor quality, making the searching of the text inaccurate.

Why might you get poor quality OCR, you ask? 

Most likely, it has to do with the quality of the document being scanned into PDF, but here are some additional reasons to consider:

  • handwriting; the original paper is of poor quality (i.e., photocopy of a photocopy of a fax)
  • scanned in grayscale instead of black and white
  • scanned at a low resolution
  • black and white documents scanned as color
  • foreign language
  • graphics or lines on the page
  • size of font and type of font on the original document
  • …there are always more…

 

TIFFs with Associated Text Files

TIFFs are a lot like pictures or static images.  Unlike PDFs, the text seen on a TIFF can only be captured in a separate file called an Associated Text File, and the image and the text can only be married together to make the TIFF searchable in applications we like to call Evidence Review Platforms (ERPs). 

TIFFs can come as single page documents or multi-page documents.  If they come as single page documents, the only way you would know where one document ends and the next begins is by looking at a load file that uses document ID numbers to identify those document breaks.  Again, load files are designed to be interpreted in an ERP alongside your set of single page TIFFs, and when all of those pieces of information are put together, you are able to move, within the ERP, from one document to the next as well as search through the captured text. 

However, just like the OCR of a PDF, Associated Text Files can vary in quality depending on the quality of the TIFF, the quality of the program used to create the Associated Text File and many other factors.  With all these moving parts, it is hard to really review TIFFs efficiently without some sort of specialized application – like an ERP.  However, once TIFFs are placed into an ERP, the load file and the Associated Text File fall right into place and allow you to perform text searches and to move fairly easily from one document to the next.

Up next:

Part III: Unitization – Where One Document Ends, Another Begins…

Scanned Paper (part I)

Does a Picture Really Speak a Thousand Words?
(part of an ongoing series about scanned paper)

Whether we like it or not, technology is not only becoming part of the legal world, but oftentimes taking it over by storm.  Where we once received paper, we now receive .tiffs, .pdfs and native files.  Where we once could organize the paper in binders or boxes, we now use a combination of tools to view, search, organize and review our much more voluminous and complex set of discovery on our computers because there is just simply too much material to print out. 

In adapting to the influx of electronic discovery, we have to realize that not all electronic discovery is created equal.  Native files such as Word documents and Excel spreadsheets come with a host of information about the file as part of its metadata (a topic we will cover later in our series), while paper that has been scanned and turned into electronic discovery in formats such as .tiff and .pdf are really just pictures of the pieces of paper we once clipped, stapled and three-hole punched.  We can now do more to those pieces of paper once they have been scanned, but keep in mind that they are just pictures of the real thing and unlike the native files we get, these pictures don’t really say as much as we want them to.

 Things to consider: 

  • Is the document searchable?  Is there associated text with the .tiffs and have the .pdfs been OCRed?  If not, should you consider having the documents OCRed?
  • Are the documents unitized?  Do you know where one document ends and the next one begins?  Is there a load file that shows you the document breaks?  If not, should you consider unitization?
  • Was there objective coding done?  Is there a load file that provides you with the objective coding?  If not, should you consider objective coding? 
  • Should you run a document inventory to get a better handle on the various file formats that may be included in the discovery?  Are there color images?  Will you need to take that into consideration when you need to print the documents?  Are there formats that require you to have the associated application in order for you to view the document or database? Are there load files included that may contain objective coding?  
  • Do you already have programs that you are currently using that can handle the viewing, organization and review of scanned paper?  Can it handle one format and not another (i.e. Adobe can handle .pdfs but not .tiffs)?  Do you need to convert your scanned paper into one format that you can handle?  If you don’t already have a program, then what types of programs should you consider?

Up next:

Part II: Searchable Scanned Documents
Can you Really Find that Needle in the Haystack?

What is a “Load File”?

A “load file” is a special kind of file that you may encounter in sets of case related materials.  While there are many different flavors of load files they all serve the same general purpose: they can be used by litigation support software to import (i.e. “load”) information about case related documents. 

Document information may include:

  • Name and locations of image files (typically scanned paper files).
  • Document unitization information (i.e. document breaks).
  • OCR (searchable text) file names and locations.
  • Electronic document (ESI) file names and locations.
  • Extracted metadata information.
  • Other fielded document information.

Load files can play an import role in assisting with the setup of a case document database.  When properly used, they can make the process of importing documents into litigation support applications faster and more efficient.  Some programs that support the importing of load files include evidence review programs (like Summation, Concordance and IPRO) and trial presentation programs (like TrialDirector and Sanction). 

Load files have different file extensions depending on the program they are designed to work with.  When talking with litigation support vendors, or discussing the format of discovery with opposing counsel.  It is important to recognize which load file formats work with your litigation support programs.

Some common file extensions of load files that you might encounter are:

  • .DII      designed to work with AD Summation
  • .OPT   designed to work with Concordance 
  • .LFP    designed to work with IPRO products
  • .OLL    designed to work with TrialDirector
  • .SDT   designed to work with Sanction
  • .DAT   generic document information load file   
  • .CSV   generic document information load file
  • .XML   new “EDRM” style load file format that works with many platforms 

Many load files contain the path of image files associated with a record.  They may also contain meaningful additional information about the documents.  For scanned documents, this may include a bates or control number, coded document information (like document type, date, title, etc…) and information about OCR (searchable text) files that might be associated with the document.   Load files for electronic documents (ESI) may also include extracted metadata (associated information about the files such as author, date created, file size, etc…).   

Most load files are simple lines of text that can be read by litigation support programs.  When viewed in a text program like Wordpad or MS Word we can see what the lines contain.  Here is an excerpt from a sample .LFP load file (as seen in in Wordpad):

IM,D0022,D,0,@DISK001;DATA\IMAGES00;D0022.TIF;2,0
IM,D0023,D,0,@DISK001;DATA\IMAGES00;D0023.TIF;2,0
IM,D0024, ,0,@DISK001;DATA\IMAGES00;D0024.TIF;2,0
IM,D0025,D,0,@DISK001;DATA\IMAGES00;D0025.TIF;2,0

This particular load file contains information about document images.  Litigation support software programs can read this file and know:

  1. the record identifier (usually the bates number) of a document
  2. where one document ends and another begins
  3. where to find the scanned paper .TIF files associated with a document

There may be times when you will receive multiple types of load files within the same set of documents.  Some of the files may contain the same information, but are designed to work with different database programs.  When working with vendors, let them know what litigation support database programs you intend to use so that they give you compatible load files. 

In the event you receive load files that are not designed for your database program, you may need to convert the file to make it compatible.  Fortunately there are a few free load file conversion programs available.  Two such programs are: 

    1. ReadyConvert from Compiled Services (compiledservices.com)
    2. iConvert+ from IPRO Tech (iprotech.com)

    To find out more about how load files can best be used interact with your existing litigation support applications refer to the help and support documents of the program.  Quite often, these are the best resource for describing how load files interact with the case database and will often demonstrate the load file import process.