Searchable Scanned Documents
Can you Really Find that Needle in the Haystack?
PDFs are typically just images. Hard to believe, when we think about all the things we can do with a PDF, including using it to create forms that can be filled out electronically and having it act as a container to hold a variety of other document formats. Scanned paper only becomes searchable if it has been OCRed.
What is OCR, you might ask?
OCR stands for Optical Character Recognition. I know that’s not of much help but two of my goals is to provide you with interesting dinner conversation and obscure words to answer the latest Geek version of Trivial Pursuit. The third, and most practical goal is to allow you to communicate amongst your team and with vendors about getting your PDFs into a searchable format.
The process of having a document OCRed is actually quite miraculous and the technical aspects can get quite complex. What is most important to note is that OCRing a PDF allows the text on the PDF to be captured and layered underneath the image in such a way that the text on the image is now searchable. However, there is always a chance that the OCR within a document can be of poor quality, making the searching of the text inaccurate.
Why might you get poor quality OCR, you ask?
Most likely, it has to do with the quality of the document being scanned into PDF, but here are some additional reasons to consider:
- handwriting; the original paper is of poor quality (i.e., photocopy of a photocopy of a fax)
- scanned in grayscale instead of black and white
- scanned at a low resolution
- black and white documents scanned as color
- foreign language
- graphics or lines on the page
- size of font and type of font on the original document
- …there are always more…
TIFFs with Associated Text Files
TIFFs are a lot like pictures or static images. Unlike PDFs, the text seen on a TIFF can only be captured in a separate file called an Associated Text File, and the image and the text can only be married together to make the TIFF searchable in applications we like to call Evidence Review Platforms (ERPs).
TIFFs can come as single page documents or multi-page documents. If they come as single page documents, the only way you would know where one document ends and the next begins is by looking at a load file that uses document ID numbers to identify those document breaks. Again, load files are designed to be interpreted in an ERP alongside your set of single page TIFFs, and when all of those pieces of information are put together, you are able to move, within the ERP, from one document to the next as well as search through the captured text.
However, just like the OCR of a PDF, Associated Text Files can vary in quality depending on the quality of the TIFF, the quality of the program used to create the Associated Text File and many other factors. With all these moving parts, it is hard to really review TIFFs efficiently without some sort of specialized application – like an ERP. However, once TIFFs are placed into an ERP, the load file and the Associated Text File fall right into place and allow you to perform text searches and to move fairly easily from one document to the next.
Part III: Unitization – Where One Document Ends, Another Begins…