eDiscovery, or electronic discovery, is the process of identifying, collecting, and analyzing electronically stored information (ESI) in order to be used as evidence in legal cases. This process can be time-consuming and costly, as it often involves manually reviewing large amounts of data. However, advances in artificial intelligence (A.I.) have opened up new opportunities for streamlining the eDiscovery process. One such technology is ChatGPT, a large language model developed by OpenAI.
ChatGPT is a powerful tool for natural language processing (NLP) that can understand and generate human-like text. This makes it an ideal candidate for use in eDiscovery, as it can quickly and accurately analyze large amounts of ESI in order to identify relevant information. For example, ChatGPT can be used to identify specific keywords or phrases within a document, classify documents by type, or even summarize the content of a document.
The introductory paragraphs above were generated by ChatGPT in response to a request to write a blog post on ChatGPT and eDiscovery. This is an example of how ChatGPT can generate text in such a way that one cannot immediately tell whether it was written by a machine or human. This blog post will provide initial takes on what the potential ramifications ChatGPT and similar Artificial Intelligence (A.I.) tools can be for the work CJA panel attorneys and federal defenders do. It is not advocating any specific position regarding A.I. technology which has wide ranging and yet to be realized implications in many fields. The goal is to provide a general idea of how this new A.I. technology might impact our work.
What is ChatGPT?
The current version of ChatGPT, 3.5 was released in late 2022 (openai.com/blog/ChatGPT). It is an artificial intelligence tool built on a natural language processing model known as a Generative Pre-trained Transformer (‘GPT’) or ‘generative A.I.’ developed by OpenAI. ChatGPT is great for generating human-like text to help solve problems. This can include answers to questions, summaries or translations of large volumes of text, generating lines of code, or providing step-by-step, conversational instructions for a wide range of complex software applications.
ChatGPT is trained on a massive corpus of datasets including many publicly available domains on the internet including Google, the Wayback Machine, Github, WordPress, Wikipedia, and so forth. However, it is not connected to the internet in real time and has limited knowledge of world and events after 2021. This means it can occasionally produce inaccurate information, a problem that OpenAI acknowledges help.openai.com/en/articles/6783457-chatgpt-general-faq. In some instances, it will tell you it doesn’t know, sometimes it will provide an answer with a disclaimer. It can also provide an authoritative sounding answer that is wrong without any qualifier. It has even been known to fill in the gap with made up information. For example, eDiscovery expert Ralph Losey asked the robot to identify the top five eDiscovery cases for 2022. Since it did not have any 2022 cases to reference – it ignored the date – listed only 2021 cases, and even made up the name of a judge! ediscoverytoday.com/2023/01/02/ai-top-cases-of-2022-doesnt-include-any-cases-from-2022-artificial-intelligence-trends/
In response to these sorts of user experiences, OpenAI recently sent out a tweet with warnings noting that ChatGPT is useful for general information in subject areas such as language, science, engineering, finance, history, culture; and less suitable for high context or niche areas such as legal advice, and real time events. twitter.com/openaicommunity.
Can ChatGPT be used for discovery review?
Artificial Intelligence models based natural language processing have been deployed extensively in eDiscovery for some time. Foremost among these approaches is Technology Assisted Review (TAR) which uses algorithms to identify and highlight relevant information based on input from subject matter experts. This technique helps reduce attorney review time and thereby creating time and cost and workflow efficiencies.
Since TAR and generative A.I. are both based on the natural language processing branch of artificial intelligence (Figure 1), one might assume that ChatGPT’s ability to generate human-like information about a broad and complex range of data sets could be easily applied to eDiscovery to enhance eDiscovery review methods such as TAR. Indeed, in the second introductory paragraph above, ChatGPT generated text that describes common eDiscovery tasks that artificial intelligence software can perform with the proper conditions. But it also wrote that it, ChatGPT, could do these types of tasks. While it is true that ChatGPT can perform these tasks based on information it has been trained on, it was not designed to perform eDiscovery tasks, and OpenAI has not developed a version of the GPT technology that can be utilized for eDiscovery. Furthermore, even if the underlying GPT-3.5 model could be developed for an eDiscovery environment, the immense computing resources it currently requires, designed for vast amounts of data, would make it non-scalable and cost-prohibitive. law.com/legaltechnews/2023/01/25/what-will-eDiscovery-lawyers-do-after-chatgpt/
What can ChatGPT do right now?
ChatGPT has more direct application in terms of workflow and analysis. Discovery in criminal cases increasingly includes both structured (databases, spreadsheets) and unstructured (documents, videos, audio files, phone extractions, social media, emails) data. Currently, most workflows designed to integrate and synthesize these heterogenous formats are necessarily cumbersome, requiring a patchwork of approaches. Many easily available open source tools (e.g. Openrefine, referenced below) or applications such as Microsoft Excel which can be helpful to practitioners are under-utilized, if leveraged at all. ChatGPT has the potential to help bridge the gap between the utility of these applications and practitioners’ ability utilize them.
For example, below (Figure 2) is a screenshot showing ChatGPT’s response to a question about importing a CSV file into CaseMap (a fact and case organization and analysis tool – nlsblog.org/2011/10/05/cja-panel-attorney-software-discounts). Note that while ChatGPT is providing helpful feedback, it is not providing specific, practical instructions on how to carry out the importation of the CSV file into a CaseMap database. This is due to the limited information about CaseMap built into the OpenAI model. In the example above, ChatGPT was able to provide a step-by-step guide on how to import a CSV file into CaseMap. However, there are better and more efficient ways to import a CSV file into CaseMap than what ChatGPT prescribed.
In our second example, (Figure 3) we see how ChatGPT can help us deal with, CSV files containing ‘messy’ data, in this case duplicate rows in a spreadsheet. It provided guidance on how to utilize a tool called Openrefine openrefine.org to ‘clean-up’ the spreadsheet.
Since Openrefine is a free, open source tool, ChatGPT was able to develop more accurate information than one might expect when dealing with ‘closed’, proprietary tools such CaseMap.
The need to harness software to effectively work our cases will only increase as data complexity continues to ratchet up. ChatGPT can help facilitate the utilization and adoption of open source and business applications in response to these challenges; lowering the bar to access by providing on-demand, human-like support to practitioners. This can help with the ‘trees’ we believe are relevant to our cases; e.g. a subset of files responsive to a search query. This still leaves the ‘forest’; the large tranches of discovery which we load into review platforms such as Eclipse SE and Casepoint, to parse and organize the data. Whether or how the generative AI technology underlying ChatGPT will have impact in this latter arena remains to be seen.
 Also known as predictive coding, computer assisted review, or supervised machine learning.
 A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record, and usually consists of tabular data from a database. The CSV file format is supported by a wide variety of business applications including MS Excel en.wikipedia.org/wiki/Comma-separated_values