For those of you who are new here, some background reading:
The faster you adopt, the more likely you will be riding rather than resisting inevitable innovation that is changing legal work as we speak.
There is so much to do with what is available, cheaply, commercially and safely, that knowing how to implement new technology in your legal work will give you quicker returns than trying to fight big tech (by developing new language models, for instance).
If you want to try to use the GPT API for yourself, I wrote a short course on this blog which might be of interest to the serious learners among you. For those of you who are still afraid to tinker with code, the ideas here will introduce you to legal review using OpenAI’s GPT models.
How do you get GPT to look at your contract or paperwork?
Lets start at the beginning. For GPT models to do anything for you, you have to provide an input, also known as - prompts. This could include, for example - the text you want analysed, and the questions you want addressed.
Prompts are a complex subject, so complex that even the folks who built the language models don’t completely understand how prompts work. Engineering prompts is now a whole field of study, with the AI stars scrambling to make new courses around this.
For now - apart from learning as much as you can understand in theory, to make things work, the best way really is - trial and error, or hanging out with really smart data scientists.
OpenAI’S GPT 3.5 Turbo and GPT 4 prompts work in the following manner.
You are required to give OpenAI a prompt that covers the necessary aspects. The explanations here are based on what has worked well for us.
Role: The prompt you provide must answer to - What is the role of the AI assistant? In this case - A legal counsel. Simple enough.
System prompt: This prompt should inform the language model what must be done and how. For example: Review this legal notice and analyse what is being demanded of the recipient. Your client is the recipient. Here is the legal notice - {draft of the legal notice].
User prompt: This prompt should answer the question - in what format do you want the response? For example: Provide a 300 word memo addressed to the client regarding the demands set out in the legal notice.
If you want to experiment with prompts, you can head over to GPT’s playground to experiment with shorter passages of text to see the efficacy of this prompt structure. You might develop it further and refine it for your uses cases. After a while, we have experienced that you will start seeing higher accuracy rates and usability for your real use cases.
What are the limitations?
Prompt structure.
As a model trained on a large amount of public data, the documents that you ask to be reviewed might not hit home with the prompt structure you have in mind.
While many see the solution as building new language models (why not?), our approach has been to work harder on prompts. It doesn’t fail but requires training, patience, commitment and practice. Not unlike, say, the piano.
Over many failures (and some successes), I have come to realise why prompts are complex. While there are things to learn, you have to develop an instinct for them. You can’t lift these weights well enough without building and strengthening new muscles upstairs. Having neurons build new connections, and developing a new mental muscle memory…
Photo by Dolo Iglesias on Unsplash
Formats.
Formats you use (such as word or PDF) can present its own set of challenges. Working in the language processing space has taught us the hard way that often, it is format issues that are bigger challenges than the review problems themselves. While many use cases are simple and solvable, there are some where format issues rear their ugly head and contribute to inaccuracy no matter what.
Length of documents.
One of the attractions of OpenAI is its 4-8k token limit (32K in the limited release GPT 4). Token limits are counted in total for the input and output. It is longer legal documents that require automated review the most, and lawyers are unable to get results that have integrity if they split them up and feed them in one by one into the ChatGPT text box or the OpenAI playground.
Our solution, (not very different from others who deal with large token problems in AI use cases) involves using the Open API, splitting the documents, and then having the results update each time a new part of the document is analysed. This is the simplest and rudimentary way to do it, also known as “serialised prompting” or “chain prompting”. Yesterday, we shared the first set of our experiments with a limited number of law practitioners who joined us in trying this out. If you are interested in participating the next time, you can become a part of our trials by signing up now here.
Cost.
GPT 3.5 Turbo is powerful and can accomplish review tasks. What we have noticed though, is that depending on the kind of review problem statement and document, there can be a percentage of results (0-30%) that are either outright wrong or worse, ‘hallucinations’!
GPT 4 on the other hand is just impressive.
This is not to say that you should rely on GPT 4 100% without testing the results, but you will be starting from a very good place indeed!
As good as GPT 4 might be, it costs 60x (yes sixty times!) as much as GPT 3.5 Turbo. When you put these two together, they still present a dilemma on cost versus performance and there is no straight answer. Before costs scare you, remember - we are talking cents and single digit $ for large contracts with GPT 4 and minuscule number of cents using GPT 3.5 Turbo.
If you were to ask me, just mind-boggling! These are truly interesting times.