On May 21, 2024, a report published by MIT researcher Eric Martínez cast doubt on a much-publicized claim. The claim that OpenAI’s GPT-4 model had outperformed 90% of trainee lawyers on the bar exam. This revelation challenges the previous media hype and raises important questions about the real capabilities of AI in legal contexts.
The Initial Claim and Its Impact
In March last year, OpenAI announced that GPT-4 had scored in the top 10% on the Uniform Bar Examination (UBE). This claim generated widespread excitement and speculation about the potential for AI to revolutionize the legal profession. However, new research suggests that this claim may have been significantly overstated.
The New GPT Bar Exam Findings
Eric Martínez published his study in the journal Artificial Intelligence and Law. He found GPT-4’s 298 out of 400 score on the bar exam was not a reflection of its overall performance. Instead, the high percentile ranking was skewed. The comparison was made against repeat test-takers—individuals who had previously failed the exam. These test-takers generally scored lower than first-time examinees.
When compared to all test-takers, GPT-4 landed in the 69th percentile on the bar exam. More strikingly, when compared to first-time test-takers, the model only scored in the 48th percentile. These findings suggest that GPT-4’s performance was considerably less impressive than initially reported.
Essay Writing and Real-World Applications
Martínez’s research also highlighted that GPT-4 struggled significantly with the written components of the bar exam. The model performed particularly poorly on the Multistate Essay Examination (MEE). It ranked in the 48th percentile overall and the 15th percentile among first-time test-takers. This is a critical point, as essay writing closely mirrors the practical tasks performed by practicing lawyers.
The study pointed out methodological flaws in the original research, particularly in how the essays were graded. Unlike the standard grading guidelines set by the National Conference of Bar Examiners, the initial researchers compared GPT-4’s answers to “good answers” from Maryland. This deviation may have contributed to the inflated performance claims.
Implications for the Legal Profession
These findings are a crucial reminder that while AI advancements are impressive, their application in professional fields like law must be approached with caution. Martínez noted that although GPT-4 showed remarkable improvements on the bar exam over its predecessor, GPT-3.5, its poor performance on tasks that resemble real legal work suggests that large language models alone are not yet reliable for such critical applications.
Broader Context and Cautionary Notes
The excitement around AI’s potential in the legal field is understandable, but Martínez’s study underscores the need for rigorous evaluation. AI systems are prone to hallucinations—fabricating facts or connections that don’t exist—which could be particularly harmful in legal contexts. A recent example is a federal appeals court judge’s suggestion that AI could help interpret legal texts, highlighting the growing interest and potential risks.
Conclusion: Bar Exams are for Lawyers, not GPTs
In light of these findings, it’s clear that while GPT-4 and similar AI models have made significant strides, their capabilities must be carefully scrutinized before being integrated into legal practice. The legal profession demands precision and reliability, qualities that current AI systems have yet to fully achieve. Therefore, stakeholders must remain vigilant and critical of AI’s role in law to prevent unintended consequences.
For further insights and updates on AI and its impact on the legal profession, subscribe to our newsletter and stay informed about the latest developments in this rapidly evolving field.


Leave a Reply