AI in the Legal System: Predictive Policing, Algorithmic Sentencing, and the Future of Justice

When the Algorithm Takes the Stand

In a courtroom in Wisconsin in 2016, a judge sentenced Eric Loomis to six years in prison, partly on the basis of a risk score generated by a proprietary algorithm called COMPAS. Loomis argued that using a secret algorithm to determine his sentence violated his due process rights because neither he nor his attorney could examine or challenge the factors that produced the score. The Wisconsin Supreme Court upheld the sentence. The algorithm stayed secret. The score stayed in the record. Loomis went to prison.

The Loomis case is neither the beginning nor the end of the story of AI in the legal system. It is a moment of crystallization, a point at which a set of debates that had circulated in academic journals and advocacy reports suddenly had a name, a court docket number, and a human being sitting at the center of them. Since then, AI-assisted decision-making has expanded substantially throughout the American criminal justice system and legal profession, and the debates it has provoked have grown more urgent rather than less.

This paper examines three distinct but intersecting domains in which AI is reshaping legal and criminal justice processes: predictive policing and crime prevention; risk assessment in bail, sentencing, and parole decisions; and the use of AI tools in legal practice itself. Across all three domains, a common set of concerns emerges: the opacity of algorithmic decision-making, the systematic reproduction and amplification of historical biases, the erosion of constitutional due process protections, and the concentration of discretionary power in systems whose internal logic the affected individuals cannot scrutinize.

These are not theoretical concerns. They are playing out in real cases, affecting real people, and they are doing so faster than courts, legislatures, or bar associations have been able to respond. The gap between the deployment of AI in the justice system and the legal frameworks needed to govern that deployment is wide and growing.

150+U.S. jurisdictions using some form of predictive policing or risk assessment tools

77%COMPAS recidivism scores for Black defendants classified high risk vs. 24% for white defendants (ProPublica, 2016)

6wrongful facial recognition arrests documented publicly in the United States by 2024

Predictive Policing: The Technology and Its Deployment

Predictive policing refers to the use of data analysis, statistical modeling, and, increasingly, machine learning to forecast where crimes are likely to occur or to identify individuals deemed likely to commit crimes. The technology has been deployed in various forms across hundreds of jurisdictions in the United States and in law enforcement agencies worldwide.

Place-based predictive policing tools, such as PredPol (now rebranded as Geolitica), generate maps indicating areas with elevated predicted crime risk. Officers are directed to patrol these areas more intensively. The underlying assumption is that past crime patterns are predictive of future crime locations. The statistical logic is borrowed from earthquake prediction: seismic models show that aftershocks cluster around the site of major quakes, and similar clustering patterns appear in crime data.

Person-based predictive policing goes further. Systems such as the Chicago Police Department's Strategic Subject List assigned individuals a numerical score indicating their estimated risk of involvement in gun violence, either as perpetrator or victim. The score was derived from factors including prior arrests, gang affiliations, and social connections to individuals who had been shot or arrested. People with high scores were subject to proactive police contact, including home visits.

The AI Now Institute's 2018 report documented the expansion of these systems and raised foundational concerns about their validity and fairness. The core problem is circular: predictive models trained on historical crime data reflect historical policing patterns, not the actual distribution of crime. If police have historically concentrated enforcement in predominantly Black and low-income neighborhoods, those neighborhoods will have disproportionately high rates of documented crime, not necessarily higher rates of actual crime. A model trained on documented crime data will predict more crime in those neighborhoods, directing more police there, generating more arrests, and feeding more data back into the model. The feedback loop produces predictions that appear statistically valid but are, in essence, a formalization of prior enforcement bias.

Berk (2017) examined the effect of algorithmic risk forecasts on parole board decisions and found that boards were significantly influenced by the scores, but that the predictive validity of the scores was modest. High-risk scores predicted recidivism at rates only modestly better than chance, while the costs of false positives, people labeled high risk who would not have reoffended, were substantial. Those false positives were not randomly distributed. They fell disproportionately on Black and lower-income individuals.

The city of Los Angeles piloted place-based predictive policing through a program called Operation LASER before suspending it in 2019 following a report by an inspector general who found that the system had targeted people based on weak evidence and that officers had circumvented oversight requirements. Santa Cruz, California, became the first U.S. city to ban predictive policing outright in 2020. These local responses, however, remain the exception. The majority of jurisdictions using predictive tools continue to do so with limited transparency and minimal independent evaluation.

COMPAS and Risk Assessment in Sentencing

The Correctional Offender Management Profiling for Alternative Sanctions, known as COMPAS, is one of the most widely studied and contested AI tools in the criminal justice system. Developed by Equivant (formerly Northpointe), COMPAS generates risk scores across multiple dimensions, including general recidivism risk, violent recidivism risk, and pretrial failure risk. These scores are used by courts and corrections officials in at least 35 states.

In 2016, ProPublica published an investigation, Machine Bias, that analyzed COMPAS scores for more than 7,000 defendants in Broward County, Florida. The investigation found that Black defendants were nearly twice as likely as white defendants to be falsely flagged as future criminals, while white defendants were more likely to be incorrectly labeled low risk and go on to reoffend. These findings ignited a debate that continues today.

Equivant disputed the ProPublica analysis, arguing that the comparison was inappropriate because it used different statistical fairness criteria than the company had applied. This dispute revealed a genuine and unresolved problem in the field: different mathematical definitions of fairness are often mutually exclusive. A model cannot simultaneously be calibrated (equally accurate across groups), achieve equal false positive rates across groups, and achieve equal false negative rates across groups. Any choice of optimization criterion privileges some version of fairness over others, and that choice is inherently a political and ethical decision, not a technical one.

Dressel and Farid (2018) published a finding in Science Advances that made the debate significantly sharper. They found that COMPAS did not perform better than predictions made by untrained human volunteers who were given only a defendant's age, gender, and criminal history. This raises a pointed question: if a simple model using a handful of variables performs as well as a complex proprietary system, what is the value added by that system, and what is the cost of its opacity?

"Different mathematical definitions of fairness are often mutually exclusive. Any choice of optimization criterion is inherently a political and ethical decision, not a technical one. COMPAS made that choice invisibly."

Mayson (2019) offered perhaps the most searching theoretical critique of risk assessment in the criminal justice context. Writing in the Yale Law Journal, she argued that any predictive risk instrument will reproduce racial and socioeconomic disparities if it is trained on data that reflects a history of racially and economically unequal treatment. This is not a flaw that can be engineered away. It is a structural consequence of using historical outcome data in a system where those outcomes were shaped by discrimination. The problem, she argued, is not bad algorithms. It is the use of prediction as a basis for punishment at all.

The Loomis decision did not resolve the underlying due process question. The Wisconsin court acknowledged that COMPAS scores should not be used as the determinative factor in sentencing, but it permitted their use as one input among many, without requiring that the algorithm be disclosed. Defense attorneys in subsequent cases have continued to struggle with the fundamental asymmetry: the prosecution can present a risk score as evidence; the defense cannot examine the logic that produced it.

Cahn (2022), writing for the ACLU, identified this asymmetry as the defining due process problem of algorithmic sentencing. The Sixth Amendment's confrontation clause gives defendants the right to confront the evidence against them. Risk scores produced by closed-source algorithms are effectively unconfrontable. The defendant cannot cross-examine a proprietary model. The score enters the record as a fact when it is, at best, a contested statistical estimate.

Facial Recognition and Wrongful Arrest

Facial recognition technology represents a different but related application of AI in the legal system. Law enforcement agencies use facial recognition to identify suspects by matching images from surveillance cameras, social media, or police databases against a reference database of photographs. The technology has become more capable and more widely deployed over the past decade, with little consistent regulatory oversight.

The National Institute of Standards and Technology evaluated commercial facial recognition algorithms in 2019 and found significant variation in accuracy across demographic groups. The best-performing algorithms were less accurate on images of Black women than on images of white men, with false positive rates up to 100 times higher for some demographic comparisons. These disparities are not marginal. They are the difference between a tool that is useful and one that generates systematic errors against specific populations.

Garvie, Bedoya, and Frankle's 2016 report, The Perpetual Line-Up, found that facial recognition systems were in use by law enforcement agencies in at least 26 states, and that approximately half of American adults were in a facial recognition network. The report documented a near-complete absence of policies governing accuracy standards, testing requirements, or the use of results in investigations.

The consequences of inaccurate facial recognition in a law enforcement context are severe. Several documented cases of wrongful arrest based on facial recognition errors involve Black men who were arrested, detained, and in some cases held for extended periods based on a match that turned out to be incorrect. Robert Williams was arrested in his driveway in front of his children in 2020 after a facial recognition algorithm incorrectly matched his driver's license photo to a shoplifting suspect. He was held in custody for 30 hours before prosecutors agreed to drop the charges.

These cases are not isolated. They reflect a pattern produced by the interaction of algorithmic inaccuracy, sparse investigative follow-up, and a presumption that algorithmic outputs are reliable. When a facial recognition match is treated as probable cause rather than as a hypothesis to be investigated, the error rate of the underlying technology translates directly into wrongful arrests. The problem is compounded by the fact that defendants often do not know that facial recognition was used in their case, because law enforcement agencies do not consistently disclose the technology in discovery.

AI in Legal Practice and the Hallucination Problem

The concern about AI in the legal system extends beyond policing and courts to the legal profession itself. Lawyers have adopted AI tools for legal research, contract analysis, document drafting, and case strategy. These tools offer genuine efficiency gains but carry a risk that became publicly visible in 2023 when a case before the Southern District of New York brought the failure mode of generative AI into sharp relief.

In the case involving the airline Avianca and a passenger named Roberto Mata, a lawyer named Steven Schwartz submitted a brief containing citations to multiple court decisions. The opposing counsel and eventually the judge noticed that the cited cases appeared to be fabrications. When confronted, Schwartz acknowledged that he had used ChatGPT to assist in research and that he had not independently verified the cases. The AI had generated plausible-sounding but entirely fictional legal citations, complete with case names, docket numbers, and judicial quotations that had never existed.

This phenomenon, which researchers call hallucination, is not a bug in the implementation of a particular tool. It is a structural feature of how large language models generate text. These models produce outputs that are statistically likely given their training data, not outputs that are verified against reality. A language model has no mechanism to distinguish between a real case it learned from training data and a case it invents that follows the same surface patterns. When asked to provide legal research, it will produce text that looks like legal research, because it has been trained on vast quantities of legal text, regardless of whether the specific citations are real.

The Mata v. Avianca incident prompted immediate responses from the courts. The SDNY and other federal courts began issuing standing orders requiring attorneys to certify that any AI-generated content in their filings had been reviewed and verified by a licensed attorney. The New York City Bar Association issued guidance on the responsible use of AI in legal practice. State bar associations began developing their own ethical frameworks.

These responses are reasonable first steps, but they do not address the underlying structural problem: lawyers, like business owners, face asymmetric incentive structures when it comes to AI adoption. The productivity gains from using AI tools for legal research are real and immediate. The risks are latent and realized only when something goes wrong. The temptation to use AI tools without adequate verification is strong, particularly for solo practitioners and small firms under cost pressure, and the consequences of getting it wrong can be severe for their clients.

The hallucination problem in legal AI is also not limited to case citations. AI tools have generated contract clauses that do not reflect the parties' actual agreement, legal opinions that mischaracterize the law of a jurisdiction, and due diligence reports that omit material information. In each case, the output looks authoritative. The attorney who does not read carefully, or who trusts the tool more than they should, may not catch the error before it causes harm.

Due Process and the Right to Understand Your Score

The due process implications of algorithmic decision-making in the legal system raise questions that existing constitutional doctrine is poorly equipped to answer. The Fifth and Fourteenth Amendments guarantee that the government may not deprive any person of life, liberty, or property without due process of law. What due process requires in any given context depends on the nature of the interest at stake, the risk of erroneous deprivation, and the governmental interest in the challenged procedure. The Supreme Court articulated this balancing test in Mathews v. Eldridge (1976), a case about Social Security disability benefits.

Applied to algorithmic sentencing or bail determinations, the Mathews balancing test produces a strong argument for disclosure. The liberty interest at stake is as significant as it gets: prison time, pretrial detention, conditions of release. The risk of erroneous deprivation is real and documented. The governmental interest in keeping the algorithm secret is primarily commercial, protecting the intellectual property of the tool's vendor, which is not the type of governmental interest the due process clause was designed to protect.

Cahn (2022) argued that this analysis leads to a clear constitutional conclusion: defendants have a due process right to meaningful access to the factors and logic that produced an algorithmic score used against them. Courts have been reluctant to reach this conclusion, in part because doing so would effectively require vendors to disclose proprietary algorithms. The courts have instead engaged in a kind of procedural sleight of hand, holding that the score can be used as long as it is not the sole basis for the decision, without confronting the disclosure question directly.

The AI Now Institute (2018) and other organizations have proposed a different approach: a presumption against the use of any algorithmic tool in high-stakes legal proceedings unless it has been independently validated for accuracy, tested for demographic disparities, and made available for adversarial examination. This is a higher bar than current practice requires, but it is arguably the only bar consistent with the constitutional values at stake.

The Algorithmic Accountability Act, introduced by Senators Leahy and Cornyn in 2023 (S. 3572, 118th Congress), would require companies deploying automated decision systems in consequential contexts, including criminal justice, to conduct impact assessments evaluating accuracy, fairness, and potential for bias. The bill had not passed as of 2026. Even if it did, its application to government-operated systems or government procurement of algorithmic tools would require interpretation. The legislative response to algorithmic accountability in criminal justice has lagged significantly behind the deployment of the technology.

Toward Accountable Algorithmic Justice

The path toward a legal system that uses AI responsibly is not a path that rejects AI entirely. Algorithmic tools can improve consistency in decision-making, reduce the idiosyncratic variation that produces sentencing disparities across judges, and help understaffed probation and parole offices allocate limited resources. The critique of algorithmic justice is not that prediction has no place in the system. The critique is that prediction is being used without adequate transparency, validation, or accountability, and that the populations bearing the costs of that inadequacy are those who were already most harmed by the pre-algorithmic system.

Several principles emerge from the literature that could form the basis of a more accountable framework. First, algorithmic tools used in high-stakes legal proceedings should be subject to mandatory independent validation before deployment, and that validation should include disaggregated accuracy testing across demographic groups. A tool that performs adequately on average but performs poorly on Black defendants is not an adequate tool.

Second, defendants and their counsel must have meaningful access to the factors and logic underlying any algorithmic score used against them. This does not necessarily require full disclosure of proprietary source code, but it does require, at minimum, disclosure of the variables used, the weighting applied, the demographic distribution of scores in the relevant population, and the validated error rates. Meaningful due process requires meaningful information.

Third, algorithmic outputs should be clearly labeled as statistical estimates with documented error rates, not as findings of fact. The presentation of a COMPAS score as a piece of evidence, without mandatory disclosure of its limitations and margins of error, invites misunderstanding by judges, juries, and defendants who may not have the statistical literacy to evaluate it appropriately.

Fourth, the feedback loops that allow biased historical data to perpetuate and amplify themselves in predictive models need to be broken by regular auditing. Models trained on data generated by a biased system will reproduce that bias. The only way to interrupt this cycle is to scrutinize training data, monitor outcomes, and adjust or discontinue tools that demonstrably produce disparate and unjust outcomes.

Fifth, the legal profession needs clearer, enforceable standards for the use of AI-generated content in legal practice. The Mata v. Avianca incident was a visible failure because it involved a federal court filing. Quieter failures, involving AI-generated contract language, research memos, or client advice that is not verified before it is acted upon, are likely more numerous and collectively more harmful.

The question animating all of these concerns is what justice means in a system where consequential decisions about liberty are increasingly influenced by algorithms that the affected individuals cannot see, challenge, or understand. The answer, in a society that claims to value due process and equal protection, should be that such a system is not acceptable as currently operated. The technology is not the problem. The lack of accountability around it is. That accountability deficit is a policy choice, and it can be changed.

Key Cases and Legislative Actions Referenced

Loomis v. Wisconsin (2016): Wisconsin Supreme Court upheld the use of COMPAS risk score in sentencing, holding it was one factor among many and did not violate due process.

Mata v. Avianca, SDNY (2023): Attorney Steven Schwartz submitted AI-generated fictitious case citations; court issued sanctions and prompted nationwide court orders on AI use in filings.

Algorithmic Accountability Act (S. 3572, 118th Congress, 2023): Leahy-Cornyn bill requiring impact assessments for automated decision systems in high-stakes contexts; not yet passed as of 2026.

References

Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). Machine Bias. ProPublica. May 23, 2016.
Dressel, J., & Farid, H. (2018). The accuracy, fairness, and limits of predicting recidivism. Science Advances, 4(1), eaao5580.
Loomis v. Wisconsin, 881 N.W.2d 749 (Wis. 2016). Cert. denied, 137 S.Ct. 2290 (2017).
Mayson, S.G. (2019). Bias in, bias out. Yale Law Journal, 128(8), 2218–2300.
Garvie, C., Bedoya, A., & Frankle, J. (2016). The Perpetual Line-Up: Unregulated Police Face Recognition in America. Georgetown Law Center on Privacy and Technology.
National Institute of Standards and Technology (2019). NISTIR 8280: Face Recognition Vendor Test (FRVT) Part 3: Demographic Effects. U.S. Department of Commerce.
AI Now Institute (2018). AI Now 2018 Report. AI Now Institute, New York University.
Berk, R.A. (2017). An impact assessment of machine learning risk forecasts on parole board decisions and recidivism. Journal of Experimental Criminology, 13(2), 193–216.
Cahn, A.F. (2022). The Dangers of AI in Criminal Justice. American Civil Liberties Union.
Leahy, P., & Cornyn, J. (2023). Algorithmic Accountability Act, S. 3572, 118th Congress.
Mata v. Avianca, No. 22-cv-1461 (PKC) (S.D.N.Y. 2023). Opinion and Order re: sanctions, June 22, 2023.