
Federal Agencies’ growing use of AI raises questions about how judges can adequately review those agencies’ decisions.
The Arkansas Department of Health Services hired a software company in 2016 to build an algorithm to automate the process of assessing disabled patients’ needs. The program failed spectacularly. For example, the algorithm reduced an amputee’s level of home care because the patient had no “foot problems.” Overall, the program negatively affected nearly half of Arkansas Medicaid recipients.
Artificial intelligence (AI) is now used widely in the federal government. According to the U.S. Government Accountability Office, use cases of AI in the federal government doubled—and for generative AI, increased nine-fold—from 2023 to 2024. With this growing use of AI, the risk of systematic errors with potentially harmful consequences, such as those in Arkansas, may increase. Similar issues have already occurred at the state level in Idaho, Texas, Michigan, and Wisconsin, as well as at the federal level at the Department of Homeland Security.
The Trump Administration has reportedly signaled its intent to introduce AI into the rulemaking process. It is unclear whether and how agencies might rely on AI-generated information to inform regulatory decisions. But this development raises key administrative law questions—particularly as horror stories like those in Arkansas become more common.
Legal scholars have long recognized the challenges judges face when enforcing the Administrative Procedure Act’s bar on “arbitrary and capricious” decision-making—which requires that a government agency engage in reasoned decision-making—but these scholars have assumed that the decisionmakers under review are humans. Issues raised by AI may exceed the limits of existing administrative law doctrines. This creates a legal problem under the Administrative Procedure Act, which compels courts to hold unlawful and set aside agency action found to be arbitrary and capricious.
How can judges properly conduct judicial review of agencies’ regulatory decision-making that relies on AI? Observers cannot always determine how an AI model made its decision—the so-called “black box” nature of AI. Even when users prompt an AI model to explain its rationale, there is no guarantee that the reason the model produces is the actual reason the model used. Under the Administrative Procedure Act, Judges must set aside arbitrary and capricious decision-making, but the black box issue can make it difficult, if not impossible, for reviewing judges to assess the reasonableness of decisions made using AI. This raises similar concerns to the problem of pretext that administrative law already recognizes, since both implicate the distinction between the stated and actual reason for a decision.
But two U.S. Supreme Court principles for the application of the arbitrary and capricious standard are particularly relevant to agencies’ use of AI. The Supreme Court has warned agencies that they may not to rely on “improper factors”—factors that the U.S. Congress “has not intended the agency to consider”—and that agencies cannot “entirely fail to consider an important aspect” of a given regulatory problem. Reviewing judges use these principles to promote reasoned decisionmaking.
Given the black box problem, an agency has no way of knowing exactly what an AI did or did not consider. If an AI system considers an improper factor and an agency relies on that recommendation, that means the agency will also have relied on that improper factor. Put differently, AI prevents an agency from knowing whether its regulatory decisionmaking fulfills the goals of reasonableness sought under the Administrative Procedure Act.
For example, Amazon attempted to build a AI model to select top job applicants from submitted resumes. The model, however, started displaying a gender bias—to the point of penalizing any resume that just included the word “women’s”—because it was trained on a disproportionate number of male resumes. The team tried to salvage the project by neutralizing the bias but ultimately scrapped the effort because the team recognized that it could not guarantee that the model would not have some form of bias.
If an agency were to rely on such a model to make decisions, it would have relied on improper factors. When, for example, the U.S. Food and Drug Administration evaluates a new drug using an AI model, the agency cannot determine which factors a model is considering or how heavily the model is weighing different factors.
Yet, the existing literature on administrative law fails to address a larger problem: It is difficult to ensure the integrity of a model’s training data, itself, when the AI product is sourced from a third party, as commercial licensing terms often shield the tool’s architecture and training data, placing them in control of the vendor rather than the government.. This outsourcing is common, as a study from the Government Accountability Office found that “over half of the selected agencies are primarily procuring generative AI product and services in lieu of internal development.”
Because AI systems procured by government agencies are often built upon third-party, commercial training data, the problem that the previous literature has identified is vastly understated. The use of such training data introduces the possibility that these AI systems are making decisions based on factors that Congress did not intend for an agency to consider, inadvertently bleeding through such avenues as social media posts and browsing behavior available through other online information. Social media posts and browsing behavior are especially likely to be captured in third-party training data because modern large language models are trained on massive datasets that draw heavily from forums, comment sections, and other forms of user-generated content. ChatGPT, for example, is trained in part using data from Reddit.
These are valuable data sources given how vast they are, but their substance can be problematic and lead to the potential consideration of improper factors. Social media frequently contain content that, if considered by a human decision-maker, would be clearly impermissible as a basis for decisionmaking, such as racist remarks. Yet these claims, by virtue of their prevalence on the internet, become embedded in training data and statistically reinforced during model training. By using that same model to inform regulatory decisionmaking, an agency would be, in some respect, considering that improper factor—in this instance, race.
This reality of third-party data changes the risk from the possibility that a model is merely inferring information on improper factors based on other proxy variables to the possibility that the model is actually using concrete information that it obtains about improper factors from another data source to influence Government decisionmaking. These outside factors could even include constitutionally protected characteristics, such as race or gender, as occurred when Amazon’s AI model displayed gender bias, raising the possibility of violations of the Constitution’s Equal Protection Clause in addition to the Administrative Procedure Act.
Still, courts face a similar black box problem with regard to human decision-makers and have found a workaround. Indeed, the possibility always exists that agencies will construct a legally permissible rationale for a given decision that was actually reached through consideration of improper factors. For instance, the U.S. Environmental Protection Agency may consider economic costs in setting the National Ambient Air Quality Standards, whether consciously or unconsciously, even though the Clean Air Act prohibits the agency from doing so. Confronted with this challenge, courts have instead focused on policing the procedures that agencies rely on as a kind of proxy for ensuring reason-based decisionmaking. But should this type of functional fix be applied to reviewing the use of AI in decisionmaking as well?
Several distinguishing features of AI weigh against this approach. Courts rely on this workaround with human decision-makers out of necessity—there is no avoiding human imperfection in agency decision-making. The same cannot be said of AI, however.
Moreover, several other institutional safeguards exist that are designed to promote forthrightness among agency decision-makers, including that agency staff engaged in decision-making swear an oath to office. Additional institutional safeguards include civil service and whistleblower protections for agency staff intended to permit them to give their candid thoughts, as well as transparency laws, such as the Freedom of Information Act, which also seek to discourage agencies from relying on misrepresentations in their work. Again, these conditions do not apply to AI.
Still, the questions AI poses are not unprecedented. The issues that arise when agencies outsource government functions to third parties offer analogies that may be useful for thinking through the challenges of relying on AI in agency decision-making. Literature in law and political science explores these questions and may offer insight on how to restore the principles of accountable and evidence-based decisionmaking that administrative law seeks to promote.



