arrow_back Insights

May 2026

Executive Judgment

John Koblinsky

Read on Substack open_in_new

Why Human Oversight of AI Fails: The Judgment Gap

By John Koblinsky

Everyone's producing more. Someone still has to decide what to do with it.

The standard solution to AI risk — putting experienced people in the review chain — assumes those people are cognitively equipped to push back on AI outputs. Research shows the assumption fails. John Koblinsky at Marsh Island Group identifies how AI deployment actively erodes the judgment capacity of designated reviewers, often without their awareness, producing oversight that looks functional while quietly stopping to work.

The Case for Human Oversight — and Where It Breaks

According to Deloitte's 2026 Global Human Capital Trends survey, sixty percent of executives now regularly use AI to support their decisions. The standard organizational response is intuitive: keep humans in the loop. Designate your researchers, senior analysts, and subject-matter experts as the review layer. The assumption follows — experienced people will catch what AI gets wrong.

The instinct is right. But the deployment that created the need for oversight is the same force degrading the capacity of the people designated to provide it. This is the judgment gap that Marsh Island Group's work identifies: AI rollout scaled output volume without scaling the human readiness to evaluate it.

Why the AI Oversight Layer Is Failing

Three research-documented mechanisms explain how the oversight layer breaks down. Each operates independently. Together they make oversight failure nearly structural.

AI Use Accelerates Skill Decay Without Practitioners Realizing It

Brooke Macnamara and colleagues, writing in Cognitive Research: Principles and Implications in 2024, found that "AI assistance accelerates skill decay in ways performers don't perceive." The degradation happens at the level of cognitive skill engagement, not task engagement. A researcher using AI to analyze transcripts still looks busy and feels capable — while the underlying capacity to catch errors quietly erodes.

The authors describe skill decay that "operates outside the performer's awareness because the disuse is only at the level of cognitive skill engagement, not with engagement with the task." In practical terms: the people designated as the check on AI output may feel fully capable while losing the very capacity that made their oversight meaningful.

This is the mechanism that makes AI oversight failure invisible to the organization. The reviewer doesn't know what they've lost. The manager doesn't know to look. The work continues to flow through a review layer that is no longer doing the work of review.

Moderate AI Experience Produces Peak Overconfidence in Reviewers

Michael Horowitz and Lauren Kahn, in a 2024 study in International Studies Quarterly, found that automation bias — the tendency to defer to AI recommendations without sufficient scrutiny — follows an inverted curve across AI experience levels. Minimal exposure produces healthy skepticism. Deep expertise allows practitioners to recognize failure modes. Moderate exposure, where most organizational users currently sit, produces peak overconfidence.

The subject-matter experts most likely to be put in the review chain are, right now, statistically the most likely to trust AI output without pushing back. They have enough familiarity to feel confident — and not enough depth to catch systematic errors. This is the danger zone Horowitz and Kahn identify: too experienced to be skeptical, not experienced enough to know what they're missing.

The good news, embedded in this finding, is that deep expertise does reduce automation bias. The danger is not permanent — it's stage-specific. The problem is that most organizations are deploying at exactly the stage where the risk peaks.

Building Experience Is What Makes the Check Valid

The cognitive map required for effective oversight is not built by deploying AI tools. It is built by constructing them — iterating through failures, learning the failure's anatomy, developing the engineering intuition that lives underneath the interface.

John Koblinsky tested this directly at Marsh Island Group. After six months of building a messaging analysis agent — learning to chunk unstructured transcript data, structure semantic coding, and iterate through failures — he ran Dovetail, a vendor platform marketed for this kind of work, against 25 hour-long interview transcripts. Dovetail returned a confident answer. It was objectively wrong.

Koblinsky caught it because prior building work had made the task's difficulty legible: he knew what it took to get usable output from even ten transcripts, and he had been in the room for all 25 interviews. Vendor deployment doesn't build that. When organizations move teams directly from no AI tool to enterprise AI platform, they may also skip the developmental work that makes oversight valid.

Institutional Review Often Confirms Rather Than Catches Error

Saar Alon-Barkat and Madalina Busuioc, in a 2022 study in the Journal of Public Administration Research and Theory, found that human review of AI outputs is often not neutral. They call it selective adherence: "using the AI's recommendation as permission to confirm an existing assumption." The review layer amplifies existing bias rather than catching error.

A compounding constraint tightens the problem further. Aruna Ranganathan and Xingqi Maggie Ye, writing in Harvard Business Review in February 2026, found that AI availability leads workers to expand scope and hours rather than reduce them. The time that should go to careful review is going to more work instead. The oversight layer is stretched thin even where cognitive capacity remains intact.

What This Means for Knowledge Workers and the Leaders Above Them

Output got cheap. AI tools scaled the supply side of organizational work — more analyses, more recommendations, more content — without scaling the judgment required to evaluate what was produced. Marsh Island Group's analysis identifies this as the defining gap: the ratio of volume to informed review keeps widening as deployment accelerates.

Deloitte's 2026 Human Capital Trends research named the right question: "The moment AI enters the workflow, the real question isn't 'What does the model say?' It's 'Who gets to disagree with it, and how fast?'" Most practitioners and most teams cannot answer that cleanly. The judgment to push back is built before the tool arrives. Deployment assumes it's already there.

For knowledge workers, the choice is a fork, not a ramp. Either dive deeper into how the tools work — building the engineering intuition that makes oversight meaningful — or produce less and make what you produce worth deciding on. Certification and AI literacy training have value, but they teach categories. Building teaches specificity: which outputs to trust, at what scale, and where a confident AI answer is actually just a well-formatted summary.

For executives who approved these tool deployments, the parallel shift is learning to ask for less and mean it. The volume of AI-generated output your teams now produce is not a signal of productivity. It may be a signal that the judgment layer has been overwhelmed — and that the check you believe is working has quietly stopped doing so.

FAQ

Frequently Asked Questions

Why does putting experienced people in the AI review chain often fail to catch errors? expand_more

Because AI deployment itself erodes the cognitive capacity of the designated reviewers. Research by Brooke Macnamara and colleagues (2024) found that AI assistance accelerates skill decay without practitioners' awareness — degrading the ability to catch AI errors while the person still feels fully capable of performing the review. The degradation is invisible to both the reviewer and the organization.

How does AI assistance cause skill decay without practitioners realizing it? expand_more

Macnamara et al. (2024) found that skill decay under AI use "operates outside the performer's awareness because the disuse is only at the level of cognitive skill engagement, not with engagement with the task." Workers remain busy and feel capable while the underlying verification capacity quietly erodes. The task looks the same from the outside; the cognitive work required to do it well has diminished.

What is automation bias, and why are moderate AI users most at risk? expand_more

Automation bias is the tendency to defer to AI recommendations without sufficient scrutiny. Horowitz and Kahn (2024) found it follows an inverted curve: minimal exposure produces healthy skepticism, deep expertise reduces deference, but moderate exposure — where most organizational users currently sit — produces peak overconfidence. The people most likely to be put in the review chain are statistically the most likely to trust AI output without questioning it.

What is selective adherence in AI-assisted decision making? expand_more

Selective adherence, documented by Alon-Barkat and Busuioc (2022), describes the pattern in which human reviewers use AI recommendations as permission to confirm existing assumptions rather than challenging them. Instead of catching AI errors, the review process amplifies pre-existing bias — making the oversight layer a source of laundered bias rather than independent verification.

How can knowledge workers maintain judgment capacity when using AI tools? expand_more

The most reliable path is building experience with how the tools work at an engineering level — not just using vendor platforms, but understanding where they fail and why. Practitioners who have constructed similar systems from scratch know what a confidently wrong answer looks like at the output layer. Certification and AI literacy training teach categories. Hands-on construction teaches specificity.

Does AI availability actually increase work hours rather than reduce them? expand_more

Yes. Ranganathan and Ye, writing in Harvard Business Review in February 2026, found that workers with AI access expanded their scope and worked longer hours rather than reducing effort. This creates a structural problem for oversight: the people designated to review AI outputs are using their available capacity to take on more work, not to review more carefully.

How should executives change how they commission AI-assisted work? expand_more

Start by asking not how much output AI enables, but whether the people reviewing that output are equipped to catch its errors. Executives need to ask less of their teams in terms of volume — and more in terms of depth of engagement. Output volume is no longer a proxy for productivity. It may be a proxy for an overwhelmed judgment layer.

Are AI literacy training programs sufficient to close the judgment gap? expand_more

Training programs have value but don't build what meaningful AI oversight requires. Certification teaches general categories of AI failure. Building something from scratch, failing at it, and understanding the failure's anatomy teaches specificity: which outputs to trust, at what scale, and where a confident AI answer is actually just a well-formatted summary. The two are not interchangeable.

MIG Override

Get the next piece before it's posted.

Subscribe Free