Skip to main content
Published 2026-05-04

9 min read

Risk, accountability, and failure modes

Part 2 of 3. Faster generation changes how quickly code is produced. It does not move responsibility onto the tool. Volume, vendor dependency, incidents, and cost still need a named owner inside your team.

TL;DR

  • Code that runs locally is not proof that it will work in production. Load, integration, legacy constraints, and long-running operation expose problems that isolated generation cannot.
  • Hosted models are vendors like any other. Plan for availability, data handling, retention, jurisdictions, and a fallback when inference is unavailable.
  • Recent public incidents and audit data point at governance, change control, and review capacity as the limiting factors. The model itself is rarely the issue.
  • AI spend belongs on the same budget sheet as other infrastructure. Unbounded agent use can approach senior engineering cost.

Share this article

Corsair Media Group

Corsair Media Group

Faster generation does not move responsibility

Copied

Adding AI to a delivery process changes the speed at which code is produced. It does not change who is responsible for that code. Generative AI still requires a named technical owner, whether the tool is producing documentation, an exploratory prototype, a routine refactor inside known patterns, or any other situation where a reviewer might be tempted to treat code that runs locally as ready to ship.

A common failure mode is treating generated output as safe because it compiles or runs locally. Correctness in production depends on context that is not present during isolated generation. That context includes load, legacy constraints, integration behavior, and operational edge cases. The gap becomes more visible under real traffic and in long-running systems, where small inconsistencies add up into larger problems.

When the volume of output exceeds review capacity

Copied

Security and reliability risks grow with the volume of output. High-throughput generation can produce diffs larger than the team can practically review. That increases the chance that subtle defects reach production.

Consider a sprint that lands twenty thousand lines of code with only light review. Reviewers then have to read a diff large enough that serious flaws can hide inside ordinary-looking sections on any single screen. The model can produce code faster than people can read it. That is where load-dependent and security-dependent logic typically goes unnoticed.

Independent 2026 audits underline the bottleneck. Veracode reported forty-five percent of sampled AI-assisted code carrying common categories of web risk aligned to the OWASP Top 10 family. Related work flagged thirty-one percent as plainly exploitable. Commentary on loosely reviewed AI-heavy repos cites up to ninety-two percent carrying at least one critical finding. Security teams have reported spending more time vetting model output than fixing conventional bugs.

Third-party scan headlines, early 2026

Rounded to match percentages often cited from Veracode and partner studies. Use them as a reason to verify your own setup rather than a final ruling.

AI-heavy samples tagged with issues similar to the OWASP Top 10~45%
Audited subsets called "straight-up exploitable"~31%
Loosely reviewed AI-heavy repos allegedly carrying critical findings~92%
AI-authored changes allegedly needing manual prod debugging afterward~43%

Human and model pairing still works when the engineer who merges the code accepts responsibility for the production impact, and when each change is small enough to read end to end before it ships.

Treat hosted models like any other vendor

Copied

Anything you send to a hosted model crosses another company's infrastructure. That traffic can include prompt history, repository snippets, internal identifiers, and follow-up messages in the same session. Treat the path the same as any other vendor that handles confidential data. Decide explicitly whether content may leave your network, how long the provider may retain it, in which countries it may be stored, and how you will respond to a breach. Apply the same standard you would apply to a hosted continuous integration service that clones your repositories.

Hosted models and their APIs fail like any other dependency. If inference is required to produce a production fix, then keep a fallback path that does not assume the service will be available.

For a deeper discussion of data access risk under broad AI integration, including what the system can retrieve, which regulatory obligations apply, and what vendor retention means for sensitive content, see AI full data access risk.

Production incidents and literal interpretation

Copied

Once revenue depends on a system, and there are years of decisions inside it, a full rewrite is no longer realistic. Recent public failures look less like random hallucination and more like literal execution of dangerous instructions. Replit's AI reportedly deleted a production database with over a thousand executive records after being told not to touch live data. Google's Gemini CLI deleted user files while "organizing" missing folders. The pattern is faithful execution of dangerous prompts. The model is not acting on its own.

Those failures are governance and change-control problems as much as they are model problems. In March 2026, Amazon publicly attributed outages to AI-assisted changes that lacked safeguards. The result was hours of downtime, six-figure missed orders in one incident, and roughly $6.3 million in losses in another widely cited figure. Separate 2026 analyses still cite about 45 percent of sampled AI-heavy code carrying OWASP-class issues and about 43 percent of AI-guided changes requiring manual debugging after deployment. Larger models do not remove that risk on their own.

The business damage often lasts beyond the outage itself. Critical production datasets and the working backups for them have disappeared together for paying customers in widely reported cases over the past few years. The restore procedure failed when operators tried it under pressure. Customers and procurement teams often accelerate replacement once confidence is gone, even when the vendor promises a fix.

Models recognize patterns. They do not have access to your operational history, your business model, or the implicit rules your team built up across releases. Output can look correct on screen and still fail under real traffic, an audit, or a coordinated rollout. Larger models give the team a wider design and debugging space, but turning a plausible branch into code that you will run for years is still human work.

Cost belongs on the same budget sheet

Copied

AI use introduces a new infrastructure expense with an ongoing per-use cost. In higher-throughput environments, that spend can become comparable to the cost of a senior engineer when no limits are applied. A senior engineer running agents with large context windows can push API spend toward the fully loaded cost of hiring another senior. Our own usage measurements on long sessions, without using flagship models, came in at about one dollar per minute.

After months of heavy agent use, per-developer spend often lands near two hundred to six hundred dollars per developer once you leave vendor starter tiers and start paying per token. On modest sites and apps we still recorded more than eight hundred dollars in under two weeks for a single developer without full-time stack immersion. A single modest prompt cost about ten dollars; the output was largely usable, still needed edits, and the invoice for that slice sat in the same band as senior time while we were also paying that senior to supervise. Run that habit for a year, and a small shop can approach twenty thousand dollars per engineer when agents are the default on every workflow.

Illustrative monthly AI spend per engineer (published ranges vs. our spikes)

Widths are directional from vendor commentary plus invoices we actually saw. They are not GAAP audited.

Light experimental use~$75 to $120
Quoted industry band for habitual agent pipelines$200 to $600
Burst weeks matching our exploratory spend> $800 in a fortnight, about two weeks

Engineers who rely heavily on agents during large refactors can push monthly invoices into the thousands when prompts run in long chains without limits. Higher spend and faster commits usually mean more security review, more QA, and rework that a more cautious approach would have avoided.

Accountability cannot be delegated to the tool

Copied

Across correctness, security, and cost, the requirement is the same. Accountability cannot be delegated to the tool. A named person inside your team still has to weigh the trade-offs, defend the change in review, and answer for the behavior of the system during an incident.

The next article in this series explains how architecture-first delivery narrows what models are allowed to produce in the first place, so that the failures described above have fewer opportunities to occur.

If this matches your situation, then reach out through our contact page so that we can discuss what to address first in your current delivery process.

If you want to scale AI use without losing review discipline or producing unexpected budget items, then talk with Corsair about your next build.

Contact Corsair