9 min read
Risk, accountability, and failure modes
Part 2 of 3. Faster generation changes how quickly code is produced. It does not move responsibility onto the tool. Volume, vendor dependency, incidents, and cost still need a named owner inside your team.
TL;DR
- Code that runs locally is not proof that it will work in production. Load, integration, legacy constraints, and long-running operation expose problems that isolated generation cannot.
- Hosted models are vendors like any other. Plan for availability, data handling, retention, jurisdictions, and a fallback when inference is unavailable.
- Recent public incidents and audit data point at governance, change control, and review capacity as the limiting factors. The model itself is rarely the issue.
- AI spend belongs on the same budget sheet as other infrastructure. Unbounded agent use can approach senior engineering cost.
Corsair Media Group
Part of a series
Faster generation does not move responsibility
Adding AI to a delivery process changes the speed at which code is produced. It does not change who is responsible for that code. Generative AI still requires a named technical owner, whether the tool is producing documentation, an exploratory prototype, a routine refactor inside known patterns, or any other situation where a reviewer might be tempted to treat code that runs locally as ready to ship.
A common failure mode is treating generated output as safe because it compiles or runs locally. Correctness in production depends on context that is not present during isolated generation. That context includes load, legacy constraints, integration behavior, and operational edge cases. The gap becomes more visible under real traffic and in long-running systems, where small inconsistencies add up into larger problems.
When the volume of output exceeds review capacity
Security and reliability risks grow with the volume of output. High-throughput generation can produce diffs larger than the team can practically review. That increases the chance that subtle defects reach production.
Consider a sprint that lands twenty thousand lines of code with only light review. Reviewers then have to read a diff large enough that serious flaws can hide inside ordinary-looking sections on any single screen. The model can produce code faster than people can read it. That is where load-dependent and security-dependent logic typically goes unnoticed.
Independent 2026 audits underline the bottleneck. Veracode reported forty-five percent of sampled AI-assisted code carrying common categories of web risk aligned to the OWASP Top 10 family. Related work flagged thirty-one percent as plainly exploitable. Commentary on loosely reviewed AI-heavy repos cites up to ninety-two percent carrying at least one critical finding. Security teams have reported spending more time vetting model output than fixing conventional bugs.
Rounded to match percentages often cited from Veracode and partner studies. Use them as a reason to verify your own setup rather than a final ruling.
Human and model pairing still works when the engineer who merges the code accepts responsibility for the production impact, and when each change is small enough to read end to end before it ships.
Treat hosted models like any other vendor
Anything you send to a hosted model crosses another company's infrastructure. That traffic can include prompt history, repository snippets, internal identifiers, and follow-up messages in the same session. Treat the path the same as any other vendor that handles confidential data. Decide explicitly whether content may leave your network, how long the provider may retain it, in which countries it may be stored, and how you will respond to a breach. Apply the same standard you would apply to a hosted continuous integration service that clones your repositories.
Hosted models and their APIs fail like any other dependency. If inference is required to produce a production fix, then keep a fallback path that does not assume the service will be available.
For a deeper discussion of data access risk under broad AI integration, including what the system can retrieve, which regulatory obligations apply, and what vendor retention means for sensitive content, see AI full data access risk.
Production incidents and literal interpretation
Once revenue depends on a system, and there are years of decisions inside it, a full rewrite is no longer realistic. Recent public failures look less like random hallucination and more like literal execution of dangerous instructions. Replit's AI reportedly deleted a production database with over a thousand executive records after being told not to touch live data. Google's Gemini CLI deleted user files while "organizing" missing folders. The pattern is faithful execution of dangerous prompts. The model is not acting on its own.
Those failures are governance and change-control problems as much as they are model problems. In March 2026, Amazon publicly attributed outages to AI-assisted changes that lacked safeguards. The result was hours of downtime, six-figure missed orders in one incident, and roughly $6.3 million in losses in another widely cited figure. Separate 2026 analyses still cite about 45 percent of sampled AI-heavy code carrying OWASP-class issues and about 43 percent of AI-guided changes requiring manual debugging after deployment. Larger models do not remove that risk on their own.
The business damage often lasts beyond the outage itself. Critical production datasets and the working backups for them have disappeared together for paying customers in widely reported cases over the past few years. The restore procedure failed when operators tried it under pressure. Customers and procurement teams often accelerate replacement once confidence is gone, even when the vendor promises a fix.
Models recognize patterns. They do not have access to your operational history, your business model, or the implicit rules your team built up across releases. Output can look correct on screen and still fail under real traffic, an audit, or a coordinated rollout. Larger models give the team a wider design and debugging space, but turning a plausible branch into code that you will run for years is still human work.
Cost belongs on the same budget sheet
AI use introduces a new infrastructure expense with an ongoing per-use cost. In higher-throughput environments, that spend can become comparable to the cost of a senior engineer when no limits are applied. A senior engineer running agents with large context windows can push API spend toward the fully loaded cost of hiring another senior. Our own usage measurements on long sessions, without using flagship models, came in at about one dollar per minute.
After months of heavy agent use, per-developer spend often lands near two hundred to six hundred dollars per developer once you leave vendor starter tiers and start paying per token. On modest sites and apps we still recorded more than eight hundred dollars in under two weeks for a single developer without full-time stack immersion. A single modest prompt cost about ten dollars; the output was largely usable, still needed edits, and the invoice for that slice sat in the same band as senior time while we were also paying that senior to supervise. Run that habit for a year, and a small shop can approach twenty thousand dollars per engineer when agents are the default on every workflow.
Widths are directional from vendor commentary plus invoices we actually saw. They are not GAAP audited.
Engineers who rely heavily on agents during large refactors can push monthly invoices into the thousands when prompts run in long chains without limits. Higher spend and faster commits usually mean more security review, more QA, and rework that a more cautious approach would have avoided.
Accountability cannot be delegated to the tool
Across correctness, security, and cost, the requirement is the same. Accountability cannot be delegated to the tool. A named person inside your team still has to weigh the trade-offs, defend the change in review, and answer for the behavior of the system during an incident.
The next article in this series explains how architecture-first delivery narrows what models are allowed to produce in the first place, so that the failures described above have fewer opportunities to occur.
If this matches your situation, then reach out through our contact page so that we can discuss what to address first in your current delivery process.
If you want to scale AI use without losing review discipline or producing unexpected budget items, then talk with Corsair about your next build.
Contact CorsairContinued reading
Keep exploring related topics that connect strategy, implementation, and long-term maintenance.
AI usage in software development: where it helps and where ownership still matters
Teams usually ask about speed first. The more important question is who is responsible for what ships, and how that responsibility is enforced before anything reaches production.
How AI fits into engineering workflows
Part 1 of 3. AI is most useful inside a well-defined engineering process. It supports the work. It does not define the work. The same review standard that applies to human-written code applies to anything a model produces.
Why architecture-first delivery controls AI behavior
Part 3 of 3. Most of the risk does not come from the model itself. It comes from how loosely structured the surrounding system is. Generators and scaffolding narrow what AI can produce before any review takes place.