Technical debt risks from AI-driven development: an evaluation framework for investors and DD practitioners

The proliferation of AI coding tools has generated widespread claims that “engineer productivity has increased 10x.” Yet the flip side of that productivity story — technical debt risks unique to AI-driven development (AI駆動開発) — is consistently underappreciated.

The conclusion upfront: technical debt generated by vibe coding and AI-driven development differs fundamentally in its formation mechanism from the debt traditional startups accumulate. “No one understands the design intent of the code,” “test absence is baked in from the start,” “attribution has shifted from people to prompt histories” — when these three patterns compound, the codebase reaches a state where source code review alone cannot yield a complete risk assessment. Investors and M&A practitioners who apply a traditional technical DD framework without modification will miss the full picture.

Defining vibe coding: concept and reality

Vibe coding is a term popularized in 2025 by AI researcher Andrej Karpathy to describe a development style in which code generated by LLMs (large language models) is adopted without deep understanding or verification. Karpathy described the experience as “barely reading the code — when an error appears, paste it straight into the LLM and let it fix it.”

The term has since expanded in usage and now carries two overlapping meanings:

Narrow definition (Karpathy’s original): An exploratory, high-speed prototyping style driven by LLM dialogue, where the developer does not grasp code details and moves on once something works.
Broad definition (industry usage): A general style of development that relies heavily on AI coding assistants such as GitHub Copilot, Cursor, or Claude Code, where engineers increasingly play a reviewer role rather than an author role.

For investment and DD purposes, the critical question is which sense a given startup’s AI-driven development falls under. When the narrow definition’s vibe coding is being applied to core production code, the risk profile changes substantially.

Figure 1: Vibe coding types and the difference in technical debt risk

Technical debt patterns unique to AI-driven development

Traditional technical debt formation can be classified into three types: speed-priority trade-offs, knowledge and experience gaps, and lag in adapting to environmental change (see Technical debt in startups: patterns and root causes). AI-driven development introduces additional debt patterns on top of these.

Pattern 1: Design intent hollowing

Traditional technical debt operates under the assumption that “someone wrote this code.” Even when design quality is poor, the person who made the decision exists and the rationale can potentially be recovered through interviews or documentation.

The essential problem with vibe coding debt is that there is no “author” of the code. When LLM-generated code is adopted as-is, no one can explain the design intent. The answer to “why this design?” and “what were the trade-offs with other options?” exists only in a prompt history — if that history was kept at all.

As design intent hollowing progresses, subsequent developers are left in a state of “I don’t understand what this running code means, but touching it might break something.” This is more serious than traditional attribution risk, because the option of “asking someone” simply does not exist.

Pattern 2: Entrenched test absence

AI coding tools excel at code generation but tend not to write tests unless explicitly instructed. Vibe coding’s bias toward “make it work first” means test creation gets deferred — but the problem goes deeper.

In traditional development, “we didn’t have time to write tests” is recognized by developers as a known liability. In AI-driven development, however, an unfounded sense of security — “it was written by AI, so it must be fine” — takes hold, and test absence becomes entrenched without being recognized as a problem.

As AI continues adding changes to a test-free codebase, regression bugs accumulate invisibly. Small teams may paper over this with informal familiarity, but the problem surfaces sharply at scale.

Pattern 3: Prompt-dependent attribution

Traditional attribution risk took the form of “code only a specific engineer understands.” If that engineer is still employed, knowledge can still be retrieved.

AI-driven development changes the form of attribution. Not “who built this product?” but “what prompts built this product?” becomes the critical question — and prompt history is rarely managed organizationally.

When the responsible engineer leaves, the only information remaining for their successor may be “it was built in ChatGPT.” The prompts needed to reproduce the code’s behavior, the context that informed them, and even which model was used — all can be unknown. This is the new attribution risk of the AI era.

Pattern 4: Hidden security vulnerabilities

LLMs occasionally produce code with insufficient security consideration: missing SQL injection protection, improper authentication implementation, hardcoded credentials. These issues are catchable in a traditional code review, but vibe coding’s tendency to skip review creates gaps.

The compounding problem is that generated code often looks correct on the surface. LLM-produced code is syntactically valid, variable names are clean, and comments are present. Its surface quality is high, making it easy for human reviewers to judge “looks fine.” But the soundness of actual security design is a different question from the code’s appearance.

Figure 2: Four technical debt patterns unique to AI-driven development

The cascade: review gaps, test gaps, and attribution collapse

These four patterns do not occur in isolation — they cascade and amplify each other. The typical degradation cycle:

Early phase: Vibe coding builds a prototype at high speed. No tests, review skipped.
Expansion phase: Features added via AI on top of running code. No one understands the full codebase, but additions are still possible.
Team growth: New members join but cannot be onboarded because “no one knows the overall design.” Handoff fails.
Problem emergence: Bugs appear but root causes cannot be traced. Fixes break other things. Without tests, the safe boundary of change is unknown.

Once this cascade begins, the cost of recovery grows exponentially. Restoring design intent, retrofitting tests, conducting security audits — each requires substantial resources, and executing all three while continuing to build product is extremely difficult.

The seven axes of technical due diligence establishes the principle that “evaluating a codebase means evaluating the process, not just the output.” AI-driven development evaluation makes this process focus even more central.

Guardrails for healthy AI-driven development

The argument here is not that vibe coding or AI-driven development is inherently bad. The problem is applying it to production without guardrails. Organizations with the right conditions in place realize genuine productivity gains from AI coding tools.

Guardrail 1: Position AI as “draft generation”

A culture where AI-generated code is treated as a “proposal” that engineers evaluate and consciously adopt is necessary. The accountability standard should be “an engineer reviewed and adopted the AI’s proposal” rather than “the AI wrote it, so it’s correct.”

Requiring engineers in code review to explain “why this implementation was chosen” is a concrete mechanism to prevent mindless adoption of AI-generated code.

Guardrail 2: Test-first (or test-concurrent) discipline

TDD (test-driven development) thinking remains effective in AI-driven development. A flow of “have AI write the tests first, then have AI write the implementation” prevents test absence from becoming entrenched.

AI is also capable of generating test code. The instruction “write unit tests for this function” as a preceding step is fully practical.

Guardrail 3: Recording prompts and context

As a countermeasure to prompt-dependent attribution risk, managing the prompts and AI dialogue logs related to significant design decisions as organizational assets is effective.

Recording “why this AI suggestion was adopted” in GitHub Issues, PR descriptions, or ADRs (Architecture Decision Records) substantially reduces context loss over time.

Guardrail 4: Specialized security review

Security review of AI-generated code requires higher expertise than standard review. Because generated code “looks correct,” surface-level review misses problems easily. Domains such as authentication, input validation, and secrets management should receive dedicated specialist review.

Guardrail 5: Implementing an agent harness

A more fundamental approach to embedding the four guardrails above — rather than relying on culture and habit — is implementing an agent harness. An agent harness is an execution environment that systematizes the rules, permissions, hooks (pre/post processing), and skills under which an AI agent operates.

Its components typically include:

Context files (CLAUDE.md / AGENTS.md): Per-repository files that define the rules, design principles, and prohibited actions an AI agent must observe. The AI reads these files before writing any code and operates within the project-specific constraints they define.
Hooks (Pre/PostToolUse): Scripts that automatically execute before and after the AI makes code changes. Gates such as “always run tests after changes” or “require human confirmation before writing to specific files” can be automated here.
Skill definitions: Standardize recurring tasks (code review criteria, pre-deploy checks, quality validation) as prompt templates, so the AI applies the same quality baseline each time.
Permission controls: Explicitly define which repositories, branches, and commands an AI may operate on, minimizing the blast radius of unintended changes.

With a well-configured agent harness, even a vibe-coding-adjacent development style results in guardrails that are baked into the code execution environment. Unlike guardrails 1–4, which depend on individual attention and organizational culture, an agent harness enforces them structurally.

For investment and DD purposes, checking whether a target company has agent configuration files such as CLAUDE.md or AGENTS.md, and whether CI-integrated hooks are present, provides a rapid signal of AI-driven development governance maturity. These files are typically committed to the Git repository and can be reviewed directly when repository access is available.

Three questions for investors and DD practitioners

When evaluating AI-driven development risk in technical DD, three questions are central.

Question 1: “Who can explain the design decisions in this codebase?”

Ask the engineering lead or CTO. Confirm whether a person exists who can articulate — specifically — the rationale for the core architecture and critical business logic decisions.

Answers that frequently reduce to “the AI generated it, so…” indicate advancing design intent hollowing. A codebase whose design judgments cannot be articulated is an asset whose change cost cannot be estimated.

Question 2: “What does the review process look like for AI-generated code?”

Investigate how code review operates. Ask whether there is a difference in treatment between AI-generated and hand-written code in PRs, whether a usage policy for AI coding tools exists, and whether test presence is included as a review criterion.

“Everyone uses it however they want” signals that guardrails do not exist. Even without a written policy, implicit rules sometimes function; asking for concrete examples is more revealing than asking about policies.

Question 3: “What portion of production code has no tests?”

Confirm test coverage. When the team cannot state a number, converting to a concrete question — “are there tests covering the core payment and authentication flows?” — is more effective.

In organizations using AI-driven development, “I’m not sure” is a common answer to this question, and that response itself is a risk signal. Not knowing the test coverage means not knowing the boundary between safely changeable and high-risk-to-change code.

Figure 3: Three questions for evaluating AI-driven development

Translating evaluation results into investment decisions

A startup using AI-driven development is not automatically disqualifying. What matters is whether AI is used with guardrails in place and whether risks are recognized and managed.

Lower-concern case: The organization positions AI as an assistive tool; engineers own design decisions; review and testing function as intended. Development velocity gains from AI usage exist alongside maintained baseline quality standards.

Watch-list case: Review of AI-generated code has become perfunctory and tests are sparse, but the team recognizes the problem and has an improvement roadmap. This can be addressed as a negotiating point or PMI condition.

High-risk case: No one can explain the design intent, tests are virtually absent, prompt management does not exist, and the team does not recognize any of this as a problem. In this state, valuing the technical asset itself becomes difficult.

The evaluation dimensions for startups building LLMs into their products are covered in How AI changes technical due diligence. Vibe coding risk is a separate problem from the product-embedding risks described there — it must be assessed independently as a development process risk.

Summary

Technical debt generated by vibe coding and AI-driven development has the following characteristics:

Characteristic	Traditional technical debt	AI-driven development debt
Location of design intent	Developer’s memory	Absent (or in prompt history)
Recognition of test absence	”We should have, but had no time"	"AI wrote it, so it’s fine”
Form of attribution risk	Dependence on individuals	Dependence on prompts, tools, models
Surface quality	Often visibly low	Often high — easy to overlook

For DD practitioners and investors to evaluate these risks, the traditional approach of “read the code” and “check test coverage” must be supplemented by “verify the process by which design decisions are made.” Concretely, using the three questions outlined above — who can articulate design intent, whether review processes exist, and the actual state of testing — surfaces the technical debt risk profile of the AI era.

For a full framework on classifying and evaluating technical debt, see Technical debt in startups: patterns and root causes. For an overview of technical DD evaluation axes, see The seven axes of technical due diligence.

For technical evaluation of investment targets, technical DD support, and advisory engagement, visit TiedPro for investors or contact us.

FAQ

Q. Will banning vibe coding solve the problem?

Prohibition is neither realistic nor necessary. The problem is not vibe coding itself but adoption without guardrails. Mandating review of AI-generated code, setting test standards, and recording design decisions are the three requirements — once these are in place, the productivity benefits of AI-driven development can be maintained while managing the risks.

Q. Can source code review detect AI-driven development risks?

Partially, but with limitations. Design intent hollowing and prompt-dependent attribution cannot be detected by reading code alone. Evaluating the development process, the team’s actual understanding, and the state of documentation is a necessary complement.

Q. How much does it cost to remediate AI-driven development debt post-investment?

It varies widely, but recovery from a state where “no one understands the design” is particularly expensive. Understanding the full codebase can take weeks; test remediation can take months; security audit and remediation requires additional time. Assessing risk level pre-investment and incorporating it into the PMI plan is strongly recommended.