Where design flaws come from — and when to fix them
Design flaws have a way of becoming critical at the worst possible time. Teams that say “we’ll fix it later” often watch that debt compound until development grinds to a halt. On the other side, teams that chase every imperfection burn engineering cycles on problems that would never have mattered.
The practical challenge is knowing which design problems to fix, which to watch, and which to leave alone. This article gives you the tools to make that call — a three-scenario taxonomy of how design debt forms, a two-axis prioritization matrix, a decision framework for refactor vs. rewrite vs. tolerate, and a way to translate those technical choices into business language that operators and investors can act on.
How design flaws form: three typical scenarios
Design flaws rarely emerge from negligence. In most cases, a decision that made sense at the time becomes a liability as circumstances change. Three patterns account for the majority of cases.
① Requirement drift (most common)
A system is designed for one set of requirements. New capabilities are bolted on, one at a time, without revisiting the original structure. Eventually the design no longer fits the product.
A common example: a database schema designed for a few hundred users in a single-tenant MVP. When multi-tenancy is added — each enterprise customer needs isolated data, separate billing, and fine-grained permissions — patching the existing schema becomes increasingly fragile. The original design wasn’t wrong; the product just outgrew it.
② Deliberate shortcuts
Speed-to-market trade-offs that were never resolved. Startups often make the right call to ship fast and clean up later. The problem is that “later” tends to get deferred indefinitely.
A common example: authentication logic copied and pasted across controllers because “it’s fast.” When security policy changes, every copy must be updated in sync. Missed instances become vulnerabilities. The original shortcut was a reasonable bet — the failure was not scheduling the payoff.
③ Knowledge gaps widening over time
Designs that were sound given the team’s experience level at the time, but that show their limits as the product scales or requirements evolve.
A three-person startup team builds a system that works well at launch. Four years later, the same system must handle ML pipelines, real-time processing, and analytics across millions of records. The constraints baked into early architectural choices become constraints on what the business can do. This isn’t a failure of the original team — it’s what growth looks like.
The prioritization matrix: impact × change frequency
Not all design debt is equal. The question is not “does a flaw exist?” but “what does it cost to leave it alone?”
Two dimensions determine that cost: impact (what breaks if this fails) and change frequency (how often engineers have to touch it).
Assessing impact
Ask: “What breaks for users or the business if this component fails?” High-impact areas include payment processing, authentication, core business logic, SLA-critical paths, and anything that affects data integrity. Low-impact areas include internal tooling, log formatting, and admin-only reports.
A common mistake is conflating technical complexity with business impact. A tangled piece of code in a rarely-used internal report is far lower priority than simpler but frequently-called payment logic.
Assessing change frequency
Pull Git commit history. Files changed more than once per week over the past six months are high-frequency. This data is objective and takes minutes to collect (git log --format="%f" --name-only | sort | uniq -c | sort -rn | head -20 gives a rough ranking).
Refactor, rewrite, or tolerate: a decision framework
Once a problem is in the “fix” quadrant, there are three paths: refactor (improve the existing code), rewrite (replace it from scratch), or tolerate (accept the cost and move on).
| Criterion | Refactor | Rewrite | Tolerate |
|---|---|---|---|
| Readability of existing code | Understandable with effort | Faster to rewrite than decipher | Either |
| Test coverage | Tests exist | No tests and adding them is difficult | — |
| Problem scope | Localized | Systemic / foundational | Narrow and non-spreading |
| Business logic value | Logic is correct; structure needs work | Premises are wrong | Fix cost exceeds benefit |
| Risk appetite | Change can be staged | Staging is impractical | Risk is acceptable |
When to refactor
Refactoring is the right call when code is still understandable, tests exist, and the problem is bounded. The key constraint: never refactor without tests. Restructuring code without a test safety net introduces new bugs at a high rate. Add tests first, then restructure.
When to rewrite
Rewrite when the question “how long to understand and fix this?” has a longer answer than “how long to write it fresh?”
The risk of a full rewrite is well-documented — Joel Spolsky called Netscape’s complete rewrite “the single worst strategic mistake” ever made by a software company. Working systems encode years of edge-case handling, regulatory constraints, and implicit business rules that are invisible until they break. A rewrite discards that knowledge.
When a rewrite is genuinely the right call, use a strangler fig approach: run old and new systems in parallel, migrate one slice at a time, and retire old code incrementally. Never switch everything at once.
When to tolerate
Tolerate when the area falls in quadrant ④ — low impact, low frequency — or when the feature is on a deprecation path. Perfectionism-driven refactoring of low-stakes code consumes capacity that could go toward problems that actually matter. “Leave it alone” is a valid engineering decision, not a failure.
The perfectionism and cleanliness trap
Conscientious engineers have a hard time leaving imperfect code alone. That instinct is generally healthy — but when it overrides business priorities, it becomes a liability.
The most common failure mode is refactoring code that nobody was planning to touch. When an engineer improves low-frequency, low-impact code simply because it bothers them, they absorb the cost of the change (time, bug-introduction risk, review cycles) in exchange for aesthetic satisfaction rather than measurable value. In a startup where engineering capacity is scarce, this is a real problem.
The Boy Scout Rule — “leave the code a little cleaner than you found it” — is sound, but the scope matters. “A little cleaner” means the immediate area of your change, not every adjacent file that could use tidying. Interpreted too broadly, it turns feature work into extended refactoring sessions.
A practical countermeasure: make the decision to not fix something explicit. When a flaw is identified but deliberately left alone, note it with a code comment or a low-priority ticket — including the reasoning. This prevents a future engineer from encountering the same code, assuming nobody noticed, and refactoring it without context. Acknowledged technical debt is far less dangerous than invisible technical debt.
Putting it into practice: a four-step process
Step 1: Surface the debt
Ask engineers “what code do you most want to fix?” They know. Supplement with Git commit frequency data and bug clustering by component.
Step 2: Assess impact
Get operators, PMs, and engineers in the same room. Ask “what happens to the business if this fails?” Engineers often flag technically messy code that is low-stakes for the product. Operators often don’t realize a clean-looking surface hides critical fragility. The joint conversation closes the gap.
Step 3: Place on the matrix
Place each issue in the four quadrants. Work from ① downward. This turns a subjective list of “things to fix someday” into a ranked, defensible prioritization.
Step 4: Integrate with development
Dedicated “refactor sprints” rarely sustain. The Boy Scout Rule works better: leave every file slightly cleaner than you found it. Pair design improvements with feature work in the same area — the change is happening anyway, and the context is fresh.
Translating design decisions into business language
Engineers speak in technical terms. Operators and investors need business terms. The translation is straightforward:
Technical framing:
“The authentication module is tightly coupled. Every change requires full regression testing across all features.”
Business framing:
“Changing how users log in takes us two weeks. Our competitors do it in three days. This one design issue is costing us a 6× speed disadvantage on auth-related features.”
This translation matters for resource allocation discussions. It also matters for VC value-add assessments: slow feature velocity relative to peers, engineering attrition with “codebase quality” as a reason, and recurring bug patterns in similar areas are all signals that design debt has been allowed to compound.
Calculating the cost of inaction
Proposals to fix design debt typically arrive as “this will take N weeks.” They should always be compared against the cost of not fixing it.
| Item | Fix now | Leave alone |
|---|---|---|
| Upfront cost | 2-week refactor | None |
| Ongoing cost | Eliminated | 2 extra days per change × 4 changes/month = 8 days/month |
| Risk | Temporary instability during migration | Bug risk on every future change |
| 3-month outcome | Change cost reduced by 2/3 | Situation worsens |
| Verdict | Pays back in ~5 weeks | Compounds indefinitely |
A table like this shifts the conversation from “this is expensive” to “this is an investment with a calculable return.”
For a broader treatment of technical debt classification, see The true nature of technical debt — a redefinition for investors and executives. For how design debt fits into a full technical due diligence assessment, see The complete picture of technical due diligence.
Summary
| Step | Question | Action |
|---|---|---|
| Surface | Where are the flaws? | Engineer interviews + Git commit frequency |
| Prioritize | Fix or leave alone? | Impact × change frequency matrix |
| Choose path | How to fix? | Refactor / rewrite / tolerate decision framework |
| Translate | How to justify it? | Development speed and cost-of-inaction comparison |
| Execute | When and how? | Boy Scout Rule embedded in regular feature work |
The goal is not a codebase free of design flaws — that does not exist. The goal is knowing which flaws are costing you the most right now, fixing those, and leaving the rest alone without guilt. Design debt management is an investment decision, not a quality crusade.
FAQ
Can non-engineers make these prioritization decisions?
Identifying specific flaws requires engineering input. Prioritizing them does not — “how bad would it be if this broke?” is a business question. Operators and investors can and should weigh in on impact ratings. Frequency data comes from Git history and requires no engineering expertise to read.
How do we find time to refactor without stopping feature work?
Don’t carve out separate refactor time. Tie improvements to feature work in the same codebase area. When an engineer is already in a component for a new feature, the marginal cost of cleaning up nearby debt is low. Dedicated refactor cycles require justification that routine improvements don’t.
How do we know when a rewrite has gone too far?
A rewrite that is still ongoing six months after the original estimate is a warning sign. Common failure modes: scope expands as hidden requirements surface, the old system accumulates new features while the rewrite is in flight, and there is no clear migration path. Set hard deadlines for the parallel-run period before starting.