How AI changes technical due diligence: evaluating startups that use LLMs in their products

The frequency of technical due diligence on startups that use LLMs (large language models) in their products is rising steadily. Applying the traditional 7-axis framework (architecture, code quality, infrastructure, security, development process, engineering organization, competitive advantage) to these companies leaves significant blind spots.

The conclusion upfront: startups that use LLMs in their products have four evaluation dimensions that traditional technical DD cannot capture: ① accurately classify the LLM usage pattern (API-based / fine-tuning / proprietary model), ② evaluate prompt management quality and its IP implications, ③ verify LLM-driven cost variability risks and the existence of evaluation systems, ④ assess whether safety mechanisms are designed to prevent unexpected AI behavior. Layering these four dimensions on top of traditional axes completes technical DD for the AI era.

What has changed: the delta from traditional DD

Traditional technical DD assumed that “code written by in-house engineers” was the core of a company’s technical assets. Code quality, architectural design, test coverage — these could be evaluated by reading source code and CI/CD configurations.

LLMs have upended that assumption. A product’s core behavior may now be determined not by in-house code but by results returned from OpenAI or Anthropic APIs. “Prompts become assets” and “model behavior determines product quality.”

This shift has three concrete impacts on technical DD.

Expanded evaluation scope: Beyond code, prompts, evaluation datasets, and fine-tuning data become evaluation targets — none of which are visible in a traditional source code review.

Changed risk structure: “External API dependency risk” and “model version update behavior change risk” are added to the technical risk profile. This resembles cloud service dependency but differs in that behavioral predictability is substantially lower.

Changed competitive advantage criteria: “Has the company fine-tuned a model on proprietary data?” and “Is there a proprietary data asset?” gain importance as sources of differentiation. The competitive moat of an architecture that “just calls the ChatGPT API” versus one with “a model undergoing continuous learning on proprietary data” is fundamentally different.

Classifying LLM usage: three patterns

The first question when starting a technical DD is “How does this product use LLMs?” Mapping the answer to the following three patterns organizes the subsequent evaluation points.

Pattern 1: API-based

The product calls APIs of general-purpose models provided by OpenAI, Anthropic, Google, etc. This is the current mainstream; most AI startups fall here.

Strengths: Fast development velocity. Low cost of tracking the latest models. No infrastructure management required.

Inherent risks:

API dependency risk: Directly affected by terms of service changes, price revisions, and service termination. When the architecture cannot function without this API, the valuation impact is significant.
Vendor lock-in: Code that depends on OpenAI-specific features like Function Calling syntax or context window characteristics incurs high migration costs when switching models.
Model version management: Verify whether the API model version is explicitly pinned. With “always use latest model” settings, a model update can suddenly change production behavior.

Key questions: Which API providers are depended upon? Has a vendor-switching cost estimate been prepared? Monthly API cost trends and the ratio to revenue?

Pattern 2: Fine-tuning

A general-purpose model is additionally trained on proprietary data. OpenAI, Google Cloud, and others offer commercial fine-tuning services.

Strengths: Domain-specific accuracy tends to exceed generic models. Incorporating proprietary data into training creates some degree of differentiation.

Inherent risks:

Training data quality and copyright: Is the license and copyright handling of fine-tuning data appropriate? Using scraped data or third-party content as-is carries IP risk.
Obsolescence cycle: Each time the base model is updated, the effort to redo fine-tuning arises. Does the organization have the capability to keep up with this update cycle?
Evaluation difficulty: Does a benchmark exist to quantitatively measure fine-tuning effectiveness? A qualitative “it got better” feeling doesn’t constitute quality assurance.

Key questions: Training data sources and copyright handling, fine-tuning update frequency and organizational capacity, evaluation dataset design.

Pattern 3: Proprietary model

A custom model architecture is designed and trained internally. Resource requirements are substantial, so this is limited to large tech companies, specialized AI research organizations, or companies holding massive proprietary data in specific domains.

Strengths: Hardest for competitors to replicate. Data, model, and inference pipeline all become proprietary assets.

Inherent risks:

Organizational dependency risk: Specialized ML/MLOps skill sets are required to develop and maintain the model, creating key-person dependency.
Infrastructure cost: GPU costs for training and inference scale rapidly. Verifying whether unit economics are viable is essential.
Development velocity: There is a structural constraint that development cycles are slower compared to general-purpose APIs.

When a startup claims this pattern, verifying “is it truly a proprietary model?” is important. Cases exist where a Hugging Face public model is used nearly as-is while being described as “proprietary AI.”

Four additional evaluation dimensions

Once the pattern classification is complete, examine the following four dimensions regardless of which pattern applies.

Dimension 1: Prompt management

In products that use LLMs, prompts (instruction text to the LLM) become critical assets that determine product quality. Assess prompt management with the following:

Version control: Are prompts managed in Git like code? “Written in Notion” or “scattered across engineer workstations” is a risk.
A/B testing and evaluation cycles: Is there a mechanism to quantitatively measure the effect when prompts are changed? An intuition-based improvement loop doesn’t constitute quality assurance.
IP implications: Prompts themselves are currently difficult to protect via patent; managing them as trade secrets is the realistic approach. When a competitor replicates them, what serves as the barrier to entry?

Dimension 2: Evaluation systems (evals)

LLM outputs are probabilistic, and many tasks lack a single correct answer. This means products that use LLMs require purpose-built evaluation design for quality assurance.

Organizations without an eval system can only discover production problems incidentally. Verify the following:

Is a golden dataset (a collection of expected input-output pairs) maintained?
Is there a CI pipeline that runs automatic evaluation before and after model updates or prompt changes?
Is there a mechanism to incorporate negative user feedback into the evaluation improvement loop?

The presence or absence of an eval system is a strong indicator of development process maturity. Organizations with strong technical capabilities tend to invest in this area early.

Dimension 3: Cost structure and scaling characteristics

In both API-based and fine-tuning patterns, LLM-related costs scale proportionally with usage. Verify whether this cost structure is aligned with the business model.

Unit economics viability: Is there a calculation of LLM cost per user or per request and its relationship to corresponding revenue?
Cost cap design: Is there a per-user API call limit? An unlimited design can cause cost explosions when heavy users exist.
Model price change risk: Is the sensitivity to API provider price revisions understood? Reference historical revision patterns (generally downward) in the assessment.

Dimension 4: Safety mechanisms and guardrail design

Whether mechanisms exist to prevent AI from behaving unexpectedly is an increasingly important evaluation axis. “Safety mechanisms” here should be assessed not only in the context of ethical AI use, but as implementation design that protects product quality, reliability, and legal compliance.

Key items to verify:

Input validation: Is there filtering to prevent prompt injection attacks and unintended instructions embedded in user inputs? (Prompt injection countermeasures)
Output validation (guardrails): Is there a mechanism to detect and filter harmful content, incorrect numbers, or personal information leakage in LLM outputs before they reach production? Verify adoption of guardrail libraries (NeMo Guardrails, Guardrails AI, etc.).
Human-in-the-loop (HITL): For high-stakes decisions (medical, financial, legal domains), is the design avoiding autonomous AI execution? Assess whether HITL design is appropriate.
Anomaly detection and alerting: Is there monitoring of AI output in production, with alerts for sudden behavioral changes (accuracy degradation, response time anomalies, increased harmful outputs)?
Test harness setup: Is there a reproducible environment for testing LLM behavior? Testing probabilistic LLM outputs requires different design from traditional unit tests. Verify whether techniques like mock LLMs, seed fixing, and behavioral snapshots are used appropriately.

Startups that are thin on this dimension carry the risk of post-launch incidents: mass generation of misinformation, security breaches, or delayed regulatory compliance. For startups operating in high-sensitivity domains (healthcare, finance, education, hiring), explicitly incorporating safety mechanism readiness into investment terms is advisable.

Translating evaluation to investment decisions

How do these assessments translate to investment decisions? The purpose of technical DD consolidates into three outputs — identifying critical risks, providing negotiation leverage, and designing the basis for PMI plans — and this principle remains unchanged in the AI era.

For AI-specific dimensions, structure outputs with particular attention to the following.

Items to treat as critical risks: When the API being used restricts the product’s use case in terms of service (e.g., prohibition on use for competing services); when training data has copyright issues; when safety mechanisms are undeveloped in high-sensitivity domain production; when only a single key person understands how the model works.

Valuation adjustment material: Discoveries like “claims proprietary data assets, but actually uses only public data” or “claims fine-tuning, but uses a general-purpose model nearly as-is” may undermine the premises of claimed competitive advantages.

PMI plan integration: If an eval system is undeveloped, include it as an early post-investment improvement item. If API dependency is high, incorporate a vendor risk diversification plan into the post-investment 100-day plan. If safety mechanisms are insufficient, tying a readiness roadmap to investment conditions is effective.

Mapping to the traditional DD checklist

By adding the following items to each axis of the complete technical DD checklist, it becomes applicable to startups that use LLMs in their products.

Additional item	Source	Priority
LLM usage pattern classification (API/FT/proprietary)	Architecture diagram, engineering interviews	◎
Degree of dependency on API providers	Infrastructure cost breakdown, terms of service	◎
Prompt version management status	Repository review	○
Evaluation dataset and CI pipeline existence	CI/CD pipeline, test code	○
Training data licensing and copyright handling	Data procurement records, legal documents	◎
LLM cost estimate per request	Unit economics documentation	○
Explicit model version pinning	API call code	△
Prompt injection countermeasure implementation	Security design docs, code review	◎
Output guardrail implementation status	Architecture design, library selection	○
Test harness readiness	Test code, CI configuration	○
HITL design presence (for high-sensitivity domains)	Spec docs, operational flow	◎

Items marked ◎ should be treated as “critical risk if undeveloped” — they can become leverage for investment conditions or price adjustments.

Conclusion

Technical DD in the AI era is completed by adding four dimensions to the traditional 7-axis framework: “LLM usage pattern classification,” “prompt management,” “evaluation systems and cost structure,” and “safety mechanism design.” These additional dimensions also serve as criteria for distinguishing startups that merely use AI as a wrapper from those with genuinely proprietary and robust AI capabilities.

Rather than accepting “we use AI” claims at face value, decomposing and evaluating which pattern applies and which risks are present — including how well-defended the system is against unexpected behavior — becomes the foundational stance for technical DD of AI-powered startups.