Architecture Decision Axes for Integrating LLM/AI into Products

When integrating an LLM into a product, the question of “which architecture to choose” is simultaneously a technical decision and a strategic business choice. Cost structure, behavior at scale, and the durability of competitive advantage all flow from this single architectural choice.

The conclusion upfront: LLM integration architectures fall into four categories, and the selection is determined by four axes: accuracy requirements, data confidentiality, cost tolerance, and development speed. API usage excels in speed and flexibility, RAG is suited for injecting domain knowledge, fine-tuning involves a tradeoff between accuracy and differentiation, and custom models offer ultimate control at maximum cost. Understanding these four patterns and evaluating them against the four axes is the foundational design decision in AI product development.

The Four Architecture Patterns

Pattern 1: API Usage (Prompt Engineering)

This pattern calls APIs from external model providers (OpenAI, Anthropic, Google, etc.) and controls output through prompt design. It is currently the most widely adopted approach and serves as the starting point for most AI startups.

Cost structure: Pay-per-use billing (token price × usage volume). Initial fixed costs are effectively zero, with costs scaling proportionally with users and processing volume. Model selection (GPT-4o vs. GPT-4o-mini, etc.) can result in several-fold cost differences, making use-case-appropriate model selection important.

Scale characteristics: Horizontal scaling is straightforward but depends on API rate limits. Large-scale traffic requires both “throttling countermeasures” and “cost budget ceiling design.”

Strengths and weaknesses: Low cost to follow the latest models with no infrastructure management required, but you’re directly exposed to provider pricing changes, specification changes, and service terminations. When your differentiation resides primarily in prompt design, the speed of competitive imitation is also worth noting.

Pattern 2: RAG (Retrieval-Augmented Generation)

Without modifying base model parameters, this pattern dynamically injects relevant documents and knowledge as context through search to improve output accuracy and freshness.

Cost structure: In addition to API usage costs, you incur vector database hosting fees (Pinecone, Weaviate, pgvector, etc.) and document indexing/embedding generation costs. However, initial investment is considerably lower than fine-tuning.

Scale characteristics: As document counts grow, maintaining retrieval accuracy becomes a challenge. The threshold at which reranking or hybrid search (keyword + vector) becomes necessary typically emerges around hundreds of thousands of documents.

Strengths and weaknesses: Domain knowledge can be injected without changing the model, eliminating the need to retrain when new data arrives. Particularly well-suited to “always needs current information” and “internal document Q&A” use cases. The counterintuitive risk is that poor retrieval quality increases hallucinations—when incorrect context is passed, the model processes it as fact.

Pattern 3: Fine-tuning

This pattern performs additional training on a general-purpose model using proprietary data. The goal is to standardize tone, format, and terminology for a specific domain, and to improve accuracy on specialized tasks.

Cost structure: Data preparation (including annotation), fine-tuning execution costs (GPU resources or provider billing), and evaluation/iteration cycle costs all accumulate. Since data quality and quantity directly affect accuracy, underestimating data preparation costs will derail plans.

Scale characteristics: Inference costs for the trained model may be comparable to or lower than the API-usage pattern (especially when using smaller base models). However, model update cycle management is added as an operational concern.

Strengths and weaknesses: Domain-specific accuracy can surpass general-purpose APIs, providing a differentiation rationale rooted in proprietary data. Behavior can be controlled without relying on system prompts, which also improves resistance to prompt injection. Weaknesses are the large upfront investment and the fact that the “accuracy ceiling depends on the base model.” With insufficient or low-quality training data, performance can actually degrade below a general-purpose model.

Pattern 4: Custom Model

This pattern handles everything from model architecture design to training in-house. In practice, adoption outside of research institutions, major tech companies, and certain regulated industries (medical, defense, etc.) is limited.

Cost structure: Building and operating GPU/TPU clusters, maintaining an ML engineering team, and collecting/managing large-scale training data are all required. Initial investment can reach hundreds of millions to billions of yen.

Scale characteristics: Complete control is possible, but capturing the benefits of scale requires both model size and data volume. The benefits of scaling laws become more pronounced at larger scale, but that simultaneously demands larger computational resources.

Strengths and weaknesses: Zero API dependency risk and the security advantage of not sending proprietary data to external parties. However, for startups, practical use cases are limited—primarily restricted to situations with specific regulatory or confidentiality requirements, or when the model itself is the core product.

The Four-Axis Decision Framework

Architecture selection is evaluated across four axes. Rate each axis as High/Mid/Low and assess fit with each pattern.

Axis 1: Accuracy Requirements

Evaluate the accuracy level demanded by the use case.

High (medical diagnosis support, legal document creation, financial reports): Fine-tuning or combination with RAG. Hallucination tolerance is low; domain-specific correctness is required.
Mid (customer support, content generation): API usage with appropriate prompt design is sufficient. Quality control (guardrails) for outputs implemented separately.
Low (brainstorming, draft generation): API usage suffices. Prioritize response speed and cost.

The cardinal rule when accuracy requirements are high is to define evaluation criteria first. Before moving into fine-tuning, a benchmark set defining “what constitutes high accuracy” must be constructed—otherwise the improvement cycle cannot function.

Axis 2: Data Confidentiality

Evaluate whether the data used for training and inference can be sent externally.

High (contains personal information, medical records, trade secrets): Custom models or on-premises inference environments required. Sending data to API provider servers may create regulatory and contractual risks.
Mid (internal documents, customer data): Combination of RAG and private cloud, or enterprise API plans (with contractual guarantees against data being used for training).
Low (public information, general content): Standard API usage is appropriate.

A commonly overlooked point is the confidentiality of data sent during inference. Even if training data isn’t sent, prompt data at inference time may contain confidential information. If PII (personally identifiable information) enters prompts by design, an enterprise contract or on-premises environment is necessary.

Axis 3: Cost Tolerance

Consider both current monthly costs and per-unit costs after scaling.

High (VC-funded growth stage, speed over cost): Run fast with API usage, defer cost optimization. Efficiency through model differentiation by task (high-accuracy vs. low-accuracy).
Mid (post-PMF, monetizing while optimizing): RAG to reduce API costs + use-case-specific model selection.
Low (cost structure is core to competitive advantage): Consider fine-tuning into smaller models via distillation, or building proprietary inference infrastructure.

“API usage is cheap” applies only in the early stage. At the scale of 10 billion tokens per month, cost structures can reverse. Scale plans and cost trajectory simulations should be conducted immediately after achieving PMF.

Axis 4: Development Speed

Evaluate product time-to-market and the team’s technical proficiency.

High (release within 3 months, small team): API usage only. No engineering capacity to build RAG or fine-tuning pipelines.
Mid (within 6 months, 1-2 ML engineers): RAG construction is feasible. Fine-tuning limited to small-scale use cases.
Low (1+ year development timeline, specialized team): Fine-tuning or proprietary inference environment can be incorporated.

The key insight is to choose the initial architecture with the expectation of discarding it. The practical pattern is to achieve PMF with API usage, then migrate to RAG or fine-tuning once challenges are clearly defined. Trying to design the “optimal architecture” from the start sacrifices hypothesis validation speed.

Architecture Selection Framework: Decision Matrix

Combining the four axes yields the following decision matrix:

Accuracy Req.	Data Confidentiality	Recommended Pattern	Notes
Low–Mid	Low	API usage	Focus on prompt design
Mid–High	Low	API usage + RAG	Retrieval quality design is key
Mid–High	Mid	Private RAG + Enterprise API	Data governance design is essential
High	High	Fine-tuning (on-prem or private cloud)	Pre-defining evaluation criteria is a prerequisite
High	High + core differentiation	Custom model	Generally not recommended for startups

When cost tolerance is low or development speed is high, choose a simpler pattern (API usage or API+RAG) one step down from the matrix above.

Reading Decision Logic Through Typical Cases

Case 1: Legal Document Review Tool for Enterprise Clients

Axis evaluation: Accuracy=High (precision of legal interpretation) · Confidentiality=High (client legal documents) · Cost=Mid · Speed=Mid

Selection: Fine-tuning (legal domain specialization) + on-premises inference or enterprise API contract

Rationale: Sending client data to an API provider’s servers may violate confidentiality obligations in client contracts. Fine-tuning for accuracy improvement is essential to represent the unique characteristics of legal interpretation (statutory interpretation, case law citation styles) more accurately than general-purpose models.

Case 2: Product Description Generation for an E-commerce Platform

Axis evaluation: Accuracy=Mid (natural language, product characteristics) · Confidentiality=Low · Cost=High (initial) → Mid (at scale) · Speed=High

Selection: API usage (initial) → Consider fine-tuning after scaling

Rationale: Move fast with API usage until PMF is confirmed. When monthly generation volume reaches hundreds of thousands, consider fine-tuning for cost reduction (distillation into smaller models) and quality improvement (brand tone consistency). Incorporating fine-tuning from the start means training data is wasted if the use case changes.

Case 3: Equipment Maintenance Knowledge Base for Manufacturing

Axis evaluation: Accuracy=High (accurate reference to equipment specifications) · Confidentiality=Mid (contains manufacturing know-how) · Cost=Mid · Speed=Mid

Selection: RAG (vectorizing internal manuals and maintenance records) + enterprise API contract

Rationale: Equipment maintenance knowledge is frequently updated (new model introductions, accumulation of failure cases). The operational cost of RAG with dynamic knowledge injection is lower than repeatedly fine-tuning. Confidentiality is addressed through enterprise contracts (data non-training guarantees).

Application to Evaluation and Investment Due Diligence

Just as AI-integrated startups require additional DD considerations beyond the traditional seven axes, architecture selection is an important evaluation point in due diligence. For investment and acquisition, confirming whether a team can explain “why this architecture” serves as an indicator of technical maturity.

Three questions to confirm:

“Why did you choose API usage (or RAG/fine-tuning)?”

Teams that can articulate rational reasons (speed priority, cost optimization, accuracy requirements, confidentiality) make systematic technical decisions. If “we just used OpenAI” is the answer, there may be insufficient consideration of cost structure and risks at scale.

“Is there an architecture migration plan for after scaling?”

Confirm both the reasoning behind the current architecture selection and whether there is a roadmap for when and where to migrate. Since the source of competitive advantage for AI startups lies in proprietary data assets and encoded domain knowledge, having a migration plan aimed in that direction matters.

“Is the LLM cost structure understood?”

Current monthly LLM costs, inference cost per user, and outlook for unit cost changes at scale—teams that cannot answer these immediately are harboring cost structure risk. Confirm that LLM costs are incorporated into Unit Economics calculations.

Conclusion

The selection of LLM integration architecture can be structured as: “understand the characteristics of the four patterns, then evaluate against the four axes (accuracy requirements, data confidentiality, cost, and speed).”

The most important principle is gradual architectural evolution. The staged migration from API usage → RAG → fine-tuning allows insights gained at each phase to inform the next architectural selection. Aiming for a “perfect architecture” from the start is a high-risk choice in the early, high-uncertainty phase.

For investors and technical DD professionals, evaluating the rationality of architectural selection provides insight into whether the team makes technical decisions systematically. Rather than probing technical details, asking “why did you make that choice?” allows you to assess the team’s technical maturity.

If you are conducting technical due diligence on an AI startup or looking for support in evaluating LLM architecture choices, feel free to reach out to Tied.