Use RAG when your model needs fresh, private, or frequently-changing facts. Use fine-tuning when your model needs to learn a style, a format, or a task it can't learn from in-context examples alone. In most production AI products we ship, the right answer is RAG first, then fine-tune only the gaps that RAG can't close. The two approaches are often pitched as rivals — they're not. RAG changes what the model knows; fine-tuning changes how the model behaves. Teams that get this backwards usually spend six weeks fine-tuning a base model to answer questions about their product documentation, when a two-day RAG setup over the same docs would have landed better accuracy with lower cost.
The decision matrix is simpler than the vendor marketing makes it sound. Data that changes (pricing, availability, policies, customer records, ticket history) belongs in a retrieval layer, because baking it into weights means re-training every time something changes. Behaviour that's hard to describe in a prompt (a consistent brand voice, a specific output schema the model keeps breaking, a domain-specific vocabulary) belongs in fine-tuning. Real products usually want both — and the order matters: RAG first, measure the failure modes, fine-tune only the ones retrieval can't solve. We cover this decision on our AI development services page and it's one of the first architectural conversations we have on every engagement.
This post is the walk-through we wish existed when we started shipping LLM products: what RAG actually is under the hood, what fine-tuning actually does (and doesn't), the four-question test we use to decide per feature, and the hybrid architectures we deploy in production at CodeLamda. If you want to skip the reading and talk through your specific use case, grab a 30-minute call.
What is RAG and what problem does it actually solve?
RAG — retrieval-augmented generation — gives a language model access to information it didn't see during training by pulling relevant chunks of your data into the prompt at query time. The typical architecture: chunk your documents, embed them, store the embeddings in a vector database, retrieve the top-K most relevant chunks for each user query, and paste those chunks into the prompt alongside the question. The model answers using the retrieved context.
What kinds of problems is RAG the right tool for?
Anything that boils down to "answer questions using this specific pile of text." Customer support bots that need to cite your docs. Internal knowledge-base assistants. Product search over a catalogue. Sales-enablement tools that summarise past deal notes. Any time the answer is in your data but isn't in the model's training set, RAG is almost always the first thing to try.
What are the parts of a production-grade RAG system?
Chunker (usually a recursive splitter with overlap), embedder (OpenAI text-embedding-3-large, Cohere, or a local model), vector store (pgvector if you already run Postgres, Pinecone or Qdrant if you need scale), retriever (dense, sparse BM25, or hybrid), reranker (Cohere Rerank or a cross-encoder), and a prompt template that cites sources cleanly. Skip the reranker and your accuracy ceiling is 10–20 points lower than it needs to be.
What is fine-tuning and when does it beat RAG?
Fine-tuning updates the model's weights by training it further on your examples. It changes the model's behaviour in a way no prompt can replicate. The weights now "know" — in a pattern-matching sense — that when they see inputs like X, they should produce outputs like Y. This is permanent, cheap at inference time, and completely invisible to the user.
What's the difference between full fine-tuning, LoRA, and instruction tuning?
Full fine-tuning updates every weight in the model — expensive, rarely what anyone actually wants. LoRA (and QLoRA) freeze most weights and train small adapters instead — 90% of the gain at 10% of the cost. This is what most production fine-tunes actually use. Instruction tuning is a specific flavour of LoRA where you're teaching the model to follow a new instruction format.
What workflows does fine-tuning uniquely solve?
Three patterns: forcing a strict output schema the base model keeps breaking (structured extraction into exact JSON); matching a domain writing style a prompt can't capture (legal drafting, clinical notes); and teaching a classifier-style task with millions of examples where pasting them all into context is impractical. Everything else, try RAG first.
When do you use RAG, when do you fine-tune, and when do you do both?
Use RAG alone when your data changes more often than monthly, or when you need citations. Use fine-tuning alone when you need a specific output format and your knowledge is stable. Use both when you need frequently-changing knowledge and a specific behaviour — RAG handles the freshness, fine-tuning handles the behaviour. In our work at CodeLamda, the ratio is roughly 70% pure RAG, 20% hybrid, 10% pure fine-tuning.
The four-question test we run on every client build
Does the required knowledge change more than once a month? RAG. Do you need to cite sources to the user? RAG. Does the base model keep breaking an output schema a prompt can't fix? Fine-tune. Do you have >10,000 labelled examples of the behaviour you want? Fine-tune. Two yeses from the same column? Start there.
What does the hybrid architecture look like in production?
Base model, optionally fine-tuned for output format and brand voice, wrapped in a RAG pipeline that injects fresh facts from your data layer. The fine-tune makes the model behave right; the RAG makes it know right. Observability at both layers — we recommend LangSmith or Langfuse for traces — lets you see which layer is responsible when something goes wrong.
What does RAG cost compared to fine-tuning?
RAG is cheap to start, linear to run. Setup: 1–2 weeks of engineering, <$100 of embedding spend for most corpora. Per-query cost: the retrieval hop (~$0.0001) plus the LLM call ($0.005–$0.02). No GPU, no training, no ops. You can be in production in a week.
Fine-tuning has a one-time cost and then runs cheap. Training a LoRA adapter on a good dataset: 3–6 weeks, $1,000–$20,000 depending on model and data prep. Per-query cost is the same as the base model, or lower if you're swapping in a smaller fine-tuned model for a larger base. The hidden cost is re-training — every time the behaviour needs to shift, you're back on the training loop.
When is fine-tuning cost-effective?
When you'd be running the same prompt template hundreds of thousands of times with the same output-shape requirements. Fine-tuning lets you drop most of the prompt boilerplate, which reduces per-query tokens, which saves real money at scale. Below ~100k queries/month, the economics favour RAG.
What are the common mistakes we see teams make?
Fine-tuning to add knowledge instead of to change behaviour. Shipping RAG without a reranker or without citations. Using cosine similarity alone when hybrid (BM25 + dense) would double accuracy. Testing retrieval quality only with a handful of hand-picked queries instead of a real eval set. Skipping evals entirely and shipping on vibes.
What's the 30-day test we run on every RAG build?
Build a 100-query eval set with expected answers on day one. Run it after every change to chunking, embedding, retrieval, or prompt. If accuracy drops below the previous release, revert. Without this loop you're flying blind, and LLM behaviour changes subtly every time a model provider updates weights.
Should you run open-source models?
If your data can't leave your cloud, yes — and Llama 3.x plus
text-embedding-3-large on Bedrock or a pinned vLLM stack is a
reasonable production setup. Otherwise, the frontier model economics
still beat self-hosting for most use cases. The gap is narrowing, but
it's not closed.
What's the right approach for your AI product?
If you're early and your data is text-heavy, start with RAG. You'll get to production faster, you'll learn from real user queries, and you'll discover which behaviours RAG can't fix — those are your fine-tuning candidates. If you already have RAG and it's plateauing, a targeted LoRA adapter on the failure cases is usually the next move. If you're not sure which category your product falls in, we do a one-hour architecture review as part of every AI development engagement — bring whatever you have in Figma or a Google Doc and we'll tell you honestly. Book a call and we'll scope it with you.