AI, Software Development

RAG Pipeline: How to Build One

By James KillickNovember 9, 2025

TL;DR: A RAG pipeline retrieves relevant chunks from your own data and feeds them to an LLM so it answers from your content, not from training data alone. You need four parts: a data store, an embeddings model, a vector database, and a query layer. This guide walks through each one in plain terms.

A RAG pipeline lets an LLM answer questions using your data. Instead of relying on what the model learned during training, it pulls relevant content from your documents first, then generates a response. That is why it works for internal knowledge bases, customer support tools, and anything where accuracy to your specific content matters.

RAG stands for retrieval-augmented generation. The retrieval part is the hard part.

What does a RAG pipeline actually do?

At its core, a RAG pipeline does two things. First, it finds the most relevant chunks of text from your data store. Second, it sends those chunks to the LLM as context alongside the user's question.

Without retrieval, the LLM makes things up or gives generic answers. With retrieval, it draws from your actual content.

The pipeline sits between your data and the model. It is the plumbing that makes AI responses grounded and trustworthy.

What are the four parts you need to build one?

Every RAG pipeline has the same four building blocks.

1. A data source

This is where your content lives. PDFs, Notion pages, database records, help docs, whatever you want the AI to know about.

2. An embeddings model

This converts your text into vectors, numerical representations that capture meaning. OpenAI's `text-embedding-3-small` is a solid starting point. Open-source options like `nomic-embed-text` work well for self-hosted setups.

3. A vector database

This stores the embeddings and lets you search by similarity rather than keyword. Pinecone, Weaviate, and pgvector (built into Postgres) are the common choices.

4. A query layer

This is the code that takes a user's question, embeds it, searches the vector database for similar chunks, and sends the top results to the LLM with a prompt.

Get these four parts right and you have a working RAG pipeline.

How do you prepare and chunk your data?

Before anything goes into a vector database, you need to split your documents into chunks. The chunking strategy matters more than most people expect.

Chunks that are too large carry too much irrelevant content. Chunks that are too small lose context. A common starting point is 512 tokens per chunk with a 50-token overlap so sentences do not get cut at the edges.

For structured documents like product specs or support articles, chunk by section heading. For unstructured text like PDFs or transcripts, use a sliding window.

Store metadata alongside each chunk: the source file, page number, section title. You will need it later to show users where the answer came from.

How do you build the retrieval step?

Retrieval is where most RAG pipelines fail or succeed. The basic version is straightforward.

Embed the user's question using the same model you used to embed your data.
Run a similarity search against your vector database.
Return the top 3-5 chunks.

That works for simple use cases. For production, you usually need to go further.

Hybrid search combines vector similarity with keyword search (BM25). It catches cases where exact terms matter, like product codes or names.

Re-ranking runs a second pass over your top results using a cross-encoder model to sort them by relevance to the specific question. Cohere and Jina both offer re-ranking APIs.

HyDE (hypothetical document embedding) generates a hypothetical answer first, embeds that, and uses it to search. It often finds better matches than embedding the raw question.

For teams with complex data, we find hybrid search plus re-ranking covers 80% of retrieval quality problems. Start there before building anything more involved.

How do you connect the retrieved chunks to the LLM?

Once you have your top chunks, you build a prompt. A standard structure looks like this:

```

You are a helpful assistant. Use only the context below to answer the question.

If the answer is not in the context, say so.

Context:

[chunk 1]

[chunk 2]

[chunk 3]

Question: [user question]

```

Keep the system prompt short and direct. Tell the model to stay within the provided context. That is the instruction that prevents hallucination.

For multi-turn conversations, you need to manage history carefully. Sending the full conversation plus retrieved chunks can push you over the context window. A common fix is to summarise older turns or use a memory layer like a short-term store keyed by session.

If your app needs to add AI to an existing platform without a full rebuild, RAG is usually the integration pattern that makes the most sense. You do not need to retrain anything.

What does a production RAG pipeline look like?

A local prototype and a production system are very different things.

In production you need:

Async ingestion so new documents get embedded and indexed without blocking the app
Chunk versioning so you can re-embed when you change your chunking strategy
Observability so you can see which queries returned poor results and why
Latency budgeting because retrieval + re-ranking + LLM inference all add up

Tools like LangSmith and Langfuse help with observability. For latency, cache frequent queries and use a faster model for retrieval scoring.

For teams that want a reference implementation, AILED publishes open tooling for AI-led document systems that covers a lot of this groundwork.

We have built RAG systems for clients across government and enterprise. The NSW Government work involved strict data handling requirements, which shaped how we structured the retrieval layer. Briometrix needed high-accuracy retrieval across dense technical content. The architecture decisions are different in each case, but the four building blocks stay the same.

If your team is working through these decisions, our AI app development service covers RAG builds end to end.

What evaluation approach should you use?

RAG pipelines are hard to evaluate because the quality of an answer depends on both retrieval and generation.

Start with two metrics:

Retrieval precision: did the retrieved chunks actually contain the answer?
Answer faithfulness: did the LLM stay within those chunks?

RAGAS is a popular open-source framework for automated RAG evaluation. It scores both metrics against a test set of question-answer pairs you build from your own data.

Build that test set before you go to production. Twenty to thirty representative queries is enough to catch most regressions when you change chunking or retrieval settings.

CTOs managing AI integrations across a platform will find more detail on the system-level considerations at our CTO resource page.

---

Devwiz has built over 200 apps since 2015. We are AI specialists based in Sydney. If you want a RAG pipeline built properly, talk to us.

---

FAQ

Frequently asked questions

What is a RAG pipeline in simple terms?

A RAG pipeline connects an LLM to your own data. When a user asks a question, the pipeline searches your documents for relevant content, then passes that content to the LLM so it answers from your data rather than guesswork. It is the standard approach for building AI that knows about your specific business, products, or knowledge base.

How long does it take to build a RAG pipeline?

A working prototype can be built in a day or two with tools like LangChain or LlamaIndex. A production-ready system with proper ingestion, re-ranking, observability, and latency management takes two to four weeks depending on data complexity. Most of that time goes into data preparation and retrieval quality tuning, not the LLM integration itself.

Which vector database should I use?

For most projects, pgvector (Postgres extension) is the lowest-friction starting point. You probably already have Postgres. Pinecone is worth it when you have millions of vectors and need managed scaling. Weaviate suits teams who want built-in hybrid search without configuring it separately. Start with pgvector and switch if you hit scale limits.

What is the difference between RAG and fine-tuning?

Fine-tuning changes the model's weights so it learns new behaviour or style. RAG gives the model information at query time without changing the model itself. RAG is better for dynamic or frequently updated content. Fine-tuning suits cases where you want the model to respond in a specific format or tone consistently. Many production systems use both.

Can RAG work with structured data like databases?

Yes, but the approach is different. You either convert structured data to natural language summaries and embed those, or you use a text-to-SQL layer so the LLM queries your database directly. The second approach works well for analytics questions but needs guardrails to prevent bad queries. For mixed structured and unstructured data, a hybrid architecture is usually the right call.

About James Killick

James is a co-founder of Devwiz and an AI product specialist. Since 2015 he has helped ship 200+ apps for founders, businesses and government, including work for NSW Government, Briometrix and Huskee. He builds AI-first platforms and writes about turning a proven program into software. He also hosts the Up in the AI podcast.

jameskillick.co · LinkedIn · AI Orchestrators

Tags: AI Integration