Getting appropriate responses from a RAG empowered LLM is an art. The primary 3 knobs to adjust are Chunk Size, Document Return Count, and the RAG System Prompt.
Use this guide to help you fine tune your RAG empowered LLM responses.
Chunk Size
Chunk size is how much text you split documents into before embedding and retrieval (often measured in tokens or characters). The goal is to make chunks large enough to contain complete, useful context, but small enough that retrieval stays precise and you don’t hit token limits when multiple chunks are added to the prompt.
- Too small: Answers are incomplete or the model fills in gaps (increasing chance of hallucinations).
- Too large: Retrieval becomes less targeted. You may retrieve big blocks with lots of irrelevant content, which increases cost and can cause the model to miss the important part inside the chunk. Large chunks also make it easier to hit context/token limits when you include multiple results.
Choose chunk sizes based on the structure of the documents and the kind of questions users ask:
- Dense technical docs / API references: smaller chunks tend to work better because users ask for specific details.
- Policies / manuals / FAQs: medium chunks often work best because answers usually span a few paragraphs.
- Narrative docs / research / long-form: larger chunks can help preserve argument flow, but you’ll need stricter retrieval limits and good overlap.
No matter what chunk size you pick, add chunk overlap so important context isn’t split across boundaries. Overlap helps preserve continuity (e.g., a definition at the end of one chunk and its usage at the start of the next).
Default Recommendation
A general-purpose default that works across many document types is:
- Default chunk size: ~800 tokens
- Default overlap: ~120–160 tokens (≈15–20%)
When to Adjust
- If answers feel missing context or “cut off,” increase chunk size (or increase overlap).
- If retrieved passages feel too broad or include lots of unrelated text, decrease chunk size.
- If you often hit token limits when injecting retrieved context, decrease chunk size, retrieve fewer chunks, or tighten your filtering/reranking.