Cost-Based Context Assembly for RAG

2026-05-23

The context window is not a bag of chunks. It is a constrained execution budget where every token competes for expected answer value.

Most RAG systems treat the context window like a shopping bag: retrieve chunks, sort by score, stuff until the token budget is full. That is not planning. It is a greedy approximation with no cost model.

A context window is scarce memory. Every token included has an opportunity cost: it displaces another piece of evidence, increases latency, and may distract the model. The planner should spend tokens the way a database optimizer spends IO.

Context candidates competing for token budget

Each candidate chunk needs more than a vector score. It needs estimated utility, token cost, source authority, freshness, graph distance, and redundancy against already selected evidence.

from dataclasses import dataclass


@dataclass(frozen=True)
class Candidate:
    chunk_id: int
    tokens: int
    vector_score: float
    authority: float
    graph_distance: int | None
    novelty: float


def estimated_utility(c: Candidate) -> float:
    graph_bonus = 0.0 if c.graph_distance is None else max(0.0, 0.12 - c.graph_distance * 0.03)
    return c.vector_score * 0.65 + c.authority * 0.20 + c.novelty * 0.15 + graph_bonus

The naive ranking is utility. The better ranking is utility per token under constraints. A 200-token chunk with one decisive invariant may be worth more than a 900-token chunk with broad but diluted relevance.

def select_context(candidates: list[Candidate], budget: int) -> list[Candidate]:
    selected: list[Candidate] = []
    remaining = budget
    for candidate in sorted(candidates, key=lambda c: estimated_utility(c) / c.tokens, reverse=True):
        if candidate.tokens <= remaining and candidate.novelty > 0.25:
            selected.append(candidate)
            remaining -= candidate.tokens
    return selected

This greedy version is intentionally simple. The deeper point is that the planner has a measurable objective. Once you have an objective, you can test alternative planners: MMR, knapsack, learned rerankers, graph-first expansion, or query-class-specific policies.

ACL filtering must happen before utility estimation. Otherwise the planner computes value for evidence it is not allowed to show. That is not just a security issue; it also corrupts evaluation because forbidden chunks may crowd out legal ones during ranking.

WITH legal_chunks AS (
    SELECT c.*
    FROM chunks c
    JOIN document_acl acl ON acl.document_id = c.document_id
    WHERE acl.principal_id = %(principal_id)s
      AND c.tenant_id = %(tenant_id)s
)
SELECT *
FROM legal_chunks
ORDER BY embedding <=> %(query_embedding)s
LIMIT 200;

The evaluation set should measure planner behavior, not only final text. Did the planner spend too many tokens on redundant chunks? Did it include at least one authoritative source? Did it preserve ACL invariants? Did it retrieve a graph-neighbor document that pure vector search missed?

def context_density(selected: list[Candidate]) -> float:
    if not selected:
        return 0.0
    return sum(estimated_utility(c) for c in selected) / sum(c.tokens for c in selected)

This is where RAG stops being “semantic search plus prompt” and becomes systems work. The language model is downstream from a planner. If the planner spends the context window badly, no prompt style can recover the missing evidence.

The context window is not large. It is merely expensive. Treat it like a budget, and retrieval quality becomes something you can reason about instead of something you vibe-check after every embedding model change.