Your LLM call isn't slow. One of the four stages is.

You shipped RAG two weeks ago. The demo was 800ms, production is 4.2 seconds at p95, and the only thing your trace tells you is "LLM call: 3.8s". So you start guessing. Maybe the model is slow today. Maybe Pinecone is throttling. Maybe a model upgrade slipped in. You bump an index, switch a model, restart a pod. p95 does not move.Stop guessing.

RAG is not slow. One stage is. This issue is the thirty-minute profiler that names the stage, and the triage table that maps the measurement to the fix.

Why your current instrumentation lies

A RAG request is at least four sequential network hops: embed the query, search the index, assemble the context, call the model. If you stream, that is a fifth stage with its own behaviour. Wrapping the whole handler in a single span collapses all of it into one number that points nowhere.

The cost of that collapse is concrete. Teams who profile their pipeline carefully usually discover that the LLM call is not the slow stage they assumed. The four common bottlenecks hiding behind "LLM: 3.8s" are:

Unbatched context fetch from the doc store. The span starts before the fetch and ends after the model returns, so 200 to 800ms of N+1 queries get attributed to the model. A fetch_many change you can ship in ten lines often takes a slice off your p95 that no amount of prompt-engineering would.

Provider-side queueing. When your LLM provider rate-limits you under load, time-to-first-token inflates but per-token throughput stays flat. From a single span, the model looks slower; in reality the request is sitting in the provider's queue. Different problem, different fix.

Cold-cache vector search. A freshly loaded vector index runs five to ten times slower than its warm steady state. If your service autoscales, the first request after a scale-up looks pathological in your dashboards. Mixed into your p95 calculation, the cold tail makes the warm latency look worse than it is.

Streaming consumer slowness. The server returned tokens fast; the React state update or JSON parser on the client is what the user actually feels. Backend metrics will say the request was quick, and the user will tell you it was not.

Optimise the wrong stage and you spend a week tuning prompt caching when the real fix was batched fetching. The instrumentation that would tell you which one is yours is not in your current trace.

The mental model: stages and budgets

Treat a RAG request as a small pipeline, not a single API call. Each stage has its own SLO and its own failure mode. A reasonable budget for a p95 = 1.5s target looks like this:

Stage

Budget

Where it usually breaks

Embed

100ms

Cross-region call, no connection pooling

Search

80ms warm / 400ms cold

k too high, no filter pushdown, cold index

Context build

50ms

N+1 fetch to the doc store

LLM (TTFT, within TTLT)

600ms

Provider queueing, prompt-eval cost on long prompts

LLM (TTLT total)

1270ms

Output bloat, no concision instruction

TTFT is a measurement within the LLM stage, not an additional stage. The additive p95 budget is embed + search + context + TTLT = 100 + 80 + 50 + 1270 = 1500ms. TTFT gets its own row because it has its own failure mode and its own fix (see the triage table); a healthy TTFT of 600ms within a healthy TTLT of 1270ms means roughly 670ms of generation, which is realistic for a streamed RAG answer.

The diagnostic question is not "why is RAG slow". It is "which budget did we blow and by how much". Reframing the question is half the work. The rest is measurement, and that is what the next section ships.

The profiler

What follows is a sixty-line standalone profiler. Drop it next to your RAG handler. Wrap each stage of your existing pipeline with the stage context manager. Wrap the streaming LLM call with measure_stream. Run it against a representative query set. It prints a per-stage table with budget violations.

It is deliberately a single file with no dependencies beyond the standard library and your existing pipeline. You do not need to deploy a collector, agree on a span schema, or wait for production traffic to fill a dashboard. You need to know which stage is your slow stage today, and this is the tool.

One scope note before the snippet: this profiler measures the four backend stages. Client-side streaming consumer slowness (the fourth bottleneck above) is real but invisible from here; it needs a front-end profile and surfaces in the last row of the triage table.

# rag_profiler.py - standalone per-stage profiler for a RAG handler.
# Wrap each stage in your handler with `with stage("name"):`, wrap the
# streaming LLM call with `measure_stream(...)`, then call `report(BUDGETS)`.

import time, statistics
from contextlib import contextmanager
from collections import defaultdict

_STATS: dict[str, list[float]] = defaultdict(list)

@contextmanager
def stage(name: str):
    t0 = time.perf_counter()
    try:
        yield
    finally:
        _STATS[name].append((time.perf_counter() - t0) * 1000)

def measure_stream(make_stream):
    """make_stream() returns an iterator of text chunks. Records ttft_ms and ttlt_ms."""
    t0 = time.perf_counter()
    t_first = None
    chunks: list[str] = []
    for chunk in make_stream():
        if t_first is None:
            t_first = time.perf_counter()
            _STATS["llm.ttft"].append((t_first - t0) * 1000)
        chunks.append(chunk)
    _STATS["llm.ttlt"].append((time.perf_counter() - t0) * 1000)
    return "".join(chunks)

def _pct(samples, q):
    if len(samples) < 2: return samples[0]
    return sorted(samples)[min(len(samples) - 1, int(len(samples) * q))]

def report(budgets: dict[str, float], cold_runs: int = 0):
    print(f"{'stage':<14}{'n':>5}{'mean':>9}{'p50':>9}{'p95':>9}{'p99':>9}{'budget':>9}{'status':>10}")
    # sort by mean descending so the dominant stage is on top
    ordered = sorted(_STATS.items(), key=lambda kv: -statistics.mean(kv[1][cold_runs:] or kv[1]))
    for name, samples in ordered:
        warm = samples[cold_runs:] if cold_runs else samples
        if not warm: continue
        budget = budgets.get(name)
        p95 = _pct(warm, 0.95)
        status = "-" if budget is None else ("OVER" if p95 > budget else "ok")
        print(f"{name:<14}{len(warm):>5}{statistics.mean(warm):>9.1f}"
              f"{_pct(warm, 0.5):>9.1f}{p95:>9.1f}{_pct(warm, 0.99):>9.1f}"
              f"{(str(budget) if budget else '-'):>9}{status:>10}")

# --- usage ----------------------------------------------------------------
# from your_handler import handle_query   # the handler under test
# QUERIES = [...]                          # 50 representative production queries
# BUDGETS = {"embed":100,"search":80,"context":50,"llm.ttft":600,"llm.ttlt":1270}
# for q in QUERIES: handle_query(q)        # internally uses stage(...) + measure_stream(...)
# report(BUDGETS, cold_runs=1)             # exclude the first run from steady-state stats

Four design choices in this profiler that matter:

The stage context manager nests, so a call inside another stage is recorded independently. You do not have to thread state through the handler.

TTFT is captured inside the streaming loop on the first chunk, not after the call returns. Without that split, time-to-first-token inflation from provider queueing is invisible.

The report sorts stages by mean time so the dominant stage is on top. The status column flags budget violations independently, because a fifty-millisecond stage that blew a ten-millisecond budget matters more than a six-hundred-millisecond LLM call hitting its budget exactly. The eye wants to optimise the big number; the data should tell you to optimise the wrong-shape one.

Cold-start runs are excluded from steady-state stats via the cold_runs argument. Mix them in and the average lies about both the cold case and the warm one. Run them and report them separately.

Wrap your handler stages, run fifty queries, read the table. In practice, the slow stage usually surfaces in the first profile run.

If you found the profiler useful, the next issue covers what to do with the production traces once the slow stage is named. Subscribe at firsttoken.dev.

The triage decision tree

The profiler tells you where. The next table tells you what. Each row maps a profiler symptom to the likely cause, the confirmation test, and the production fix.

If the profiler says...

Likely cause

Confirm with

Fix

embed > 200ms

Cross-region call or no connection pooling

TCP connection reuse on the embed client

Co-locate the embed endpoint; enable keep-alive

search > 200ms warm

k too high, no filter pushdown, ANN degraded

Run with top_k=5; verify filter pushdown

Cheatsheet #13 (push filter to index), #11 (right-size k)

search > 600ms cold only

Index not warm after autoscale

Tag cold_start=True on the stage

Pre-warm on health check, or exclude cold from SLO

context > 200ms

N+1 fetch to the doc store

Count DB queries per request

Batched fetch_many at context assembly

llm.ttft > 1.5s, output tokens normal

Provider rate-limiting / queueing

Log x-ratelimit-remaining-* headers

Negotiate a higher tier; consider a fallback provider

llm.ttft > 1.5s, prompt > 8k tokens

Prompt-eval cost on a long prompt

Compare TTFT with and without retrieved context

Turn on prompt caching (Worksheet Q8)

llm.ttlt > 3s, llm.ttft < 500ms

Output bloat

Plot completion_tokens p99

Tighten max_tokens; add "be concise" (Worksheet Q6)

Total fast, UX feels slow

Client-side streaming consumer

Time the front-end separately

Profile your client, not your backend

The pair is the deliverable. Measurement without triage is curiosity; triage without measurement is hope. Together they turn "RAG is slow" into a tracked, fixable claim.

Trade-offs and gotchas

The profiler is a diagnostic tool, not a permanent telemetry layer. Use it to find the stage; instrument that stage properly in production once you know what matters.

A representative query set is non-negotiable. Five canned queries lie. Sample fifty from real traffic, including the long, the short, the multi-hop, and the empty-result ones. Profile what production looks like, not what your eval set rewards.

Separate cold runs from warm. The cold_runs argument exists because mixing them collapses two distributions into a misleading average. If your service autoscales, your cold case is real and worth knowing. It is also not what you want polluting your p95.

Provider queueing varies by time of day. A profile run at 3am tells you about your code; a profile run at peak tells you about your provider relationship. Run both. The delta between them is your queueing exposure.

Do not point the profiler at production traffic. It is designed to run N copies of the same query, which is not what your users want. Run it against a clone, a shadow path, or a load-test environment that mirrors your real index state.

TTFT measurement requires a streaming client. If your provider does not support streaming for the model you use, you can still report TTLT, but you lose the ability to distinguish queueing from generation.

Profiling overhead is on the order of microseconds per stage (around 0.5µs in a benchmark of the empty context manager). Immaterial against the latencies you are debugging; mentioned because someone will ask.

Production checklist

  • Profile before you instrument. Instrument the stage the profile told you matters.

  • Set per-stage budgets in code, alongside the handler, so they live with the system they describe.

  • Split TTFT from TTLT on every LLM call, in dev and in production.

  • Tag cold-cache runs and exclude them from your SLO maths.

  • Alert on per-stage budget violations, not on total latency.

  • Log rate-limit headers from every LLM call so provider-side queueing becomes a visible signal.

  • Re-profile after every retrieval change: a new chunker, a different k, a swapped embedding model.

  • Once you know your bottleneck, port the profiler's span shape to OpenTelemetry for permanent visibility.

The takeaway

RAG is not slow. One stage is. Profile per-stage, not per-request, because the four common slow stages each have a different fix and the average call hides which one is yours. Thirty minutes of measurement is worth a week of guessing, and turns every later optimisation into a data-driven one. Run the profiler at peak, with cold and warm separated, against real queries. The table tells you the rest.