Your LLM call isn't slow. One of the four stages is.
You shipped RAG two weeks ago. The demo was 800ms, production is 4.2 seconds at p95, and the only thing your trace tells you is "LLM call: 3.8s". So you start guessing. Maybe the model is slow today. Maybe Pinecone is throttling. Maybe a model upgrade slipped in. You bump an index, switch a model, restart a pod. p95 does not move.Stop guessing.
RAG is not slow. One stage is. This issue is the thirty-minute profiler that names the stage, and the triage table that maps the measurement to the fix.
Why your current instrumentation lies
A RAG request is at least four sequential network hops: embed the query, search the index, assemble the context, call the model. If you stream, that is a fifth stage with its own behaviour. Wrapping the whole handler in a single span collapses all of it into one number that points nowhere.
The cost of that collapse is concrete. Teams who profile their pipeline carefully usually discover that the LLM call is not the slow stage they assumed. The four common bottlenecks hiding behind "LLM: 3.8s" are:
Unbatched context fetch from the doc store. The span starts before the fetch and ends after the model returns, so 200 to 800ms of N+1 queries get attributed to the model. A fetch_many change you can ship in ten lines often takes a slice off your p95 that no amount of prompt-engineering would.
Provider-side queueing. When your LLM provider rate-limits you under load, time-to-first-token inflates but per-token throughput stays flat. From a single span, the model looks slower; in reality the request is sitting in the provider's queue. Different problem, different fix.
Cold-cache vector search. A freshly loaded vector index runs five to ten times slower than its warm steady state. If your service autoscales, the first request after a scale-up looks pathological in your dashboards. Mixed into your p95 calculation, the cold tail makes the warm latency look worse than it is.
Streaming consumer slowness. The server returned tokens fast; the React state update or JSON parser on the client is what the user actually feels. Backend metrics will say the request was quick, and the user will tell you it was not.
Optimise the wrong stage and you spend a week tuning prompt caching when the real fix was batched fetching. The instrumentation that would tell you which one is yours is not in your current trace.
The mental model: stages and budgets
Treat a RAG request as a small pipeline, not a single API call. Each stage has its own SLO and its own failure mode. A reasonable budget for a p95 = 1.5s target looks like this:

Stage | Budget | Where it usually breaks |
|---|---|---|
Embed | 100ms | Cross-region call, no connection pooling |
Search | 80ms warm / 400ms cold | k too high, no filter pushdown, cold index |
Context build | 50ms | N+1 fetch to the doc store |
LLM (TTFT, within TTLT) | 600ms | Provider queueing, prompt-eval cost on long prompts |
LLM (TTLT total) | 1270ms | Output bloat, no concision instruction |
TTFT is a measurement within the LLM stage, not an additional stage. The additive p95 budget is embed + search + context + TTLT = 100 + 80 + 50 + 1270 = 1500ms. TTFT gets its own row because it has its own failure mode and its own fix (see the triage table); a healthy TTFT of 600ms within a healthy TTLT of 1270ms means roughly 670ms of generation, which is realistic for a streamed RAG answer.
The diagnostic question is not "why is RAG slow". It is "which budget did we blow and by how much". Reframing the question is half the work. The rest is measurement, and that is what the next section ships.
The profiler
What follows is a sixty-line standalone profiler. Drop it next to your RAG handler. Wrap each stage of your existing pipeline with the stage context manager. Wrap the streaming LLM call with measure_stream. Run it against a representative query set. It prints a per-stage table with budget violations.
It is deliberately a single file with no dependencies beyond the standard library and your existing pipeline. You do not need to deploy a collector, agree on a span schema, or wait for production traffic to fill a dashboard. You need to know which stage is your slow stage today, and this is the tool.
One scope note before the snippet: this profiler measures the four backend stages. Client-side streaming consumer slowness (the fourth bottleneck above) is real but invisible from here; it needs a front-end profile and surfaces in the last row of the triage table.
# rag_profiler.py - standalone per-stage profiler for a RAG handler.
# Wrap each stage in your handler with `with stage("name"):`, wrap the
# streaming LLM call with `measure_stream(...)`, then call `report(BUDGETS)`.
import time, statistics
from contextlib import contextmanager
from collections import defaultdict
_STATS: dict[str, list[float]] = defaultdict(list)
@contextmanager
def stage(name: str):
t0 = time.perf_counter()
try:
yield
finally:
_STATS[name].append((time.perf_counter() - t0) * 1000)
def measure_stream(make_stream):
"""make_stream() returns an iterator of text chunks. Records ttft_ms and ttlt_ms."""
t0 = time.perf_counter()
t_first = None
chunks: list[str] = []
for chunk in make_stream():
if t_first is None:
t_first = time.perf_counter()
_STATS["llm.ttft"].append((t_first - t0) * 1000)
chunks.append(chunk)
_STATS["llm.ttlt"].append((time.perf_counter() - t0) * 1000)
return "".join(chunks)
def _pct(samples, q):
if len(samples) < 2: return samples[0]
return sorted(samples)[min(len(samples) - 1, int(len(samples) * q))]
def report(budgets: dict[str, float], cold_runs: int = 0):
print(f"{'stage':<14}{'n':>5}{'mean':>9}{'p50':>9}{'p95':>9}{'p99':>9}{'budget':>9}{'status':>10}")
# sort by mean descending so the dominant stage is on top
ordered = sorted(_STATS.items(), key=lambda kv: -statistics.mean(kv[1][cold_runs:] or kv[1]))
for name, samples in ordered:
warm = samples[cold_runs:] if cold_runs else samples
if not warm: continue
budget = budgets.get(name)
p95 = _pct(warm, 0.95)
status = "-" if budget is None else ("OVER" if p95 > budget else "ok")
print(f"{name:<14}{len(warm):>5}{statistics.mean(warm):>9.1f}"
f"{_pct(warm, 0.5):>9.1f}{p95:>9.1f}{_pct(warm, 0.99):>9.1f}"
f"{(str(budget) if budget else '-'):>9}{status:>10}")
# --- usage ----------------------------------------------------------------
# from your_handler import handle_query # the handler under test
# QUERIES = [...] # 50 representative production queries
# BUDGETS = {"embed":100,"search":80,"context":50,"llm.ttft":600,"llm.ttlt":1270}
# for q in QUERIES: handle_query(q) # internally uses stage(...) + measure_stream(...)
# report(BUDGETS, cold_runs=1) # exclude the first run from steady-state statsFour design choices in this profiler that matter:
The stage context manager nests, so a call inside another stage is recorded independently. You do not have to thread state through the handler.
TTFT is captured inside the streaming loop on the first chunk, not after the call returns. Without that split, time-to-first-token inflation from provider queueing is invisible.
The report sorts stages by mean time so the dominant stage is on top. The status column flags budget violations independently, because a fifty-millisecond stage that blew a ten-millisecond budget matters more than a six-hundred-millisecond LLM call hitting its budget exactly. The eye wants to optimise the big number; the data should tell you to optimise the wrong-shape one.
Cold-start runs are excluded from steady-state stats via the cold_runs argument. Mix them in and the average lies about both the cold case and the warm one. Run them and report them separately.
Wrap your handler stages, run fifty queries, read the table. In practice, the slow stage usually surfaces in the first profile run.
If you found the profiler useful, the next issue covers what to do with the production traces once the slow stage is named. Subscribe at firsttoken.dev.
The triage decision tree
The profiler tells you where. The next table tells you what. Each row maps a profiler symptom to the likely cause, the confirmation test, and the production fix.
If the profiler says... | Likely cause | Confirm with | Fix |
|---|---|---|---|
| Cross-region call or no connection pooling | TCP connection reuse on the embed client | Co-locate the embed endpoint; enable keep-alive |
| k too high, no filter pushdown, ANN degraded | Run with | Cheatsheet #13 (push filter to index), #11 (right-size k) |
| Index not warm after autoscale | Tag | Pre-warm on health check, or exclude cold from SLO |
| N+1 fetch to the doc store | Count DB queries per request | Batched |
| Provider rate-limiting / queueing | Log | Negotiate a higher tier; consider a fallback provider |
| Prompt-eval cost on a long prompt | Compare TTFT with and without retrieved context | Turn on prompt caching (Worksheet Q8) |
| Output bloat | Plot | Tighten |
Total fast, UX feels slow | Client-side streaming consumer | Time the front-end separately | Profile your client, not your backend |
The pair is the deliverable. Measurement without triage is curiosity; triage without measurement is hope. Together they turn "RAG is slow" into a tracked, fixable claim.
Trade-offs and gotchas
The profiler is a diagnostic tool, not a permanent telemetry layer. Use it to find the stage; instrument that stage properly in production once you know what matters.
A representative query set is non-negotiable. Five canned queries lie. Sample fifty from real traffic, including the long, the short, the multi-hop, and the empty-result ones. Profile what production looks like, not what your eval set rewards.
Separate cold runs from warm. The cold_runs argument exists because mixing them collapses two distributions into a misleading average. If your service autoscales, your cold case is real and worth knowing. It is also not what you want polluting your p95.
Provider queueing varies by time of day. A profile run at 3am tells you about your code; a profile run at peak tells you about your provider relationship. Run both. The delta between them is your queueing exposure.
Do not point the profiler at production traffic. It is designed to run N copies of the same query, which is not what your users want. Run it against a clone, a shadow path, or a load-test environment that mirrors your real index state.
TTFT measurement requires a streaming client. If your provider does not support streaming for the model you use, you can still report TTLT, but you lose the ability to distinguish queueing from generation.
Profiling overhead is on the order of microseconds per stage (around 0.5µs in a benchmark of the empty context manager). Immaterial against the latencies you are debugging; mentioned because someone will ask.
Production checklist
Profile before you instrument. Instrument the stage the profile told you matters.
Set per-stage budgets in code, alongside the handler, so they live with the system they describe.
Split TTFT from TTLT on every LLM call, in dev and in production.
Tag cold-cache runs and exclude them from your SLO maths.
Alert on per-stage budget violations, not on total latency.
Log rate-limit headers from every LLM call so provider-side queueing becomes a visible signal.
Re-profile after every retrieval change: a new chunker, a different k, a swapped embedding model.
Once you know your bottleneck, port the profiler's span shape to OpenTelemetry for permanent visibility.
The takeaway
RAG is not slow. One stage is. Profile per-stage, not per-request, because the four common slow stages each have a different fix and the average call hides which one is yours. Thirty minutes of measurement is worth a week of guessing, and turns every later optimisation into a data-driven one. Run the profiler at peak, with cold and warm separated, against real queries. The table tells you the rest.