Skip to main content
Claude & LLMs 9 min read By Benjamin Holczer

Claude in production: evals, guardrails, and keeping costs honest

What actually separates a Claude demo from a Claude system your ops team relies on every day. Evals, caching, routing, and the boring engineering that makes LLM workflows reliable.

  • #Claude
  • #evals
  • #prompt caching
  • #LangGraph
  • #production AI

Most teams we meet have a Claude-powered prototype that “works” and a conviction that putting it in production is one more weekend of effort. It rarely is.

The gap between a working prompt and a production LLM workflow is not intelligence — it is discipline. Here is what that discipline looks like.

1. You cannot ship what you cannot measure

Every production Claude workflow we ship starts with an eval set. Always. Before the first prompt revision.

An eval set is:

  • A fixed corpus of realistic inputs (emails, tickets, documents, call transcripts — whatever the workflow consumes).
  • Expected outputs or rubrics for each input. Sometimes a gold answer. More often a set of criteria the output must satisfy.
  • A grader. Sometimes deterministic (regex, schema validation). Sometimes another LLM acting as judge, with its own eval.

You run the full set on every prompt change, model change, or temperature change. The number goes up or down. The decision becomes obvious.

Without an eval set, you are doing vibes-based engineering on a system where the vibes are determined by tokens your team has never seen. It will degrade, silently, every week.

2. Prompt caching is the difference between $8 and $80

Claude’s prompt caching changes the economics of LLM workflows. Done right, it cuts cost 70–90% and latency 30–50%. Done wrong, you get the same bill and think caching is oversold.

The shape that works:

[CACHED: system prompt + persistent context + large reference docs]
[CACHED: few-shot examples]
[UNCACHED: the user's current request]

The cache lives 5 minutes by default. If you call the workflow more often than every 5 minutes, you stay hot. If less, you pay a cache-write on each call, which is slightly more expensive than the uncached path.

Rules we enforce on every Claude integration:

  • Stable system prompts. Never include timestamps or counters in the cached block.
  • Order matters — cached content must appear before anything dynamic.
  • Monitor cache_read_input_tokens vs cache_creation_input_tokens on every request. The ratio tells you if caching is actually working.

3. Route to the cheapest model that gets it right

Not every LLM call needs Opus. Not every call even needs Sonnet.

Our default routing:

  • Claude Haiku — routing, classification, simple extraction, tool-choice decisions. Sub-cent cost.
  • Claude Sonnet — most summarization, reasoning, email drafting, document Q&A. The workhorse.
  • Claude Opus — complex multi-step reasoning, long-context synthesis, agentic workflows with heavy tool use.

The router is a small Haiku call that decides which model to send to. It sounds paranoid; it pays back immediately. One of our clients’ email triage workflow drops from $0.04 per email to $0.004 per email when routing is in place — with no measurable quality drop.

4. Tool use beats free-form reasoning

If you want the model to return a date, make it call a tool called set_date(date), not say “the appointment is on March 14, 2026.”

Structured tool calls give you:

  • Typed arguments, validated before they do anything.
  • Retry logic if arguments fail validation.
  • Idempotency keys — the same tool call with the same args is safely re-runnable.
  • Audit trails your compliance team will thank you for.

Claude is particularly good at tool use. The Agent SDK makes chained tool calls reliable at production volumes. If your workflow involves more than one action, you want tools, not parsing.

5. Guardrails go on the boundary, not in the prompt

Every time a prompt says “do not reveal internal information” you are one jailbreak away from embarrassment. Prompts should guide behavior; boundaries should be enforced by code.

In practice:

  • PII redaction on every input before it hits the model.
  • Output schema validation — Claude returns JSON that validates against a Zod schema, or we regenerate.
  • Allow-list for tool calls — the model cannot call tools it is not explicitly granted.
  • Content filtering on output for specific compliance domains.
  • Rate limits on the whole workflow by tenant, not just by API key.

6. Human-in-the-loop where it matters

For anything that moves money, sends external communications, or changes customer records, we add a review gate. The LLM drafts. A human approves. The model learns from the delta between its draft and the approved version.

This is not a failure of AI. It is how you earn the right to eventually remove the gate for the specific, measurable cases where the model is as good as or better than the human. Which, for a well-evaluated workflow, eventually happens for 70–90% of cases.

7. Observability, or stop calling it production

Every Claude integration we ship has:

  • Per-request logging of model, tokens, cache hits, latency, and final output.
  • Per-workflow dashboards showing cost, p50/p95 latency, and quality metric over time.
  • Alerts on quality regressions (from the continuous eval), cost spikes, and cache-hit-rate drops.
  • A “last 100 calls” UI your ops team can scroll through on a bad Tuesday.

Without this, you are flying blind. With it, problems get caught in hours instead of quarters.

The shape of a good Claude project

When we scope a Claude automation, the line items look like this:

  1. Eval set (the first thing we build).
  2. Prompt and tool design.
  3. Prompt caching architecture.
  4. Model routing.
  5. Guardrails (PII, schema, allow-lists).
  6. Human review UI where applicable.
  7. Observability and alerts.
  8. Runbook and hand-off docs.

Every one of those is boring. Every one of those is why the system still works 6 months later when the demo version would have quietly died.

If you have a Claude prototype that works in a notebook and a production roadmap that has been “next week” for three months, we can probably help. The gap is smaller than it looks — but it is real engineering, not prompt tweaks.

Start where it pays back fastest

Let’s find the automation that moves your biggest number.

Free 30-minute call. We review your stack, point at the 2–3 highest-ROI automations, and tell you honestly whether we’re the right team to build them.