Skip to content
← Blog

Run Claude on your own server — no API key required

6 min readClaudeLLMself-hostedGocost

You're paying a Claude subscription. You want Claude inside an automation — an n8n flow, a Worker, a nightly enrichment job — and the moment you reach for the backend, you reach for the Anthropic API and a second meter starts ticking. For background and batch work, that meter is most of the bill.

There's another path: run the Claude Code CLI headless on your own server, authenticated by the subscription you already pay for, behind one HTTP endpoint. No Anthropic API key. That's exactly what llm-gateway — a small, open-source (MIT) Go service — does, and this post is how it works and how to stand it up.

The double-pay problem

A Claude subscription buys you Claude in the app and in the Claude Code CLI. But the API is billed separately, per token. So the instant you want Claude in a backend — classify these rows, draft this copy, summarize this ticket — you open a pay-as-you-go API account and start paying twice for the same model.

For anything that isn't a live user request — the kind of batch and agent work where the bill is mostly re-sent context — that API meter can dwarf what you're already paying for the subscription.

The trick: drive the Claude Code CLI

The Claude Code CLI can authenticate with your subscription instead of an API key. One command mints a long-lived OAuth token:

claude setup-token        # prints a CLAUDE_CODE_OAUTH_TOKEN

llm-gateway wraps that CLI in a tiny HTTP service. Any app — Python, a Worker, n8n, another Go service — POSTs a prompt to one endpoint and gets back cached, retried, schema-validated JSON. Under the hood it's the claude binary running on your subscription, no API key in sight.

⚠️ It's a CLI-orchestration service, not a hosted SDK: the claude CLI must be installed, on PATH, and authenticated on the host (the Docker image bakes the binary; you supply the token at runtime).

Two kinds of key — don't mix them up

This is the one thing people trip on. There are two independent tokens:

  • LLM_AUTH_TOKEN — a bearer token you choose. It locks down your gateway so only your apps can call it. Set this on any public server.
  • CLAUDE_CODE_OAUTH_TOKEN — how the claude CLI authenticates to Anthropic. This is the subscription token from claude setup-token (an ANTHROPIC_API_KEY works too if you'd rather go pay-as-you-go).

A locked-down, subscription-auth server sets both: your gate, plus Claude's auth with no API key.

Quick start (Docker)

Clone the repo, build the image — it bakes the Claude Code binary — and supply auth at runtime:

git clone https://github.com/sundarshahi/llm-gateway && cd llm-gateway
docker build -t llm-gateway .

docker run -p 8787:8787 \
  -e LLM_AUTH_TOKEN="$(openssl rand -hex 24)" \
  -e CLAUDE_CODE_OAUTH_TOKEN="<from: claude setup-token>" \
  llm-gateway

Then call it like any HTTP API. Plain text:

curl -s localhost:8787/v1/llm -H "Authorization: Bearer $LLM_AUTH_TOKEN" \
  -d '{"model":"claude","prompt":"Reply with exactly: PONG"}'
# {"model":"claude","ok":true,"text":"PONG"}

Or schema-validated JSON — native structured output, so you get a typed object back, not a string you have to coax into shape:

curl -s localhost:8787/v1/llm -H "Authorization: Bearer $LLM_AUTH_TOKEN" \
  -d '{"model":"claude","prompt":"Capital of Japan and its population in millions.",
       "json_schema":{"type":"object","properties":{"capital":{"type":"string"},
       "population_millions":{"type":"number"}},"required":["capital","population_millions"]}}'
# {"data":{"capital":"Tokyo","population_millions":37.4},"model":"claude","ok":true}

There's a POST /v1/llm/batch that runs many jobs concurrently, and an unauthenticated GET /health for your load balancer.

Use it as a Go library

It's a service and a library — go get github.com/sundarshahi/llm-gateway. If you're already in Go, skip the HTTP hop and embed it:

cfg := llmgateway.Config{
    MaxConcurrency: 4,
    Cache:          llmgateway.NewMemoryCache(),
    // Providers default to Claude + Gemini.
}
gw, _ := llmgateway.New(cfg)

res := gw.Run(ctx, llmgateway.Job{
    Model:  "claude",
    Prompt: "Summarize this in one sentence: ...",
    JSON:   true,
})
fmt.Println(res.OK, res.Data)

The core carries no application logic — providers, cache, prompt store, post-processing, and metrics are all seams on Config. The examples/customserver shows the pattern: import the library, mount its generic /llm routes, and add your own handlers alongside them instead of forking. The examples/seohooks is a real PostProcess hook that rewrites replies on the way out.

It's not just "no API key"

Skipping the API bill is the headline, but the reason I'd point a production automation at this rather than exec-ing the CLI myself is everything around the call:

  • Content-addressed cache. Identical requests return byte-identical replies and skip the subprocess entirely — sub-10ms cache hits. This is the single biggest LLM cost lever, applied for free.
  • Self-consistency voting. Set samples > 1 and it runs N times and returns the JSON majority — buying accuracy on the hard calls where you'd otherwise route to a pricier frontier model.
  • Tolerant JSON. Native schema mode when you want it; tolerant extraction and repair (double-escaped envelopes, stray code fences, bad escapes) when the model gets creative.
  • Retries, timeouts, and cancellation. Transient errors and invalid JSON are retried; timeouts never are; a disconnected client kills the spawn so you don't pay for output nobody's waiting on.
  • Per-stage tuning. Model, vote count, and thinking budget resolved per named prompt — your cheap, structural steps run small and your synthesis step runs big, the routing discipline that actually moves the bill.

Where this fits (and where it doesn't)

This is for your own automation against your own subscription — internal tools, side projects, content pipelines, background enrichment. It's not a way to resell Claude or fan one seat out to a thousand users: you're bound by your plan's usage limits and Anthropic's terms, the same as in the app. And because it's a long-running CLI on a box, treat it like one — bind it to localhost or put LLM_AUTH_TOKEN in front of it before it ever faces the public internet.

Within those lines, it's the difference between a subscription you use interactively and a subscription that also quietly powers your backend. If you're sizing how many concurrent jobs one box can take, the throughput & concurrency calculator translates LLM_MAX_CONCURRENCY and your per-call latency into a real ceiling.

Standing this up against a real workload? Star or fork it on GitHub, then model the before/after with the AI Agent Cost Calculator — drop the API line to zero and watch where your number lands.

The takeaway: if you're already paying for Claude, you don't have to pay the API a second time to put it in your stack. Drive the CLI you already have, lock it behind a token, and let the cache and retries do the unglamorous work. The whole thing is one MIT-licensed Go repo: github.com/sundarshahi/llm-gateway.

Want this wired into your pipeline and sized for your traffic? That's exactly what a production-readiness audit covers.

Working through something like this? I help teams ship AI and cloud systems that hold up — and cost what they should.