What Is LLM Inference?
Inference is the process of using a trained machine learning model to generate predictions or outputs from new input data. In the context of LLMs, inference means running the model on a prompt to produce a response.
This is distinct from:
- Training — Adjusting model weights using gradient descent on a large dataset. Happens once (or infrequently).
- Fine-tuning — A second round of training on a smaller domain-specific dataset.
- Inference — Running the already-trained model to generate outputs. Happens every time a user makes a request.
When you call the OpenAI API, the Anthropic API, or any LLM endpoint, you are performing inference.
How LLM Inference Works
LLM inference is an autoregressive process — the model generates one token at a time, appending each token to the growing sequence and feeding it back as input for the next step:
Prompt: "The capital of France is"
Step 1: model → P("Paris" | "The capital of France is") = 0.94 → sample "Paris"
Step 2: model → P(token | "The capital of France is Paris") = ... → sample " ."
Step 3: model → P(EOS | ...) = 0.91 → stop
This sequential generation is the primary reason LLM inference is slow compared to, say, image classification — you cannot parallelize across output tokens.
The Inference Stack
A production inference serving stack typically includes:
- Model weights — The billions of parameters stored in GPU memory (VRAM).
- KV cache — A cache of key-value attention matrices for the prompt, avoiding recomputation on each generation step. Critical for performance with long contexts.
- Batching — Combining multiple requests into a single GPU forward pass to improve throughput.
- Quantization — Reducing weight precision (e.g., FP16 → INT8 → INT4) to fit larger models in less VRAM.
Popular open-source inference engines: vLLM, TGI (Text Generation Inference), llama.cpp, Ollama.
Latency Components
End-to-end LLM inference latency has two parts:
- Time to First Token (TTFT) — How long until the first output token appears. Dominated by prompt processing time, which scales with input length.
- Inter-Token Latency (ITL) — How long between each subsequent output token. Relatively constant per token.
For a typical API call:
Total latency ≈ TTFT + (output_tokens × ITL)
≈ 300ms + (200 tokens × 20ms)
≈ 4.3 seconds
Inference in Application Architecture
In most LLM applications, inference is the most expensive operation — both in time and cost. Good architecture minimizes unnecessary inference calls through:
- Caching — Cache responses to identical prompts.
- Streaming — Stream tokens to the user as they are generated rather than waiting for the full response.
- Prompt optimization — Shorter prompts with cleaner input reduce TTFT and cost.
// Streaming inference with OpenAI
const stream = await openai.chat.completions.create({
model: "gpt-4o",
stream: true,
messages: [{ role: "user", content: cleanContent }]
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}
Clean Input = Faster, Cheaper Inference
The quality of content you feed into an inference call directly affects both cost and output quality. Noisy HTML inflates input token counts, increasing TTFT and cost while degrading response quality.
KnowledgeSDK preprocesses web content before your inference calls. The /v1/scrape and /v1/extract endpoints strip HTML noise and return efficient markdown, reducing input tokens by 50–80%. This means faster TTFT, lower cost, and better model attention on what actually matters.
const { content } = await sdk.scrape(url);
// content is clean markdown: ~1,200 tokens instead of ~8,000 tokens of raw HTML
// Your inference call is now 6x cheaper and faster
Managed Inference vs. Self-Hosted
| Option | Pros | Cons |
|---|---|---|
| API (OpenAI, Anthropic) | No infra, latest models | Cost scales with usage, rate limits |
| Cloud GPU (RunPod, Modal) | Cost control, private models | DevOps overhead |
| Self-hosted (vLLM, Ollama) | Full control, no data egress | High upfront infra cost |