Inference

The process of running a trained LLM to generate output from a given input prompt, as opposed to training or fine-tuning the model.

What Is LLM Inference?

Inference is the process of using a trained machine learning model to generate predictions or outputs from new input data. In the context of LLMs, inference means running the model on a prompt to produce a response.

This is distinct from:

Training — Adjusting model weights using gradient descent on a large dataset. Happens once (or infrequently).
Fine-tuning — A second round of training on a smaller domain-specific dataset.
Inference — Running the already-trained model to generate outputs. Happens every time a user makes a request.

When you call the OpenAI API, the Anthropic API, or any LLM endpoint, you are performing inference.

How LLM Inference Works

LLM inference is an autoregressive process — the model generates one token at a time, appending each token to the growing sequence and feeding it back as input for the next step:

Prompt: "The capital of France is"
Step 1: model → P("Paris" | "The capital of France is") = 0.94 → sample "Paris"
Step 2: model → P(token | "The capital of France is Paris") = ... → sample " ."
Step 3: model → P(EOS | ...) = 0.91 → stop

This sequential generation is the primary reason LLM inference is slow compared to, say, image classification — you cannot parallelize across output tokens.

The Inference Stack

A production inference serving stack typically includes:

Model weights — The billions of parameters stored in GPU memory (VRAM).
KV cache — A cache of key-value attention matrices for the prompt, avoiding recomputation on each generation step. Critical for performance with long contexts.
Batching — Combining multiple requests into a single GPU forward pass to improve throughput.
Quantization — Reducing weight precision (e.g., FP16 → INT8 → INT4) to fit larger models in less VRAM.

Popular open-source inference engines: vLLM, TGI (Text Generation Inference), llama.cpp, Ollama.

Latency Components

End-to-end LLM inference latency has two parts:

Time to First Token (TTFT) — How long until the first output token appears. Dominated by prompt processing time, which scales with input length.
Inter-Token Latency (ITL) — How long between each subsequent output token. Relatively constant per token.

For a typical API call:

Total latency ≈ TTFT + (output_tokens × ITL)
               ≈ 300ms + (200 tokens × 20ms)
               ≈ 4.3 seconds

Inference in Application Architecture

In most LLM applications, inference is the most expensive operation — both in time and cost. Good architecture minimizes unnecessary inference calls through:

Caching — Cache responses to identical prompts.
Streaming — Stream tokens to the user as they are generated rather than waiting for the full response.
Prompt optimization — Shorter prompts with cleaner input reduce TTFT and cost.

// Streaming inference with OpenAI
const stream = await openai.chat.completions.create({
  model: "gpt-4o",
  stream: true,
  messages: [{ role: "user", content: cleanContent }]
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}

Clean Input = Faster, Cheaper Inference

The quality of content you feed into an inference call directly affects both cost and output quality. Noisy HTML inflates input token counts, increasing TTFT and cost while degrading response quality.

KnowledgeSDK preprocesses web content before your inference calls. The /v1/scrape and /v1/extract endpoints strip HTML noise and return efficient markdown, reducing input tokens by 50–80%. This means faster TTFT, lower cost, and better model attention on what actually matters.

const { content } = await sdk.scrape(url);
// content is clean markdown: ~1,200 tokens instead of ~8,000 tokens of raw HTML
// Your inference call is now 6x cheaper and faster

Managed Inference vs. Self-Hosted

Option	Pros	Cons
API (OpenAI, Anthropic)	No infra, latest models	Cost scales with usage, rate limits
Cloud GPU (RunPod, Modal)	Cost control, private models	DevOps overhead
Self-hosted (vLLM, Ollama)	Full control, no data egress	High upfront infra cost

Related Terms

LLMsbeginner

Large Language Model

A neural network trained on vast text corpora that can generate, summarize, translate, and reason about language.

LLMsbeginner

Temperature

A sampling parameter that controls the randomness of LLM outputs — lower values make responses more deterministic, higher values more creative.

LLMsbeginner

Token

The basic unit of text processed by an LLM — roughly 3/4 of a word in English — that models use to read and generate language.

← Indexing Intelligent Extraction →

Try it now

Build with Inference using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →

← Back to glossary