Temperature

A sampling parameter that controls the randomness of LLM outputs — lower values make responses more deterministic, higher values more creative.

What Is Temperature?

Temperature is a hyperparameter applied during LLM inference that scales the probability distribution over the model's vocabulary before a token is sampled. In plain terms, it controls how "creative" or "random" the model's outputs are.

Temperature = 0 — Always pick the single highest-probability token (greedy decoding). Maximally deterministic; same prompt produces identical output every time.
Temperature = 1 — Sample from the raw probability distribution the model learned. Balanced between coherence and variation.
Temperature > 1 — Flatten the distribution, making unlikely tokens more probable. More creative, but higher risk of incoherence.

The Math Behind Temperature

After the model's final linear layer produces logits for each vocabulary token, a softmax converts them to probabilities. Temperature T is applied by dividing the logits before softmax:

P(token_i) = exp(logit_i / T) / Σ exp(logit_j / T)

When T → 0, the highest-logit token gets probability ≈ 1 (greedy).
When T = 1, standard softmax.
When T > 1, the distribution flattens toward uniform.

Choosing the Right Temperature

Use Case	Recommended Temperature
Factual Q&A, data extraction	0 – 0.2
Summarization, translation	0.2 – 0.5
Conversational assistants	0.5 – 0.8
Creative writing, brainstorming	0.8 – 1.2
Poetry, experimental generation	1.2 – 2.0

Temperature in API Calls

Most LLM APIs expose temperature as a top-level parameter:

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  temperature: 0.2, // low for factual, structured tasks
  messages: [
    { role: "system", content: "Extract product details as JSON." },
    { role: "user", content: pageContent }
  ]
});

response = anthropic.messages.create(
    model="claude-opus-4-6",
    temperature=0.0,  # deterministic for structured extraction
    max_tokens=1024,
    messages=[{"role": "user", "content": prompt}]
)

Temperature vs. Top-K vs. Top-P

Temperature is rarely used in isolation. It is commonly combined with:

Top-K sampling — Restricts sampling to the K highest-probability tokens before temperature is applied.
Top-P (nucleus sampling) — Restricts sampling to the smallest set of tokens whose cumulative probability exceeds P.

A typical production setup uses temperature=0.7, top_p=0.9 together.

Practical Advice

For structured extraction tasks — where you want JSON, specific field values, or deterministic classifications — always set temperature to 0 or close to it. Variation is a bug, not a feature, when you need reliable data.

When using KnowledgeSDK's /v1/extract or /v1/classify endpoints, the underlying models are configured with low temperature internally to ensure consistent, structured outputs. You get deterministic, well-formed results without needing to manage inference parameters yourself.

Common Mistakes

Setting temperature = 0 for creative tasks and wondering why outputs feel robotic.
Setting temperature = 1.5 for factual tasks and wondering why the model invents facts.
Assuming temperature = 0 means the model is "correct" — it only means it is consistent; greedy decoding can still hallucinate.

Related Terms

LLMsintermediate

Top-K Sampling

A decoding strategy that restricts token selection to the K most probable next tokens, balancing coherence and diversity.

LLMsbeginner

Large Language Model

A neural network trained on vast text corpora that can generate, summarize, translate, and reason about language.

LLMsbeginner

Inference

The process of running a trained LLM to generate output from a given input prompt, as opposed to training or fine-tuning the model.

← System Prompt Throughput →

Try it now

Build with Temperature using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →

← Back to glossary