Temperature, Top‑p & Friends: Tuning LLM Randomness
A practical guide to sampling parameters like temperature, top‑p, and seeds: what they actually do, when to change them, and how to choose reliable defaults for your LLM apps.

If prompts are the specification you hand to an AI coworker, then sampling parameters are that coworker’s mood sliders. Same spec, different mood: sometimes you want a safe, by‑the‑book engineer; sometimes you want an over‑caffeinated brainstorm buddy.
In the first article we zoomed in on what a prompt is—instruction, constraints, context, examples, output spec. This one is about what happens after you’ve written a good prompt: how the model actually chooses words, why different runs give different outputs, and how knobs like temperature, top‑p, and seed let you deliberately control that randomness instead of just hoping for the best.
We’ll stay grounded in the OpenAI API (Responses / Chat) and their docs on text generation, structured outputs, and production best practices.
What “sampling” means in LLMs
Language models don’t “decide” the next word; they sample it.
After reading your prompt (and previous conversation), the model internally computes, for the next token, a probability for every possible token in its vocabulary. Think of a huge table like:
- “the” → 23%
- “a” → 15%
- “this” → 7%
- “function” → 3%
- thousands of other tokens with tiny probabilities
This is the probability distribution over the next token.
Two important ways you can use that distribution:
-
Greedy decoding – always pick the most likely token (the argmax). That tends to be accurate but boring and sometimes oddly stuck in loops.
-
Sampling – interpret the distribution as a weighted die and roll it. “the” is still most likely, but sometimes “this” or “a” wins.
The model repeats this step token by token until it hits an end condition (like running out of max_output_tokens or emitting a stop sequence).
All the “randomness knobs” you see—temperature, top‑p/nucleus, top‑k, penalties, seed—do one of two things:
- Reshape the distribution (make it sharper, flatter, or truncated).
- Control the randomness of the sampling process (how you roll the die).
You don’t get more intelligence out of these knobs, only different trade‑offs between diversity, stability, verbosity, and safety. Good engineering is knowing which trade‑off a task needs.
Temperature vs. top‑p: what changes what
OpenAI’s APIs expose two main controls for randomness:
temperaturetop_p(nucleus sampling)
They both influence how adventurous the sampling is, but in very different ways.
Temperature: reshaping the curve
Mathematically, temperature rescales the model’s raw scores before they become probabilities. Practically:
-
Low temperature (0–0.3) → the distribution gets sharper. The most likely token becomes even more dominant; you get more predictable, repetitive outputs.
-
Medium (0.4–0.8) → a balanced curve. The model usually picks the best token, but occasionally explores nearby options.
-
High (0.9–1.5+) → the curve flattens. Lower‑probability tokens get a boost; you get more creative, surprising, and noisy outputs.
OpenAI documents temperature as a number between 0 and 2, with higher values making outputs more random and lower values more deterministic.
At temperature ≈ 0, you’re very close to greedy decoding: the model tends to give the same answer every time for the same prompt (though underlying systems can still introduce tiny variations).
Top‑p (nucleus sampling): trimming the tail
top_p is a different kind of control. Instead of reshaping the entire curve, it cuts off the long tail of unlikely tokens.
- Imagine you sort all tokens by probability, highest to lowest.
- Then you keep only the smallest set whose probabilities add up to
top_p(say 0.9). - You renormalize those and sample only from this “nucleus”.
So with:
top_p = 1.0→ use the full distribution (no truncation).top_p = 0.9→ ignore the least likely 10% of probability mass.top_p = 0.5→ be extremely conservative, only sampling from a tiny set of very likely tokens.
This is why OpenAI’s docs describe top‑p as “nucleus sampling” and specifically note that it’s an alternative to temperature.
How they feel different
A simple mental model:
-
Temperature – How “spiky” is the whole curve?
-
Turn it down to make the model more confident and repetitive.
-
Turn it up to make it more curious across the board.
-
Top‑p – Where do we cut off low‑probability weirdness?
-
Lower it to ban the weirdest words.
-
Raise it to allow more wild, rare choices.
OpenAI explicitly recommends tuning either temperature or top‑p, but not both at once, because combining them can be tricky and produce non‑intuitive effects.
When to lower/raise temperature
You almost never want one universal temperature. Different jobs want different levels of randomness.
Let’s turn this into practical guidance.
When to go low (0–0.3)
Use low temperatures when correctness and stability matter more than variety:
-
Code generation & refactors You want compilable, deterministic code snippets, close to the training distribution of “good code”. (For serious pipelines, also wrap with tests or static analysis.)
-
Schema‑bound JSON / structured outputs When you’re using JSON schemas or other structured formats (
response_format/ structured outputs), a lower temperature reduces the odds of malformed responses. -
Policy‑sensitive replies Support bots, compliance checks, anything where you need consistent interpretations and phrasing.
-
Scoring, classification, extraction Tasks where the model is mapping input → label/fields; you want minimal drift.
Typical ranges: 0.0–0.2. I’ll start at 0 for these and only increase if the model becomes oddly terse or brittle.
When to stay in the middle (0.3–0.7)
Medium temperatures are a good default when you want helpful explanations that don’t all look copy‑pasted:
- Educational explanations, tutoring, walkthroughs
- Summaries and rewrites where tone matters but strict consistency doesn’t
- Product copy & UX microcopy when you’ll still review manually
- Most “assistant” style chat UIs
Typical range: 0.3–0.5 for production, maybe up to 0.7 for internal tools.
When to go high (0.8–1.2+)
High temperatures are for ideation and exploration, where unexpected outputs are a feature, not a bug:
- Brainstorming campaign ideas or variants
- Fiction, character voices, creative metaphors
- Alternative phrasings, risky suggestions, or “tell me 20 wild options”
Typical range: 0.8–1.0 is already pretty adventurous; beyond 1.2 can get very chaotic, especially if the prompt is vague.
Top‑k/top‑p together: cautions
OpenAI’s current APIs focus mainly on temperature and top_p; you won’t see a top_k parameter there, but you will see it in libraries like Hugging Face Transformers or some other providers.
Conceptually:
- Top‑k = “Only consider the k most likely tokens.”
- Top‑p = “Only consider the smallest set of tokens whose probabilities sum to p.”
You can combine them—e.g., top_k=50, top_p=0.95—but that’s like putting two filters in series. Great if you know what you’re doing, dangerous if you don’t.
Typical things that can go wrong:
-
Over‑constraining – Low temperature + small top_k + low top_p can make the model weirdly repetitive or get it stuck in short loops; it simply doesn’t have enough options to escape.
-
Unexpected style changes – Some rare but high‑impact tokens (like punctuation or formatting tokens) can get trimmed away, changing the “feel” of responses in subtle ways.
-
Tuning hell – It becomes hard to debug whether an issue is due to the prompt, the model, temperature, top_p, or top_k.
OpenAI’s own recommendation to adjust either temperature or top_p, but typically not both, is a good default philosophy across providers.
One dial at a time
If you’re stuck, set top_p to 1.0, leave penalties at 0, and tune only temperature until you like the behavior. Then maybe touch other knobs.
Determinism, retries, and seeds
Random sampling is great until you ship a product and suddenly need reproducibility:
- Offline evaluation (“Did the new prompt really improve accuracy?”)
- Audit logs (“Can we reproduce the exact answer the user saw?”)
- Consistent experiments across environments
OpenAI exposes a seed parameter that lets the system aim for deterministic sampling: using the same model, prompt, parameters, and seed should give you the same output in most cases.
How to think about seeds
You can think of seed as setting the random number generator for sampling:
- Same seed + same everything else → highly likely identical output.
- Different seed → different random choices along the way.
For strict determinism, combine:
temperature = 0(or very low),top_p = 1.0,- a fixed
seed.
This is especially useful for:
- Regression tests for prompts
- Golden test cases in your CI/CD pipeline
- Training flows, where you want stable inputs for a judge model, like PractiqAI’s “judge” that evaluates whether a response meets task criteria.
Seeds and retries
In production, you typically retry on network errors or transient API issues. You have a few options:
-
Idempotent retries – Reuse the same seed so that a retried request yields the same content, minimizing user confusion.
-
Diverse retries – Intentionally change the seed (or rely on default randomness) on retry to hunt for a better answer if the first one fails schema validation or a safety check.
Both are valid; just be explicit. For user‑facing text you might prefer diverse retries, while for internal scoring you probably want idempotence.
Creative vs. precise tasks (recommended presets)
Let’s translate all this into ready‑to‑use “personalities”. These are starting points, not laws of physics, but they’re surprisingly effective in practice.
I’ll assume:
- You’re using a modern GPT‑5.x‑style model via Responses or Chat.
- Penalties (
frequency_penalty,presence_penalty) are 0 unless noted.
1. Strict JSON / tools / automation
You care about valid structure, not style.
temperature: 0top_p: 1.0response_format: { "type": "json_schema", ... }(or structured outputs)- Small
max_output_tokens, tuned to your schema size
Use this preset for:
- Database row generation
- Function/tool calling
- Any PractiqAI‑style task where the judge model expects strict, machine‑checkable results.
2. Analytical explainer
You want accurate, consistent, human‑readable explanations.
temperature: 0.2–0.4top_p: 0.9–1.0max_output_tokens: modest (e.g. enough for 3–6 paragraphs, not an essay)
Use this for:
- “Explain this code/contract/log to me.”
- “Summarize this document for a specific role.”
- PractiqAI tasks that train “explain your reasoning in plain language”.
3. Helpful general assistant
The “default chat bot” vibe.
temperature: 0.4–0.7top_p: 0.9–1.0- No penalties or small positives (e.g.
presence_penalty: 0.2) if you see repetition
Use when:
- You want nice phrasing and some variety.
- Answers are reviewed by a human before acting.
4. Creative brainstormer
You explicitly want lots of varied ideas, even if some are off.
temperature: 0.8–1.0top_p: 0.9–1.0- Mild positive
presence_penaltyto push new ideas instead of repeating old ones.
Use for:
- Campaign ideas, taglines, experiments, story beats
- “Give me 5 drastically different approaches…”
For each of these presets, the prompt structure from the first article still matters: clear task, constraints, context, and output spec. Sampling settings amplify or dampen what your prompt is already doing; they don’t fix a weak prompt.
Guarding against verbosity and off‑task rambling
If your model keeps writing way too much, or drifts into weird tangents, it’s tempting to blame temperature. Sometimes that’s true, but very often the prompt and other parameters are the real culprits.
Here’s a layered approach.
1. Control length explicitly
Models are trained to be “helpful” and often interpret that as “say more stuff”. Counteract that with:
-
Hard limits: Set
max_output_tokens/max_tokensappropriately; fewer tokens naturally trim verbosity and also reduce latency and cost. -
Soft limits in the prompt: “Use at most 120 words.” “Return 3 bullet points of 1 short sentence each.” “Do not write explanations or disclaimers.”
And, crucially, say who the audience is and what “done” means—just like in your prompt design fundamentals.
2. Use structured outputs where possible
If your downstream system expects JSON or another structure, don’t ask for free‑form prose and then clean it; ask directly for structure:
response_formatwithtype: "json_schema"in Chat- Structured Outputs in the Responses API
This naturally discourages rambling and makes off‑task digressions easier to detect (extra fields, unexpected strings, etc.).
3. Tighten the prompt before tightening randomness
Typical mistakes that cause rambling:
- Vague instructions (“Teach me about X”) with no length or target audience
- Multiple questions in one prompt, no clear priority
- Encouraging tone (“be comprehensive”) without a bound
Fix those first. Temperature 0.2 with a fuzzy prompt is still fuzzy.
4. Use penalties to curb repetition
If you’re getting the same phrases over and over, consider:
- Positive frequency penalties to discourage repeating exact tokens.
- Positive presence penalties to encourage introducing new tokens.
These don’t directly reduce verbosity, but they fight the worst kind of it: repetitive waffle.
5. Then adjust temperature/top‑p
Finally:
- If the model goes on weird tangents → lower temperature and maybe lower
top_p. - If it’s verbose but on topic → reduce
max_output_tokensand tighten your “done” definition instead of driving temperature to zero.
Fast experiments to pick parameters
You don’t have to guess your way to good settings. You can run tiny, cheap experiments that fit in an afternoon.
A practical loop:
-
Fix your prompt and context Use the prompting structure from article #1 (role, task, constraints, context, examples, acceptance criteria).
-
Define what “good” means for this task
- Did the output follow the schema?
- Was the answer correct (as judged by a script or another model)?
- Was it within your word limit?
- Pick a small parameter grid For example:
- Temperatures:
[0, 0.3, 0.7] - top_p:
[1.0](fixed for now)
-
Run a small batch For each parameter combo, run 20–50 test prompts (or PractiqAI‑style tasks). You can even have a judge model score them, like the PractiqAI “judge” that verifies outputs against task conditions.
-
Record simple metrics
- % outputs that passed schema or validation
- Average length
- A rough quality score (e.g. 1–5 from a human or model)
- Choose the best combo Often the middle temperature will dominate: more variety than 0, but fewer failures than 0.7+.
OpenAI’s advanced usage and production guides explicitly encourage this kind of offline evaluation and A/B testing before shipping parameter changes to users.
You can repeat this process per task type (coding, summarization, brainstorming) and bake the winner into your app’s presets.
Production defaults and overrides
At some point you have to pick sensible defaults and stop tweaking every single request by hand.
A pragmatic strategy:
1. Establish conservative global defaults
For most production text‑generation endpoints:
temperature: 0.2–0.4top_p: 1.0max_output_tokens: chosen per endpoint (summary vs long email)seed: unset in production unless you need strict reproducibility
These settings give you:
- Enough diversity to avoid robotic answers
- Stable reasoning
- Predictable latency and token use (shorter outputs mean faster responses, per OpenAI’s latency and cost docs).
2. Add role‑ or endpoint‑specific overrides
Define a few “named modes”:
mode: "strict_json"→ temperature 0, structured outputs, tight token limitmode: "analysis"→ temperature 0.3, moderate tokensmode: "creative"→ temperature 0.9, larger token budget, mild presence penalty
Your application logic selects the mode; your LLM client translates it into parameters.
This mirrors how PractiqAI courses map tasks to roles and skills—each job type or course module might naturally correspond to a different sampling profile.
3. Keep a panic button
Have an ops‑level switch that can:
- Force a lower temperature globally if you detect instability or new safety requirements.
- Reduce
max_output_tokensduring incident response to control costs and latency spikes.
Because tokens dominate latency, especially completion tokens, this is one of the fastest knobs you can turn in a crisis.
Troubleshooting guide
Finally, a quick “if this, then that” map you can keep by your keyboard.
Symptom: Outputs are too random / inconsistent
- Check prompts: Are instructions clear and uniquely specifying the task?
- Lower
temperature(e.g. from 0.7 → 0.3). - Lower
top_pslightly (e.g. 1.0 → 0.9), or just fix it at 1.0 and tune temperature first. - For critical tasks, consider
temperature = 0.
Symptom: Outputs are boring or nearly identical every time
- Raise
temperaturea bit (0.2 → 0.5, 0.5 → 0.7). - Consider a higher
top_pif you lowered it earlier. - Prompt for variation: “Give me 5 very different ideas.”
Symptom: JSON is often invalid
- Lower
temperaturetoward 0. - Use structured outputs / JSON schema (
response_formator Responses text formats). - Reduce
max_output_tokensto discourage multi‑paragraph commentary. - Add explicit instructions: “Return only JSON, no extra text.”
Symptom: Model repeats itself or loops
- Slightly increase frequency or presence penalties.
- Check if your prompt accidentally encourages restatements (“reiterate the key points again…”).
- Lower temperature if set very high.
Symptom: Long latency / high cost
- Reduce
max_output_tokens(and maybe trimmed input context). - Avoid unnecessary retries.
- Use cheaper / faster models for simple tasks.
Symptom: A/B experiment results are noisy
- Use a fixed
seedfor offline evaluation. - Make sure you’re not changing multiple things (prompt + temperature + model) at once.
- Increase sample size; a dozen prompts is not enough.
Closing thought
Sampling parameters look intimidating at first—lots of cryptic Greek letters hiding behind a simple text box. But in practice, they reduce to a few clear ideas:
- Temperature controls how bold the model is.
- Top‑p controls how much of the tail you allow.
- Seeds and penalties refine behavior around determinism and repetition.
- Prompts, structure, and token limits still do most of the heavy lifting.
Use one dial at a time, run small experiments, and treat your presets as skills you can learn, just like writing better prompts or designing better tasks. That’s exactly the muscle PractiqAI is built to train: turning vague “AI magic” into concrete, improvable techniques you can practice, measure, and eventually master.
Paweł Brzuszkiewicz
PractiqAI Team
PractiqAI designs guided drills and feedback loops that make learning with AI feel like muscle memory training. Follow along for product notes and workflow ideas from the team.
Ready to make AI practice part of your routine?
Explore interactive drills, daily streaks, and certification paths built by the PractiqAI team.
Explore coursesLatest articles
Fresh insights from the PractiqAI team.

System Prompts, Roles & Instruction Hierarchy
A practical guide to the three chat roles (system/developer/user), what truly belongs in each, how to resolve conflicts, and how to test and template role‑specific prompts.

What Are Tokens? (And Why They Matter)
A practical, slightly nerdy guide to tokens: what they are, how tokenization works (BPE/WordPiece/SentencePiece), why tokens control cost, speed, and limits, plus safe ways to shrink prompts without losing quality.

What Is a Prompt?
A practical, slightly nerdy guide to prompts: what they are, types and formats (including JSON), how to write great ones, and why good prompting works with LLMs.