Chain-of-Thought Prompting

Chain-of-thought prompting is a technique that encourages a model to reason through intermediate steps before producing an answer. The simplest version is a phrase such as:

Let's think step by step.

The motivation is straightforward: many problems are not solved by recalling a single fact. They require decomposition, intermediate calculations, comparison, planning, or logical consistency. A model that jumps directly to an answer may skip important steps.

A structured reasoning prompt might ask the model to:

Restate the problem.
Identify relevant facts.
Break the task into steps.
Solve each step.
Check the result.
Provide a final answer.

Chain-of-thought prompting can improve performance on arithmetic, logic puzzles, planning, classification with nuanced criteria, and tasks that require comparing multiple constraints.

However, it has limitations. Reasoning text can be verbose, misleading, or confidently wrong. For some applications, you may want the model to use internal reasoning but only return the final answer or a concise explanation.

Before and after: arithmetic reasoning

Consider this prompt without chain-of-thought:

A conference has 18 tables. Each table seats 6 people. 14 seats are empty. How many people are seated?

A weak response might be:

There are 18 × 6 = 108 seats, so 108 people are seated.

The model missed the empty seats.

A chain-of-thought-style prompt improves the instruction:

Solve the problem step by step, then give the final answer.
A conference has 18 tables. Each table seats 6 people. 14 seats are empty. How many people are seated?

A better model-like response:

There are 18 tables with 6 seats each, so the total number of seats is 18 × 6 = 108.
14 of those seats are empty.
So the number of people seated is 108 - 14 = 94.
Final answer: 94 people are seated.

The intermediate steps make it less likely that the model ignores a constraint.

Before and after: classification with criteria

Without reasoning:

Classify this support ticket as Low, Medium, or High priority.
Ticket: "The billing dashboard is slow for one user, but invoices are still accessible."

Possible output:

High

This may be too severe. Add criteria and reasoning:

Classify the ticket as Low, Medium, or High priority.
Reason briefly using the criteria below.

High: system outage, security issue, payment failure for many users
Medium: degraded feature affecting multiple users or a business-critical workflow
Low: inconvenience, isolated issue, workaround available

Ticket: "The billing dashboard is slow for one user, but invoices are still accessible."

Better output:

Priority: Low
Reason: The issue affects one user, the dashboard is slow rather than unavailable, and invoices remain accessible. A workaround appears to exist.

Here, the reasoning is not about math. It is about applying decision criteria consistently.

Zero-shot versus few-shot chain-of-thought

Zero-shot chain-of-thought adds a general reasoning instruction without examples:

Answer the question. Think step by step before giving the final answer.

This can help on unfamiliar multi-step tasks, but it may be inconsistent.

Few-shot chain-of-thought provides examples that demonstrate the reasoning pattern:

Question: A box has 5 red balls and 7 blue balls. If 3 red balls are removed, how many balls remain?
Reasoning: Start with 5 + 7 = 12 balls. Remove 3 red balls. 12 - 3 = 9.
Answer: 9

Question: A store has 40 notebooks. It sells 12 and receives 8 more. How many notebooks are there now?
Reasoning: Start with 40. After selling 12, there are 28. After receiving 8, there are 36.
Answer: 36

Question: A team has 9 developers. 2 are on vacation and 3 contractors join. How many people are available?
Reasoning:

Few-shot examples show the model the level of detail, format, and reasoning style expected. This is useful when you need consistent reasoning behavior across many inputs.

Self-consistency

Self-consistency is an extension of chain-of-thought. Instead of sampling one reasoning path, the system samples multiple reasoning paths and chooses the most common final answer.

Conceptually:

Prompt the model multiple times with reasoning enabled.
Collect final answers.
Choose the majority answer.

Example:

Run 1 final answer: 42
Run 2 final answer: 40
Run 3 final answer: 42
Run 4 final answer: 42
Run 5 final answer: 40
Majority answer: 42

Self-consistency can improve performance on problems where individual reasoning attempts are noisy. It costs more because it requires multiple model calls. It is most useful for tasks where there is a clear final answer, such as math, logic, or constrained classification.

For open-ended writing, majority voting is less meaningful. There may be many good answers rather than one correct answer.

When chain-of-thought hurts

Chain-of-thought is not always beneficial.

For simple factual recall, it can add unnecessary verbosity:

Question: What is the capital of France?

A step-by-step analysis is not needed. The answer is simply Paris.

For creative tasks, chain-of-thought can make outputs stiff or overplanned. A prompt like “Think step by step before writing a poem” may reduce spontaneity.

For structured extraction, lengthy reasoning may interfere with clean output. If the task is to return valid JSON, asking for reasoning inside the response may increase the chance of format errors.

For sensitive or high-stakes domains, visible reasoning can sound persuasive even when it is wrong. In those cases, it may be better to ask for a concise justification, citations, or a confidence note rather than a long internal-style reasoning trace.

Tree of thoughts

Tree of thoughts prompting expands the idea of chain-of-thought. Instead of following one reasoning path, the model explores multiple possible paths, evaluates them, and selects or combines the best one.

A simplified pattern:

1. Generate three possible solution strategies.
2. Evaluate the strengths and weaknesses of each.
3. Choose the best strategy.
4. Solve using that strategy.
5. Check the answer.

This can help with planning, puzzle solving, architecture design, and complex tradeoff analysis. It is more expensive than simple prompting and can become verbose, so it is best used when the task genuinely benefits from exploring alternatives.

Example prompt:

We need to design a caching strategy for a high-traffic API.
Generate three possible approaches.
Evaluate each for complexity, cost, freshness, and failure modes.
Then recommend one approach and explain why.

The model is not just reasoning linearly. It is comparing branches.

Chain-of-thought and modern reasoning models

Modern reasoning-oriented models may use hidden scratchpads, extended thinking, or internal deliberation mechanisms. The user may not see every internal reasoning step, but the system is still allocating more computation to intermediate reasoning.

The practical relationship is this: chain-of-thought prompting teaches the model to externalize reasoning, while reasoning models may perform more of that reasoning internally. For developers, the goal is not always to display every step. The goal is to get better task performance and appropriate explanations.

In production, you might ask for:

Think carefully before answering. Return only:
- Final answer
- Brief justification
- Any assumptions

This gives the model permission to reason while keeping the user-facing output concise.

Practical guide: when to use chain-of-thought

Use reasoning prompts when the task involves:

Multi-step arithmetic
Logic or constraint satisfaction
Planning
Debugging
Comparing tradeoffs
Applying nuanced criteria
Explaining a decision

Avoid or minimize reasoning prompts when the task involves:

Simple factual recall
Short transformations
Strict JSON extraction
Creative generation where spontaneity matters
Cases where a long explanation may overpersuade the user

A practical compromise is to ask for a brief rationale rather than a full chain:

Return the answer and a brief explanation of the key reason.
Do not include unnecessary intermediate steps.

Chain-of-thought prompting is a tool, not a universal setting. Use it when the task benefits from decomposition, and constrain it when clarity, brevity, or format reliability matter more.

Learning objectives