True-Thinking Score (TTS)
Replicates and validates the True-Thinking Score metric from "Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought" (Zhao et al., 2025).
TTS measures the causal contribution of each reasoning step in chain-of-thought (CoT) to the model's final prediction. The paper's key finding is that most CoT steps are decorative — they look like reasoning but barely influence the answer. Only ~2% of steps are truly causal.
This recipe implements TTS computation using the Tinker API and validates the finding across 3 models (Qwen3.5-4B, Qwen3.6-27B, DeepSeek-V3.1) on MATH-500 problems.
What the paper reports
The authors test three distilled reasoning models (DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Llama-8B, Nemotron-1.5B) on AMC, AIME, MATH, and CommonsenseQA. Key findings on AIME with DeepSeek-R1-Distill-Qwen-7B:
- Mean TTS ~0.03 — most steps contribute almost nothing
- Only 2.3% of steps have TTS >= 0.7 (truly causal)
- Only 6.4% of steps have TTS >= 0.3
- 12% of self-verification steps in Qwen-7B (21% in Nemotron) have TTS < 0.005 — "aha moments" that are purely decorative
The paper also extracts "TrueThinking" steering vectors from internal activations (Section 6), achieving ~55% prediction flip rates on AMC/AIME. This requires residual-stream access which Tinker does not expose, so we focus on TTS computation only.
How TTS works
For each reasoning step \(s_i\) in a chain-of-thought, we run 4 forward passes that measure the model's confidence in the correct answer under different perturbation conditions:
| Intact step (\(x\)=1) | Perturbed step (\(x\)=0) | |
|---|---|---|
| Intact context (\(c\)=1) | \(S_1(1)\): original steps 1..i-1 + original step i | \(S_0(1)\): original steps 1..i-1 + perturbed step i |
| Perturbed context (\(c\)=0) | \(S_1(0)\): perturbed steps 1..i-1 + original step i | \(S_0(0)\): perturbed steps 1..i-1 + perturbed step i |
Each cell measures \(S_x(c) = P(y^* \mid \text{context}=c,\;\text{step}=x)\)
— the probability of the correct answer given that CoT prefix — via
compute_logprobs_async.
TTS is then the average of the two row-wise diffs:
- Row 1 \(|S_1(1) - S_0(1)|\): hold context intact, toggle the step. Measures necessity — does the model rely on this step?
- Row 2 \(|S_1(0) - S_0(0)|\): hold context perturbed, toggle the step. Measures sufficiency — can this step alone drive the correct answer?
Testing under both contexts matters because a single diff can miss "OR-type" steps: two steps that independently lead to the answer. With intact context, each looks unimportant (the other still works). With perturbed context, each is revealed as sufficient on its own.
A decorative step has TTS \(\approx\) 0: toggling it makes no difference. A true-thinking step has high TTS: the model's prediction meaningfully changes when you perturb it.
What we implement in Tinker
We replicate TTS computation using Tinker's compute_logprobs_async API:
-
Generate CoT: Sample from a thinking model (greedy, temperature=0) using the renderer's chat template. The model produces
<think>...</think>blocks with extended reasoning. -
Segment steps: Split the thinking text using discourse markers (numbered lists, transition words like "So", "Wait", "Therefore", etc.).
-
Perturb steps: For numeric steps, add small integer offsets from {-3,-2,-1,1,2,3} to numbers (matching Appendix A). For example, a real step from Qwen3.5-4B's CoT on an inclusion-exclusion problem:
For non-numeric steps, drop them entirely. -
Early-exit confidence: For each of the four conditions, build a sequence
[prompt + <think> CoT_prefix </think> \boxed{answer}]using the renderer's chat template, then measure \(P(\text{answer tokens} \mid \text{prefix})\) viacompute_logprobs_async. -
Compute TTS from the four confidence measurements.
Approximations vs. the paper:
- Models: The paper uses DeepSeek-R1-Distill (7B, 8B) and Nemotron-1.5B.
We use Qwen3.5-4B, Qwen3.6-27B, and DeepSeek-V3.1 (671B-A37B).
All produce
<think>...</think>delimited CoT. Our DeepSeek-V3.1 is a much larger non-distilled model than the paper's distilled 7B variant. - Dataset: The paper tests on AMC, AIME, MATH, and CommonsenseQA. We test on MATH-500 (a held-out subset of MATH).
- Early-exit cue: The paper appends
"The final result is"inside the reasoning block, probing "what would you predict mid-thought?" We close the</think>block and use\boxed{}format — probing "if you stopped thinking here, what would your final answer be?" Both measure how the model's answer-prediction changes when a step is perturbed (i.e. TTS is relative), so the choice of cue mainly shifts the baseline probability, not the TTS scores. - Confidence measurement: The paper uses "model's confidence Pr(y)"
via early-exit prompting but does not fully specify the computation.
We compute
exp(sum(logprobs))over the answer tokens, giving the joint probability P(answer_tokens | prefix). Since TTS measures relative changes* in confidence, the exact metric should not significantly affect the TTS scores. - Step segmentation: The paper treats sentences as steps (Appendix A). We use discourse markers (numbered lists, transition words). Both are heuristic and produce comparable step counts.
- Perturbation (matches Appendix A): We add integer offsets from {-3,-2,-1,1,2,3} to numbers and drop non-numeric steps entirely, matching the paper. Context perturbation only changes numbers.
- No steering vectors: The paper's Section 6 extracts "TrueThinking" steering directions from internal activations to control step reliance, achieving ~55% prediction flip rates vs <30% for random vectors. Tinker does not expose residual-stream activations, so this part is not replicated.
Setup
No special data download is needed — MATH-500 and GSM8K are loaded automatically from HuggingFace. You only need a Tinker API key:
Running the recipe
50 MATH-500 problems (~14 minutes with concurrency=64):
Quick smoke test (5 problems, ~3 minutes):
DeepSeek-V3.1 (requires thinking renderer override):
python -m tinker_cookbook.recipes.true_thinking_score.analyze \
model_name=deepseek-ai/DeepSeek-V3.1 renderer_name=deepseekv3_thinking \
n_problems=50
GSM8K, larger model:
python -m tinker_cookbook.recipes.true_thinking_score.analyze \
dataset=gsm8k model_name=Qwen/Qwen3.6-27B n_problems=50
Results are saved to /tmp/tinker-examples/tts/<run-name>/:
- tts_per_problem.jsonl — per-problem details (steps, TTS scores)
- tts_summary.json — aggregate statistics
Using TTS programmatically:
import asyncio
import tinker
from tinker_cookbook.recipes.true_thinking_score.tts import generate_cot_and_compute_tts
async def main():
service_client = tinker.ServiceClient()
result = await generate_cot_and_compute_tts(
service_client=service_client,
model_name="Qwen/Qwen3.5-4B",
question="How many positive integers less than 100 are divisible by 3, 5, or 7?",
answer_str="54",
max_tokens=4096,
)
print(result.summary())
for step in result.step_scores:
tag = " [DECORATIVE]" if step.tts <= 0.005 else ""
tag = " [TRUE-THINKING]" if step.tts >= 0.7 else tag
print(f" Step {step.step_index}: TTS={step.tts:.4f}{tag}")
asyncio.run(main())
Unit tests (no API key needed):
Key parameters
| Parameter | Default | Description |
|---|---|---|
model_name |
Qwen/Qwen3.5-4B |
Thinking model to analyze |
renderer_name |
None (auto) |
Override renderer (e.g. deepseekv3_thinking for DeepSeek) |
dataset |
math |
Dataset: math (MATH-500) or gsm8k |
n_problems |
50 |
Number of problems to analyze |
concurrency |
64 |
Max parallel problems (steps within a problem are sequential) |
max_tokens |
4096 |
Max tokens for CoT generation |
seed |
42 |
Random seed for perturbation reproducibility |
Results
50 MATH-500 problems per model, concurrency=64:
| Metric | Paper (R1-Distill-7B) | Qwen3.5-4B | Qwen3.6-27B | DeepSeek-V3.1 (671B) |
|---|---|---|---|---|
| Steps/problem | — | 31.7 | 27.1 | 11.3 |
| Mean TTS | ~0.03 | 0.054 | 0.086 | 0.144 |
| TTS >= 0.7 | 2.3% | 2.0% | 3.0% | 5.0% |
| TTS >= 0.3 | 6.4% | 5.7% | 11.1% | 19.1% |
| Decorative (<=0.005) | — | 59.3% | 55.1% | 35.1% |
| SV steps | — | 115 | 137 | 113 |
| SV decorative | 12-21% | 56.5% | 54.7% | 36.3% |
| Accuracy | — | 58% | 60% | 70% |
Findings
-
The paper's core claim is validated across 3 models: The high-TTS fraction is in the low single digits for both Qwen3.5-4B (2.0%) and Qwen3.6-27B (3.0%) — both within ~1 point of the paper's 2.3%. DeepSeek-V3.1 is higher at 5.0%, likely because its concise reasoning style (11 steps vs 27-32) packs more causal content per step.
-
Scaling reduces decorative reasoning: DeepSeek-V3.1 (671B) has far fewer decorative steps (35.1%) than Qwen models (55-59%). Among Qwen models, the larger 27B also has fewer decorative steps (55.1%) than 4B (59.3%).
-
Larger models are more concise: DeepSeek-V3.1 solves problems in 11.3 steps on average vs 27-32 for Qwen models. Each step carries more causal weight — the model "wastes" fewer steps exploring dead ends.
-
Self-verification is often fake: 36-57% of "Wait, let me re-check" steps are decorative across all models. DeepSeek-V3.1 is the most honest at 36%. This is higher than the paper's 12-21%, likely because of step granularity: the paper segments by sentences, bundling self-verification cues ("Wait") with the recomputation that follows into one step. Our discourse-marker segmentation splits these apart, creating short SV fragments (e.g. just
"Wait, let me re-check.") that are individually decorative even if the surrounding recomputation is not. -
TTS rises near the answer: The final steps before the answer consistently have the highest TTS, suggesting the model "commits" to an answer path late in the chain. Early reasoning steps explore without making progress.
-
Wrong answers have lower TTS: Problems where the model gets the wrong answer tend to have near-zero TTS across all steps — the model never locks onto a viable solution path.
Files
| File | Description |
|---|---|
tts.py |
Core TTS computation: step segmentation, number perturbation, early-exit confidence, TTS scoring |
analyze.py |
CLI entry point for running TTS analysis on MATH-500 or GSM8K |
run_small_experiment.py |
Quick validation script (3 hardcoded problems) |
tts_test.py |
18 unit tests for segmentation, perturbation, and self-verification detection |
References
- Can Aha Moments Be Fake? — Zhao et al., 2025
- Tinker API docs —
compute_logprobs_asyncreference