True-Thinking Score
This recipe implements the True-Thinking Score metric from "Can Aha Moments Be Fake?" (Zhao et al., 2025) using Tinker. The True-Thinking Score measures the causal contribution of each reasoning step in chain-of-thought to the model's final prediction — revealing that most CoT steps are decorative and only ~2% are truly causal.
What you'll build
A True-Thinking Score analysis pipeline that generates chain-of-thought reasoning from a thinking model, segments it into steps, and measures each step's causal contribution via perturbation experiments. The output is a per-step score that identifies which reasoning steps actually matter.
Prerequisites
Set your Tinker API key:
No data download needed — MATH-500 and GSM8K are loaded automatically from HuggingFace.
Key concepts
- True-thinking step — a reasoning step with a high score (>= 0.7), meaning it causally influences the model's final answer
- Decorative step — a step with a score near 0, which looks like reasoning but barely affects the output
- Self-verification step — a step containing patterns like "Wait, let me re-check" — often decorative despite appearing diligent
- Perturbation — adding small integer offsets ({-3..3}) to numbers in a step, or dropping the step entirely if non-numeric
- Early-exit confidence — measuring P(correct answer | CoT prefix) by appending the answer after a partial reasoning chain
How it works
For each reasoning step, we run 4 forward passes that measure the model's confidence in the correct answer under different perturbation conditions:
| Intact step | Perturbed step | |
|---|---|---|
| Intact context | Original steps 1..i-1 + original step i | Original steps 1..i-1 + perturbed step i |
| Perturbed context | Perturbed steps 1..i-1 + original step i | Perturbed steps 1..i-1 + perturbed step i |
Each cell measures P(correct answer | CoT prefix) via compute_logprobs_async. The True-Thinking Score is the average of the two row-wise diffs:
- Row 1 measures necessity — does the model rely on this step?
- Row 2 measures sufficiency — can this step alone drive the correct answer?
For example, a real step from Qwen3.5-4B on an inclusion-exclusion problem:
Run it
50 MATH-500 problems (~14 minutes with concurrency=64)
Quick smoke test (5 problems, ~3 minutes)
DeepSeek-V3.1 (requires thinking renderer override)
python -m tinker_cookbook.recipes.true_thinking_score.analyze \
model_name=deepseek-ai/DeepSeek-V3.1 renderer_name=deepseekv3_thinking \
n_problems=50
GSM8K with a larger model
python -m tinker_cookbook.recipes.true_thinking_score.analyze \
dataset=gsm8k model_name=Qwen/Qwen3.5-27B n_problems=50
Results are saved to /tmp/tinker-examples/tts/<run-name>/:
tts_per_problem.jsonl— per-problem details (steps, scores)tts_summary.json— aggregate statistics
Key parameters
| Parameter | Default | Description |
|---|---|---|
model_name |
Qwen/Qwen3.5-4B |
Thinking model to analyze |
renderer_name |
None (auto) |
Override renderer (e.g. deepseekv3_thinking for DeepSeek) |
dataset |
math |
Dataset: math (MATH-500) or gsm8k |
n_problems |
50 |
Number of problems to analyze |
concurrency |
64 |
Max parallel problems (steps within a problem are sequential) |
max_tokens |
4096 |
Max tokens for CoT generation |
seed |
42 |
Random seed for perturbation reproducibility |
Expected results
50 MATH-500 problems per model, concurrency=64:
| Metric | Paper (R1-Distill-7B) | Qwen3.5-4B | Qwen3.5-27B | DeepSeek-V3.1 (671B) |
|---|---|---|---|---|
| Steps/problem | — | 31.7 | 31.9 | 11.3 |
| Mean score | ~0.03 | 0.054 | 0.070 | 0.144 |
| Score >= 0.7 | 2.3% | 2.0% | 1.8% | 5.0% |
| Score >= 0.3 | 6.4% | 5.7% | 8.0% | 19.1% |
| Decorative (<=0.005) | — | 59.3% | 51.1% | 35.1% |
| SV decorative | 12-21% | 56.5% | 49.1% | 36.3% |
| Accuracy | — | 58% | 62% | 70% |
Findings
-
The paper's core claim is validated across 3 models. The ~2% high-scoring rate is consistent across Qwen3.5-4B (2.0%) and Qwen3.5-27B (1.8%), closely matching the paper's 2.3%.
-
Scaling reduces decorative reasoning. DeepSeek-V3.1 (671B) has far fewer decorative steps (35%) than Qwen models (51-59%), and solves problems in 11 steps vs 32 for Qwen.
-
Self-verification is often fake. 36-57% of "Wait, let me re-check" steps are decorative across all models. DeepSeek-V3.1 is the most honest at 36%.
Differences from the paper
- Models: The paper tests DeepSeek-R1-Distill (7B, 8B) and Nemotron-1.5B. We test Qwen3.5-4B, Qwen3.5-27B, and DeepSeek-V3.1 (671B).
- Early-exit cue: The paper appends
"The final result is"inside the thinking block. We close</think>and use\boxed{}format. Both measure relative confidence changes. - Step segmentation: The paper uses sentences (Appendix A). We use discourse markers.
- Steering vectors (Section 6): Not replicated — requires internal activation access not available in Tinker.
Learn more
- Zhao et al., "Can Aha Moments Be Fake?", 2025
- Source code
- Tinker
compute_logprobs_asyncreference