True-Thinking Score

This recipe implements the True-Thinking Score metric from "Can Aha Moments Be Fake?" (Zhao et al., 2025) using Tinker. The True-Thinking Score measures the causal contribution of each reasoning step in chain-of-thought to the model's final prediction — revealing that most CoT steps are decorative and only ~2% are truly causal.

What you'll build

A True-Thinking Score analysis pipeline that generates chain-of-thought reasoning from a thinking model, segments it into steps, and measures each step's causal contribution via perturbation experiments. The output is a per-step score that identifies which reasoning steps actually matter.

Prerequisites

uv pip install tinker-cookbook

Set your Tinker API key:

export TINKER_API_KEY=<your-key>

No data download needed — MATH-500 and GSM8K are loaded automatically from HuggingFace.

Key concepts

True-thinking step — a reasoning step with a high score (>= 0.7), meaning it causally influences the model's final answer
Decorative step — a step with a score near 0, which looks like reasoning but barely affects the output
Self-verification step — a step containing patterns like "Wait, let me re-check" — often decorative despite appearing diligent
Perturbation — adding small integer offsets ({-3..3}) to numbers in a step, or dropping the step entirely if non-numeric
Early-exit confidence — measuring P(correct answer | CoT prefix) by appending the answer after a partial reasoning chain

How it works

For each reasoning step, we run 4 forward passes that measure the model's confidence in the correct answer under different perturbation conditions:

	Intact step	Perturbed step
Intact context	Original steps 1..i-1 + original step i	Original steps 1..i-1 + perturbed step i
Perturbed context	Perturbed steps 1..i-1 + original step i	Perturbed steps 1..i-1 + perturbed step i

Each cell measures P(correct answer | CoT prefix) via compute_logprobs_async. The True-Thinking Score is the average of the two row-wise diffs:

\[\text{True-Thinking Score}(s) = \tfrac{1}{2}\bigl(|S_1(1) - S_0(1)| + |S_1(0) - S_0(0)|\bigr)\]

Row 1 measures necessity — does the model rely on this step?
Row 2 measures sufficiency — can this step alone drive the correct answer?

For example, a real step from Qwen3.5-4B on an inclusion-exclusion problem:

Original:  (33 + 19 + 14) - (6 + 4 + 2) + 0
Perturbed: (30 + 16 + 17) - (5 + 2 + 0) + -2

Run it

50 MATH-500 problems (~14 minutes with concurrency=64)

python -m tinker_cookbook.recipes.true_thinking_score.analyze \
    dataset=math n_problems=50

Quick smoke test (5 problems, ~3 minutes)

python -m tinker_cookbook.recipes.true_thinking_score.analyze \
    n_problems=5

DeepSeek-V3.1 (requires thinking renderer override)

python -m tinker_cookbook.recipes.true_thinking_score.analyze \
    model_name=deepseek-ai/DeepSeek-V3.1 renderer_name=deepseekv3_thinking \
    n_problems=50

GSM8K with a larger model

python -m tinker_cookbook.recipes.true_thinking_score.analyze \
    dataset=gsm8k model_name=Qwen/Qwen3.5-27B n_problems=50

Results are saved to /tmp/tinker-examples/tts/<run-name>/:

tts_per_problem.jsonl — per-problem details (steps, scores)
tts_summary.json — aggregate statistics

Key parameters

Parameter	Default	Description
`model_name`	`Qwen/Qwen3.5-4B`	Thinking model to analyze
`renderer_name`	`None` (auto)	Override renderer (e.g. `deepseekv3_thinking` for DeepSeek)
`dataset`	`math`	Dataset: `math` (MATH-500) or `gsm8k`
`n_problems`	`50`	Number of problems to analyze
`concurrency`	`64`	Max parallel problems (steps within a problem are sequential)
`max_tokens`	`4096`	Max tokens for CoT generation
`seed`	`42`	Random seed for perturbation reproducibility

Expected results

50 MATH-500 problems per model, concurrency=64:

Metric	Paper (R1-Distill-7B)	Qwen3.5-4B	Qwen3.5-27B	DeepSeek-V3.1 (671B)
Steps/problem	—	31.7	31.9	11.3
Mean score	~0.03	0.054	0.070	0.144
Score >= 0.7	2.3%	2.0%	1.8%	5.0%
Score >= 0.3	6.4%	5.7%	8.0%	19.1%
Decorative (<=0.005)	—	59.3%	51.1%	35.1%
SV decorative	12-21%	56.5%	49.1%	36.3%
Accuracy	—	58%	62%	70%

Findings

The paper's core claim is validated across 3 models. The ~2% high-scoring rate is consistent across Qwen3.5-4B (2.0%) and Qwen3.5-27B (1.8%), closely matching the paper's 2.3%.
Scaling reduces decorative reasoning. DeepSeek-V3.1 (671B) has far fewer decorative steps (35%) than Qwen models (51-59%), and solves problems in 11 steps vs 32 for Qwen.
Self-verification is often fake. 36-57% of "Wait, let me re-check" steps are decorative across all models. DeepSeek-V3.1 is the most honest at 36%.

Differences from the paper

Models: The paper tests DeepSeek-R1-Distill (7B, 8B) and Nemotron-1.5B. We test Qwen3.5-4B, Qwen3.5-27B, and DeepSeek-V3.1 (671B).
Early-exit cue: The paper appends "The final result is" inside the thinking block. We close </think> and use \boxed{} format. Both measure relative confidence changes.
Step segmentation: The paper uses sentences (Appendix A). We use discourse markers.
Steering vectors (Section 6): Not replicated — requires internal activation access not available in Tinker.

Learn more

Zhao et al., "Can Aha Moments Be Fake?", 2025
Source code
Tinker compute_logprobs_async reference