Skip to content

True-Thinking Score

This recipe implements the True-Thinking Score metric from "Can Aha Moments Be Fake?" (Zhao et al., 2025) using Tinker. The True-Thinking Score measures the causal contribution of each reasoning step in chain-of-thought to the model's final prediction — revealing that most CoT steps are decorative and only ~2% are truly causal.

What you'll build

A True-Thinking Score analysis pipeline that generates chain-of-thought reasoning from a thinking model, segments it into steps, and measures each step's causal contribution via perturbation experiments. The output is a per-step score that identifies which reasoning steps actually matter.

Prerequisites

uv pip install tinker-cookbook

Set your Tinker API key:

export TINKER_API_KEY=<your-key>

No data download needed — MATH-500 and GSM8K are loaded automatically from HuggingFace.

Key concepts

  • True-thinking step — a reasoning step with a high score (>= 0.7), meaning it causally influences the model's final answer
  • Decorative step — a step with a score near 0, which looks like reasoning but barely affects the output
  • Self-verification step — a step containing patterns like "Wait, let me re-check" — often decorative despite appearing diligent
  • Perturbation — adding small integer offsets ({-3..3}) to numbers in a step, or dropping the step entirely if non-numeric
  • Early-exit confidence — measuring P(correct answer | CoT prefix) by appending the answer after a partial reasoning chain

How it works

For each reasoning step, we run 4 forward passes that measure the model's confidence in the correct answer under different perturbation conditions:

Intact step Perturbed step
Intact context Original steps 1..i-1 + original step i Original steps 1..i-1 + perturbed step i
Perturbed context Perturbed steps 1..i-1 + original step i Perturbed steps 1..i-1 + perturbed step i

Each cell measures P(correct answer | CoT prefix) via compute_logprobs_async. The True-Thinking Score is the average of the two row-wise diffs:

\[\text{True-Thinking Score}(s) = \tfrac{1}{2}\bigl(|S_1(1) - S_0(1)| + |S_1(0) - S_0(0)|\bigr)\]
  • Row 1 measures necessity — does the model rely on this step?
  • Row 2 measures sufficiency — can this step alone drive the correct answer?

For example, a real step from Qwen3.5-4B on an inclusion-exclusion problem:

Original:  (33 + 19 + 14) - (6 + 4 + 2) + 0
Perturbed: (30 + 16 + 17) - (5 + 2 + 0) + -2

Run it

50 MATH-500 problems (~14 minutes with concurrency=64)

python -m tinker_cookbook.recipes.true_thinking_score.analyze \
    dataset=math n_problems=50

Quick smoke test (5 problems, ~3 minutes)

python -m tinker_cookbook.recipes.true_thinking_score.analyze \
    n_problems=5

DeepSeek-V3.1 (requires thinking renderer override)

python -m tinker_cookbook.recipes.true_thinking_score.analyze \
    model_name=deepseek-ai/DeepSeek-V3.1 renderer_name=deepseekv3_thinking \
    n_problems=50

GSM8K with a larger model

python -m tinker_cookbook.recipes.true_thinking_score.analyze \
    dataset=gsm8k model_name=Qwen/Qwen3.5-27B n_problems=50

Results are saved to /tmp/tinker-examples/tts/<run-name>/:

  • tts_per_problem.jsonl — per-problem details (steps, scores)
  • tts_summary.json — aggregate statistics

Key parameters

Parameter Default Description
model_name Qwen/Qwen3.5-4B Thinking model to analyze
renderer_name None (auto) Override renderer (e.g. deepseekv3_thinking for DeepSeek)
dataset math Dataset: math (MATH-500) or gsm8k
n_problems 50 Number of problems to analyze
concurrency 64 Max parallel problems (steps within a problem are sequential)
max_tokens 4096 Max tokens for CoT generation
seed 42 Random seed for perturbation reproducibility

Expected results

50 MATH-500 problems per model, concurrency=64:

Metric Paper (R1-Distill-7B) Qwen3.5-4B Qwen3.5-27B DeepSeek-V3.1 (671B)
Steps/problem 31.7 31.9 11.3
Mean score ~0.03 0.054 0.070 0.144
Score >= 0.7 2.3% 2.0% 1.8% 5.0%
Score >= 0.3 6.4% 5.7% 8.0% 19.1%
Decorative (<=0.005) 59.3% 51.1% 35.1%
SV decorative 12-21% 56.5% 49.1% 36.3%
Accuracy 58% 62% 70%

Findings

  1. The paper's core claim is validated across 3 models. The ~2% high-scoring rate is consistent across Qwen3.5-4B (2.0%) and Qwen3.5-27B (1.8%), closely matching the paper's 2.3%.

  2. Scaling reduces decorative reasoning. DeepSeek-V3.1 (671B) has far fewer decorative steps (35%) than Qwen models (51-59%), and solves problems in 11 steps vs 32 for Qwen.

  3. Self-verification is often fake. 36-57% of "Wait, let me re-check" steps are decorative across all models. DeepSeek-V3.1 is the most honest at 36%.

Differences from the paper

  • Models: The paper tests DeepSeek-R1-Distill (7B, 8B) and Nemotron-1.5B. We test Qwen3.5-4B, Qwen3.5-27B, and DeepSeek-V3.1 (671B).
  • Early-exit cue: The paper appends "The final result is" inside the thinking block. We close </think> and use \boxed{} format. Both measure relative confidence changes.
  • Step segmentation: The paper uses sentences (Appendix A). We use discourse markers.
  • Steering vectors (Section 6): Not replicated — requires internal activation access not available in Tinker.

Learn more