CISPO
CISPO (Chen et al., 2024; Khatri et al., 2024) is a policy gradient method that uses a clipped importance ratio as a coefficient for the policy gradient. Unlike PPO which clips the objective directly, CISPO clips the ratio and uses it to weight the log probability.
The CISPO objective is:
This is implemented as:
# Compute probability ratio
prob_ratio = torch.exp(target_logprobs - sampling_logprobs)
# Apply clipping
clipped_ratio = torch.clamp(prob_ratio, clip_low_threshold, clip_high_threshold)
# Compute CISPO objective (detach the clipped ratio)
cispo_objective = clipped_ratio.detach() * target_logprobs * advantages
# CISPO loss is negative of objective
loss = -cispo_objective.sum()
Input tensors:
target_tokens: array[(N,), int]— Target token IDs (from the sampler \(q\))logprobs: array[(N,), float]—sampling_logprobsfor the tokensadvantages: array[(N,), float]— Advantage values for RL
Output tensors:
logprobs: array[(N,), float]—target_logprobsfor the tokens
Output diagnostics:
loss:sum(scalar) — Sum of CISPO losses
Choosing clipping thresholds
Because the clipped ratio is a detached coefficient on \(\log p_\theta\) (rather than clipping the objective like PPO), CISPO never zeros a token's gradient — it only bounds the coefficient's magnitude. This is what makes the choice of thresholds, especially the lower one, matter.
The default is a one-sided clip. CISPO defaults to disabling the lower bound
and only capping the upper side (clip_low_threshold=0.0, clip_high_threshold=4.0),
so you get this without passing loss_fn_config. To set it explicitly:
fwd_bwd_future = await training_client.forward_backward_async(
data=data,
loss_fn="cispo",
loss_fn_config={"clip_low_threshold": 0.0, "clip_high_threshold": 4.0}
)
fwd_bwd_result = await fwd_bwd_future.result_async()
This is a safe default in any setting. With no lower bound, the worst case is that
the coefficient just falls back toward the plain importance-sampling weight, which
is well-behaved. A positive clip_low_threshold (e.g. 0.8), by contrast, floors
the coefficient even for tokens whose ratio has dropped well below 1 — tokens the
policy has already moved away from — which removes the attenuation importance
sampling provides for stale tokens. On-policy that rarely matters, but off-policy
(async, where the sampler \(q\) lags the trainer \(p_\theta\)) it can set up a positive
feedback loop: the sampler/trainer KL grows, more tokens fall outside the band, the
bias grows, KL grows further. So there is little downside to dropping the lower
bound and a real upside off-policy.
This matches both papers that introduced and scaled CISPO. MiniMax-M1 (Chen et al., 2024) report: "we did not impose a lower bound on the IS weight by setting \(\epsilon_{\text{low}}^{\text{IS}}\) to a large value; instead, we only tuned \(\epsilon_{\text{high}}^{\text{IS}}\)." ScaleRL (Khatri et al., 2024) use \(\mathrm{clip}(r, 0, \epsilon_{\max})\) and find CISPO is largely insensitive to the upper bound — \(\epsilon_{\max}\) of 4, 5, and 8 perform the same (Fig. 19b) — so a loose upper bound in that range is a safe default.