Method note · TC-RLSD

Teacher-Critique Self-Distilled RLVR

Turning RLVR's sparse success/failure reward into a dense, token-level training signal — by treating an external teacher's natural-language critique as a self-distillation target, with an additive teacher branch that survives even when the verifier signal collapses.

Llama-3.2-3B student · Qwen3-4B critique generator · MATH-lighteval

1. Background — the RLVR sparseness problem

2. Related work — two self-distilled RLVR baselines

Two recent papers turn self-distillation into a way to enrich the RLVR signal. We position TC-RLSD as a generalisation of both.

SDPO arXiv 2601.20802 RLSD arXiv 2604.03128 (Apple)
Teacher self, conditioned on feedback $f$ self, conditioned on privileged context (gold)
Feedback $f$ env errors / peer rollouts gold solution (multimodal)
Advantage $\hat A_t^{\text{SDPO}} = \Delta_t$ (per-token, dense) $\hat A_t^{\text{RLSD}} = A \cdot ((1-\lambda)+\lambda w_t)$ (multiplicative)
Handles $A=0$ yes (no verifier needed) no (multiplicative on $A$)
Token-level shaping direct log-ratio sign-anchored magnitude reweighting

where $\Delta_t = \text{sg}[\log \pi(y_t \mid x, f, y_{

3. Proposed method — TC-RLSD

The same policy weights $\theta$ are used in two roles:

3.0 Method overview

LLAMA-3.2-3B θ · SHARED WEIGHTS x f log P log Q · sg sign(A) ∇θ update INPUT Query x · math problem INPUT Critique f · Qwen-4B reads rollout TRAIN Student mode π(y_t | x, y_<t) STOP-GRAD Teacher mode π(y_t | x, f, y_<t) REWARD Verifier & A r ∈ {0,1} → (r − μ)/σ PER-TOKEN Teacher gap Δ_t sg[ log Q − log P ] BRANCH · RLSD Multiplicative α · A · ((1−λ) + λ · w_t) collapses when A = 0 NEW Additive β · Δ_t independent of A · dense survives OUTPUT Final advantage Â_t α A · ((1−λ)+λ w_t) + β Δ_t LEGEND Input Step Reward Focal · new Merge Stop-grad · ∇
Shared θ runs as Student (trainable) and Teacher (stop-grad, sees critique f). Verifier sets the sign of A; teacher provides per-token Δ_t. Both branches always compute — the additive term in coral survives A = 0.

3.1 The three advantage forms (all in our codebase)

We implement the RLSD form, the SDPO+GRPO additive hybrid, and our combined form — all selectable via algorithm.adv_estimator.

Form 1 — rlsd (multiplicative, RLSD paper)

Verifier advantage $A$, reshaped by the teacher's per-token confidence $w_t$. $$\hat A_t^{\text{RLSD}} = A \cdot \bigl((1-\lambda) + \lambda\, w_t\bigr)$$
$w_t$ is the teacher's clipped log-ratio modulator; it scales the magnitude of $A$ while leaving the sign untouched.

Pros: sign-anchored (verifier directs, teacher shapes magnitude only); leakage-free. Cons: dies when $A=0$ on saturated groups.

Form 2 — rlsd_add (additive, SDPO+GRPO hybrid)

A linear blend: shrinking GRPO advantage, growing dense teacher signal. $$\hat A_t^{\text{rlsd\_add}} = (1-\lambda)\cdot A + \lambda\cdot \Delta_t$$
$\lambda$ anneals 0.5 → 0 — the teacher term dominates early, the verifier takes over later. Standard mix-of-advantage construction; both terms are valid PG estimators.

Pros: linear combination of two valid advantage estimators (gradient is linear in advantage, so summing advantages = summing policy gradients). Survives $A=0$ via the $\lambda\Delta_t$ term — dense teacher signal even on saturated data. Cons: no sign-anchor for the teacher term.

Form 3 — rlsd_hybrid (multiplicative + additive, ours)

Both branches at once — $\alpha$-weighted RLSD reshape plus $\beta$-weighted dense teacher gap. $$\hat A_t^{\text{rlsd\_hybrid}} = \alpha \cdot A \cdot \bigl((1-\lambda) + \lambda\, w_t\bigr) + \beta \cdot \Delta_t$$
Strict generalisation: $\beta{=}0 \Rightarrow$ Form 1; $\alpha{=}0,\lambda{=}0 \Rightarrow$ Form 2. Defaults: $\alpha = 0.7,\ \beta = 0.3$.

Combines RLSD's verifier-anchored shaping with SDPO's dense fallback. Recovers either pure form as a limit.

3.2 Why the additive term matters (RL theory)

Policy gradient is linear in advantage:

The gradient is the score function $\nabla\log\pi_\theta$, weighted by advantage. $$\nabla_\theta \mathcal L \;=\; \mathbb E_{y\sim\pi_\theta}\!\bigl[\nabla_\theta \log\pi_\theta(y_t)\cdot \hat A_t\bigr]$$
Because $\hat A_t$ enters linearly, decomposing $\hat A = A_1 + A_2$ gives $\nabla\mathcal L = \nabla\mathcal L_1 + \nabla\mathcal L_2$. Two policy gradients combined — RLHF KL, PPO entropy bonus, multi-task aux loss all use this.

So summing the verifier-grounded $A_{\text{GRPO}}$ (sparse, sequence-level) and the teacher-grounded $\Delta_t$ (dense, per-token) is the principled way to keep both signals.

Token-level vs sequence-level — a covariance view

The same idea in OPD vocabulary: a convex blend of the on-policy-distillation advantage and the RLVR advantage,

$$\hat A_i^{(\alpha)} \;=\; \alpha \cdot A^{\text{OPD}}_i \;+\; (1-\alpha)\cdot A^{\text{RLVR}}_i \qquad (\alpha = 1 \,\Rightarrow\, \text{pure OPD})$$

has natural-gradient covariance

$$C_{\text{OPD}} \;=\; \mathbb{E}\!\left[\,\sum_{t} a_t\,a_t'\cdot \phi_t\,\phi_t^{\!\top}\,\right] \qquad \text{(per-token heterogeneous coefficients)}$$ $$C_{\text{RLVR}} \;=\; \mathbb{E}\!\left[\,A(y)^{2}\cdot s\,s^{\!\top}\,\right],\qquad s \;=\; \sum_{t}\phi_t \qquad \text{(rank-1 — every token collapses into one scalar)}$$

$C_{\text{RLVR}}$ is rank-1 in token-feature space: every token shares the same sequence-level scalar $A(y)$, and only the summed gradient direction $s$ survives the outer product. $C_{\text{OPD}}$ keeps each $\phi_t\phi_t^{\!\top}$ with its own per-token coefficient $a_ta_t'$, so it is generically full-rank. Adding the two — exactly what rlsd_add / rlsd_hybrid do — strictly enriches the information geometry of the update.

For TC-RLSD:

Summing them is the principled way to keep both signals.

3.3 Hyperparameters

symbolmeaningrecommended
$\lambda$RLSD multiplicative-branch decay$0.5 \to 0$ over 50 steps
$\varepsilon_w$$w_t$ clip range0.2
$\alpha$ hybridweight on RLSD multiplicative branch0.7
$\beta$ hybridweight on SDPO additive branch0.3

3.4 What's new vs SDPO and RLSD

SDPORLSDTC-RLSD ours
modalitytext (code/math)multimodaltext (math)
teacher sourceselfselfexternal cross-family (Qwen3-4B → Llama-3.2-3B)
feedback $f$env errors / peersgold solutionexternal model's NL critique of student's rollout
advantage form$\Delta_t$ direct$A\cdot((1-\lambda)+\lambda w_t)$3 forms; ours = $\alpha A m_t + \beta\Delta_t$
handles $A=0$yesnoyes (additive term)
cross-family vocabsharedsharedbridged via text — no shared logits

4. Experimental setup

4.1 Models & data

4.2 Training config (faithful to RLSD paper + ReMax setup)

itemvalue
RL baseGRPO (adv_estimator: grpo / rlsd / rlsd_add / rlsd_hybrid)
advantage std-normalisationon
lr1e-6
entropy_coeff1e-3
KL0  (use_kl_loss=False)
train batch1024
rollout n (group)8
max prompt / response1024 / 3072
dynamic_bsz, ppo_max_token/gpuon, 24000
ppo_mini_batch256
RLSD λ / ε_w0.5→0 (50 steps) / 0.2
hybrid α / β0.7 / 0.3
epochs10
val_kwargs.n / temp32 / 0.6 (avg@32)
save_freq80 (final ckpt only)

4.3 Infrastructure

4.4 Critique feedback options (ablation axes)

Prompt design note. The original _WHOLE_REJUDGE prompt with "verdict may be a format artifact" hedge was found to cause lenient bias — Qwen defended real reasoning errors as format issues. Replaced with _WHOLE_INCORRECT + a brief factual note that grader parsing is regex-based (no rejudge instruction).

5. Results

5.1 gsm8k_level1, 10 epochs (mean@1 — early run, too easy → tied)

methodval acc peakval acc finalcritique coverage
GRPO baseline0.3660.36
TC-RLSD critique+gold0.3740.363–4 %

Essentially tied (+0.008, within noise). Cause: gsm8k is too easy — by epoch 10 the train reward hits 0.93, so almost no rollouts are wrong → critique coverage collapses to 3–4 %, $w \approx 1.0$, modulation negligible. Confirms the $A=0$ failure mode the additive form targets.

5.2 MATH-lighteval 4-way (mean@1, apple-to-apple at 16 train GPUs, n=8)

methodval peakval finalΔ-gapcritique cov.
GRPO (vanilla RLVR)0.5140.514
RLSD-gold (Apple replication)0.5180.5181.0 (gold mode)
RLSD-self-critique (SDPO-style)0.5170.517−0.0290.149
TC-RLSD critique (Qwen-4B external)0.5250.525−0.0370.136

Cross-family external teacher wins (+1.1 / +0.7 / +0.8 over the three self-distilled baselines) on harder data where the verifier stays informative. Larger Δ-gap also confirms a more discriminating per-token teacher signal.

5.3 Planned — 6-way comparison with the additive / hybrid forms (avg@32)

methodadvantage formhyperparameters
GRPO (baseline)$A$
RLSD-gold$A((1-\lambda)+\lambda w_t)$gold mode, $\lambda=0.5\to0$
RLSD-self$A((1-\lambda)+\lambda w_t)$self critique
TC-RLSD-critique multiplicative$A((1-\lambda)+\lambda w_t)$external Qwen critique
TC-RLSD-add new$(1-\lambda)A + \lambda\Delta_t$external Qwen critique
TC-RLSD-hybrid new$\alpha A m_t + \beta\Delta_t$$\alpha=0.7,\;\beta=0.3$

Hypothesis: the additive forms outperform multiplicative on saturated groups (later in training, or on easier subsets) because the dense $\Delta_t$ term survives where $A=0$.

6. Key finding — verdict false-negatives + lenient-bias prompt


Repo: oishikimchi97/tc-rlsd2  (private monorepo). Reports / critique samples / metric extractions live under reports/.