Method note · TC-RLSD

Teacher-Critique Self-Distilled RLVR

Turning RLVR's sparse success/failure reward into a dense, token-level training signal — by treating an external teacher's natural-language critique as a self-distillation target, with an additive teacher branch that survives even when the verifier signal collapses.

Llama-3.2-3B student · Qwen3-4B critique generator · MATH-lighteval

1. Background — the RLVR sparseness problem

RLVR (Reinforcement Learning with Verifiable Rewards) trains a policy on tasks with checkable answers — math, code. The reward is 1 if correct, else 0: sparse, outcome-only.
GRPO normalises this per-prompt group: $A = (r - \text{mean}(r)) / \text{std}(r)$.
Failure mode: when all rollouts for a prompt agree (all correct OR all wrong), $A = 0$ for every token. The learning signal vanishes entirely on saturated data (unanimously easy or unanimously hard prompts).
There is also no token-level information about where the solution went wrong — only a sequence-level scalar broadcast to every token.

2. Related work — two self-distilled RLVR baselines

Two recent papers turn self-distillation into a way to enrich the RLVR signal. We position TC-RLSD as a generalisation of both.

	SDPO arXiv 2601.20802	RLSD arXiv 2604.03128 (Apple)
Teacher	self, conditioned on feedback $f$	self, conditioned on privileged context (gold)
Feedback $f$	env errors / peer rollouts	gold solution (multimodal)
Advantage	$\hat A_t^{\text{SDPO}} = \Delta_t$ (per-token, dense)	$\hat A_t^{\text{RLSD}} = A \cdot ((1-\lambda)+\lambda w_t)$ (multiplicative)
Handles $A=0$	yes (no verifier needed)	no (multiplicative on $A$)
Token-level shaping	direct log-ratio	sign-anchored magnitude reweighting

where $\Delta_t = \text{sg}[\log \pi(y_t \mid x, f, y_{

3. Proposed method — TC-RLSD

The same policy weights $\theta$ are used in two roles:

Student $\pi(y \mid x)$ — sees only the problem.
Teacher $\pi(y \mid x, f)$ — conditioned on an external teacher's natural-language critique $f$ written after reading the student's actual rollout (the gold answer is never revealed in the critique text).

3.0 Method overview

Shared θ runs as Student (trainable) and Teacher (stop-grad, sees critique f). Verifier sets the sign of A; teacher provides per-token Δ_t. Both branches always compute — the additive term in coral survives A = 0.

3.1 The three advantage forms (all in our codebase)

We implement the RLSD form, the SDPO+GRPO additive hybrid, and our combined form — all selectable via algorithm.adv_estimator.

Form 1 — `rlsd` (multiplicative, RLSD paper)

Verifier advantage $A$, reshaped by the teacher's per-token confidence $w_t$. $$\hat A_t^{\text{RLSD}} = A \cdot \bigl((1-\lambda) + \lambda\, w_t\bigr)$$

$w_t$ is the teacher's clipped log-ratio modulator; it scales the magnitude of $A$ while leaving the sign untouched.

Pros: sign-anchored (verifier directs, teacher shapes magnitude only); leakage-free. Cons: dies when $A=0$ on saturated groups.

Form 2 — `rlsd_add` (additive, SDPO+GRPO hybrid)

A linear blend: shrinking GRPO advantage, growing dense teacher signal. $$\hat A_t^{\text{rlsd\_add}} = (1-\lambda)\cdot A + \lambda\cdot \Delta_t$$

$\lambda$ anneals 0.5 → 0 — the teacher term dominates early, the verifier takes over later. Standard mix-of-advantage construction; both terms are valid PG estimators.

Pros: linear combination of two valid advantage estimators (gradient is linear in advantage, so summing advantages = summing policy gradients). Survives $A=0$ via the $\lambda\Delta_t$ term — dense teacher signal even on saturated data. Cons: no sign-anchor for the teacher term.

Form 3 — `rlsd_hybrid` (multiplicative + additive, ours)

Both branches at once — $\alpha$-weighted RLSD reshape plus $\beta$-weighted dense teacher gap. $$\hat A_t^{\text{rlsd\_hybrid}} = \alpha \cdot A \cdot \bigl((1-\lambda) + \lambda\, w_t\bigr) + \beta \cdot \Delta_t$$

Strict generalisation: $\beta{=}0 \Rightarrow$ Form 1; $\alpha{=}0,\lambda{=}0 \Rightarrow$ Form 2. Defaults: $\alpha = 0.7,\ \beta = 0.3$.

Combines RLSD's verifier-anchored shaping with SDPO's dense fallback. Recovers either pure form as a limit.

3.2 Why the additive term matters (RL theory)

Policy gradient is linear in advantage:

The gradient is the score function $\nabla\log\pi_\theta$, weighted by advantage. $$\nabla_\theta \mathcal L \;=\; \mathbb E_{y\sim\pi_\theta}\!\bigl[\nabla_\theta \log\pi_\theta(y_t)\cdot \hat A_t\bigr]$$

Because $\hat A_t$ enters linearly, decomposing $\hat A = A_1 + A_2$ gives $\nabla\mathcal L = \nabla\mathcal L_1 + \nabla\mathcal L_2$. Two policy gradients combined — RLHF KL, PPO entropy bonus, multi-task aux loss all use this.

So summing the verifier-grounded $A_{\text{GRPO}}$ (sparse, sequence-level) and the teacher-grounded $\Delta_t$ (dense, per-token) is the principled way to keep both signals.

Token-level vs sequence-level — a covariance view

The same idea in OPD vocabulary: a convex blend of the on-policy-distillation advantage and the RLVR advantage,

$$\hat A_i^{(\alpha)} \;=\; \alpha \cdot A^{\text{OPD}}_i \;+\; (1-\alpha)\cdot A^{\text{RLVR}}_i \qquad (\alpha = 1 \,\Rightarrow\, \text{pure OPD})$$

has natural-gradient covariance

$$C_{\text{OPD}} \;=\; \mathbb{E}\!\left[\,\sum_{t} a_t\,a_t'\cdot \phi_t\,\phi_t^{\!\top}\,\right] \qquad \text{(per-token heterogeneous coefficients)}$$ $$C_{\text{RLVR}} \;=\; \mathbb{E}\!\left[\,A(y)^{2}\cdot s\,s^{\!\top}\,\right],\qquad s \;=\; \sum_{t}\phi_t \qquad \text{(rank-1 — every token collapses into one scalar)}$$

$C_{\text{RLVR}}$ is rank-1 in token-feature space: every token shares the same sequence-level scalar $A(y)$, and only the summed gradient direction $s$ survives the outer product. $C_{\text{OPD}}$ keeps each $\phi_t\phi_t^{\!\top}$ with its own per-token coefficient $a_ta_t'$, so it is generically full-rank. Adding the two — exactly what rlsd_add / rlsd_hybrid do — strictly enriches the information geometry of the update.

For TC-RLSD:

$A_{\text{GRPO}}$: verifier-grounded, sequence-level, sparse (collapses to 0 on saturated groups).
$\Delta_t$: teacher-grounded, per-token, dense (always informative when feedback exists).

Summing them is the principled way to keep both signals.

3.3 Hyperparameters

symbol	meaning	recommended
$\lambda$	RLSD multiplicative-branch decay	$0.5 \to 0$ over 50 steps
$\varepsilon_w$	$w_t$ clip range	0.2
$\alpha$ hybrid	weight on RLSD multiplicative branch	0.7
$\beta$ hybrid	weight on SDPO additive branch	0.3

3.4 What's new vs SDPO and RLSD

	SDPO	RLSD	TC-RLSD ours
modality	text (code/math)	multimodal	text (math)
teacher source	self	self	external cross-family (Qwen3-4B → Llama-3.2-3B)
feedback $f$	env errors / peers	gold solution	external model's NL critique of student's rollout
advantage form	$\Delta_t$ direct	$A\cdot((1-\lambda)+\lambda w_t)$	3 forms; ours = $\alpha A m_t + \beta\Delta_t$
handles $A=0$	yes	no	yes (additive term)
cross-family vocab	shared	shared	bridged via text — no shared logits

4. Experimental setup

4.1 Models & data

Student (policy): meta-llama/Llama-3.2-3B-Instruct
Critique generator (teacher): Qwen/Qwen3-4B-Instruct-2507 — standing vLLM OpenAI server; fresh critique per rollout, every step
Self-critique baseline (SDPO-style): same Llama-3.2-3B serving its own critique
Data: gsm8k_level1 (easy, ceiling test) / MATH-lighteval (hard, real test-bed)
Eval: validation accuracy. Original runs used mean@1 (single greedy sample); current runs upgraded to avg@32 ($n=32,\;T=0.6$) for variance-controlled metric.

4.2 Training config (faithful to RLSD paper + ReMax setup)

item	value
RL base	GRPO (`adv_estimator`: `grpo` / `rlsd` / `rlsd_add` / `rlsd_hybrid`)
advantage std-normalisation	on
lr	1e-6
entropy_coeff	1e-3
KL	0 (`use_kl_loss=False`)
train batch	1024
rollout n (group)	8
max prompt / response	1024 / 3072
dynamic_bsz, ppo_max_token/gpu	on, 24000
ppo_mini_batch	256
RLSD λ / ε_w	0.5→0 (50 steps) / 0.2
hybrid α / β	0.7 / 0.3
epochs	10
val_kwargs.n / temp	32 / 0.6 (avg@32)
save_freq	80 (final ckpt only)

4.3 Infrastructure

miyabi (aarch64 GH200, PBS scheduler). 16-node baselines (GRPO / RLSD-gold) at small-g. 17-node critique runs (1 server + 16 train) at medium-g.
Reproducible track: every run launched via hpc run -t miyabi → pins a git commit and materialises an isolated worktree+venv per job. No hpc submit debug-track for real experiments.
Repos:
- Project (active): oishikimchi97/tc-rlsd2 monorepo — ReMax verl + recipe/rlsd + hpc orchestration
- Project (legacy): oishikimchi97/tc-rlsd split repo, kept as backup

4.4 Critique feedback options (ablation axes)

granularity: diagnostic (whole-paragraph) / rubric (4-axis PASS/FAIL) / best_worst (two-line)
grade_info (verifier output shown to teacher): none / student (parsed answer only) / gold (reference in input only, leak-filtered)

Prompt design note. The original _WHOLE_REJUDGE prompt with "verdict may be a format artifact" hedge was found to cause lenient bias — Qwen defended real reasoning errors as format issues. Replaced with _WHOLE_INCORRECT + a brief factual note that grader parsing is regex-based (no rejudge instruction).

5. Results

5.1 gsm8k_level1, 10 epochs (mean@1 — early run, too easy → tied)

method	val acc peak	val acc final	critique coverage
GRPO baseline	0.366	0.36	—
TC-RLSD critique+gold	0.374	0.36	3–4 %

Essentially tied (+0.008, within noise). Cause: gsm8k is too easy — by epoch 10 the train reward hits 0.93, so almost no rollouts are wrong → critique coverage collapses to 3–4 %, $w \approx 1.0$, modulation negligible. Confirms the $A=0$ failure mode the additive form targets.

5.2 MATH-lighteval 4-way (mean@1, apple-to-apple at 16 train GPUs, n=8)

method	val peak	val final	Δ-gap	critique cov.
GRPO (vanilla RLVR)	0.514	0.514	—	—
RLSD-gold (Apple replication)	0.518	0.518	—	1.0 (gold mode)
RLSD-self-critique (SDPO-style)	0.517	0.517	−0.029	0.149
TC-RLSD critique (Qwen-4B external)	0.525	0.525	−0.037	0.136

Cross-family external teacher wins (+1.1 / +0.7 / +0.8 over the three self-distilled baselines) on harder data where the verifier stays informative. Larger Δ-gap also confirms a more discriminating per-token teacher signal.

5.3 Planned — 6-way comparison with the additive / hybrid forms (avg@32)

method	advantage form	hyperparameters
GRPO (baseline)	$A$	—
RLSD-gold	$A((1-\lambda)+\lambda w_t)$	gold mode, $\lambda=0.5\to0$
RLSD-self	$A((1-\lambda)+\lambda w_t)$	self critique
TC-RLSD-critique multiplicative	$A((1-\lambda)+\lambda w_t)$	external Qwen critique
TC-RLSD-add new	$(1-\lambda)A + \lambda\Delta_t$	external Qwen critique
TC-RLSD-hybrid new	$\alpha A m_t + \beta\Delta_t$	$\alpha=0.7,\;\beta=0.3$

Hypothesis: the additive forms outperform multiplicative on saturated groups (later in training, or on easier subsets) because the dense $\Delta_t$ term survives where $A=0$.

6. Key finding — verdict false-negatives + lenient-bias prompt

The rule reward for gsm8k / MATH is sensitive to #### / \boxed extraction → a correct answer in a malformed output is graded incorrect → the critique generator, told "this is wrong, find the error," hallucinates a non-existent error.
First fix attempt: grade_info=gold + _WHOLE_REJUDGE system prompt with "verdict may be a format artifact, re-judge." Worked for format false-negatives but introduced the opposite bias.
Lenient-bias bug: a Qwen probe on "x+5=12, find x²" with student answer "x=17 → x²=289" had Qwen defend the answer as "mathematically accurate; the error lies in grader's format parsing." Qwen ignored the verifier verdict and the gold reference because the hedge was too strong.
Final fix: drop the rejudge prompt. Keep _WHOLE_INCORRECT (or _WHOLE_CORRECT) verdict-conditioned prompt, just add a brief factual note that grader parsing is regex-based. Teacher uses judgment without an explicit override license.

Repo: oishikimchi97/tc-rlsd2 (private monorepo). Reports / critique samples / metric extractions live under reports/.

1. Background — the RLVR sparseness problem

2. Related work — two self-distilled RLVR baselines

3. Proposed method — TC-RLSD

3.0 Method overview

3.1 The three advantage forms (all in our codebase)

Form 1 — rlsd (multiplicative, RLSD paper)

Form 2 — rlsd_add (additive, SDPO+GRPO hybrid)

Form 3 — rlsd_hybrid (multiplicative + additive, ours)

3.2 Why the additive term matters (RL theory)

Token-level vs sequence-level — a covariance view

3.3 Hyperparameters

3.4 What's new vs SDPO and RLSD

4. Experimental setup

4.1 Models & data

4.2 Training config (faithful to RLSD paper + ReMax setup)

4.3 Infrastructure

4.4 Critique feedback options (ablation axes)

5. Results

5.1 gsm8k_level1, 10 epochs (mean@1 — early run, too easy → tied)

5.2 MATH-lighteval 4-way (mean@1, apple-to-apple at 16 train GPUs, n=8)

5.3 Planned — 6-way comparison with the additive / hybrid forms (avg@32)

6. Key finding — verdict false-negatives + lenient-bias prompt

Form 1 — `rlsd` (multiplicative, RLSD paper)

Form 2 — `rlsd_add` (additive, SDPO+GRPO hybrid)

Form 3 — `rlsd_hybrid` (multiplicative + additive, ours)