Turning RLVR's sparse success/failure reward into a dense, token-level training signal — by treating an external teacher's natural-language critique as a self-distillation target, with an additive teacher branch that survives even when the verifier signal collapses.
Two recent papers turn self-distillation into a way to enrich the RLVR signal. We position TC-RLSD as a generalisation of both.
| SDPO arXiv 2601.20802 | RLSD arXiv 2604.03128 (Apple) | |
|---|---|---|
| Teacher | self, conditioned on feedback $f$ | self, conditioned on privileged context (gold) |
| Feedback $f$ | env errors / peer rollouts | gold solution (multimodal) |
| Advantage | $\hat A_t^{\text{SDPO}} = \Delta_t$ (per-token, dense) | $\hat A_t^{\text{RLSD}} = A \cdot ((1-\lambda)+\lambda w_t)$ (multiplicative) |
| Handles $A=0$ | yes (no verifier needed) | no (multiplicative on $A$) |
| Token-level shaping | direct log-ratio | sign-anchored magnitude reweighting |
where $\Delta_t = \text{sg}[\log \pi(y_t \mid x, f, y_{ The same policy weights $\theta$ are used in two roles: We implement the RLSD form, the SDPO+GRPO additive hybrid, and our combined form — all selectable via Pros: sign-anchored (verifier directs, teacher shapes magnitude only); leakage-free. Cons: dies when $A=0$ on saturated groups. Pros: linear combination of two valid advantage estimators (gradient is linear in advantage, so summing advantages = summing policy gradients). Survives $A=0$ via the $\lambda\Delta_t$ term — dense teacher signal even on saturated data. Cons: no sign-anchor for the teacher term. Combines RLSD's verifier-anchored shaping with SDPO's dense fallback. Recovers either pure form as a limit. Policy gradient is linear in advantage: So summing the verifier-grounded $A_{\text{GRPO}}$ (sparse, sequence-level) and the teacher-grounded $\Delta_t$ (dense, per-token) is the principled way to keep both signals. The same idea in OPD vocabulary: a convex blend of the on-policy-distillation advantage and the RLVR advantage, has natural-gradient covariance $C_{\text{RLVR}}$ is rank-1 in token-feature space: every token shares the same sequence-level scalar $A(y)$, and only the summed gradient direction $s$ survives the outer product. $C_{\text{OPD}}$ keeps each $\phi_t\phi_t^{\!\top}$ with its own per-token coefficient $a_ta_t'$, so it is generically full-rank. Adding the two — exactly what For TC-RLSD: Summing them is the principled way to keep both signals. Essentially tied (+0.008, within noise). Cause: gsm8k is too easy — by epoch 10 the train reward hits 0.93, so almost no rollouts are wrong → critique coverage collapses to 3–4 %, $w \approx 1.0$, modulation negligible. Confirms the $A=0$ failure mode the additive form targets. Cross-family external teacher wins (+1.1 / +0.7 / +0.8 over the three self-distilled baselines) on harder data where the verifier stays informative. Larger Δ-gap also confirms a more discriminating per-token teacher signal. Hypothesis: the additive forms outperform multiplicative on saturated groups (later in training, or on easier subsets) because the dense $\Delta_t$ term survives where $A=0$. Repo: 3. Proposed method — TC-RLSD
3.0 Method overview
3.1 The three advantage forms (all in our codebase)
algorithm.adv_estimator.Form 1 —
rlsd (multiplicative, RLSD paper)Form 2 —
rlsd_add (additive, SDPO+GRPO hybrid)Form 3 —
rlsd_hybrid (multiplicative + additive, ours)3.2 Why the additive term matters (RL theory)
Token-level vs sequence-level — a covariance view
rlsd_add / rlsd_hybrid do — strictly enriches the information geometry of the update.
3.3 Hyperparameters
symbol meaning recommended $\lambda$ RLSD multiplicative-branch decay $0.5 \to 0$ over 50 steps $\varepsilon_w$ $w_t$ clip range 0.2 $\alpha$ hybrid weight on RLSD multiplicative branch 0.7 $\beta$ hybrid weight on SDPO additive branch 0.3 3.4 What's new vs SDPO and RLSD
SDPO RLSD TC-RLSD ours modality text (code/math) multimodal text (math) teacher source self self external cross-family (Qwen3-4B → Llama-3.2-3B) feedback $f$ env errors / peers gold solution external model's NL critique of student's rollout advantage form $\Delta_t$ direct $A\cdot((1-\lambda)+\lambda w_t)$ 3 forms; ours = $\alpha A m_t + \beta\Delta_t$ handles $A=0$ yes no yes (additive term) cross-family vocab shared shared bridged via text — no shared logits 4. Experimental setup
4.1 Models & data
meta-llama/Llama-3.2-3B-InstructQwen/Qwen3-4B-Instruct-2507 — standing vLLM OpenAI server; fresh critique per rollout, every stepgsm8k_level1 (easy, ceiling test) / MATH-lighteval (hard, real test-bed)mean@1 (single greedy sample); current runs upgraded to avg@32 ($n=32,\;T=0.6$) for variance-controlled metric.4.2 Training config (faithful to RLSD paper + ReMax setup)
item value RL base GRPO ( adv_estimator: grpo / rlsd / rlsd_add / rlsd_hybrid)advantage std-normalisation on lr 1e-6 entropy_coeff 1e-3 KL 0 ( use_kl_loss=False)train batch 1024 rollout n (group) 8 max prompt / response 1024 / 3072 dynamic_bsz, ppo_max_token/gpu on, 24000 ppo_mini_batch 256 RLSD λ / ε_w 0.5→0 (50 steps) / 0.2 hybrid α / β 0.7 / 0.3 epochs 10 val_kwargs.n / temp 32 / 0.6 (avg@32) save_freq 80 (final ckpt only) 4.3 Infrastructure
small-g. 17-node critique runs (1 server + 16 train) at medium-g.hpc run -t miyabi → pins a git commit and materialises an isolated worktree+venv per job. No hpc submit debug-track for real experiments.
oishikimchi97/tc-rlsd2 monorepo — ReMax verl + recipe/rlsd + hpc orchestrationoishikimchi97/tc-rlsd split repo, kept as backup4.4 Critique feedback options (ablation axes)
diagnostic (whole-paragraph) / rubric (4-axis PASS/FAIL) / best_worst (two-line)none / student (parsed answer only) / gold (reference in input only, leak-filtered)_WHOLE_REJUDGE prompt with "verdict may be a format artifact" hedge was found to cause lenient bias — Qwen defended real reasoning errors as format issues. Replaced with _WHOLE_INCORRECT + a brief factual note that grader parsing is regex-based (no rejudge instruction).
5. Results
5.1 gsm8k_level1, 10 epochs (mean@1 — early run, too easy → tied)
method val acc peak val acc final critique coverage GRPO baseline 0.366 0.36 — TC-RLSD critique+gold 0.374 0.36 3–4 % 5.2 MATH-lighteval 4-way (mean@1, apple-to-apple at 16 train GPUs, n=8)
method val peak val final Δ-gap critique cov. GRPO (vanilla RLVR) 0.514 0.514 — — RLSD-gold (Apple replication) 0.518 0.518 — 1.0 (gold mode) RLSD-self-critique (SDPO-style) 0.517 0.517 −0.029 0.149 TC-RLSD critique (Qwen-4B external) 0.525 0.525 −0.037 0.136 5.3 Planned — 6-way comparison with the additive / hybrid forms (avg@32)
method advantage form hyperparameters GRPO (baseline) $A$ — RLSD-gold $A((1-\lambda)+\lambda w_t)$ gold mode, $\lambda=0.5\to0$ RLSD-self $A((1-\lambda)+\lambda w_t)$ self critique TC-RLSD-critique multiplicative $A((1-\lambda)+\lambda w_t)$ external Qwen critique TC-RLSD-add new $(1-\lambda)A + \lambda\Delta_t$ external Qwen critique TC-RLSD-hybrid new $\alpha A m_t + \beta\Delta_t$ $\alpha=0.7,\;\beta=0.3$ 6. Key finding — verdict false-negatives + lenient-bias prompt
#### / \boxed extraction → a correct answer in a malformed output is graded incorrect → the critique generator, told "this is wrong, find the error," hallucinates a non-existent error.grade_info=gold + _WHOLE_REJUDGE system prompt with "verdict may be a format artifact, re-judge." Worked for format false-negatives but introduced the opposite bias._WHOLE_INCORRECT (or _WHOLE_CORRECT) verdict-conditioned prompt, just add a brief factual note that grader parsing is regex-based. Teacher uses judgment without an explicit override license.
oishikimchi97/tc-rlsd2 (private monorepo). Reports / critique samples / metric extractions live under reports/.