YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

     

SCOPE Overview


πŸ”₯ News

πŸ“– Overview

On-Policy Distillation (OPD) alleviates alignment gaps by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality.

Existing OPD Limitations:

  • Diversity degradation: Correct paths are reinforced equally, reducing exploration at the capability boundary
  • Rectification inefficiency: Noisy teacher signals mislead incorrect trajectories

SCOPE Solution: We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths.


πŸ“ Abstract

On-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality. We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths. For incorrect trajectories, SCOPE performs teacher-perplexity-weighted KL distillation to prioritize instances where the teacher demonstrates genuine corrective capability, while down-weighting unreliable guidance. For correct trajectories, it applies student-perplexity-weighted MLE to concentrate reinforcement on low-confidence samples at the capability boundary rather than over-reinforcing already mastered ones. Both paths employ a group-level normalization to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts. Extensive experiments on six reasoning benchmarks show that SCOPE achieves an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines, demonstrating its consistent effectiveness.


πŸ† Key Contributions

  • Empirical analysis of signal quality heterogeneity in OPD: Uncovers that teacher and student perplexity reliably predict corrective capability on incorrect trajectories and capability-boundary samples on correct ones.

  • The SCOPE dual-path adaptive framework: Routes rollouts by correctness, directing incorrect trajectories to teacher-perplexity-weighted OPD and correct trajectories to student-perplexity-weighted MLE.

  • Extensive experimental validation: Achieves 11.42% Avg@32 and 7.30% Pass@32 relative improvement over baselines on six reasoning benchmarks.


πŸ“– Method

SCOPE Framework

SCOPE is a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths:

Path Trajectories Method Objective
Student Path Correct (Ξ©_c) Perplexity-weighted MLE Reinforce unconventional valid paths at capability boundary
Teacher Path Incorrect (Ξ©_w) Perplexity-weighted KL distillation Filter out context-induced noise, prioritize reliable guidance

Weight Formulation

Student-guided weight (for correct trajectories Ξ©_c):

wistu=PPLS(yi∣x)1/Ο„βˆ‘j∈ΩcPPLS(yj∣x)1/Ο„w_i^{stu} = \frac{\text{PPL}_S(y_i|x)^{1/\tau}}{\sum_{j \in \Omega_c} \text{PPL}_S(y_j|x)^{1/\tau}}

Amplifies "unconventional valid paths" at the capability boundary using perplexity-based weighting.

Teacher-guided weight (for incorrect trajectories Ξ©_w):

witea=PPLT(yi∣x)βˆ’1/Ο„βˆ‘j∈ΩwPPLT(yj∣x)βˆ’1/Ο„w_i^{tea} = \frac{\text{PPL}_T(y_i|x)^{-1/\tau}}{\sum_{j \in \Omega_w} \text{PPL}_T(y_j|x)^{-1/\tau}}

Filters "context-induced noise" by down-weighting high teacher perplexity instances.

Key Insight

Within each prompt's trajectory group, SCOPE applies group-level perplexity-based normalization to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts.

Overall Objective

The combined SCOPE objective jointly optimizes:

LSCOPE=βˆ‘i∈Ωcwistuβ‹…LMLE+βˆ‘i∈Ωwwiteaβ‹…LOPD\mathcal{L}_{SCOPE} = \sum_{i \in \Omega_c} w_i^{stu} \cdot \mathcal{L}_{MLE} + \sum_{i \in \Omega_w} w_i^{tea} \cdot \mathcal{L}_{OPD}


πŸ“Š Main Results

Mathematical Reasoning (Teacher: Skywork-OR1-Math-7B β†’ Student: DeepSeek-R1-Distill-Qwen-1.5B)

Benchmark Avg@32 Pass@32 vs OPD
AIME24 42.7 77.9 +6.22%
AIME25 30.4 50.9 +5.19%
AMC23 80.9 97.2 +6.59%
MATH500 89.8 97.9 +0.90%
Minerva 37.8 55.1 +8.31%
Olympiad 49.7 70.9 +10.69%

Key findings:

  • 11.42% relative improvement in Avg@32
  • 7.30% relative improvement in Pass@32
  • +5.54% average improvement over standard OPD across benchmarks

⚑ Quick Start

1. Install Dependencies

pip install -r requirements.txt
pip install -e .  # install verl itself

2. Deploy VLLM Service

bash deploy_vllm.sh

Key configurations in deploy_vllm.sh:

Parameter Description Default
model_name_or_path Model path ./Models/Skywork-OR1-7B
served_model_name Model name in API Skywork-OR1-7B
--api-key API authentication key xxx (must match verl/utils/api_interface.py)

3. Configure Experiment Scripts

Set the following in run_experiment_distill_1_5b.sh:

TEACHER_MODEL_NAME=Skywork-OR1-7B  # Must match served_model_name in deploy_vllm.sh
IP_POOL="['xx.xxx.x.xx','...']"    # VLLM service node IP list

API Key Consistency: The --api-key in deploy_vllm.sh must match the api_key in verl/utils/api_interface.py.

4. Run Training

bash run_experiment_distill_1_5b.sh

πŸ”§ Training Parameters

Model Configuration

Parameter Description Default
POLICY_MODEL_PATH Student model path DeepSeek-R1-Distill-Qwen-1.5B
TEACHER_MODEL_NAME Teacher model name (as registered in VLLM) Skywork-OR1-7B
IP_POOL VLLM service node IP list ['xx.xxx.x.xx','...']

Data Configuration

Parameter Description Default
TRAIN_DATA Training data path ./verl-distillation-ori/data/deepmath_new/deepmath_new_train.parquet
VAL_DATA Validation data path ./verl-distillation-ori/data/aime/test.parquet
MAX_PROMPT_LENGTH Max prompt length 2048
MAX_RESPONSE_LENGTH Max response length 12288

SCOPE Dual-Path Configuration

Parameter Description Default
USE_SCOPE_DUAL_PATH_WEIGHTING Enable SCOPE dual-path weighting True
SCOPE_TAU Weight temperature parameter 1
SCOPE_USE_SEQ_WEIGHTS Use sequence-level weights True
USE_STUDENT_PATH_WEIGHTS Use student path weights True
USE_TEACHER_PATH_WEIGHTS Use teacher path weights True
STUDENT_PATH_PPL_POSITIVE Student path: higher PPL β†’ higher weight True
TEACHER_PATH_PPL_POSITIVE Teacher path: higher PPL β†’ lower weight False

🀝 Acknowledgements

This work builds upon verl and the on-policy distillation paradigm, with appreciation for their contributions to the research community.

πŸ”— Citation

If you find our work useful, please consider citing:

@article{scope2026,
  title={SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting},
  author={Zheng, Binbin and Ma, Xing and Liang, Yiheng and Ruan, Jingqing and Fu, Xiaoliang and Lin, Kepeng and Zhu, Benchang and Zeng, Ke and Cai, Xunliang},
  journal={arXiv preprint arXiv:2604.10688},
  year={2026}
}

πŸ“ License

This project is licensed under the MIT License. See the LICENSE file for details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for Albert711/dataset