Checkpoints for "Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models" https://arxiv.org/abs/2410.18252
Shengyi Costa Huang
vwxyzjn
AI & ML interests
None yet
Organizations
TL;DR summarization checkpoints
The checkpoints are trained in https://arxiv.org/abs/2403.17031 and taken from https://wandb.ai/costa-huang/tldr_summarize/reports/Release--Vmlldzo3MT
-
cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr
Text Generation • Updated • 2.73k -
cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr
Text Classification • Updated • 1.82k -
cleanrl/EleutherAI_pythia-2.8b-deduped__sft__tldr
Text Generation • Updated • 4 -
cleanrl/EleutherAI_pythia-2.8b-deduped__reward__tldr
Text Classification • Updated • 6
Async RLHF Paper Checkpoints
Checkpoints for "Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models" https://arxiv.org/abs/2410.18252
lm-human-preference-details
TL;DR summarization checkpoints
The checkpoints are trained in https://arxiv.org/abs/2403.17031 and taken from https://wandb.ai/costa-huang/tldr_summarize/reports/Release--Vmlldzo3MT
-
cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr
Text Generation • Updated • 2.73k -
cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr
Text Classification • Updated • 1.82k -
cleanrl/EleutherAI_pythia-2.8b-deduped__sft__tldr
Text Generation • Updated • 4 -
cleanrl/EleutherAI_pythia-2.8b-deduped__reward__tldr
Text Classification • Updated • 6
RLOO / PPOv2 TL;DR summarize checkpoints