Skip to content

reyzipamir/RLFromScratch

 
 

Repository files navigation

RLFromScratch

This repo implements Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO) from scratch in PyTorch, without relying on off-the-shelf libraries like TRL or VERL.

Why this repo

To open the black box: we unpack the training details—masking, KL penalties, scheduling, and evaluation—so you can see exactly how these algorithms work in practice.

Quick results

  • GRPO on Llama-3.2-1B-Instruct (GSM8K): ~10% → ~23% accuracy in 1 epoch.
  • DPO on Llama-3.2-1B using Tiny-Safe-Pair (safe-pair-data): ~50% → ~60% preference accuracy in 3 epochs.

Both evaluation pipelines are included.

Training setup

The scripts default to multi-GPU training with PyTorch DDP, and can be easily adapted to a single GPU by adjusting the launch command and disabling distributed initialization.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%