How GRPO’s Relative Rewards Work
Group Relative Policy Optimization (GRPO) calculates a relative "advantage" for an output by comparing its reward to the average reward of other outputs generated for the same prompt. This group-based baseline eliminates the need for a separate value function (critic model), making the training of Large Language Models more memory-efficient and stable.