Generative Verifiers: Reward Modeling as Next-Token Prediction
Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, Rishabh Agarwal
International Conference on Learning Representations (ICLR), 2025
Reward models are better with next token prediction and chain of thoughts, too.