EMA Policy Gradient: Taming Reinforcement Learning for LLMs with EMA Anchor and Top-k KL
Lunjun Zhang, Jimmy Ba
arXiv 2026
Top-k KL flexibly interpolates between exact and sampled KL, while remaining unbiased at any k.
I am a CS PhD candidate in the Machine Learning Group at University of Toronto, advised by Professor Jimmy Ba.
I was a student researcher at Google DeepMind working on LLM reasoning, post-training, and eval from 2024-2025 for over a year.
I was an early employee at self-driving startup Waabi from 2021-2024, studying under Raquel Urtasun.
I studied Engineering Science at University of Toronto, and interned at Vector Institute, Mila, Uber ATG.
Contact: Email / Google Scholar / LinkedIn / Twitter
I am broadly interested in building general-purpose agents in the digital and physical worlds, with a focus on recursive self improvement.
I currently work on improving various aspects of language model reasoning and agentic capabilities.
Previously, I worked on unsupervised learning of perception, prediction, and planning in robotics.
Lunjun Zhang, Jimmy Ba
arXiv 2026
Top-k KL flexibly interpolates between exact and sampled KL, while remaining unbiased at any k.
Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, Rishabh Agarwal
International Conference on Learning Representations (ICLR), 2025
Reward models are better with next token prediction and chain of thoughts, too.
Lunjun Zhang, Yuwen Xiong, Ze Yang, Sergio Casas, Rui Hu, Raquel Urtasun
International Conference on Learning Representations (ICLR), 2024
[Paper] [Proceedings] [Poster] [Website]
A foundation model for self-driving that explicitly reasons in both 3D space and time.
Lunjun Zhang, Anqi Joyce Yang, Yuwen Xiong, Sergio Casas, Bin Yang, Mengye Ren, Raquel Urtasun
Conference on Computer Vision and Pattern Recognition (CVPR), 2023
[Paper] [Proceedings] [Poster] [Website]
Self-supervised, scalable object discovery in the wild.
Lunjun Zhang, Ge Yang, Bradly Stadie
International Conference on Machine Learning (ICML), 2021 (Long Talk)
[Paper] [Proceedings] [Poster] [Website] [Code]
Unsupervised long-horizon planning via graph-structured world models.