← Writing
ACM AI AI Scholars Fellowship Application  ·  2025

MLP Bipedal Walker: PPO Training Writeup

Technical writeup of a reinforcement learning project training a PPO agent on the BipedalWalker-v3 environment over 1 million timesteps, covering curriculum learning, GAE tuning, gSDE, observation normalization, and ablation study results.

Prior Experience & Timeline

Prior to this project, I had been a fellow for the fall technical track with AI Safety @ UCLA, where I was introduced to how RL works intuitively and mathematically and given sample exercises to practice implementation. Beyond this, I had neither learned about nor coded any deep learning or other RL models.

I spent a total of 10 hours on this project, broken up over several days. The first three hours covered reviewing requirements, studying PPO as a surrogate objective function (Schulman et al., 2017), Stable-Baselines3 (Raffin et al., 2021), Generalized Advantage Estimation (Schulman et al., 2016), and the reward structure and observation/action spaces of BipedalWalker-v3. Hours 4–6 covered ablation experiments; 7–8 focused on intensifying training timescales toward solving the environment; 9–10 finalized conclusions and researched next steps.

Techniques & Justifications

Key Issues Encountered

Although any increase in entropy alongside a high GAE value seemed to individually improve mean reward, testing them in combination yielded suboptimal results. Any increase in entropy alongside GAE lambda would both individually inject variance into PPO updates, adding unnecessary noise in both advantage estimation and action selection and introducing large instability into the learning signal. For a contact-rich task like learning to walk, this quickly becomes untenable.

gSDE — and in particular the combination of gSDE with a high GAE — becomes more unstable than justifiable in later training, during what should be a policy refinement rather than an exploratory phase. Increased stochasticity during trajectory generation combined with amplified propagation of long-horizon reward information degraded credit assignment and created a model that was both brittle and too high in variance.

Parallel environment sampling increased throughput but caused the model to optimize too aggressively in late training, leading to brittleness as the policy drifted. I was too time-constrained to investigate fixes specifically, but it remains a promising direction.

Conclusions & Next Steps

Overall, I conclude that curriculum learning-style optimizations of tradeoffs between exploration/exploitation and advantage/reward are among the most crucial parts of training an MLP for this environment. Observation normalization unequivocally improved training stability. The effectiveness of gSDE seemed limited by the dominance of stability and reward-structure challenges rather than insufficient exploration noise; similarly, parallel environment sampling seemed ill-suited to the training of a stable model given the scope of this project.

Improvements with further investigation would include: correlating optimization epochs with rollouts and other training data; exploiting parallel data collection using off-policy algorithms with experience replay (Haarnoja et al., 2018; Fujimoto et al., 2018); or further tweaking KL-based trust-region constraints to preserve training stability.

Works Cited

Farama Foundation. (2023). Gymnasium. https://gymnasium.farama.org

Fujimoto, S., Hoof, H., and Meger, D. (2018). Addressing function approximation error in actor-critic methods. Proceedings of the 35th International Conference on Machine Learning (ICML).

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the 35th International Conference on Machine Learning (ICML).

OpenAI. (2018). Spinning up in deep reinforcement learning. https://spinningup.openai.com

Raffin, A., et al. (2021). Stable-Baselines3: reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268), 1–8.

Raffin, A., Kober, J., and Stulp, F. (2020). Smooth exploration for robotic reinforcement learning. arXiv:2005.05719.

Schulman, J., et al. (2016). High-dimensional continuous control using generalized advantage estimation. arXiv:1506.02438.

Schulman, J., et al. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.