Diffusion Policy Policy Optimization




TL;DR: We introduce DPPO, an algorithmic framework and set of best practices for fine-tuning diffusion-based policies in continuous control and robot learning tasks. DPPO shows marked improvements over diffusion and non-diffusion baselines alike, across a variety of tasks and sim-to-real transfer.



Approach overview

DPPO introduces a two-layer Diffusion Policy MDP with the inner MDP representing the denoising process and the outer MDP representing the environment --- each step of the entire MDP involves Gaussian likelihood and thus can be optimized with policy gradient. DPPO builds upon Proximal Policy Optimization (PPO) and proposes a set of best practices including modifications to the denoising schedule to ensure fine-tuning efficiency and stability.



Performance evaluation

DPPO yields consistent and marked improvements in training stability and final performance compared to other diffusion-based RL algorithms and common policy parameterizations such as Gaussian and Gaussian Mixture. Most remarkably, DPPO achieves robust zero-shot sim-to-real transfer (no usage of real data) in a state-based, long-horizon assembly task, while Gaussian policy shows a significant sim-to-real gap and consistently causes hardware error.



BC-only policy tends to exhibit haphazard behavior, e.g., not ensuring the peg is properly inserted before loosening the grip. DPPO policy, after RL fine-tuning, exhibits more robust insertion behavior.



DPPO policy shows robust recovery behavior, e.g., here the peg is pushed away after the failed grasp, but the robot then relocates to the peg and drags it back to the proper location. Such behavior is not present in the expert demonstrations or the BC-only policy.



DPPO solves the more challenging Square and Transport tasks from robomimic to >90% success rates using either state or pixel input and sparse reward. To our knowledge, DPPO is the first RL algorithm to solve Transport to >50% success rates. The final behavior is robust and smooth without using any regularization or reward shaping in training.



In three multi-stage assembly tasks from Furniture-Bench, One-leg, Lamp, and Round-table, DPPO improves the success rate of pre-trained policies from 57% to 97%, 12% to 87%, and 1% to 86%, respectively, learning from only sparse reward.



Although the fine-tuned Gaussian policy can sometimes achieve high success rate in simulation, e.g., in the Lamp task, its behavior is very jittery and unstable and thus unlikely to transfer well to the real world. We discuss the reason from the perspective of exploration next. We also find adding action penalty for possibly smoothening the behavior hinders fine-tuning.





Understanding DPPO's properties

Through investigative experiments, we find DPPO engages in structured, on-manifold exploration around the expert data. Gaussian policy generates less structured exploration noise (especially in M2) and Gaussian Mixture exhibits narrower coverage. DPPO's structured exploration also leads to more natural behavior after fine-tuning, which in turn leads to robust sim-to-real transfer.



DPPO preserves the iterative action refinement through denoising process, and generates policies that are robust to perturbations in dynamics and the initial state distribution. Such robustness is crucial to maintainining training stability, and notably, allowing more extensive domain randomization in simulation to facilitate sim-to-real transfer.


Citation