For residual learning that refines existing behavior, sample efficiency depends on two things: how much information each rollout returns, and how efficiently the learner uses that information. Reinforcement learning's standard scalar reward carries far less information than the directional task error that defines the task. Random exploration further discards whatever information each rollout returns.
Through residual learning with directional task-error supervision and a task error model that drives sample selection, we achieve stable three-, four-, and five-ball juggling on anthropomorphic Barrett WAM arms. Despite planning and controlling through a deliberately simple stack with idealized assumptions, the system converges from the second attempt: the first attempt drops, after which task error decreases monotonically without further failures. In comparison, five-ball juggling typically takes humans years of practice.
We compare residual learners across two ternary axes, the directional information in the learning feedback and the commitment of the analytic prior, spanning Newton-style Jacobian updates, Composite Bayesian Optimization, and stochastic search methods. Both axes prove necessary: neither directional feedback nor an informative prior suffices alone, and the simplest method that combines them, a fixed-Jacobian Newton update, is the most reliable. The learned residual tolerates a misaligned analytic prior and degraded tracking, with only convergence speed affected. The bottleneck for residual learning on real robots is therefore the information content of the supervision signal and how the learner uses it, not the accuracy of the surrounding stack.
Best-known residual policies juggling uniform three-, four-, and five-ball patterns on two Barrett WAM arms. Each panel advances attempt by attempt. The first attempt drops; every subsequent attempt succeeds — for three balls, every attempt succeeds from the first.
The learned five-ball residual remains stable as the tracking controller's gains are scaled down to 75%, 50%, and 25% of nominal — reducing proportional gain (Kp), derivative gain (Kd), and both together.
Proportional gain Kp at 100 / 75 / 50 / 25%.
Derivative gain Kd at 100 / 75 / 50 / 25%.
Both gains Kp & Kd at 100 / 75 / 50 / 25%.
Rotating the analytic prior away from its nominal orientation (0° to 90°). The residual learner absorbs increasingly misaligned priors, reaching a stable pattern from the second attempt across rotations.
@unpublished{ploeger2026residual,
title = {Task-Error Residual Learning for Real-Robot Five-Ball Juggling},
author = {Ploeger, Kai and Peters, Jan},
year = {2026},
note = {Under submission},
}