Smooth Operator

Human-like self-driving with reinforcement learning

13 min readJun 15, 2020

Overview

We’ve been able to achieve human-like driving with deep reinforcement learning, something that has not been possible until now. We did this with the introduction of two simple techniques: 1) a training curriculum with graduated movement penalties and more importantly 2) a smooth action discretization that constrains actions without stifling exploration. Both of these techniques can be easily incorporated into any existing RL algorithm that supports discrete actions. We are also open sourcing the faster-than-realtime sim that was used for this work, deepdrive-zero, and the agents trained within it.

Figure 1: An unprotected left, negotiated smoothly with Deep RL.

Motivation

Applying deep RL to continuous action spaces in robotics has been fraught with problems inherent in the random exploration of large spaces, e.g. shakiness, large oscillations, and what are often random looking behaviors.

Unnaturally wild behaviors of deep RL humanoids

The source of these behaviors stems from the randomness used to train agents in RL. Starting out, agents try things completely at random. Based on how well they did, they’ll statistically do more of the “good” actions and less of the “bad”. For discrete agents, there are a finite number of things an agent can try at any given time (usually around 10–100). For continuous agents however, the number of possible actions is vastly bigger, i.e. the floating point numbers between -1 and 1 (or around 1e9 possible different values). This makes the problem much more difficult. To compound this problem, the number of possible actions grows exponentially with the number of action dimensions (e.g. in self-driving, there are 3: steering, throttle, and brake leading to 1e27 possible actions per time step) and these actions must be chained together along several seconds at several actions per second to reach navigation goals. The large search space and the agent’s random exploration of the space while it learns from scratch results in agents flailing around, and creates bad habits that persist seemingly no matter how much additional training we apply.

Figure 1: Sweeping behavior of a self-driving agent trained with RL rewarded via time-to-destination.

If we simply penalize jerky actions, the agents will learn that staying still is the best course of action, forcing agents into bad local optima.

Curriculum

To partially overcome this, one useful technique is to induce a curriculum that progressively increases movement penalties throughout training. This gives the agent a bias for action early in training, when movement penalties are low, and the avoids local optima where the agent stalls or ends the episode before reaching waypoints. Crucially, the agent will continue to exhibit this bias for action even when movement penalties are drastically increased. The time to establish this bias is quite short compared to the total training time: e.g. just 5 minutes out 13 total hours for the agents in the video in Figure 5. The exact hyperparameters and source can be found at the following links for phases 1, 2, and 3 of this curriculum.

The agents driving in the intersection below are the result of training the flailing agent from Figure 1 with such a curriculum.

Figure 2: Unprotected left with continuously controlled agents which are fine tuned to reduce oscillations. The agents were trained with PPO via self-play.

You can see that while this policy is dramatically smoother than where we started, it is still far from something that could be deployed into a real car. So how can we do better? We want to allow the agent to explore its environment without developing suboptimal behaviors that are difficult to unlearn later on.

We want to allow the agent to explore its environment without developing suboptimal behaviors that are difficult to unlearn later on.

Enter — The Smooth Operator

Consider the following discrete action space for steering:

Figure 3: Steering actions and their domain in degrees

Discrete steering space
1m/s² — Large turn: Steering angle which results in a 1m/s² rotational acceleration. Essentially this is the largest comfortable turn the car can make at a given speed. Since this depends on speed, its domain spans the entire steering range, where larger angles are returned at lower speeds and vice versa. The lack of precision in this action is made up for by combining it with the other smaller actions below.
Decay: Slowly reduce the current steering angle(we use 10%). This allows the agent to gradually transition its heading toward some target without overcorrecting.
±1° — Small turn: Results in an in-lane maneuver at low speeds and a comfortable turn at medium to low speeds, i.e. ~15 kph. The idea is that most steering actions should be small adjustments, so we bias the action space with small angles.
±0.1° — Nudge: Another small angle which acts as a minor tracking adjustment at speeds below 200 kph.
Idle: Sets both steering and throttle to zero, i.e. coasts and goes straight. This is the final action which biases the agent towards steady state.

Notice that only 1 out of 5 steering actions results in a large change to the agent’s trajectory, namely the 1 m/s² large turn. All other actions return the agent back to a steady state in accordance with Newton’s 1st law.

Throttle follows a similar formulation, but with a smaller number of choices due to the relative simplicity of longitudinal control vs. lateral control.

Discrete throttle space
Faster: 1 m/s² acceleration — one quarter throttle
Slower: 1 m/s² deceleration — brake at 1 m/s²
Maintain: Make 1% changes to the throttle to move towards previous speed
Idle: Set both steering and throttle to zero, i.e. coast and go straight

Figure 4: Throttle actions and their domain in m/s²

Using this action space enabled us to solve the two-vehicle unprotected left in deepdrive-zero. The unprotected left is widely considered to be one of the most difficult scenarios in self-driving (and human driving!) as it requires predicting and planning around an human-driven obstacle that we must cross paths with at high velocity.

Here is a video covering the entire approach, including the use of both the curriculum and smooth discretization. Qualitatively, the driving looks natural, something that to our knowledge has not been achieved by RL to-date.

Figure 5: Training progression using curriculum and smooth discretization. All training performed on an 8 core i7 CPU with multi-agent variation of PPO. The network is a two-layer, 256x256 parameter MLP with tanh units, trained with Adam.

Generality

This discretization may seem specific to self-driving, but the underlying principles are quite general. In particular, preferring small actions is fundamental to physics and to minimizing the energy of the system, which is also why preferring small actions is a staple of traditional control. What we’ve discovered is that you can achieve this with Deep RL by discretizing continuous actions to bias towards low energy movements. This constrains the types of behaviors an agent will try and makes it much easier for the network to learn a viable policy.

G-force and Jerk

It’s important to note that we measured comfort and smoothness through g-force and jerk. Usually comfort is thought of in g’s, however jerk, the derivative of g-force, is perhaps an even more important metric when it comes to the design of comfortable human locomotion and is critical to the design of things from trampolines, to hospital beds, to roller coasters.

Since our actions are limited to 1 m/s² — the g-force encountered by the agent doesn’t change throughout training. However, we still need to penalize jerk in order to reduce swerving and lurching. The jerk curriculum simply consists of doubling jerk penalties in the final stage of training (see the end of the video for what happens when you do this). This relatively limited curriculum is only possible with the discretization — without which we did need to include many rounds of increasing g-force and jerk penalties in the curriculum to achieve smoothness. The suboptimal agent shown in Figure 2 is an example of training under this longer curriculum.

Future work

There are certain situations where we’d want to perform more energetic actions than this, and would therefore need to add emergency types of actions to the discrete space.

Also a strategy collapse occurs as the two agents, trained with self-play, end up converging on a set of limited behaviors as they know what the other agent will do. And while the natural curriculum of self-play does expose agents to a wide variety of behaviors, we’d still want agent populations to include older versions of the network, and in general we’d want the training distribution to exceed the diversity of the behavior distribution that would be encountered in the real world.

Figure 6: Self-play setup for our agents. The same network is used, but with separate inputs for each agent.

A final thought for future work is that we could measure oscillations using FFTs to penalize them by their frequencies. Perhaps this would allow us to remove the jerk curriculum entirely.

All NN Driving

Self-driving software stacks typically consist of several sequential modules.

For simplicity, consider them to be:

Perception — Creates a birds eye view representation of relevant objects around us using cameras, lidar, sonar, etc...
Prediction — Predict where everything will be in the next 5 to 10 seconds, i.e. other cars, pedestrians, and cyclists
Behavior planning — Decide whether we want to stop, turn, change lanes, etc…
Path planning, motion planning, and control — Enact a plan that will avoid predicted obstacles and execute the desired behavior

Of these steps, neural nets are currently only responsible for perception and prediction. A post-perception deep reinforcement learning agent like the one shown could replace prediction, behavior planning, and control — i.e. everything besides perception.

A self-driving system composed entirely of neural nets could be jointly trained and fine-tuned with supervised, self-supervised, and unsupervised techniques, enjoying the full benefits software 2.0.

RL + IL

I want to be clear that I don’t think Deep RL should be solely responsible for post-perception, but rather that it’s an excellent choice for overcoming the difficulties of imitation learning, allowing the agent to learn from the results of its actions and form a more robust model of intuitive physics that could be achieved with imitation learning alone. Similar to humans, I think it will be vital for sequential decision making systems to both learn on their own and to learn from others, or in other words to combine imitation learning with reinforcement learning. In this way, we could make use of the vast amount of driving data being collected by self-driving prototypes and advanced driver-assistance systems today to ensure that models behaved in ways similar to humans (using imitation learning), while at the same time forming a robust model of action-reaction dynamics using the random exploration inherent to reinforcement learning. Deep learning allows for the combination of these training techniques in multiple ways, including joint training on multiple tasks, generative adversarial imitation learning to combine the imitation learning objective and RL objective, and others.

Targeted predictions

Current self-driving prediction systems attempt to predict the most likely future positions and time derivatives of all actors in the scene. This is much more computationally expensive than what learning control from perception is doing — which is to form predictions targeted specifically at ensuring the agent meets its objective. Eliminating the need to predict and attend to everything will vastly simplify self-driving stacks and the computation required to drive. Prediction is perhaps the hardest single step in self-driving. Learning it as it’s own module over-complicates the problem. We need to instead learn targeted predictions.

Tractable Driving Objectives

RL’s objective function need not be differentiable, linear, quadratic, or run on the car. This lies in contrast to control methods like LQR and DDP, the common choices for current self-driving, which require locally linear or quadratic dynamics and locally quadratic cost. The lack of such constraints gives more control over the driving objective and reduces the human engineering effort required to modify the objective/cost function as requirements are added.

Experimental Process

I think the process of arriving at these results may be just as useful, or even more so, than the results themselves — so I will go over some of the highlights here.

Minimal Environments

One key to quickly making progress was to create minimal environments that embodied important aspects of the self-driving problem which could be trained in a matter of minutes. This not only sped up simulation, but also allowed us to use smaller models and to learn the environment more quickly.

We started with a very simple environment, perhaps the simplest you could think of, matching a real valued input. This could be thought of as a way to simulate matching a steering angle, should you know the exact angle ahead of time.

Figure 7: Simple match-input (a.k.a learning the identity function) environment used to evaluate different algorithms implementations

From there, we experimented with the ability of the net to handle discontinuous step functions (as a way to emulate bounded steering), before finally moving on to controlling a vehicle model. Checkout this repo for the gym environments we used in these simple experiments.

The next step was to implement the simplest driving physics that were realistic, yet useful. In the autonomous vehicle industry, we have a great model for this, the bicycle model. This is often used within motion control methods to approximate optimal motion plans. We found that you could step a single bicycle model at 100x realtime, in Python. Then, using Numba, we could run that same model at 500x realtime. Iterating at this speed allowed us to derive the curriculum which biased the agents towards action and smoothly reached single waypoints.

Figure 8: Simple single waypoint environment

For the next step in complexity, we added a static obstacle. This required increasing the size of the network from roughly 4k parameters to 16k, to use jerk in addition to g-force penalties, and to add additional steps of increasing these penalties to the curriculum.

These environments were useful not only for the initial phase of development, but anytime we wanted to experiment with a large change, we could go back and quickly test the change in the smaller environments.

Two inspirations for creating these environments (vs using our existing full 3D Unreal Engine based sim, Deepdrive) include the stories behind PPO told by Pieter Abbeel and DQN by its creator Volodymyr Mnih, which both involved testing agents in simple environments to achieve their landmark results. John Schulman also discusses this and more about RL experimentation in his great talk here.

Figure 10: A custom environment, called catch, used for the development of DQN before Atari

Figure 11: The environment used for the development of PPO, an algorithm that beats top level players in Dota 2, and also the algorithm used for our agents

Documenting experiments

Another time-saving practice for us was to document experiments by keeping all hyperparameters and config for the experiment in a single Python file. This allowed us to easily refactor and rerun old experiments in addition to documenting the experimental process within git. You can see an example of our experimental run files here.

Controlled experiments

Reducing the amount of variables in your experiments is often at odds with maintaining clean code — especially with the long lead times involved in deep learning. The best time to refactor is after you have something working, but usually experiments are not working. So you are faced with making hacky changes to existing code until things work, or refactoring and adding more variables to your experiment. This again is where the simple environments help a lot in that you can quickly test to see the effect of a refactor and bisect to look for changes that have caused regressions.

Idea bugs

How do you know if your hypothesis is wrong when implementing bug free software is essentially impossible? Well, you don’t, but the more ideas you try, the more bugs you’ll find and the better the underlying infrastructure code will get. Sometimes your hypothesis will be wrong, but so will the null hypothesis! In other words, the opposite of what you thought would happen is true. Embrace the possibility that you may have found something counterintuitive. But also, be aware that just because the results are not what you expect, doesn’t mean your hypothesis is wrong, as you likely have bugs and those bugs could be interfering with your ability to test the hypothesis.

More experimental notes can be found here along with lots of things that didn’t work.

Impact statement

Self-driving is one of the most potentially beneficial applications of AI in the short to medium term, i.e. ~0–10 years. However, there are possible harmful uses of this technology in the form of job loss, weaponization, and individual misuse. We believe the overarching concern in the long term is that AI be created in a way that prioritizes the safety of machines that are much smarter than humans in every way, and to explore the multitude of concerns broached by the field AGI safety. Self-driving vehicles are supercomputers on wheels and will be distributed in the billions, making them the most powerful computer most people will have physical access to. And as sensorimotor agents, they will perform more human types of computation than other computers. This sets up self-driving cars to be a potential forerunner to superhuman AI. As such, self-driving as a field should be considered an area where AGI safety is a primary concern. Luckily safety is already the primary concern for self-driving, making this field perhaps the only one where monetary and safety interests are aligned.

Let’s go!

If you’re building self-driving cars and are interested in using deep reinforcement learning for post-perception, please get in touch with me at craig@deepdrive.io or @crizcraig on Twitter.

All code is also available on GitHub:

Environment: https://github.com/deepdrive/deepdrive-zero
Agents: https://github.com/deepdrive/spinningup

Thanks

This work would have not been possible without Christina Araiza, Oliver Cameron, Drew Gray, and Maik Ernst. Tremendous thanks to them for all their input, guidance, and advice during the course of working on this project.

Also many thanks to Isaac Kargar who has verified similar reductions in training time and increases to smoothness within R2D2, PPO LSTM, DQN, and SAC (in addition to PPO). Also thanks to Farhan Hubble and Shantanu Wadnerkar for reviewing the code, finding and fixing bugs, and providing improvements to the simulation physics.

Also, big shout out to the creator of highway-env, Ed Leurent, whose intersection throttle discretization was the inspiration for the two-dimensional steering and throttle discretization we used in our agents. Also huge thanks to Joshua Achiam and OpenAI for the spinninup project, upon which our agents were built.

Also, thanks to Johannes Günther at Amii for discussions around this work.