The Policy of Truth

This is the sixth part of “An Outsider’s Tour of Reinforcement Learning.” Part 7 is here. Part 5 is here. Part 1 is here.

Our first generic candidate for solving reinforcement learning is Policy Gradient. I find it shocking that Policy Gradient wasn’t ruled out as a bad idea in 1993. Policy gradient is seductive as it apparently lets one fine tune a program to solve any problem without any domain knowledge. Of course, anything that makes such a claim must be too general for its own good. Indeed, if you dive into it, policy gradient is nothing more than random search dressed up in mathematical symbols and lingo.

I apologize in advance that this is one of the more notationally heavy posts. Policy Gradient makes excessive use of notation to fool us into thinking there is something deep going on. My guess is that part of the reason Policy Gradient remained a research topic was because people didn’t implement it and the mathematics looked so appealing on its own. This makes it easy to lose sight of what would happen if the method actually got coded up. See if you can find the places where leaps of faith occur.

Adding abstraction until the problem is solved

Let’s start with the super general problem that people solve with policy gradient. Recall that a trajectory is a sequence of states $x_k$ and control actions $u_k$ generated by a dynamical system,

\[\tau_t = (u_1,…,u_{t-1},x_0,…,x_t) \,,\]

and a policy is a function, $\pi$, that takes a trajectory and outputs a new control action. Our goal remains to find a policy that maximizes the total reward after $L$ time steps.

In policy gradient, we fix our attention on parametric, randomized policies. The policy $\pi$ has a list of parameters to tune, $\vartheta$. And rather than returning a specific control action, we assume that $\pi$ is a probability distribution over actions. An action is chosen in practice in each step by sampling from this distribution $\pi$. You might ask, why are we sampling? That’s a great question! But let’s not get bogged down by reasonable questions and press on.

Let’s write $\pi_\vartheta$ to make the dependence on the parameters $\vartheta$ explicit. Since $\pi_\vartheta$ is a probability distribution, using $\pi_\vartheta$ as a policy induces a probability distribution over trajectories:

\[p(\tau;\vartheta) = \prod_{t=0}^{L-1} p(x_{t+1} \vert x_{t},u_{t}) \pi_\vartheta(u_t\vert \tau_t)\,.\]

Moreover, we can overload notation and define the reward of a trajectory to be

\[R(\tau) = \sum_{t=0}^N R_t(x_t,u_t)\]

Then our optimization problem for reinforcement learning tidily becomes:

\[\begin{array}{ll} \mbox{maximize}_{\vartheta} & \mathbb{E}_{p(\tau \vert \vartheta)}[ R(\tau)] \end{array}\]

We can make this even cleaner by defining

\[J(\vartheta) := \mathbb{E}_{p(\tau \vert \vartheta)}[ R(\tau) ]\,.\]

Our goal in reinforcement learning can now be even more compactly written as

\[\begin{array}{ll} \mbox{maximize}_{\vartheta} & J(\vartheta)\,. \end{array}\]

Policy Gradient

Having set up the problem in tidy notation, Policy Gradient can now be derived by the following clever trick:

\[\begin{align*} \nabla_{\vartheta} J(\vartheta) &= \int R(\tau) \nabla_{\vartheta} p(\tau;\vartheta) d\tau\\ &= \int R(\tau) \left(\frac{\nabla_{\vartheta} p(\tau;\vartheta)}{p(\tau;\vartheta)}\right) p(\tau;\vartheta) d\tau\\ &= \int \left( R(\tau) \nabla_{\vartheta} \log p(\tau;\vartheta) \right) p(\tau;\vartheta)d\tau \\ &= \mathbb{E}_{p(\tau;\vartheta)}\left[ R(\tau) \nabla_{\vartheta} \log p(\tau;\vartheta) \right]\,. \end{align*}\]

This calculation reveals that the gradient of $J$ with respect to $\vartheta$ is the expected value of the function

\[G(\tau,\vartheta) = R(\tau) \nabla_{\vartheta} \log p(\tau;\vartheta)\]

Hence, if we sample a trajectory $\tau$ by running policy $\pi_\vartheta$, we can compute $G(\tau,\vartheta)$ and will have an unbiased estimate of the gradient of $J$. We can follow this direction and will be running stochastic gradient descent on $J$.

What is more magic, is that the function $G(\tau,\vartheta)$ can be computed without knowing the equations that govern the dynamical system. To see this note that

\[p(x_{t+1}|x_{t},u_{t})\]

is not a function of the parameter $\vartheta$. Hence,

\[\nabla_\vartheta \log p(\tau;\vartheta) = \sum_{t=0}^{L-1} \nabla_\vartheta \log \pi_\vartheta(u_t|\tau_t)\,.\]

These derivatives can be computed provided that $\pi_\vartheta$ is differentiable and you have the latest version of autograd installed.

To sum up, we have a fairly miraculous method that lets us optimize an optimal control problem without knowing anything about the dynamics of the system.

Choose some initial guess $\vartheta_0$ and stepsize sequence ${\alpha_k}$. Set $k=0$.
Sample $\tau_k$ by running the simulator with policy $\pi_{\vartheta_k}$.
Set $\vartheta_{k+1} = \vartheta_k + \alpha_k R(\tau_k) \sum_{t=0}^{L-1} \nabla_\vartheta \log \pi_\vartheta(u_{tk}\vert \tau_t)$.
Increment $k=k+1$ and go to step 2.

The main appeal of policy gradient is that it is this easy. If you can efficiently sample from $\pi_\vartheta$, you can run this algorithm on essentially any problem. You can fly quadcopters, you can cool data centers, you can teach robots to open doors. The question becomes, of course, can you do this well? I think that a simple appeal to the Linearization Principle will make it clear that Policy Gradient is likely never the algorithm that you’d want to use.

Why are we using probabilistic policies again?

Before talking about linear models, let’s step back and consider a pure optimization setup. We added a bunch of notation to reinforcement learning so that at the end, it seemed like we were only aiming to maximize an unconstrained function. Let’s remove all of the dynamics and consider the one step optimal control problem. Given a function $R(u)$, I want to find the $u$ that makes this as large as possible. That is, I’d like to solve the optimization problem

\[\begin{array}{ll} \mbox{maximize}_u & R(u) \,. \end{array}\]

Now, bear with me for a second into a digression that might seem tangential. Any optimization problem like this is equivalent to an optimization over probability distributions on $u$.

\[\begin{array}{ll} \mbox{maximize}_{p(u)} & \mathbb{E}_p[R(u)] \end{array}\]

The equivalence goes like this: if $u_\star$ is the optimal solution, then we’ll get the same reward if you put a Delta-function around $u_\star$. Moreover, if $p$ is a probability distribution, it’s clear that the expected reward can never be larger than maximal reward achievable by a fixed $u$. So we can either optimize over $u$ or we can optimize over distributions over $u$.

Now here is the first logical jump in Policy Gradient. Rather than optimizing over the space of of all probability distributions, we optimize over a parametric family $p(u;\vartheta)$. If this family contains all of the Delta functions, then the optimal value will coincide with the non-random optimization problem. But if the family does not contain the Delta functions, we will only get an lower bound on the optimal reward no matter how good of a probability distribution we find. In this case, if you sample $u$ from the policy, their expected reward will necessarily be suboptimal.

A major problem with this paradigm of optimization over distributions is that we have to balance many requirements for our family of distributions. We need probability distributions that are

rich enough to approximate delta functions
easy to search by gradient methods
easy to sample from

That’s a lot of demands to place upon distributions, especially when your control actions take continuous values. For continuous actions, more often than not people will choose a family of Gaussian distributions so that

\[u_t = f(\tau_t) + g_t\]

Here, $f$ is some nonlinear function and $g_t$ is a Gaussian random vector. No parameterization like this contains the Delta functions. And it is not clear how much we lose by making such a parameterization because we’re not allowed to model anything in reinforcement learning.

It’s important at this point to reemphasize there is no need for a randomized policy in the basic optimal control problem we have been studying. And there is certainly no need for the simple LQR problem. The probabilistic policy is a modeling choice, and one that is never better than a deterministic policy.

The super general REINFORCE algorithm

So it turns out that this Policy Gradient algorithm is in fact a general purpose method for finding stochastic gradients of rewards of the form

\[J(\vartheta):=\mathbb{E}_{p(u;\vartheta)}[R(u)]\]

The log-likelihood trick works in full generality here:

\[\begin{align*} \nabla_{\vartheta} J(\vartheta) &= \int R(u) \nabla_{\vartheta} p(u;\vartheta) du\\ &= \int R(u) \left(\frac{\nabla_{\vartheta} p(u;\vartheta)}{p(u;\vartheta)}\right) p(u;\vartheta) du\\ &= \int \left( R(u) \nabla_{\vartheta} \log p(u;\vartheta) \right) p(u;\vartheta)du \\ &= \mathbb{E}_{p(u;\vartheta)}\left[ R(u) \nabla_{\vartheta} \log p(u;\vartheta) \right]\,. \end{align*}\]

And hence the following is a general purpose algorithm for maximizing rewards with respect to parametric distributions:

Choose some initial guess $\vartheta_0$ and stepsize sequence ${\alpha_k}$. Set $k=0$.
Sample $u_k$ i.i.d., from $p(u;\vartheta_k)$.
Set $\vartheta_{k+1} = \vartheta_k + \alpha_k R(u_k) \nabla_{\vartheta} \log p(u_k;\vartheta_k)$.
Increment $k=k+1$ and go to step 2.

The algorithm in this form is called REINFORCE. It seems weird: we get a stochastic gradient, but the function we cared about optimizing—$R$—is only accessed through function evaluations. We never compute gradients of $R$ itself. So is this algorithm any good?

It depends on what you are looking for. If you’re looking for something to compete with gradients, no. It’s a terrible algorithm. If you’re looking for an algorithm to compete with a finite difference approximation to $R$ then… it’s still a terrible algorithm. But the math is cute.

The thing is, the Linearization Principle suggests this is algorithm should be discarded almost immediately. Let’s consider the most trivial example of LQR: $R(u) = -||u-z||^2$ Let $p(u;\vartheta)$ be a multivariate Gaussian with mean $\vartheta$ and variance $\sigma^2 I$. What does policy gradient do? First, note that

\[\mathbb{E}_{p(u;\vartheta)} [R(u)]= -\|\vartheta-z\|^2 - \sigma^2 d\]

Obviously, the best thing to do would be to set $\vartheta=z$. Note that the expected reward is off by $\sigma^2 d$ at this point, but at least this would be finding a good guess for $u$. Also, as a function of $\vartheta$, $J$ is strongly convex, and the most important thing to know is the expected norm of the gradient as this will control the number of iterations. Now, if you start at $\vartheta=0$, then the gradient is

\[g=\frac{||\omega-z||^2 \omega}{\sigma^2}\,,\]

where $\omega$ is a normally distributed random vector with mean zero and covariance $\sigma^2 I$. The expected norm of this stochastic gradient is… gross. You need to compute 6th order moments, and that’s never fun. But if you grind through the details, you’ll see the expected norm is on the order of

\[O\left(\sigma d^{1.5} + \sigma^{-1} d^{0.5} \|z\|\right)\,.\]

That’s quite large! The scaling with dimension is rather troubling.

Many people have analyzed the complexity of this method, and it is indeed not great and strongly depends on the dimension of the search space. It also depends on the largest magnitude reward $B$. If the function values are noisy, even for convex functions, the convergence rate is $O((d^2B^2/T)^{-1/3})$, and this assumes you get the algorithm parameters exactly right. For strongly convex functions, you can possibly eke out a decent solution in $O((d^2B^2/T)^{-1/2})$ function evaluations, but this result is also rather fragile to choice of parameters. Finally, note that just adding an constant offset to the reward dramatically slows down the algorithm. If you start with a reward function whose values are in $[0,1]$ and you subtract one million from each reward, this will increase the running time of the algorithm by a factor of a million, even though the ordering of the rewards amongst parameter values remains the same.

Note that matters only get worse as we bring in dynamics. The policy gradient update for LQR is very noisy, and its variance grows with the simulation length $L$. Moreover, the search for $\vartheta$ is necessarily nonconvex if one is searching for a simple static policy. While this could work in practice, we already have so many hurdles in our face that it suggests we should look for an alternative.

How can people be claiming such success in RL?

Lots of papers have been applying policy gradient to all sorts of different settings, and claiming crazy results, but I hope that it is now clear that they are just dressing up random search in a clever outfit. When you end up with a bunch of papers showing that genetic algorithms are competitive with your methods, this does not mean that we’ve made an advance in genetic algorithms. It is far more likely that this means that your method is a lousy implementation of random search.

Regardless, both genetic algorithms and policy gradient require an absurd number of samples. This is OK if you are willing to spend millions of dollars on AWS and never actually want to tune a physical system. But there must be a better way.

I don’t think I can overemphasize the point that policy gradient and RL are not magic. I’d go as far as to say that policy gradient and its derivatives are legitimately bad algorithms. In order to make them work well, you need lots of tricks. Algorithms which are hard to tune, hard to reproduce, and don’t outperform off the shelf genetic algorithms are bad algorithms.

We’ll come back to this many times in this series: for any application where policy gradient is successful, a dramatically simpler and more robust algorithm exists that will match or outperform it. It’s never a good idea, and I cannot for the life of me figure out why it is so popular.

Indeed! In the next post I’ll turn back to LQR and look at some other strategies that might be more successful than policy gradient.