Solving Stochastic Optimal Control via Path Integrals

Utilizing the time-reversed Feynman–Kac equation

8 min readMar 26, 2022

Preface

It is well known that an Optimal Control Problem (OCP) may be decomposed into smaller solvable optimal problems with the Principle of Optimality [1]. Under a continuous-time setup, the “solved” sub-problem will leads to a PDE called Hamilton–Jacobi–Bellman equation (HJB) that shapes the optimal cost-to-go function (readers may refer to this previous post). We will further extend this idea to a class of stochastic systems and revealed a solving algorithm inspired by the celebrated Path Integral interpretation [2] in this article. The derivation used in this post is heavily based on reference [3] which is worth reading.

Stochastic dynamic systems

The stochastic systems discussed are restricted in the form which can be viewed as deterministic dynamics perturbed by an exogenous noise. The noise is stochastically modeled as a Wiener process with zero drifting. Although some extra limitations were also required, we will keep them away from the derivation until one is relevant.

A stochastic system composed of a deterministic system with an exogenous noise input “v”

Stochastic Optimal control problem (SOCP)

Let’s start from some fixed point “xs” and run through a predetermined period “[ts, tf]” with a control signal “u(t)”. Similar to a deterministic case, we could define a total cost along the resulting trajectories. In order to receive a traceable outcome, it is also mandatory to take an expectation under the same initial condition. In total, a SOCP could be formulated as the following problem:

Stochastic optimal control problem which minimizes the expected total cost

The hidden necessity of closed-loop execution in SOCP

At first glance, the preceding setup seems to be merely a slight modification from the deterministic OCP, but there is a critical difference. Recalling the ODE approaches of optimal problems (basically a variational method) such as the Lagrange equations or the Pontryagin’s Maximum Principle[4,5], the control signal was interpreted and executed under an open-loop style except for a few special cases. This same idea does not extend well for a stochastic system due to the unpredictable noise, in fact, the OCP setup with open-loop control is fundamentally different from the one with closed-loop control with noise presence.

Optimal open-loop control solves a signal which gives an optimized result under expectation, each “trail” with an identical control signal.
Optimal closed-loop control solves a strategy with an optimized result under expectation, each “trail” could have a different control signal.

We will solve the closed-loop strategy in the following arguments, which is a natural outcome from a Dynamic Programming (Principle of Optimality) approach even with stochastic noises.

Stochastic Hamilton–Jacobi–Bellman equation (S-HJB)

The derivation of a stochastic HJB is pretty much identical to a deterministic one, only with a sophisticated interchanging between the “expectation” and the “min” operators. The equality between the 4th and the 5th equations follows the optimal closed-loop interpretation stated above. The closed-loop control sequence can be different for each “trial” inside the expectation and each perform there best under different stochastic outcome.

Expanding the optimal cost-to-go function at some instant and fixed initial state, the equality between the 4th & 5th follows a closed-loop interpretation of optimal control

With the assumption of a smooth optimal cost-to-go function “J”, we can further expand the last equation under the rules of Ito calculus. Notice the noise was modeled by a Wiener process.

The Stochastic Hamilton–Jacobi–Bellman equation (S-HJB)

Finally, we were rewarded with the Stochastic version HJB equation. At this point, most numerical solvers are able to obtain a whole sheet of optimal cost-to-go function with a given boundary condition and extract the associated closed-loop control at each space-time. No doubt this is usually infeasible in limited computational resources.

Control input as a biased drifting

A way more elegant approach can be revealed if the instantaneous cost and the noise are restricted as the following equations.

A special class of SOCP with control input and noise acting along the same direction, the quadratic control cost is also set to match the noise variance inversely

The closed-loop controlled S-HJB may now be analytically solved as shown.

Solved S-HJB with quadratic control cost and affine inputs

With an exponential-transformed negative optimal cost-to-go function proposed in [3].

Exponential-transformed optimal cost-to-go function and the corresponding partial derivative

Readers may interpret the transformed function over state-time as some distribution function (exponential of a negative quadratic term is a Gaussian distribution function). Replaced all terms in the S-HJB with the transformed function, the following linear equation was obtained and surprisingly resemble the Feynman–Kac formula equation.

Exponentially transformed S-HJB and the resulting linear equation

As one may guess, the equation we are left with can be solved more efficiently compared to the original one. Currently, we still need to propagate the transformed cost-to-go function reversely in time and are still required to solve an expensive PDE (and get a whole strategy over states which might not be necessary).

The forward evolving conjugate function

The linearity in the last equation prompts us to define a co-state “ρ” that evolves forward in time by a special propagator (the Hermitian conjugate) and preserves the inner product as a time-independent constant.

Definition of the forward propagator which leaves a time-independent inner product

The Hermitian conjugate can be “find” as follow.

Finding expression for the Hermitian conjugate as the forward propagator, the equality between the 2nd and 3rd was held by assuming a zero density at a far distance away

This is the Fokker–Planck equation for the forward evolved probability density “ρ” with an extra term that will decrease the total probability along the time axis with a positive state cost “q”. One should also notice this propagator is totally free from the original controller input “u”!!

Path Integral approximation

Finally, we could reach a result whereby solving/simulating the forward evolving “probability density” governed by the dissipative Fokker–Planck equation above, the optimal cost to go can be obtained.

Starting from a concentrated distribution at the initial state, the inner product between the evolved distribution and the transformed termination cost is exactly the optimal cost-to-go from the very starting point.

Replacing time-reversed S-HJB with a forward propagator governed by the Hermitian conjugate

This process can be approximated through certain samples Path Integral with a real diffusion coefficient. Where the singular direction of the inverse of the variance is effectively an infinitely large “action” or equivalently an infinite small contribution in a real-valued Path Integral. This is less of a concern if one simply sampled from the control input direction “B*dε”.

Path Integral representation of the forward propagation process

Although both the backward (Feynman–Kac) and the forward (Fokker–Planck) equations can be approximated similarly. The latter was preferred because we are usually only interested in the optimal cost around the initial state but not the whole possible combinations.

Further improvements

The original Monte Carlo sampling of the Path Integrals was based on the passive part of the original dynamics. A trivial modification would be drawing samples under a biased favor if there is a known better prior. This is effectively an importance sampling technic in statistical science and will introduce an additional action value in the Path Integrals.

Importance sampling over a proposed signal ub

Example 1: Solving a double-slit problem

We will walk through the process of finding the optimal cost-to-go function via Path Integral and compared it with an analytical one in a one dimension double-slit problem [3]. This first-order system evolves over time with a single input and no passive dynamics (dx=0+udt+dv), while the control task is to pass the slit placed at a time instant and terminate at the origin with a quadratic cost.

The double-slit OCP with a first-order dynamic system, the state cost at t=1 is infinite other than the slit

Our task here is to find the optimal cost-to-go at “ts” for all possible initial positions “x”. We first approach this problem with the exact solution by solving the linearized S-HJB reversely in time.

Analytically solved S-HJB for, t=2 (orange, quadratic termination), t=1.01 (yellow, just after slit), t=0.99 (purple, just before slit), and t=0 (green, initial)

In the meantime, we could also show how the time-reversed Feynman–Kac equation could also be solved through a diffusion liked process through the exponentially transformed cost “Ψ”.

Exponentially transformed cost for, t=2 (blue, termination), t=1.01 (orange, just after slit), t=0.99 (yellow, just before slit), and t=0 (purple, initial)

Now we could try out the forward sampling which approximates the Path Integral with and without an importance sampling. The actual forward sampled trajectories are shown as follows while the biased input is simply a straight line passing through one of the slits and towards the origin after t=1.

Sampled trajectories without bias (left), and with bias (right), both under the same noise

The resulting approximated cost-to-go can be collectively present as the figures below, each compared to the analytic solution (HJB solution).

Figures in the left, the totally passive sampling (orange, Naïve MC), biased to pass through the higher slit (yellow), biased to pass through the lower slit (purple). Figure in the right, biased to pass through the slit and towards the origin(blue, 100 samples), biased to pass through the slit then left passively(orange, 100 samples), biased to pass through the slit then left passively(yellow, 1000 samples)

Example 2: Risk awareness in OCP

In this example, a similar first-order problem is given with a longer duration of the “barrier”. Furthermore, one of the slits is narrow while the other is relatively wide. Readers may notice the sampled trajectories that were biased toward the higher “slit” in the high variance case, pretty much never reached the termination. This indicates a variance-dependent HJB solution (the optimal cost-to-go) which is also depicted in the last figure.

Sampling under stronger noise v=0.1, 500 samples biased to the top and the other 500 to the bottom

Sampling under stronger noise v=0.001, 500 samples biased to the top and the other 500 to the bottom

The solution of S-HJB suggested a more risky path under a lower variance environment while being alerted to change to the lower path once the variance rises to a certain degree.

Approximated optimal cost to go at t=0, 2 batches each with 1000 samples were executed

References

[1] R. Bellman, R. Corporation, en K. M. R. Collection, Dynamic Programming. Princeton University Press, 1957.

[2] J. Townsend, A Modern Approach to Quantum Mechanics. University Science Books, 2012.

[3] H. J. Kappen, ‘Path integrals and symmetry breaking for optimal control theory’, Journal of statistical mechanics: theory and experiment, τ. 2005, τχ. 11, σ. P11011, 2005.

[4] A. T. Fuller, “Bibliography of Pontryagm’s Maximum Principle”, Journal of Electronics and Control, vol 15, no 5, bll 513–517, 1963.

[5] L. N. Hand en J. D. Finch, Analytical Mechanics. Cambridge University Press, 1998.