Solving Stochastic Optimal Control via Path Integrals
Utilizing the time-reversed Feynman–Kac equation
Preface
It is well known that an Optimal Control Problem (OCP) may be decomposed into smaller solvable optimal problems with the Principle of Optimality [1]. Under a continuous-time setup, the “solved” sub-problem will leads to a PDE called Hamilton–Jacobi–Bellman equation (HJB) that shapes the optimal cost-to-go function (readers may refer to this previous post). We will further extend this idea to a class of stochastic systems and revealed a solving algorithm inspired by the celebrated Path Integral interpretation [2] in this article. The derivation used in this post is heavily based on reference [3] which is worth reading.
Stochastic dynamic systems
The stochastic systems discussed are restricted in the form which can be viewed as deterministic dynamics perturbed by an exogenous noise. The noise is stochastically modeled as a Wiener process with zero drifting. Although some extra limitations were also required, we will keep them away from the derivation until one is relevant.
Stochastic Optimal control problem (SOCP)
Let’s start from some fixed point “xs” and run through a predetermined period “[ts, tf]” with a control signal “u(t)”. Similar to a deterministic case, we could define a total cost along the resulting trajectories. In order to receive a traceable outcome, it is also mandatory to take an expectation under the same initial condition. In total, a SOCP could be formulated as the following problem:
The hidden necessity of closed-loop execution in SOCP
At first glance, the preceding setup seems to be merely a slight modification from the deterministic OCP, but there is a critical difference. Recalling the ODE approaches of optimal problems (basically a variational method) such as the Lagrange equations or the Pontryagin’s Maximum Principle[4,5], the control signal was interpreted and executed under an open-loop style except for a few special cases. This same idea does not extend well for a stochastic system due to the unpredictable noise, in fact, the OCP setup with open-loop control is fundamentally different from the one with closed-loop control with noise presence.
Optimal open-loop control solves a signal which gives an optimized result under expectation, each “trail” with an identical control signal.
Optimal closed-loop control solves a strategy with an optimized result under expectation, each “trail” could have a different control signal.
We will solve the closed-loop strategy in the following arguments, which is a natural outcome from a Dynamic Programming (Principle of Optimality) approach even with stochastic noises.
Stochastic Hamilton–Jacobi–Bellman equation (S-HJB)
The derivation of a stochastic HJB is pretty much identical to a deterministic one, only with a sophisticated interchanging between the “expectation” and the “min” operators. The equality between the 4th and the 5th equations follows the optimal closed-loop interpretation stated above. The closed-loop control sequence can be different for each “trial” inside the expectation and each perform there best under different stochastic outcome.
With the assumption of a smooth optimal cost-to-go function “J”, we can further expand the last equation under the rules of Ito calculus. Notice the noise was modeled by a Wiener process.
Finally, we were rewarded with the Stochastic version HJB equation. At this point, most numerical solvers are able to obtain a whole sheet of optimal cost-to-go function with a given boundary condition and extract the associated closed-loop control at each space-time. No doubt this is usually infeasible in limited computational resources.
Control input as a biased drifting
A way more elegant approach can be revealed if the instantaneous cost and the noise are restricted as the following equations.
The closed-loop controlled S-HJB may now be analytically solved as shown.
With an exponential-transformed negative optimal cost-to-go function proposed in [3].
Readers may interpret the transformed function over state-time as some distribution function (exponential of a negative quadratic term is a Gaussian distribution function). Replaced all terms in the S-HJB with the transformed function, the following linear equation was obtained and surprisingly resemble the Feynman–Kac formula equation.
As one may guess, the equation we are left with can be solved more efficiently compared to the original one. Currently, we still need to propagate the transformed cost-to-go function reversely in time and are still required to solve an expensive PDE (and get a whole strategy over states which might not be necessary).
The forward evolving conjugate function
The linearity in the last equation prompts us to define a co-state “ρ” that evolves forward in time by a special propagator (the Hermitian conjugate) and preserves the inner product as a time-independent constant.
The Hermitian conjugate can be “find” as follow.
This is the Fokker–Planck equation for the forward evolved probability density “ρ” with an extra term that will decrease the total probability along the time axis with a positive state cost “q”. One should also notice this propagator is totally free from the original controller input “u”!!
Path Integral approximation
Finally, we could reach a result whereby solving/simulating the forward evolving “probability density” governed by the dissipative Fokker–Planck equation above, the optimal cost to go can be obtained.
Starting from a concentrated distribution at the initial state, the inner product between the evolved distribution and the transformed termination cost is exactly the optimal cost-to-go from the very starting point.
This process can be approximated through certain samples Path Integral with a real diffusion coefficient. Where the singular direction of the inverse of the variance is effectively an infinitely large “action” or equivalently an infinite small contribution in a real-valued Path Integral. This is less of a concern if one simply sampled from the control input direction “B*dε”.
Although both the backward (Feynman–Kac) and the forward (Fokker–Planck) equations can be approximated similarly. The latter was preferred because we are usually only interested in the optimal cost around the initial state but not the whole possible combinations.
Further improvements
The original Monte Carlo sampling of the Path Integrals was based on the passive part of the original dynamics. A trivial modification would be drawing samples under a biased favor if there is a known better prior. This is effectively an importance sampling technic in statistical science and will introduce an additional action value in the Path Integrals.
Example 1: Solving a double-slit problem
We will walk through the process of finding the optimal cost-to-go function via Path Integral and compared it with an analytical one in a one dimension double-slit problem [3]. This first-order system evolves over time with a single input and no passive dynamics (dx=0+udt+dv), while the control task is to pass the slit placed at a time instant and terminate at the origin with a quadratic cost.
Our task here is to find the optimal cost-to-go at “ts” for all possible initial positions “x”. We first approach this problem with the exact solution by solving the linearized S-HJB reversely in time.
In the meantime, we could also show how the time-reversed Feynman–Kac equation could also be solved through a diffusion liked process through the exponentially transformed cost “Ψ”.
Now we could try out the forward sampling which approximates the Path Integral with and without an importance sampling. The actual forward sampled trajectories are shown as follows while the biased input is simply a straight line passing through one of the slits and towards the origin after t=1.
The resulting approximated cost-to-go can be collectively present as the figures below, each compared to the analytic solution (HJB solution).
Example 2: Risk awareness in OCP
In this example, a similar first-order problem is given with a longer duration of the “barrier”. Furthermore, one of the slits is narrow while the other is relatively wide. Readers may notice the sampled trajectories that were biased toward the higher “slit” in the high variance case, pretty much never reached the termination. This indicates a variance-dependent HJB solution (the optimal cost-to-go) which is also depicted in the last figure.
The solution of S-HJB suggested a more risky path under a lower variance environment while being alerted to change to the lower path once the variance rises to a certain degree.
References
[1] R. Bellman, R. Corporation, en K. M. R. Collection, Dynamic Programming. Princeton University Press, 1957.
[2] J. Townsend, A Modern Approach to Quantum Mechanics. University Science Books, 2012.
[3] H. J. Kappen, ‘Path integrals and symmetry breaking for optimal control theory’, Journal of statistical mechanics: theory and experiment, τ. 2005, τχ. 11, σ. P11011, 2005.
[4] A. T. Fuller, “Bibliography of Pontryagm’s Maximum Principle”, Journal of Electronics and Control, vol 15, no 5, bll 513–517, 1963.
[5] L. N. Hand en J. D. Finch, Analytical Mechanics. Cambridge University Press, 1998.