How to be optimally wrong: understanding Kalman filters
Published:
Let’s say you are tracking a drone with a radar. You try to measure the position with the sensor, but the measurement is inaccurate and doesn’t capture its true position. You have another approach, model the motion of the drone, but your model is not perfect, due to environmental factors such as wind. You now have two inperfect approaches, so how do you go about tracking the motion of a drone?
The Kalman filters solves this problem, by blending Bayesian updating with the interplace between you physical model and observations.
A high level overview of a Kalman filter
A Kalman filter is an algorithm that use measurements observed over time to predict the state of some physical object while incorporating model uncertainty and noisy measurements. This is done by performing Bayesian updating while weighting the predicted state from a physically motivated model and a measuremnt. These weights are determined by their uncertainty associated with each one. If the uncertainty from your measurement is huge compared to your predicted state, the Kalman filter will update mainly using the predicted state, not with your measurements.
Kalman filtering was invented by Rudolf E. Kalman in 1960 [1].
The Kalman filtering algorithm can be summarised by 5 core equations:
- Determine the predicted state at time k \(\begin{equation} \mathbf{x}_{k\vert k-1} = F \cdot \mathbf{x}_{k-1\vert k-1} + B \mathbf{u}_k \end{equation}\)
- Determine the predicted covariance \(\begin{equation} \Sigma_{k|k-1} = F \cdot \Sigma_{k-1|k-1} \cdot F^{-1} + Q \end{equation}\)
- Determine the Kalman gain \(\begin{equation} K_k = \Sigma_{k|k-1}\cdot H^T (H\Sigma_{k|k-1}+H^T + R)^{-1} \end{equation}\)
- Update the state estimate \(\begin{equation} \mathbf{x}_{k\vert k} = \mathbf{x}_{k\vert k-1} + K_k (\mathbf{z} - H \mathbf{x}_{k|k-1}) \end{equation}\)
- Update the state error: \(\begin{equation} \Sigma_{k\vert k} = (I-K_k\cdot H)\cdot P_{k\vert k-1} \end{equation}\)
The physical model and predicted state
State space formalism
The Kalman filter is built on top of the state-space formalism. The idea is pretty simple, we take all the physical quantities in the system we care about (e.g. position and velocity), and stick it onto a vector. As a result, we define the state vector \(\begin{equation} \mathbf{x} = \begin{\pmatrix}x\\ v \end{pmatrix} \end{equation}\)
where \(x\) is the object’s position and \(v\) is the object’s velocity. This vector will span the position-velocity space, a two dimensional grid with position in one axis and velocity in the other. This is essentially the state space, and is quite useful. There is a simple set of variables where you can track the system without having to store the entire history of the system, where you can use the set of variables to predict the future of the system.
The sets of equations that we use to predict what happens in the future is our first two Kalman equations. The first equation is what you predict the system to be in the next time step \(\begin{equation} \mathbf{x}_{k\vert k-1} = F \cdot \mathbf{x}_{k-1\vert k-1} + B \mathbf{u}_k. \end{equation}\)
| This is a statement of: given the current state of the system at step \(k-1\), what is the system at time k (this is why you see the subscript $$k | k-1$$ in the equation). The second Kalman equation is a statement on the uncertainty of the system, and how it propagates forward in your model |
These two equations is about writing down a physical model of your system, and making predictions on what will the system be in the next step.
From high-school physics, we know that the position after a certain time is \(\begin{equation} x_1 = x_0 + vt \end{equation}\)
\[\begin{equation} \mathbf{x}_{k\vert k-1} = F \cdot \mathbf{x}_{k-1\vert k-1} + B \mathbf{u}_k \end{equation}\] \[\begin{equation} \Sigma_{k|k-1} = F \cdot \Sigma_{k-1|k-1} \cdot F^{-1} \end{equation}\] \[\begin{equation} \mathbf{z}_k = H\cdot \mathbf{x}_k + \sigma_k \end{equation}\]Bayesian updating
Bayes theorem to Bayesian updating
The Kalman filter is an optimal method of Bayesian updating, assuming a Gaussian distribution. The heart of Bayesian inference is Bayes’ theorem
\[\begin{equation} \text{posterior} \propto \text{likelihood} \cdot \text{prior} \end{equation}\]and lets rewrite them in mathematical notion
\[\begin{equation} p(x_k|z_{0:k}) \propto p(z_k \vert x_k) \cdot p(x_k). \end{equation}\]The first assumption we will make is that our liklihood and prior are distributed as a Gaussian. Hence, we can define a two as
\[\begin{equation} p(x_k) = C \exp{\left( -\frac{(x-\mu_1)^2}{2\sigma_1^2}\right)} \end{equation}\]and
\[\begin{equation} p(z|x_k) = C \exp{\left(-\frac{(x-\mu_2)^2}{2\sigma_2^2}\right)}. \end{equation}\]Now, we wish to multiple these together to determine the posterior
\[\begin{equation} p(x_k|z_{0:k}) \propto \exp{-\frac{1}{2}\left(\frac{(x-\mu_1)^2}{\sigma_1^2}+\frac{(x-\mu_2)^2}{\sigma_2^2}\right)} \end{equation}\]and complete the square
\[\begin{equation} \frac{(x-\mu_1)^2}{\sigma_1^2} + \frac{(x-\mu_2)^2}{\sigma^2_2} = \frac{\sigma^2_2(x-\mu_1)^2+\sigma^1_2(x-\mu_2)^2}{\sigma^2_1 \sigma^2_2} \end{equation}\]after expanding the numerator and collecting the terms, we get
\[\begin{equation} =\frac{(\sigma^2_1+\sigma^2_2)(x-\frac{\sigma^2_2\mu_1+\sigma^2_1\mu_2}{\sigma_1^2+\sigma^2_2})}{\sigma_1^2\sigma^2_2} \end{equation}\]Thus, we arrive at our posterior (which is also Gaussian distributed)
\[\begin{equation} p(x_k|z_{0:k}) \propto \exp{-\frac{(x-\mu_{\text{new}})^2}{2\sigma_{\text{new}}^2}} \end{equation}\]with the new mean
\[\begin{equation} \mu_{\text{new}} = \frac{\sigma^2_2 \mu_1+\sigma^2_1\mu_2}{\sigma^2_1 + \sigma^2_2} \end{equation}\]and new covariance
\[\begin{equation} \sigma_{\text{new}} = \frac{\sigma^2_2\sigma^2_1}{\sigma^2_1 + \sigma^2_2} \end{equation}\]Rewriting into Kalman filter form
Let’s rewrite the posterior mean to be
\[\begin{equation} \mu_{\text{new}} = \frac{\sigma^2_2 \mu_1+\sigma^2_1\mu_2}{\sigma^2_1 + \sigma^2_2} = \mu_1 + \frac{\sigma^2_1 (\mu_2 - \mu_1)}{\sigma^2_1 + \sigma^2_2} \end{equation}\]let’s define
\[\begin{equation}K = \frac{\sigma^2_1}{\sigma^2_1 + \sigma^2_2}\end{equation}\]Thus, we arrive at the 4th equation
\[\begin{equation} \mu_{\text{new}} = \mu_1 + K(\mu_2 - \mu_1) \end{equation}\]where it can be rewritten in the more familiar form
\[\begin{equation} \mathbf{x}_{k\vert k} = \mathbf{x}_{k\vert k-1} + K(\mathcal{z}_k - H\cdot \mathbf{x}_{k|k-1}) \end{equation}\]and the 5th equation
\[\begin{equation} \sigma^2_{\text{new}} = (1-K) \sigma^2_1 \end{equation}\]where it can be rewritten in the more familiar form
\[\begin{equation} \Sigma^2_{\text{new}} = (1-K) \Sigma_{k\vert k-1} \end{equation}\]As we can see, the 4th and 5th equations of the Kalman filter is Bayes’ thoerem with a special properties of Gaussians: the product of two Gaussians is a Gaussian. This allow the filter to be recursive, as the posterior will always be the same form as the prior, hence you have a simple recursive loop where you only need a mean and a variance.
The Kalman gain and minimising the mean squared error
When we are using the Kalman filter to predict and update the next time step, we need to use some sort of metric to find the most optimal point. Since we are using Gaussian, we wil use the mean squared error (MSE). As the variance of the Gaussian is the expected squared deviation from the mean, minimising the MSE is equivalent to minimising the spread of your posterior Gaussian, which is the same as shrinking the uncertainty in your predictions.
\[\begin{equation} e_k = x_k - \mathbf{x}_{k\vert k} \end{equation}\]using the 4th equation \(\begin{equation} e_k = x_k - \mathbf{x}_{k\vert k-1} + K(\mathcal{z}_k - H\cdot \mathbf{x}_{k|k-1}) \end{equation}\)
and substituting in \(z_k\)
\[\begin{equation} = x_k - \mathbf{x}_{k\vert k-1} + K(H\cdot \mathbf{x}_k + \sigma_k - H\cdot \mathbf{x}_{k|k-1}) \end{equation}\] \[\begin{equation} = e_{k\vert k-1} - K (H e_{k\vert k-1} - \sigma_k) \end{equation}\] \[\begin{equation} = (I-KH)e_{k\vert k-1} - K\sigma_k \end{equation}\]The full covariance is defined as
\[\begin{equation} \Sigma = E[e_k e^T_k] = E[((I-KH)e_{k\vert k-1} - K\sigma_k)((I-KH)e_{k\vert k-1} - K\sigma_k)^T] \end{equation}\]when we expand this product, since the prediction error \(e_{k\vert k-1}\) and measurement error \(\sigma_k\) are uncorrelated, the cross terms vanish.
\[\begin{equation} = (I-KH)E[e_{k|k-1}e_{k|k-1}^T] + KE[\sigma_k\sigma_k^T]K^T \end{equation}\] \[\begin{equation} = (I-KH)\Sigma_{k|k-1}(I-KH)^T+KRK^T \end{equation}\]and once we expand this
\[\begin{equation} = \Sigma_{k|k-1} - KH\Sigma_{k|k-1} - \Sigma_{k|k-1}H^TK^T + K(H\Sigma_{k|k-1}H^T+R)K^T \end{equation}\]The MSE is written as
\[\begin{equation} \text{MSE} = E||x_k - \mathbf{x}_k || = tr(\Sigma_{k\vert k}) \end{equation}\]Let’s minimise it
\[\begin{equation} \frac{d}{dK}tr(|\Sigma_{k|k}) = 0 \end{equation}\]Apply the derivative to all terms in \(\Sigma\).
\[\begin{equation} \frac{d}{dK}tr(|\Sigma_{k|k-1}) = 0 \end{equation}\] \[\begin{equation} \frac{d}{dK}tr(-KH\Sigma_{k|k-1}) = - (H\Sigma_{k|k-1})^T= -\Sigma_{k|k-1}H^T \end{equation}\] \[\begin{equation} \frac{d}{dK}tr(-\Sigma_{k|k-1}H^T K^T) = -\Sigma_{k|k-1}H^T \end{equation}\] \[\begin{equation} \frac{d}{dK}tr(K(H\Sigma_{k|k-1}H^T+R)K^T) = 2(H\Sigma_{k|k-1}H^T + R)^T K^T \end{equation}\]Thus
\[\begin{equation} \frac{d}{dK}tr(|\Sigma_{k|k}) = -2\Sigma_{k|k-1}H^T +2K(H\Sigma_{k|k-1}H^T + R) = 0 \end{equation}\]Now we solve for K,
\[\begin{equation} 2\Sigma_{k|k-1}H^T = 2K(H\Sigma_{k|k-1}H^T + R) \end{equation}\] \[\begin{equation} K= \Sigma_{k|k-1}H^T(H\Sigma_{k|k-1}H^T + R)^{-1} \end{equation}\]and we have arrived at the third Kalman equation.
Conclusion
References
- A New Approach to Linear Filtering and Prediction Problems, Kalman, R. E. (1960 )
- An Introduction to the Kalman Filter, Welch and Bishop (2006)
- Least-squares Estimationl from Gauss to Kalman, Sorenson, H. W. (1970)
