Exponentially smoothed quantiles
The problem
Imagine having a sensor somewhere, maybe it's recording temperatures. You see the following output from this sensor:
Prelude: Exponential smoothing (mean)
There is this trick in statistics that is as beautifully simple as it is powerful. Exponential smoothing, or the exponentially weighted moving average, is a simple formula that lets you turn a noisy time series into a smooth one by moving each point closer to the mean of the recent points. The method is dead easy to implement and computationally about as cheap as it gets. You define a smoothing parameter $\alpha$ between 0 (infinte smoothing) and 1 (no smoothing), and then the noisy series $x_t$ is smoothed into the series $s_t$ by the formula: \[ s_{t+1} = \alpha x_{t+1} + (1 - \alpha) s_t \] \[ s_0 = x_0 \] Here is what that looks like on our noisy sensor data:
You see that too high an $\alpha$ leaves you with too much noise, while too low an $\alpha$ removes most of the signal. However, even with medium smoothing (the red line), you can see that the smoothed line lags behind on the actual pattern. That's because the smoothing is based on the past, and doesn't account for the direction in which the data is moving. We can add an estimator, $b_t$, for the direction in which the data is moving to reduce the lag. This is called the Holt smoother: \[ s_{t+1} = \alpha x_{t+1} + (1 - \alpha) (s_t + b_t) \] \[ s_0 = x_0 \] \[ b_{t+1} = \beta (s_{t+1} - s_t) + (1 - \beta) b_t \] \[ b_0 = 0 \] And as you can see, the Holt smoother manages much better to follow the actual pattern.
Why does exponential smoothing work?
Let's see what we can learn from the Holt smoother to build a quantile estimator. First, notice that exponential smoothing (ignore the trend for now) smoothes towards the mean of the data. I can think of two ways to show this.
So, maybe we just replace sum of squared errors with a loss function that gets minimized by a quantile, use its gradient and we're done?
Applying the same principle to quantiles
There is a snag though: The quantile loss function is not differentiable when $s$ equals any of the data points, so we can't compute the gradient. Thankfully there is a solution to this problem: we can use a subgradient of the loss function. A subgradient, like the gradient, points in a direction that decreases the loss function, but unlike the "real" gradient, it might not be the "best" direction. As our subgradient we will pick: \[\begin{align} -q &\quad \text{ if } s < x_i \\ 1-q &\quad \text{ if } s > x_i \\ 0 &\quad \text{ if } s = x_i. \\ \end{align}\] You can now see another problem introduced by the subgradient: the subgradient doesn't become smaller when $s$ is close to $x_i$. As a result, an update step can overshoot: say $s$ is just a little bit smaller than $x_i$, then an update should bring $s$ only a little bit closer to $x_i$. But the update will increase $s$ by $q$, which is potentially larger than the distance between $s$ and $x_i$. This can lead to instability in the optimization process. So we will make one more amendment to the final update rule: \[ s_{t+1} = s_t + \alpha \cdot \begin{cases} \min(q, x_t - s_t) & \text{ if } s_t < x_t \\ \max(q-1, x_t - s_t) & \text{ if } s_t > x_t \\ 0 & \text{ if } s_t = x_t. \\ \end{cases} \] Let's see what it looks like in practice:
Yikes. For the median ($q=0.5$) we see a similar lag that we saw for the mean, but the other quantiles lag dramatically more, especially when the data is moving away from the quantile. If you think about it this makes sense. The algorithm works by reacting more strongly to data points on one side of the distribution, and hardly reacts to data points on the other side. Let's add trend components to each of the quantiles as well, in the same way we did for the mean:
Oh man. What is going on here? While the trend component seems to have helped with the median, the more extreme quantiles are just not sensitive enough to datapoints on one side of the distribution, and the trend component refuses to reverse. But what if we use the trend component that we calculated on the mean as the trend component for the quantiles? This trend component is more sensitive to datapoints on both sides of the distribution, and should more easily reverse when the data starts to move away from the quantile.
Finally! We have a solution that hardly lags at all, it's easy to implement, and we have not had to make any assumptions about the data. I'm going to skip the formal verification of the statistical properties of this solution. Instead I'll note that because our construction is based on on-line gradient descent, we can rely on standard results from on-line gradient descent to show that our algorithm will converge to the true quantile. Let me close off with the final solution in formula form: \begin{align} s_0 &= x_0 & \text{the exponential quantile}\\ t_0 &= 0 & \text{the trend}\\ m_0 &= x_0 & \text{the mean (for calculating the trend)}\\ e_t &= s_t + t_t & \text{the expected update for the quantile}\\ s_{t+1} &= e_t + \alpha \cdot \begin{cases} \min(q, x_t - e_t) & \text{ if } e_t < x_t \\ \max(q-1, x_t - e_t) & \text{ if } e_t > x_t \\ 0 & \text{ if } e_t = x_t \\ \end{cases}\\ m_{t+1} &= \alpha x_{t+1} + (1 - \alpha) (m_t + t_t) \\ t_{t+1} &= \beta (m_{t+1} - m_t) + (1 - \beta) t_t \\ \end{align}