Exponentially smoothed quantiles

Recently I found myself wondering if I could compute quantiles for a stream of data, if the distribution of the data was changing over time. The usual approach to finding quantiles requires you to either know the distribution, or at least the distribution should be stationary (that is, not change over time) so you can estimate it. I couldn't find anything in the

literature

(or on wikipedia) so I drew up my own solution: Exponentially smoothed quantiles.

The problem

Imagine having a sensor somewhere, maybe it's recording temperatures. You see the following output from this sensor:

With no further information, how would you determine a 90% confidence interval for this sensor? We could just assume that this is a sine wave with normally distributed noise, but what if the noise is following a slightly different distribution, or what if the sine wave starts behaving differently? Our confidence interval would be completely wrong for future measurements. Without a compelling argument that our data has to be

a sine wave

, making assumptions about the distribution of the data in a production system is risky business.

Prelude: Exponential smoothing (mean)

There is this trick in statistics that is as beautifully simple as it is powerful. Exponential smoothing, or the exponentially weighted moving average, is a simple formula that lets you turn a noisy time series into a smooth one by moving each point closer to the mean of the recent points. The method is dead easy to implement and computationally about as cheap as it gets. You define a smoothing parameter $\alpha$ between 0 (infinte smoothing) and 1 (no smoothing), and then the noisy series $x_t$ is smoothed into the series $s_t$ by the formula: \[ s_{t+1} = \alpha x_{t+1} + (1 - \alpha) s_t \] \[ s_0 = x_0 \] Here is what that looks like on our noisy sensor data:

You see that too high an $\alpha$ leaves you with too much noise, while too low an $\alpha$ removes most of the signal. However, even with medium smoothing (the red line), you can see that the smoothed line lags behind on the actual pattern. That's because the smoothing is based on the past, and doesn't account for the direction in which the data is moving. We can add an estimator, $b_t$, for the direction in which the data is moving to reduce the lag. This is called the Holt smoother: \[ s_{t+1} = \alpha x_{t+1} + (1 - \alpha) (s_t + b_t) \] \[ s_0 = x_0 \] \[ b_{t+1} = \beta (s_{t+1} - s_t) + (1 - \beta) b_t \] \[ b_0 = 0 \] And as you can see, the Holt smoother manages much better to follow the actual pattern.

Why does exponential smoothing work?

Let's see what we can learn from the Holt smoother to build a quantile estimator. First, notice that exponential smoothing (ignore the trend for now) smoothes towards the mean of the data. I can think of two ways to show this.

The explanation I've seen most is that the exponential smoother is approximately a weighted average of the data, but this is not very insightful. Instead we'll use another explanation: It turns out that the exponential smoother is basically just doing on-line gradient descent to find the mean of the data. This works as follows. Let's say we want to minimize the sum of squared errors with respect to a constant $s$: \[ \sum_{i=0}^t (x_i - s)^2, \] then we can differentiate with respect to $s$ and set the derivative to zero, to find the best $s$: \begin{align} 0 &= \frac{d} {ds} \sum_{i=0}^t (x_i - s)^2 \\ &= \sum_{i=0}^t -2 (x_i - s) \\ &= 2ts - \sum_{i=0}^t 2 x_i \\ \end{align} Divide by 2t and rearrange and you get \[ s = \frac{\sum_{i=0}^t x_i} {t}, \] so the mean actually minimizes the sum of squared errors. If instead we want to use on-line gradient descent on the sum of squared errors, then we just take the gradient of a single squared error: \[ \frac{d} {ds} (x_t - s)^2 = -2 (x_t - s), \] and replace the 2 with a step size $\alpha$ to get the

update rule

: \[ s_{t+1} = s_t - \alpha(- (x_t - s_t)) = \alpha x_t + (1 - \alpha) s_t. \]

So, maybe we just replace sum of squared errors with a loss function that gets minimized by a quantile, use its gradient and we're done?

Applying the same principle to quantiles

Let's start with the right loss function. For a given quantile $q$, the quantile loss function is given by: \[ \sum_{i=0}^t (x_i - s) (q - I(x_i < s)) \] which you can check is the same as \[ q\sum_{x_i > s} |x_i - s| + (1-q) \sum_{x_i < s} |x_i - s|. \] To get some feeling for why this function is minimized by the $q$th quantile, assume that $s$ just happens to be the $q$th quantile of your dataset of size $T+1$. That means there will be $(1-q)T$ elements in the first sum and $qT$ elements

in the second sum

, and exactly 1 element, let's say it happens to be $x_7$, that is equal to $s$. Now lets see what happens to our loss function if we try a value that is

ever so slightly smaller

, $s-\varepsilon$: \begin{align} &q\sum_{x_i > s} |x_i - s + \varepsilon| + q|x_7-s+\varepsilon| + (1-q) \sum_{x_i < s} |x_i - s + \varepsilon| \\&= q\left[ \sum_{x_i > s} |x_i - s|+\varepsilon (1-q)T\right] + q|x_7-s+\varepsilon| + (1-q) \left[ \sum_{x_i < s} |x_i - s| -\varepsilon qT\right] \\&= q \sum_{x_i > s} |x_i - s| + (1-q) \sum_{x_i < s} |x_i - s| + q\varepsilon (1-q)T + q\varepsilon - (1-q)\varepsilon qT \\&= q \sum_{x_i > s} |x_i - s| + (1-q) \sum_{x_i < s} |x_i - s| + q\varepsilon . \end{align} So in fact, decreasing $s$ by $\varepsilon$ increases the loss function by $q\varepsilon$. You can see that the error will similarly increase if we increase $s$ by $\varepsilon$. And since this loss function is convex, we've shown that the $q$th quantile is its unique

minimum.

There is a snag though: The quantile loss function is not differentiable when $s$ equals any of the data points, so we can't compute the gradient. Thankfully there is a solution to this problem: we can use a subgradient of the loss function. A subgradient, like the gradient, points in a direction that decreases the loss function, but unlike the "real" gradient, it might not be the "best" direction. As our subgradient we will pick: \[\begin{align} -q &\quad \text{ if } s < x_i \\ 1-q &\quad \text{ if } s > x_i \\ 0 &\quad \text{ if } s = x_i. \\ \end{align}\] You can now see another problem introduced by the subgradient: the subgradient doesn't become smaller when $s$ is close to $x_i$. As a result, an update step can overshoot: say $s$ is just a little bit smaller than $x_i$, then an update should bring $s$ only a little bit closer to $x_i$. But the update will increase $s$ by $q$, which is potentially larger than the distance between $s$ and $x_i$. This can lead to instability in the optimization process. So we will make one more amendment to the final update rule: \[ s_{t+1} = s_t + \alpha \cdot \begin{cases} \min(q, x_t - s_t) & \text{ if } s_t < x_t \\ \max(q-1, x_t - s_t) & \text{ if } s_t > x_t \\ 0 & \text{ if } s_t = x_t. \\ \end{cases} \] Let's see what it looks like in practice:

Yikes. For the median ($q=0.5$) we see a similar lag that we saw for the mean, but the other quantiles lag dramatically more, especially when the data is moving away from the quantile. If you think about it this makes sense. The algorithm works by reacting more strongly to data points on one side of the distribution, and hardly reacts to data points on the other side. Let's add trend components to each of the quantiles as well, in the same way we did for the mean:

Oh man. What is going on here? While the trend component seems to have helped with the median, the more extreme quantiles are just not sensitive enough to datapoints on one side of the distribution, and the trend component refuses to reverse. But what if we use the trend component that we calculated on the mean as the trend component for the quantiles? This trend component is more sensitive to datapoints on both sides of the distribution, and should more easily reverse when the data starts to move away from the quantile.

Finally! We have a solution that hardly lags at all, it's easy to implement, and we have not had to make any assumptions about the data. I'm going to skip the formal verification of the statistical properties of this solution. Instead I'll note that because our construction is based on on-line gradient descent, we can rely on standard results from on-line gradient descent to show that our algorithm will converge to the true quantile. Let me close off with the final solution in formula form: \begin{align} s_0 &= x_0 & \text{the exponential quantile}\\ t_0 &= 0 & \text{the trend}\\ m_0 &= x_0 & \text{the mean (for calculating the trend)}\\ e_t &= s_t + t_t & \text{the expected update for the quantile}\\ s_{t+1} &= e_t + \alpha \cdot \begin{cases} \min(q, x_t - e_t) & \text{ if } e_t < x_t \\ \max(q-1, x_t - e_t) & \text{ if } e_t > x_t \\ 0 & \text{ if } e_t = x_t \\ \end{cases}\\ m_{t+1} &= \alpha x_{t+1} + (1 - \alpha) (m_t + t_t) \\ t_{t+1} &= \beta (m_{t+1} - m_t) + (1 - \beta) t_t \\ \end{align}