Calculating the median of a time series

Sep 29, 2017

code
data structures
problem solving

The other day I needed to work out how to find the mean, standard deviation and median values associated with a time series of data. In this case I am dealing with server response times. I’m not talking a few data points here, I talking 20,000 points per second for a week. In total, that’s 1,728,000,000 (1.7 billion) data points streaming in real time. Don’t ask…

The mean and standard deviation are straight forward. These calculations are simply run against a running total. The median ended up being a bit more difficult as my initial loop and monitor idea was quickly blown out of the water once I gave it some thought. A quick sort was no good because the data was coming in too quickly and the data set was too big.

Sampling

Firstly, this is a ridiculous amount of data that will present numerous problems if we were to take it onboard as raw data. Whether we dealing with 20,000 data points per second or 50 data points per second will still receive an excellent insight into the data we are receiving. In fact, even one datapoint per second would be satisfactory for a week’s worth of data.

Without doing any regression analysis, we can assume the data points recorded for each second will most likely display a high auto correlation. Logically, if the server is under heavy load then all response times - and not just a few- will tend to be higher. Thus, they are not mutually exclusive.

There is a diminishing return from recording more data points. As described by Markowitz, the benefits of diversification start to quickly reduce past around 20 data points. Given the assumed correlated nature of server response times this scenario will be no different.

For these reasons, I am going to reduce our sampling rate to just 1 sample per second which will bring our data set down to 604,800 data points - more than enough to get an insight into the median. Assuming each number recorded takes up 4 bytes of memory this is around 2.4mb compared to 1728mb when all 20,000 samples are recorded. (this does not include the memory used for the object which stores the number and its references.) Obviously this will amount to a huge reduction in both memory and CPU usage.

Write speed / read speed

20,000 points per second will require fast write speed - ideally O(1). Lagging behind will lead to a bottleneck and inevitably problems. This being said, I am sampling the data however I still want it to be performant.

On the other hand, reading from the structure is less critical and some wait time is acceptable. Let’s strive for this to be as fast as possible whilst respecting the priority given to write speed.

The options

Initially I thought of simply implementing a merge sort. Insert the values into an array and sort it when calling for the median value. Given that the worst case scenarios for the Merge, heap and quick sorts is O(n log n) this is no satisfactory for the given dataset. Move on.

Far more interesting is a priority queue backed by a heap. The performance is far better and it should handle the requirements nicely.

Priority queue

As it turns out, I think I found a nice neat solution that should be satisfactory for solving how to store our data. Rather than trying to manage the whole thing in a single structure it makes sense to manage the data using two priority queues with each one acting as either the left of the right side of the median value. Conceptually, one queue will return its maximum value (the left queue) and the other one (the right queue) will return its minimum value. To keep the queues equal in length they will both synchronise their size as part of the structure.

The heap

There are a number of ways to deal with priority queues but given the size of the structure we need a way that is memory efficient and CPU efficient. Ideally, coding the structure needs to be as simple as possible.

Heaps provide a nice way of creating an efficient priority queue. Based on the Priority Queue page on Wikipedia there are several structures available that would suit this problem quite nicely. See below:

Operation	Binary	Binomial	Fibonacci	Pairing	Brodal	Rank-pairing	Strict Fibonacci
find-min	Θ(1)	Θ(logĘn)	Θ(1)	Θ(1)	Θ(1)	Θ(1)	Θ(1)
delete-min	Θ(logĘn)	Θ(logĘn)	Θ(logĘn)	Θ(logĘn)	Θ(logĘn)	Θ(logĘn)	Θ(logĘn)
insert	Θ(logĘn)	Θ(1)	Θ(1)	Θ(1)	Θ(1)	Θ(1)	Θ(1)
decrease-key	Θ(logĘn)	Θ(logĘn)	Θ(1)	Θ(logĘn)	Θ(1)	Θ(1)	Θ(1)
merge	Θ(n)	Θ(logĘn)	Θ(1)	Θ(1)	Θ(1)	Θ(1)	Θ(1)

The only operations I am interested in for this problem are

find-min (find-max)
delete-min (delete-max)
insert

The other operations are irrelevant for solving this problem.

Whichever way you look at it we are going to have to deal with worst case Θ(log n) time for deleting the minimum (maximum) value from the heap however in our case, we are only removing the top node and rebalancing the tree.

The Binary and Binomal heaps will require a trade off for one or the other of the required operations. Although simpler to implement, this is not satisfactory. The Rank Pairing, Fibonacia and Brodal will do the job but the coding complexity is too high. In this particular scenario I chose the Pairing Heap as this structure gives the optimal performance for the requirements.

Tying it together

With the data structure chosen it is now time to implement the solution. In this case I used Javascript but as I am not using any external libraries the ideas are the same regardless of the language.

This link shows an implementation in Javascript of what the min/max pair heap could look like. Obviously there is a lot more work to be done for something that would be used in the real world.

https://jsfiddle.net/jordant/np87qctf/embedded/

From the code contained within the fiddle we can see that there is the StatsCollector class which contains two heaps. Think of it as the left and right sides of the distribution where the median is the middle.The root of the max heap is the maximum value contained in the heap. The root of the min heap is the minimum value contained in that heap. As items are added to the StatCollector we add them to the the correct heap and then rebalance. All the StatsCollector is doing is co-ordinating this process of rebalancing.

Conclusion

This has been an interesting exercise which went beyond the data structures underlying the solution. In this post I’ve talked about some of the things I think about when I attempt to build a piece of software. It wasn’t just the underlying data structure. There was also consideration of the gains achieved by recording every single data point. By considering the problem from data structures through to the real world practicalities what seemed like a complex problem actually saw a fairly simple solution.

Previous Sep 18, 2017
« Object Caching

Oct 4, 2017 Next
Java's auto boxing can be problematic »