Saturday, July 21, 2012

running median problem

I found this interesting problem from interview street.

problem was to maintain running median from stream of numbers.

more formally,  following is full problem from interview street.


The median of M numbers is defined as the middle number after sorting them in order, if M is odd or the average number of the middle 2 numbers (again after sorting) if M is even. You have an empty number list at first. Then you can add or remove some number from the list. For each add or remove operation, output the median of numbers in the list.

Example : For a set of m = 5 numbers, { 9, 2, 8, 4, 1 } the median is the third number in sorted set { 1, 2, 4, 8, 9 } which is 4. Similarly for set of m = 4, { 5, 2, 10, 4 }, the median is the average of second and the third element in the sorted set { 2, 4, 5, 10 } which is (4+5)/2 = 4.5  
Constraints:
0 < n <= 100,000

1. naive

If n is small enough, problem is simple.
each insert/delete sort entire elements
this require n * nlogn(n operation x cost for sorting).

2. using two multiset


Thing is only few elements are affected when insert/delete operation happen.
solution using two multiset s1, s2.

invariant

1) s1 is supposed to maintain 0 ~ n/2th order statistic.
2) s2 is supposed to maintain n/2 ~ nth order statistic.
3) s1.size()  and s2.size() only differ at most 1.

to hold invariant 1, 2
we compare x(current value to insert or delete) with maximum of s1 and minimum of s2 to decide where to put x.

to hold invariant 3
we need re-balance step
this can be done in constant time by
1) moving minimum of s2 to s1
2) moving maximum of s1 to s

Solution

once 3 invariant is meet, then calculating median as problem description is trivial.

1) s1.size == s2.size(s1.size + s2.size is even), then return (maximum of s1 + minimum of s2) / 2.0
one tip we should take care of integer overflow when we average two value(32bit integer) since adding two 32bit integer can cause integer overflow.
2) otherwise if s1.size > s2.size, return maximum of s1. s1.size < s2.size, return minimum of s2.

finally checking if we can delete x is trivial, since multiset provide find method in O(logN) complexity.

just good enough pass testcases.

3. using skip list

I was wondering what about more generic cases.
few google search gives me link to skip list.
for this problem, we exactly need indexable skip list since we keep need to extract median as input comes.
obviously this problem can be solved by indexable skip list.
here is my simple implementation.

TODO
want to take a look at how redis use skip list and understand why skip list is better than self-balance-tree in distributed environment.


check out code here