295 Find Median from Data Stream
Median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value. So the median is the mean of the two middle value. For example, [2,3,4], the median is 3 [2,3], the median is (2 + 3) / 2 = 2.5 Design a data structure that supports the following two operations: * void addNum(int num) - Add a integer number from the data stream to the data structure. * double findMedian() - Return the median of all elements so far. Example: addNum(1) addNum(2) findMedian() -> 1.5 addNum(3) findMedian() -> 2 Follow up: 1. If all integer numbers from the stream are between 0 and 100, how would you optimize it? 2. If 99% of all integer numbers from the stream are between 0 and 100, how would you optimize it? https://www.youtube.com/watch?v=60xnYZ21Ir0 ///// new PriorityQueue<>(Collections.reverseOrder()); ////. smart public void addNum(int num) { max.offer(num); min.offer(max.poll()); if (max.size() < min.size()){ max.offer(min.poll()); } } ///// class MedianFinder { /** initialize your data structure here. */ PriorityQueue<Integer> maxHeap = new PriorityQueue<>(Collections.reverseOrder()); PriorityQueue<Integer> minHeap = new PriorityQueue<>(); public MedianFinder() { } public void addNum(int num) { maxHeap.offer(num); if(!minHeap.isEmpty() && maxHeap.peek() > minHeap.peek()){ minHeap.offer(maxHeap.poll()); } if(minHeap.size() > maxHeap.size()){ maxHeap.offer(minHeap.poll()); } if(maxHeap.size() - minHeap.size() > 1){ minHeap.offer(maxHeap.poll()); } } public double findMedian() { if(maxHeap.size() == minHeap.size()){ return (maxHeap.peek() + minHeap.peek()) / 2.0 ; /// 2.0 instead of 2 }else{ return maxHeap.peek(); } } }
class MedianFinder { // max queue is always larger or equal to min queue PriorityQueue<Integer> min = new PriorityQueue(); PriorityQueue<Integer> max = new PriorityQueue(1000, Collections.reverseOrder()); // Adds a number into the data structure. public void addNum(int num) { max.offer(num); min.offer(max.poll()); if (max.size() < min.size()){ max.offer(min.poll()); } } // Returns the median of current data stream public double findMedian() { if (max.size() == min.size()) return (max.peek() + min.peek()) / 2.0; else return max.peek(); } };
Approach #3 Two Heaps! [Accepted]
Intuition
The above two approaches gave us some valuable insights on how to tackle this problem. Concretely, one can infer two things:
- If we could maintain direct access to median elements at all times, then finding the median would take a constant amount of time.
- If we could find a reasonably fast way of adding numbers to our containers, additional penalties incurred could be lessened.
But perhaps the most important insight, which is not readily observable, is the fact that we only need a consistent way to access the median elements. Keeping the entire input sorted is not a requirement.
Well, if only there were a data structure which could handle our needs.
As it turns out there are two data structures for the job:
- Heaps (or Priority Queues 1)
- Self-balancing Binary Search Trees (we'll talk more about them in Approach #4)
Heaps are a natural ingredient for this dish! Adding elements to them take logarithmic order of time. They also give direct access to the maximal/minimal elements in a group.
If we could maintain two heaps in the following way:
- A max-heap to store the smaller half of the input numbers
- A min-heap to store the larger half of the input numbers
This gives access to median values in the input: they comprise the top of the heaps!
Wait, what? How?
If the following conditions are met:
- Both the heaps are balanced (or nearly balanced)
- The max-heap contains all the smaller numbers while the min-heap contains all the larger numbers
then we can say that:
- All the numbers in the max-heap are smaller or equal to the top element of the max-heap (let's call it xx)
- All the numbers in the min-heap are larger or equal to the top element of the min-heap (let's call it yy)
Then xx and/or yy are smaller than (or equal to) almost half of the elements and larger than (or equal to) the other half. That is the definition of median elements.
This leads us to a huge point of pain in this approach: balancing the two heaps!
Algorithm
-
Two priority queues:
- A max-heap
loto store the smaller half of the numbers - A min-heap
hito store the larger half of the numbers
- A max-heap
-
The max-heap
lois allowed to store, at worst, one more element more than the min-heaphi. Hence if we have processed kk elements:- If k = 2*n + 1 \quad (\forall \, n \in \mathbb{Z})k=2∗n+1(∀n∈Z), then
lois allowed to hold n+1n+1 elements, whilehican hold nnelements. - If k = 2*n \quad (\forall \, n \in \mathbb{Z})k=2∗n(∀n∈Z), then both heaps are balanced and hold nn elements each.
This gives us the nice property that when the heaps are perfectly balanced, the median can be derived from the tops of both heaps. Otherwise, the top of the max-heap
loholds the legitimate median. - If k = 2*n + 1 \quad (\forall \, n \in \mathbb{Z})k=2∗n+1(∀n∈Z), then
-
Adding a number
num:- Add
numto max-heaplo. Sinceloreceived a new element, we must do a balancing step forhi. So remove the largest element fromloand offer it tohi. - The min-heap
himight end holding more elements than the max-heaplo, after the previous operation. We fix that by removing the smallest element fromhiand offering it tolo.
The above step ensures that we do not disturb the nice little size property we just mentioned.
- Add
A little example will clear this up! Say we take input from the stream [41, 35, 62, 5, 97, 108]. The run-though of the algorithm looks like this:
Adding number 41 MaxHeap lo: [41] // MaxHeap stores the largest value at the top (index 0) MinHeap hi: [] // MinHeap stores the smallest value at the top (index 0) Median is 41 ======================= Adding number 35 MaxHeap lo: [35] MinHeap hi: [41] Median is 38 ======================= Adding number 62 MaxHeap lo: [41, 35] MinHeap hi: [62] Median is 41 ======================= Adding number 4 MaxHeap lo: [35, 4] MinHeap hi: [41, 62] Median is 38 ======================= Adding number 97 MaxHeap lo: [41, 35, 4] MinHeap hi: [62, 97] Median is 41 ======================= Adding number 108 MaxHeap lo: [41, 35, 4] MinHeap hi: [62, 97, 108] Median is 51.5
Complexity Analysis
-
Time complexity: O(log(n))O(5∗log(n))+O(1)≈O(log(n)).
- At worst, there are three heap insertions and two heap deletions from the top. Each of these takes about O(log(n)) time.
- Finding the mean takes constant O(1) time since the tops of heaps are directly accessible.
-
Space complexity: O(n) linear space to hold input in containers.
-
posted on 2018-11-08 02:21 猪猪🐷 阅读(125) 评论(0) 收藏 举报
浙公网安备 33010602011771号