C++复习之STL（二）——谈一谈关联式容器set和有序vector的使用选择问题 - Painful

1.set的宣言

先看看C++标准中对set的介绍：

A set is a kind of associative container that supports unique keys (contains at most one
of each key value) and provides for fast retrieval of the keys themselves. Set supports 
bidirectional iterators

下面列举关于set的两点事实，需要注意：

set对于性能有明确的保证：set::find和set::insert消耗时间级别都为logN。所以，如果你确实需要保证插入和检索时间在logN，set可能是个不错的选择。
set底层实现通常使用red-black tree：有额外的空间和时间负担。每个red-black tree的节点都需要存储颜色标记、指向子女和父亲的指针；插入需要树的重新平衡；查询和遍历需要指针操作。

2.另外的选择：二分查找

红黑树并非提供logN级别的搜索的唯一数据结构。很容易想到的就是在有序数集中进行binary_search，该算也提供的logN级别的时间复杂度，而且最数据结构的要求仅仅是“一个有序顺序集（该集支持某些必要操作）”,而且, 常数因子比set更小。使用set会占用更多的空间，不利于cache机制的使用，且可能造成更多的page faults。

事实胜于雄辩,下面是用一段代码来对比set提供的搜索和二分查找提供的搜索的性能,编译环境为WOW64 Release

注意：检测元素是否搜索到是必须的,此处忽略之，对于push_back等的调用其实可以使用“范围函数”来完成，可能性能更好！

#include <iostream>
#include <set>
#include <vector>
#include <algorithm>
#include <windows.h>
int main()
{
  using namespace std;
  const int MAX_ELEMENT = 1000000;
  const int MAX_SIZE = MAX_ELEMENT+1;
  const int MAX_TIMES = 10000000;
  set<int> intSet;
  vector<int> intVect;
  intVect.reserve(MAX_SIZE);
  for (int i = 0; i < MAX_SIZE; ++i) 
  {
    intSet.insert(i);
    intVect.push_back(i);    //这里由于插入的特殊性，intVect元素状态是“有序”的
  }
  DWORD st1, st2;
  st1 = GetTickCount();
  for (int i = 0; i < MAX_TIMES; ++i) 
  {
    intSet.find(MAX_ELEMENT);
    intSet.find(MAX_ELEMENT/2);
    intSet.find(MAX_ELEMENT/15);
  }
  st2 = GetTickCount();
  cout << "intSet.find(...):\t" << (st2 - st1) << "ms" << endl;
  st1 = GetTickCount();
  for (int i = 0; i < MAX_TIMES; ++i) 
  {
    binary_search(intVect.begin(), intVect.end(), MAX_ELEMENT);
    binary_search(intVect.begin(), intVect.end(), MAX_ELEMENT/2);
    binary_search(intVect.begin(), intVect.end(), MAX_ELEMENT/15);
  }
  st2 = GetTickCount();
  cout << "binary_search(...):\t" << (st2 - st1) << "ms" << endl;
  return 0;
}

运行截图:

可以看到,由于缓存、缺页等各方面因素，二分查找和对于set的查找性能相差极大。更详细的信息可以用Profiler来获得。下面给出了任务管理器中的截图，程序为STL.exe*32。

3.总结,3个选择

1.使用set：当元素个数可能会变得足够大，即N足够大，logN和N的区别非常明显之时，元素是随机插入的，插入和搜索交互发生，无法预料下一次的操作。

2.使用sorted_vector:需要快速的搜索和遍历，但是对插入的性能要求很低，或者元素是预先一次性插入的，然后排序好, 在此基础上进行二分搜索。亦或者对内存限制较大。或者确信搜索操作和插入、删除操作几乎不交错在一起。或者元素的插入是“几乎有序”的，这样的插入的额外负担较小。

3.使用基于哈希表的set：如果哈希函数足够好和哈希表大小适合，通常情况下会提供常数级别的搜索。

4.简单sorted_vector实现

#pragma once
#include <vector>
#include <algorithm>
#include <functional>
template <typename T, typename Pred = std::less<T> > class sorted_vector
{
public:
  typedef typename std::vector<T>::iterator    iterator;
  typedef typename std::vector<T>::const_iterator  const_iterator;
  typedef typename std::vector<T>::size_type    size_type;
  iterator      begin()      { return sort_vect.begin();    }
  const_iterator    begin()  const  { return sort_vect.cbegin();  }
  iterator      end()      { return sort_vect.end();    }
  const_iterator    end()  const  { return sort_vect.cend();    }
  void reserve(size_type sz)      { sort_vect.reserve(sz);    }
  //Other wrapper help methods
  //...
  //Well, use res_size to avoid reallocation possibly
  sorted_vector(int res_size, const Pred& p = Pred()) : sort_vect(), pred(p) 
  {
    sort_vect.reserve(res_size);
  }
  template <typename InputIterator>
  sorted_vector(InputIterator first, InputIterator last, const Pred& p = Pred())
    : sort_vect(first, last), pred(p)
  {
    std::sort(begin(), end(), pred);
  }
  //This container is always sorted
  //O(N)
  iterator insert(const T& elem)
  {
    iterator it = std::lower_bound(begin(), end(), elem, pred);
    if (it == end() || pred(elem, *it))
      sort_vect.insert(it, elem);
    return it;
  }
  //The element is the container should not be modified!!
  //O(logN)
  const_iterator find(const T& elem) const 
  {
    const_iterator it = lower_bound(begin(), end(), elem, pred);
    return it == end() || pred(elem, *it) ? end() :  it;
  }
private:
  std::vector<T> sort_vect;
  Pred pred;
};

测试代码如下，前面的性能比较代码段中修改即可：

sorted_vector<int> sortVect(MAX_SIZE);
  for (int i = 0; i < MAX_SIZE; ++i) 
  {
    intSet.insert(i);
    intVect.push_back(i);    //这里由于插入的特殊性，intVect元素状态是“有序”的
    sortVect.insert(i);
  }
  cout << *(sortVect.find(MAX_ELEMENT/2)) << endl;
  DWORD st1, st2;
  st1 = GetTickCount();
  for (int i = 0; i < MAX_TIMES; ++i) 
  {
    sortVect.find(MAX_ELEMENT);
    sortVect.find(MAX_ELEMENT/2);
    sortVect.find(MAX_ELEMENT/15);
  }
  
  cout << "sortVect.find(...):\t" << (st2 - st1) << "ms" << endl;

运行截图：

参考：《Effective STL》、《Why you shouldn't use set》

发表于 2011-08-19 22:44 Painful 阅读(4443) 评论(0) 收藏举报