1 - Mr.He多多指教

1.表的设计

1.1Pre-Creating Region

默认情况下，在创建HBase表的时候会自动创建一个分区，当导入数据的时候，所有的HBase客户端都向这一个Region写数据，知道这个Region足够大了才进行切分。一种可以加快批量写入速度的方法是通过预先创建一些空的regions，这样当数据写入HBase时，会按照Region分区情况，在集群内做数据的负载均衡。

面试提问：如何解决负载均衡和数据倾斜问题-----预分区

public static boolean createTable(HBaseAdmin admin, HTableDescriptor table, byte[][] splits)
throws IOException {
  try {
    admin.createTable( table, splits );
    return true;
  } catch (TableExistsException e) {
    logger.info("table " + table.getNameAsString() + " already exists");
    // the table already exists...
    return false;  
  }
}
 
public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) {
  byte[][] splits = new byte[numRegions-1][];
  BigInteger lowestKey = new BigInteger(startKey, 16);
  BigInteger highestKey = new BigInteger(endKey, 16);
  BigInteger range = highestKey.subtract(lowestKey);
  BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions));
  lowestKey = lowestKey.add(regionIncrement);
  for(int i=0; i &lt; numRegions-1;i++) {
    BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i)));
    byte[] b = String.format("%016x", key).getBytes();
    splits[i] = b;
  }
  return splits;
}

原先预算总数据多少，还有每个分区的大小，比如每个分区500M，这样可以算出需要多少个分区

这个还是可能造成数据的不均衡的。虽然ip地址被均分了，但是ip地址不一定都是平均的访问这个网站。有的ip地址可能总来访问，有的ip地址不访问，这就造成老访问的哪个ip地址所在的Region，访问日志非常多。

数据抽样再进行预分区

手动对大的分区进行裂变

越靠前的分区，越先被检索

缓存：可以加快查询速度，但是会使得插入，更新和删除变慢，因为在插入，更新和删除的同时，需要对缓存做同样的操作。

Hibernate的二级缓存有一个缺陷，太靠近底层，太靠近底层的缺点是速度慢，复杂的业务处理，和复杂的表现层封装。

缓存最好的是放在客户端。不能最大限度达到我们的要求

最靠近用户的地方做缓存

Major compaction 每个分区下的每个Store下的所有StoreFile全部合并

非常消耗系统资源，默认为24小时执行一次。改成手动的

posted on 2017-04-11 15:23 Mr.He多多指教阅读(201) 评论(0) 收藏举报