探究 Parquet 生成方式(impala,hive都可以查询)(一)
https://my.oschina.net/skyim/blog/479159
1. Parquet 的优点我就不说拉(列存储和良好的压缩),列存储可以参考如下链接
2.主要是项目中用到的存储
3.第一步,首先在hive中创建一张表,操作表语句如下
create external table parquet_example (
basketid bigint,
productid bigint,
quantity int,
price float,
totalbasketvalue float
) stored as parquet location '/user/hive/warehouse/parquet_example';
hive 中操作语句如下

4.我们到界面上面去看看这两张表,hive左下角已经有parquet-example

5.需要在impala里面查看的话
需要在impala执行如下语句 INVALIDATE METADAT6.现在主要是将表里面写入相关parquet文件
public class BasketWriter {
public static void main(String[] args) throws IOException {
DateFormat dateFormat = new SimpleDateFormat("YYYYMMddHHmmss");
new BasketWriter().generateBasketData("part_"+dateFormat.format(new Date()));
}
private void generateBasketData(String outFilePath) throws IOException {
final MessageType schema = MessageTypeParser.parseMessageType("message basket { required int64 basketid; required int64 productid; required int32 quantity; required float price; required float totalbasketvalue; }");
Configuration config = new Configuration();
DataWritableWriteSupport.setSchema(schema, config);
Path outDirPath = new Path("hdfs://192.168.0.80/user/hive/warehouse/parquet_example/"+outFilePath); //hdfs 文件目录
ParquetWriter writer = new ParquetWriter(outDirPath, new DataWritableWriteSupport () {
@Override
public WriteContext init(Configuration configuration) {
if (configuration.get(DataWritableWriteSupport.PARQUET_HIVE_SCHEMA) == null) {
configuration.set(DataWritableWriteSupport.PARQUET_HIVE_SCHEMA, schema.toString());
}
return super.init(configuration);
}
}, CompressionCodecName.SNAPPY, 256*1024*1024, 100*1024);
int numBaskets = 1000000;
Random numProdsRandom = new Random();
Random quantityRandom = new Random();
Random priceRandom = new Random();
Random prodRandom = new Random();
for (int i = 0; i < numBaskets; i++) {
int numProdsInBasket = numProdsRandom.nextInt(30);
numProdsInBasket = Math.max(7, numProdsInBasket);
float totalPrice = priceRandom.nextFloat();
totalPrice = (float)Math.max(0.1, totalPrice) * 100;
for (int j = 0; j < numProdsInBasket; j++) {
Writable[] values = new Writable[5];
values[0] = new LongWritable(i);
values[1] = new LongWritable(prodRandom.nextInt(200000));
values[2] = new IntWritable(quantityRandom.nextInt(10));
values[3] = new FloatWritable(priceRandom.nextFloat());
values[4] = new FloatWritable(totalPrice);
ArrayWritable value = new ArrayWritable(Writable.class, values);
writer.write(value);
}
}
writer.close();
}
}
7.下面可以查看到我们输入的数据

8.下面可以在hive或者 impala 查询写入的数据


源代码可以用如下找到
https://github.com/wangxuehui/writeparquet/

浙公网安备 33010602011771号