hive怎样处理多分隔符数据

问题描写叙述：

大数据维稳需求中，客户提供的測试数据为多个字符做分隔符('|#'),在pig中会直接报错，hive中仅仅认第一个分隔符。

因为数据量比較大（160G），在文本中替换成单个字符分隔符已不现实，下面提供两个方案解决这一问题。

例子数据

110|#警察

120|#医院

方案1：利用hive自带的序列化/反序列化的方式RegexSe

add jar /home/cup/software/……/hive-contrib-0.10.0-cdh4.4.0.jar;

create table test

(

id string,

name string

)partitioned by (c_day string)

row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

with serdeproperties

( 'input.regex' = '([^\\|#]*)\\|#([^\\|#]*)' , 'output.format.string' = '%1$s%2$s')

stored as textfile;

load data local inpath '/……/test.txt' overwrite into table test partition(c_day = '20141027');

select * from test;

110 警察 20141027

120 医院 20141027

==========================================================

方案2：重写对应的InputFormat和OutputFormat方法

posted @ 2016-01-06 15:12 blfshiye 阅读(449) 评论(0) 收藏举报

刷新页面返回顶部

blfshiye