2015 年 6月 8 日随笔档案 - lishouguang

2015年6月8日

摘要： 1、REST Request URIcurl -XGET http://vm1:9200/customer/external/_search?q=*&pretty 2、REST Request Body1）查询设置curl -XPOST http://vm1:9200/customer/external/_search?pretty -d '{query: {match_all: {}},size... 阅读全文

posted @ 2015-06-08 14:51 lishouguang 阅读(493) 评论(0) 推荐(0)

elasticsearch常用命令

摘要： elasticsearch的rest访问格式：curl -X :///1、启动[es@vm1 bin]$ ./elasticsearch --cluster.name myes --node.name node12、查看集群状态[es@vm1 ~]$ curl http://vm1:9200/_cat/health?vepoch timestamp cluster status node... 阅读全文

posted @ 2015-06-08 14:50 lishouguang 阅读(9600) 评论(0) 推荐(0)

Flume Channel Selector

摘要： Flume 基于Channel Selector可以实现扇入、扇出。同一个数据源分发到不同的目的，如下图。在source上可以定义channel selector：123456789a1.sources=r1...a1.channels=c1 c2...a1.sources.r1.selector.type=multiplexinga1.sources.r1.selector.header=t... 阅读全文

posted @ 2015-06-08 14:48 lishouguang 阅读(1160) 评论(0) 推荐(0)

Flume组件汇总2

摘要： Component InterfaceType AliasImplementation Classorg.apache.flume.Channelmemoryorg.apache.flume.channel.MemoryChannelorg.apache.flume.Channeljdbcorg.apache.flume.channel.jdbc.JdbcChannelorg.apache.flu... 阅读全文

posted @ 2015-06-08 14:46 lishouguang 阅读(379) 评论(0) 推荐(0)

HDFS Sink使用技巧

摘要： 1、文件滚动策略在HDFS Sink的文件滚动就是文件生成，即关闭当前文件，创建新文件。它的滚动策略由以下几个属性控制：hdfs.rollInterval基于时间间隔来进行文件滚动，默认是30，即每隔30秒滚动一个文件。0就是不使用这个策略。hdfs.rollSize基于文件大小进行文件滚动，默认是1024，即当文件大于1024个字节时，关闭当前文件，创建新的文件。0就是不使用这个策略。hdfs.... 阅读全文

posted @ 2015-06-08 14:44 lishouguang 阅读(3503) 评论(0) 推荐(0)

Spooling Directory Source使用技巧

摘要： 1、使用文件原来的名字1234567891011121314151617a1.sources=r1 a1.sinks=k1 a1.sources.r1.type=spooldir .... a1.sources.r1.basenameHeader=true a1.sources.r1.basenameHeaderKey=basename ..... a1.sinks.k1.type=hdfs a1... 阅读全文

posted @ 2015-06-08 14:42 lishouguang 阅读(1208) 评论(0) 推荐(0)

Log4J Appender - 将Log4J的日志内容发送到agent的source

摘要：项目中使用log4j打印的内容同时传输到flume1、flume端flume的agent配置内容如下：12345678910111213141516a1.sources=s1a1.sinks=k1a1.channels=c1 a1.sources.s1.channels=c1a1.sinks.k1.channel=c1 a1.sources.s1.type=avroa1.sources.s1.bi... 阅读全文

posted @ 2015-06-08 14:41 lishouguang 阅读(639) 评论(0) 推荐(0)

Flume Source 实例

摘要： Flume Source 实例Avro Source监听avro端口，接收外部avro客户端数据流。跟前面的agent的Avro Sink可以组成多层拓扑结构。12345678910111213141516a1.sources=s1a1.sinks=k1a1.channels=c1 a1.sources.s1.channels=c1a1.sinks.k1.channel=c1 a1.sou... 阅读全文

posted @ 2015-06-08 14:38 lishouguang 阅读(1910) 评论(0) 推荐(0)

flume组件汇总 source、sink、channel

摘要： Flume SourceSource类型说明Avro Source支持Avro协议（实际上是Avro RPC），内置支持Thrift Source支持Thrift协议，内置支持Exec Source基于Unix的command在标准输出上生产数据JMS Source从JMS系统（消息、主题）中读取数据，ActiveMQ已经测试过Spooling Directory Source监控指定目录内数据变... 阅读全文

posted @ 2015-06-08 14:35 lishouguang 阅读(4757) 评论(0) 推荐(1)

flume使用场景 flume与kafka的比较

摘要： Is Flume a good fit for your problem?If you need to ingest textual log data into Hadoop/HDFS then Flume is the right fit for your problem, full stop. For other use cases, here are some guidelines:Flum... 阅读全文

posted @ 2015-06-08 14:33 lishouguang 阅读(8408) 评论(0) 推荐(0)

Hive lateral view explode

摘要： select 'hello', x from dual lateral view explode(array(1,2,3,4,5)) vt as x结果是：hello 1hello 2hello 3hello 4hello 5来自为知笔记(Wiz) 阅读全文

posted @ 2015-06-08 14:29 lishouguang 阅读(279) 评论(0) 推荐(0)

Hive分组取Top N

摘要： Hive在0.11.0版本开始加入了row_number、rank、dense_rank分析函数，可以查询分组排序后的top值说明：row_number() over ([partition col1] [order by col2])rank() over ([partition col1] [o... 阅读全文

posted @ 2015-06-08 14:27 lishouguang 阅读(11084) 评论(1) 推荐(1)

Hive 锁 lock

摘要： Hive + zookeeper 可以支持锁功能锁有两种：共享锁、独占锁，Hive开启并发功能的时候自动开启锁功能1）查询操作使用共享锁，共享锁是可以多重、并发使用的2）修改表操作使用独占锁，它会阻止其他的查询、修改操作3）可以对分区使用锁。1、修改hive-site.xml，配置如下： hive.zookeeper.quorum zk1,zk2,zk3 hive.suppo... 阅读全文

posted @ 2015-06-08 14:25 lishouguang 阅读(2969) 评论(0) 推荐(0)

Hive创建指向HBase表的表

摘要： create [external] table t1(id int, value string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties('hbase.column.mapping'=':key,f:name');如果想创建一个指向已经存在的HBase表的话，需要使用exte... 阅读全文

posted @ 2015-06-08 14:23 lishouguang 阅读(338) 评论(0) 推荐(0)

Hive Serde - CSV、TSV

摘要： CSVhive-0.14.0内置支持CSV Serde，以前的版本需要引入第三方库的jar包（http://https://github.com/ogrodnek/csv-serde）现在有个文本文件a.csv（从数据库中导出的数据通常都是这种格式），里面的内容如下：[hive@vm1 ~]$ more a.csv '1','zhangsan','20','beijing,shanghai,sha... 阅读全文

posted @ 2015-06-08 14:21 lishouguang 阅读(2006) 评论(0) 推荐(0)

自定义UDF

摘要： 1、编写udf类package hive.udf;import org.apache.hadoop.hive.ql.exec.Description;import org.apache.hadoop.hive.ql.exec.UDF;/** * UDF的说明文档 * name是UDF的名称 * value是desc function xx输出的内容 * extended是desc functio... 阅读全文

posted @ 2015-06-08 14:18 lishouguang 阅读(649) 评论(0) 推荐(0)

Hive使用SequenceFile存储数据

摘要： SequenceFile是使用二进制保存数据，是可以压缩的，并且压缩后的数据可被分割，可以供mapreduce处理。下面的实例使用SequenceFile保存Hive表的数据，并且使用了压缩。set hive.exec.compress.output=true; #压缩mapreduce输出数据set mapreduce.output.fileoutputformat.compress.codec... 阅读全文

posted @ 2015-06-08 14:15 lishouguang 阅读(869) 评论(0) 推荐(0)

alter table

摘要：表重命名alter table t1 rename to t2;添加分区alter table t1 add if not exists partition(xx=yy) location '/xx';添加多个分区alter table t1 add if not exists partition(x1=y1) location '/x1'partition(x2=y2) location '/x... 阅读全文

posted @ 2015-06-08 14:12 lishouguang 阅读(761) 评论(0) 推荐(0)

describe命令

摘要： describe简写是desc表desc t1;desc t1 column1;desc extended t1;desc formatted t1;数据库desc database test;分区desc formatted t1 partition(xx=yy);函数desc function xx;desc function extended xx;来自为知笔记(Wiz) 阅读全文

posted @ 2015-06-08 14:10 lishouguang 阅读(680) 评论(0) 推荐(0)

show命令

摘要：数据库show databases;表show tables;show tables in xxdb;show tables 'a*';tblpropertiesshow tblproperties t1;分区show partitions t1;show partitions t1 partition(xx=yy);函数show functions;来自为知笔记(Wiz) 阅读全文

posted @ 2015-06-08 14:08 lishouguang 阅读(328) 评论(0) 推荐(0)

Hive 桶表

摘要：桶表1）桶是更为细粒度的数据范围划分，它能使一些特定的查询效率更高2）保存数据时，取分桶字段的哈希值，跟分桶数取余，然后将数据放到不同的桶（文件）里。1、定义：create table b1(id int, name string) clustered by (id) into 4 buckets;2、加载数据：1）使用load data 来加载数据，可以加载成功，也能查询到，但是没有分桶。2）i... 阅读全文

posted @ 2015-06-08 14:07 lishouguang 阅读(509) 评论(0) 推荐(0)

Hive是读时模式

摘要： Hive处理的数据是大数据，在保存表数据时不对数据进行校验，而是在读数据时校验，不符合格式的数据设置为NULL；读时模式的优点是，加载数据库快。传统的数据库如mysql、oracle是写时模式，不符合格式的数据写不进去。来自为知笔记(Wiz) 阅读全文

posted @ 2015-06-08 14:06 lishouguang 阅读(1851) 评论(0) 推荐(1)

Hive命令参数

摘要： 1、hive -h 显示帮助2、hive -h hiveserverhost -p port 连接远程hive服务器3、hive --define a=1 --hivevar b=1 --hiveconf hive.cli.print.current.db=true 见《Hive设置变量》4、hive -e "show tables"; 直接执行hivesql语句 h... 阅读全文

posted @ 2015-06-08 14:05 lishouguang 阅读(962) 评论(0) 推荐(0)

Hive设置变量

摘要： hive --define --hivevar --hiveconfset1、hivevar命名空间用户自定义变量hive -d name=zhangsanhive --define name=zhangsanhive -d a=1 -d b=2效果跟hivevar是一样的hive --hivevar a=1 --hivevar b=2引用hivevar命名空间的变量时，变量名前面可以加hivev... 阅读全文

posted @ 2015-06-08 14:04 lishouguang 阅读(15055) 评论(0) 推荐(0)

hive 排序 order by sort by distribute by cluster by

摘要： order by： order by是全局排序，受hive.mapred.mode的影响。使用orderby有一些限制： 1、在严格模式下（hive.mapred.mode=strict），orderby必须跟limit一起使用（？）。原因：在执行orderby时，hive使用一个re... 阅读全文

posted @ 2015-06-08 14:03 lishouguang 阅读(567) 评论(0) 推荐(0)

Hive常用配置

摘要： 1、配置hive在HDFS上的根目录位置 hive.metastore.warehouse.dir /hive2、配置derby数据库文件的位置（固定derby数据的位置） javax.jdo.option.ConnectionURL jdbc:derby:;databaseName=/usr/local/bigdata/hive-0.14.0/metastore_db;c... 阅读全文

posted @ 2015-06-08 14:02 lishouguang 阅读(328) 评论(0) 推荐(0)

hive 排序 order by sort by distribute by cluster by

摘要： order by： order by是全局排序，受hive.mapred.mode的影响。使用orderby有一些限制： 1、在严格模式下（hive.mapred.mode=strict），orderby必须跟limit一起使用（？）。原因：在执行orderby时，hive使用一个reducer，如果查询结果量很大，这个reducer执行起来会很费劲，所以必须要... 阅读全文

posted @ 2015-06-08 12:37 lishouguang 阅读(477) 评论(0) 推荐(0)

配置hive使用mysql存储metadata metadatastore

摘要： hive默认使用derby数据库保存元数据，derby数据库比较小众，并且一次只能打开一个会话，一般修改为mysql数据库。1、修改conf/hive-site.xml配置项： javax.jdo.option.ConnectionURL jdbc:mysql://hadoop1:3306/hive?createDatabaseIfNotExist=true ja... 阅读全文

posted @ 2015-06-08 12:20 lishouguang 阅读(349) 评论(0) 推荐(0)

.hivehistory

摘要：在当前用户的家目录下有个.hivestory文件，里面存放了用户执行的hive操作记录，如下：[hadoop@hadoop1 hive-0.14]$ cat ~/.hivehistoryshow databases;quit;quit;create table pokes(foo int, bar string);load data local inpath 'examples/files/kv1... 阅读全文

posted @ 2015-06-08 12:19 lishouguang 阅读(494) 评论(0) 推荐(0)

.hiverc

摘要：使用hive cli的时候，会读取.hiverc脚本，在.hiverc脚本里可以做一些自己的预设。比如：set hive.cli.print.current.db=true;set hive.cli.print.header=true;.hiverc可以放在~（linux用户家目录）、$HIVE_HOME/conf、$HIVE_HOME/bin目录下。来自为知笔记(Wiz) 阅读全文

posted @ 2015-06-08 12:17 lishouguang 阅读(547) 评论(0) 推荐(0)

Hive安装

摘要： 1、下载hive并压缩2、修改conf下的文件 1）去掉所有文件的后缀.template 2）复制hive-default.xml为hive-site.xml，并编辑hive-site.xml中的内容为空： 3）编辑hive-env.sh里的内容：export JAVA_HOME=~/java/jdk1.6.0_45export HADOOP_HOME=~/hadoop-2.2.... 阅读全文

posted @ 2015-06-08 12:15 lishouguang 阅读(156) 评论(0) 推荐(0)

搭建Kafka开发环境

摘要： Kafka版本是：kafka_2.10-0.8.2.11、maven工程方式在pom.xml中配置kafka依赖12345org.apache.kafkakafka_2.100.8.2.12、普通java工程方式依赖的jar包如下：阅读全文

posted @ 2015-06-08 12:03 lishouguang 阅读(436) 评论(0) 推荐(0)

java实现Kafka的消费者示例

摘要：使用java实现Kafka的消费者123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596... 阅读全文

posted @ 2015-06-08 12:00 lishouguang 阅读(21625) 评论(0) 推荐(0)

java实现Kafka生产者示例

摘要：使用java实现Kafka的生产者123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869package com.lisg.kafkatest; import java.util.Propertie... 阅读全文

posted @ 2015-06-08 11:59 lishouguang 阅读(10261) 评论(0) 推荐(0)

Kafka集群部署

摘要：三台机器：vm1、vm2、vm31、部署zookeeper集群假设已经有一个部署好了的zookeeper集群：zk1、zk2、zk32、下载、解压kafka1tar -xzvf kafka_2.10-0.8.2.1.tgz3、修改vm1上面的config/server.properties1234b... 阅读全文

posted @ 2015-06-08 11:57 lishouguang 阅读(508) 评论(0) 推荐(0)

kafka介绍 - 官网

摘要：介绍Kafka是一个分布式的、分区的、冗余的日志提交服务。它使用了独特的设计，提供了所有消息传递系统所具有的功能。我们先来看下几个消息传递系统的术语：Kafka维护消息类别的东西是主题（topic）.我们称发布消息到Kafka主题的进程叫生产者（producer）.我们称订阅主题、获取消息的进程叫消... 阅读全文

posted @ 2015-06-08 11:55 lishouguang 阅读(1776) 评论(0) 推荐(1)

公告