随笔分类 - BigData
摘要:Logstash 6.6.2 官方:https://www.elastic.co/products/logstash 一 简介 Centralize, Transform & Stash Your Data Logstash is an open source, server-side data p
阅读全文
摘要:应用一:kafka数据同步到kudu 1 准备kafka topic # bin/kafka-topics.sh --zookeeper $zk:2181/kafka -create --topic test_sync --partitions 2 --replication-factor 2 WA
阅读全文
摘要:hadoop.security.authentication: Kerberos -> Simple hadoop.security.authorization: true -> false dfs.datanode.address: -> from 1004 (for Kerberos) to 5
阅读全文
摘要:presto 0.217 官方:http://prestodb.github.io/ 一 简介 Presto is an open source distributed SQL query engine for running interactive analytic queries against
阅读全文
摘要:应用一:mysql数据增量同步到kafka 1 准备mysql测试表 mysql> create table test_sync(id int not null auto_increment, name varchar(32), description varchar(64), create_tim
阅读全文
摘要:问题:spark中如果有两个DataFrame(或者DataSet),DataFrameA依赖DataFrameB,并且两个DataFrame都进行了cache,将DataFrameB unpersist之后,DataFrameA的cache也会失效,官方解释如下: When invalidatin
阅读全文
摘要:1 compress & mr hive默认的execution engine是mr hive> set hive.execution.engine;hive.execution.engine=mr 所以针对mr的优化就是hive的优化,比如压缩和临时目录 mapred-site.xml <prop
阅读全文
摘要:tpc 官方:http://www.tpc.org/ 一 简介 The TPC is a non-profit corporation founded to define transaction processing and database benchmarks and to disseminat
阅读全文
摘要:hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as its execution engine. set hive.execution.engine=spar
阅读全文
摘要:1 hive # kadmin.local -q 'ktadd -k /tmp/hive3.keytab -norandkey hive/server03@TEST.COM'# kinit -kt /tmp/hive3.keytab hive/server03@TEST.COM# klist # b
阅读全文
摘要:部署方式:docker+airflow+mysql+LocalExecutor 使用airflow的docker镜像 https://hub.docker.com/r/puckel/docker-airflow 使用默认的sqlite+SequentialExecutor启动: $ docker r
阅读全文
摘要:定期清理cloudera manager server的磁盘空间 1 停止Service Monitor和Host Monitor 2 删除日志 # /bin/rm /var/lib/cloudera-host-monitor/ts/*/partition*/* -rf# /bin/rm /var/
阅读全文
摘要:建表语句 CREATE EXTERNAL TABLE `my_lzo_table`(`something` string)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS INPUTFORMAT 'com.hadoop.mapred.D
阅读全文
摘要:HiBench 7官方:https://github.com/intel-hadoop/HiBench 一 简介 HiBench is a big data benchmark suite that helps evaluate different big data frameworks in te
阅读全文
摘要:1 下载解压 https://spark.apache.org/downloads.html $ wget http://mirrors.shu.edu.cn/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz $ tar xvf spark
阅读全文
摘要:hive建表后直接将数据文件拷贝到table目录下,select * 可以查到数据,但是select count(1) 一直返回0,这个是因为hive中有个配置 hive.stats.autogather=true Enables automated gathering of table-level
阅读全文
摘要:hadoop部署参考:https://www.cnblogs.com/barneywill/p/10428098.html 1 拷贝到所有服务器上并解压 # ansible all-servers -m copy -a 'src=/src/path/to/apache-hive-2.3.4-bin.
阅读全文
摘要:环境:3结点集群 192.168.0.1192.168.0.2192.168.0.3 1 配置root用户服务期间免密登录 参考:https://www.cnblogs.com/barneywill/p/10271679.html 2 安装ansible 参考:https://www.cnblogs
阅读全文
摘要:Cluster OS&Platform redhat/centos7, docker, mesos, cloudera manager(cdh) Checklist 1 check user & password & network reachability, make sure everythin
阅读全文
摘要:kerberos5-1.17 官方:https://kerberos.org/ 一 简介 The Kerberos protocol is designed to provide reliable authentication over open and insecure networks wher
阅读全文

浙公网安备 33010602011771号