BigData - 随笔分类(第5页) - 匠人先生

大数据基础之Logstash（1）简介、安装、使用

摘要：Logstash 6.6.2 官方：https://www.elastic.co/products/logstash 一简介 Centralize, Transform & Stash Your Data Logstash is an open source, server-side data p 阅读全文

posted @ 2019-03-19 16:12 匠人先生阅读(410) 评论(0) 推荐(0)

大数据基础之Flume（2）应用之kafka-kudu

摘要：应用一：kafka数据同步到kudu 1 准备kafka topic # bin/kafka-topics.sh --zookeeper $zk:2181/kafka -create --topic test_sync --partitions 2 --replication-factor 2 WA 阅读全文

posted @ 2019-03-16 17:43 匠人先生阅读(1398) 评论(1) 推荐(0)

大叔经验分享（40）hdfs关闭kerberos

摘要：hadoop.security.authentication: Kerberos -> Simple hadoop.security.authorization: true -> false dfs.datanode.address: -> from 1004 (for Kerberos) to 5 阅读全文

posted @ 2019-03-15 22:39 匠人先生阅读(718) 评论(0) 推荐(0)

大数据基础之Presto（1）简介、安装、使用

摘要：presto 0.217 官方：http://prestodb.github.io/ 一简介 Presto is an open source distributed SQL query engine for running interactive analytic queries against 阅读全文

posted @ 2019-03-14 12:11 匠人先生阅读(3389) 评论(0) 推荐(0)

大数据基础之Logstash（2）应用之mysql-kafka

摘要：应用一：mysql数据增量同步到kafka 1 准备mysql测试表 mysql> create table test_sync(id int not null auto_increment, name varchar(32), description varchar(64), create_tim 阅读全文

posted @ 2019-03-13 22:41 匠人先生阅读(746) 评论(0) 推荐(1)

大叔经验分享（39）spark cache unpersist级联操作

摘要：问题：spark中如果有两个DataFrame（或者DataSet），DataFrameA依赖DataFrameB，并且两个DataFrame都进行了cache，将DataFrameB unpersist之后，DataFrameA的cache也会失效，官方解释如下： When invalidatin 阅读全文

posted @ 2019-03-13 17:52 匠人先生阅读(1626) 评论(0) 推荐(0)

大数据基础之Hive（5）性能调优Performance Tuning

摘要：1 compress & mr hive默认的execution engine是mr hive> set hive.execution.engine;hive.execution.engine=mr 所以针对mr的优化就是hive的优化，比如压缩和临时目录 mapred-site.xml <prop 阅读全文

posted @ 2019-03-12 20:38 匠人先生阅读(3744) 评论(0) 推荐(0)

大数据基础之Benchmark（2）TPC-DS

摘要：tpc 官方：http://www.tpc.org/ 一简介 The TPC is a non-profit corporation founded to define transaction processing and database benchmarks and to disseminat 阅读全文

posted @ 2019-03-05 22:55 匠人先生阅读(6649) 评论(1) 推荐(1)

大数据基础之Hive（5）hive on spark

摘要：hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as its execution engine. set hive.execution.engine=spar 阅读全文

posted @ 2019-03-05 18:42 匠人先生阅读(4215) 评论(0) 推荐(0)

大数据基础之Kerberos（2）hive impala hdfs访问

摘要：1 hive # kadmin.local -q 'ktadd -k /tmp/hive3.keytab -norandkey hive/server03@TEST.COM'# kinit -kt /tmp/hive3.keytab hive/server03@TEST.COM# klist # b 阅读全文

posted @ 2019-03-02 15:02 匠人先生阅读(647) 评论(0) 推荐(0)

运维基础之Docker（5）docker部署airflow

摘要：部署方式：docker+airflow+mysql+LocalExecutor 使用airflow的docker镜像 https://hub.docker.com/r/puckel/docker-airflow 使用默认的sqlite+SequentialExecutor启动： $ docker r 阅读全文

posted @ 2019-03-01 10:59 匠人先生阅读(3739) 评论(0) 推荐(0)

大叔经验分享（37）CM清理磁盘空间

摘要：定期清理cloudera manager server的磁盘空间 1 停止Service Monitor和Host Monitor 2 删除日志 # /bin/rm /var/lib/cloudera-host-monitor/ts/*/partition*/* -rf# /bin/rm /var/ 阅读全文

posted @ 2019-02-27 09:45 匠人先生阅读(1904) 评论(0) 推荐(1)

大叔经验分享（35）lzo格式支持

摘要：建表语句 CREATE EXTERNAL TABLE `my_lzo_table`(`something` string)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS INPUTFORMAT 'com.hadoop.mapred.D 阅读全文

posted @ 2019-02-26 18:24 匠人先生阅读(2145) 评论(0) 推荐(1)

大数据基础之Benchmark（1）HiBench

摘要：HiBench 7官方：https://github.com/intel-hadoop/HiBench 一简介 HiBench is a big data benchmark suite that helps evaluate different big data frameworks in te 阅读全文

posted @ 2019-02-26 11:45 匠人先生阅读(1952) 评论(0) 推荐(1)

大数据基础之Spark（9）spark部署方式yarn/mesos

摘要：1 下载解压 https://spark.apache.org/downloads.html $ wget http://mirrors.shu.edu.cn/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz $ tar xvf spark 阅读全文

posted @ 2019-02-25 18:37 匠人先生阅读(1041) 评论(0) 推荐(1)

大叔经验分享（33）hive select count为0

摘要：hive建表后直接将数据文件拷贝到table目录下，select * 可以查到数据，但是select count(1) 一直返回0，这个是因为hive中有个配置 hive.stats.autogather=true Enables automated gathering of table-level 阅读全文

posted @ 2019-02-25 15:51 匠人先生阅读(3570) 评论(1) 推荐(1)

大数据基础之Hive（3）最简绿色部署

摘要：hadoop部署参考：https://www.cnblogs.com/barneywill/p/10428098.html 1 拷贝到所有服务器上并解压 # ansible all-servers -m copy -a 'src=/src/path/to/apache-hive-2.3.4-bin. 阅读全文

posted @ 2019-02-25 11:26 匠人先生阅读(276) 评论(0) 推荐(1)

大数据基础之Hadoop（2）hdfs和yarn最简绿色部署

摘要：环境：3结点集群 192.168.0.1192.168.0.2192.168.0.3 1 配置root用户服务期间免密登录参考：https://www.cnblogs.com/barneywill/p/10271679.html 2 安装ansible 参考：https://www.cnblogs 阅读全文

posted @ 2019-02-25 11:14 匠人先生阅读(332) 评论(0) 推荐(1)

大数据基础之集群搭建

摘要：Cluster OS&Platform redhat/centos7, docker, mesos, cloudera manager(cdh) Checklist 1 check user & password & network reachability, make sure everythin 阅读全文

posted @ 2019-02-22 23:33 匠人先生阅读(318) 评论(0) 推荐(0)

大数据基础之Kerberos（1）简介、安装、使用

摘要：kerberos5-1.17 官方：https://kerberos.org/ 一简介 The Kerberos protocol is designed to provide reliable authentication over open and insecure networks wher 阅读全文

posted @ 2019-02-19 00:35 匠人先生阅读(1289) 评论(0) 推荐(1)

Thinking in BigData

匠人先生

随笔分类 - BigData

公告