随笔分类 -  spark

摘要:https://mp.weixin.qq.com/s/bGXhC9hvDj4lzK7wYYHGDg 目前,我们使用Filebeat监控日志产生的目录,收集产生的日志,打到logstash集群,接入kafka的topic,再由Spark Streaming 进行实时解析,将解析的结果打入Redis缓存 阅读全文
posted @ 2019-02-03 22:47 papering 阅读(300) 评论(0) 推荐(0)
摘要:spark 统计分析时时 访客 阅读全文
posted @ 2018-05-28 12:52 papering 阅读(164) 评论(0) 推荐(0)
摘要:对Key/Value型RDD进行变换 groupBy按Key汇聚 fruit,applevegetable,cucumberfruit,cherryvegetable,beanfruit,bananavegetable,pepper sc.textFile("D:\\LearnSpark\\win\ 阅读全文
posted @ 2018-05-26 11:01 papering 阅读(205) 评论(0) 推荐(0)
摘要:当某个RDD的部分数据丢失时候,Saprk会根据记录的世系关系找到该RDD的父RDD以及更上级的RDD。只需要将该RDD依赖的上级RDD重新计算就可以将该RDD进行恢复。 Directed Acyclic Graph DAG RDD 的有向无环图构建过程,就是不停将Spark代码中刚一系列的RDD转 阅读全文
posted @ 2018-05-23 20:03 papering 阅读(384) 评论(0) 推荐(0)
摘要:spark join 广告特征做广播 阅读全文
posted @ 2018-05-20 22:26 papering 阅读(214) 评论(0) 推荐(0)
摘要:hadoop 迭代消耗大 每次迭代启动一个完整的MapReduce作业 spark 首要目标就是避免运算时 过多的网络和磁盘IO开销 Resilient Distributed Datasets http://www.cs.cmu.edu/~pavlo/courses/fall2013/static 阅读全文
posted @ 2018-05-19 07:38 papering 阅读(210) 评论(0) 推荐(0)
摘要:[Spark性能调优] 第四章 : Spark Shuffle 中 JVM 内存使用及配置内幕详情 - 無情 - 博客园 https://www.cnblogs.com/jcchoiling/p/6494652.html 阅读全文
posted @ 2017-12-04 19:23 papering 阅读(154) 评论(0) 推荐(0)
摘要:http://192.168.2.51:4041 http://hadoop1:8088/proxy/application_1512362707596_0006/executors/ Executors Executors Executors Show Additional Metrics Sum 阅读全文
posted @ 2017-12-04 13:13 papering 阅读(433) 评论(0) 推荐(0)
摘要:http://192.168.2.51:4040/executors/ http://192.168.2.51:4040/executors/ ssh://root@192.168.2.51:22/usr/bin/python -u /root/.pycharm_helpers/pydev/pyde 阅读全文
posted @ 2017-12-03 21:50 papering 阅读(441) 评论(0) 推荐(0)
摘要:Spark on Yarn提交任务时报ClosedChannelException解决方案_服务器应用_Linux公社-Linux系统门户网站 http://www.linuxidc.com/Linux/2017-01/140068.htm <property> <name>yarn.nodeman 阅读全文
posted @ 2017-12-03 21:37 papering 阅读(1163) 评论(0) 推荐(0)
摘要:Apache Hadoop 2.9.0 – YARN Commands http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YarnCommands.html 阅读全文
posted @ 2017-12-01 14:28 papering 阅读(163) 评论(0) 推荐(0)
摘要:https://spark.apache.org/docs/latest/cluster-overview.html 阅读全文
posted @ 2017-12-01 13:41 papering 阅读(167) 评论(0) 推荐(0)
摘要:https://databricks.com/blog/2014/08/14/mining-graph-data-with-spark-at-alibaba-taobao.html 阅读全文
posted @ 2017-11-29 12:15 papering 阅读(195) 评论(0) 推荐(0)
摘要:http://spark.apache.org/docs/latest/sql-programming-guide.html 阅读全文
posted @ 2017-11-24 08:53 papering 阅读(134) 评论(0) 推荐(0)
摘要:https://github.com/mongodb/mongo-spark 阅读全文
posted @ 2017-11-23 20:32 papering 阅读(176) 评论(0) 推荐(0)
摘要:[hadoop@hadoop1 bin]$ ./spark-shell --packages org.mongodb.spark:mongo-spark-connector_2.10-2.2.1 Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Provided Maven C... 阅读全文
posted @ 2017-11-23 20:18 papering 阅读(1876) 评论(0) 推荐(0)
摘要:性能优化事项 http://www.mongoing.com/wp-content/uploads/2016/08/MDBSH2016/TJ_MongoDB+Spark.pdf MongoDB + Spark: 完整的大数据解决方案 | MongoDB中文社区 http://www.mongoing 阅读全文
posted @ 2017-11-23 17:09 papering 阅读(768) 评论(0) 推荐(0)
摘要:启动hadoop cd /usr/local/hadoop/hadoop $hadoop namenode -format # 启动前格式化namenode $./sbin/start-all.sh 检查是否启动成功 [hadoop@hadoop1 hadoop]$ jps 16855 NodeManager 16999 Jps 16090 NameNode 16570 Resource... 阅读全文
posted @ 2017-11-23 16:40 papering 阅读(239) 评论(0) 推荐(0)
摘要:https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/network https://github.com/apache/spark/blob/master/core/src/main/sca 阅读全文
posted @ 2017-11-20 19:39 papering 阅读(447) 评论(0) 推荐(0)
摘要:Apache Spark is built around a distributed collection of immutable Java Virtual Machine (JVM) objects called Resilient Distributed Datasets (RDDs for 阅读全文
posted @ 2017-11-20 17:12 papering 阅读(247) 评论(0) 推荐(0)