随笔分类 - spark
摘要:https://mp.weixin.qq.com/s/bGXhC9hvDj4lzK7wYYHGDg 目前,我们使用Filebeat监控日志产生的目录,收集产生的日志,打到logstash集群,接入kafka的topic,再由Spark Streaming 进行实时解析,将解析的结果打入Redis缓存
阅读全文
摘要:对Key/Value型RDD进行变换 groupBy按Key汇聚 fruit,applevegetable,cucumberfruit,cherryvegetable,beanfruit,bananavegetable,pepper sc.textFile("D:\\LearnSpark\\win\
阅读全文
摘要:当某个RDD的部分数据丢失时候,Saprk会根据记录的世系关系找到该RDD的父RDD以及更上级的RDD。只需要将该RDD依赖的上级RDD重新计算就可以将该RDD进行恢复。 Directed Acyclic Graph DAG RDD 的有向无环图构建过程,就是不停将Spark代码中刚一系列的RDD转
阅读全文
摘要:spark join 广告特征做广播
阅读全文
摘要:hadoop 迭代消耗大 每次迭代启动一个完整的MapReduce作业 spark 首要目标就是避免运算时 过多的网络和磁盘IO开销 Resilient Distributed Datasets http://www.cs.cmu.edu/~pavlo/courses/fall2013/static
阅读全文
摘要:[Spark性能调优] 第四章 : Spark Shuffle 中 JVM 内存使用及配置内幕详情 - 無情 - 博客园 https://www.cnblogs.com/jcchoiling/p/6494652.html
阅读全文
摘要:http://192.168.2.51:4041 http://hadoop1:8088/proxy/application_1512362707596_0006/executors/ Executors Executors Executors Show Additional Metrics Sum
阅读全文
摘要:http://192.168.2.51:4040/executors/ http://192.168.2.51:4040/executors/ ssh://root@192.168.2.51:22/usr/bin/python -u /root/.pycharm_helpers/pydev/pyde
阅读全文
摘要:Spark on Yarn提交任务时报ClosedChannelException解决方案_服务器应用_Linux公社-Linux系统门户网站 http://www.linuxidc.com/Linux/2017-01/140068.htm <property> <name>yarn.nodeman
阅读全文
摘要:Apache Hadoop 2.9.0 – YARN Commands http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YarnCommands.html
阅读全文
摘要:https://spark.apache.org/docs/latest/cluster-overview.html
阅读全文
摘要:https://databricks.com/blog/2014/08/14/mining-graph-data-with-spark-at-alibaba-taobao.html
阅读全文
摘要:http://spark.apache.org/docs/latest/sql-programming-guide.html
阅读全文
摘要:https://github.com/mongodb/mongo-spark
阅读全文
摘要:[hadoop@hadoop1 bin]$ ./spark-shell --packages org.mongodb.spark:mongo-spark-connector_2.10-2.2.1 Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Provided Maven C...
阅读全文
摘要:性能优化事项 http://www.mongoing.com/wp-content/uploads/2016/08/MDBSH2016/TJ_MongoDB+Spark.pdf MongoDB + Spark: 完整的大数据解决方案 | MongoDB中文社区 http://www.mongoing
阅读全文
摘要:启动hadoop cd /usr/local/hadoop/hadoop $hadoop namenode -format # 启动前格式化namenode $./sbin/start-all.sh 检查是否启动成功 [hadoop@hadoop1 hadoop]$ jps 16855 NodeManager 16999 Jps 16090 NameNode 16570 Resource...
阅读全文
摘要:https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/network https://github.com/apache/spark/blob/master/core/src/main/sca
阅读全文
摘要:Apache Spark is built around a distributed collection of immutable Java Virtual Machine (JVM) objects called Resilient Distributed Datasets (RDDs for
阅读全文

浙公网安备 33010602011771号