Apache Spark - 随笔分类 - fxjwind

SparkSQL(3.1.1)源码分析

摘要：入口，sql /** * Executes a SQL query using Spark, returning the result as a `DataFrame`. * This API eagerly runs DDL/DML commands, but not for SELECT que 阅读全文

posted @ 2021-05-20 15:54 fxjwind 阅读(468) 评论(0) 推荐(0)

Spark SQL: Relational Data Processing in Spark （SIGMOD’15）

摘要：Introduction Big data applications require a mix of processing techniques, data sources and storage formats. The earliest systems designed for these w 阅读全文

posted @ 2021-05-11 17:43 fxjwind 阅读(416) 评论(0) 推荐(0)

Structured Streaming Programming Guide

摘要：https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html http://www.slideshare.net/databricks/a-deep-dive-into-structured-streaming Structured Streaming is a scalable and ... 阅读全文

posted @ 2016-08-24 16:42 fxjwind 阅读(810) 评论(0) 推荐(0)

Spark 2.0

摘要：Apache Spark 2.0: Faster, Easier, and Smarter http://blog.madhukaraphatak.com/categories/spark-two/ https://amplab.cs.berkeley.edu/technical-preview-of-apache-spark-2-0-easier-faster-and-smarter/ ... 阅读全文

posted @ 2016-05-27 14:52 fxjwind 阅读(697) 评论(0) 推荐(0)

Tuning Spark

摘要：https://spark.apache.org/docs/1.2.1/tuning.html Data Serialization 数据序列化，对于任意分布式系统都是性能的关键点 Spark默认使用Java serialization，这个比较低效推荐使用，Kryo serialization，会比Java序列化，更快更小， Spark使用Twitter chill library（Kry... 阅读全文

posted @ 2015-04-21 19:52 fxjwind 阅读(1002) 评论(1) 推荐(0)

Spark MLlib - Decision Tree源码分析

摘要：http://spark.apache.org/docs/latest/mllib-decision-tree.html 以决策树作为开始，因为简单，而且也比较容易用到，当前的boosting或random forest也是常以其为基础的决策树算法本身参考之前的blog，其实就是贪婪算法，每次切分使得数据变得最为有序那么如何来定义有序或无序？无序，node impurity ... 阅读全文

posted @ 2014-12-08 14:32 fxjwind 阅读(6810) 评论(0) 推荐(0)

Spark Streaming源码分析 – Checkpoint

摘要：PersistenceStreaming没有做特别的事情，DStream最终还是以其中的每个RDD作为job进行调度的，所以persistence就以RDD为单位按照原先Spark的方式去做就可以了，不同的是Streaming是无限，需要考虑Clear的问题在clearMetadata时，在删除过期的RDD的同时，也会做相应的unpersist比较特别的是，NetworkInputDStream，... 阅读全文

posted @ 2014-03-12 15:30 fxjwind 阅读(3426) 评论(0) 推荐(0)

Spark Streaming源码分析 – JobScheduler

摘要：先给出一个job从被generate到被执行的整个过程在JobGenerator中，需要定时的发起GenerateJobs事件，而每个job其实就是针对DStream中的一个RDD，发起一个SparkContext.runJob，通过对DStream中每个RDD都runJob来模拟流处理 //StreamingContext.scalaprivate[streaming] val schedule... 阅读全文

posted @ 2014-03-10 17:02 fxjwind 阅读(1494) 评论(0) 推荐(0)

Spark Streaming源码分析 – InputDStream

摘要：对于NetworkInputDStream而言，其实不是真正的流方式，将数据读出来后不是直接去处理，而是先写到blocks中，后面的RDD再从blocks中读取数据继续处理这就是一个将stream离散化的过程NetworkInputDStream就是封装了将数据从source中读出来，然后放到blocks里面去的逻辑（Receiver线程）还需要一个可以管理NetworkInputDStream，... 阅读全文

posted @ 2014-03-07 18:08 fxjwind 阅读(2161) 评论(4) 推荐(1)

Spark Streaming源码分析 – DStream

摘要：A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous sequence of RDDs (of the same type) representing a continuous stream of data.Dstream本质就是离散化的stream，将stream离散化成... 阅读全文

posted @ 2014-03-06 18:15 fxjwind 阅读(2846) 评论(0) 推荐(1)

Spark Streaming Programming Guide

摘要：参考，http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html Overview SparkStreaming支持多种流输入，like Kafka, Flume, Twitter, ZeroMQ or plain old TCP sockets，并且可以在上面进行transform操作，最终数据存入... 阅读全文

posted @ 2014-02-21 18:19 fxjwind 阅读(2685) 评论(0) 推荐(0)

Spark 源码分析 -- task实际执行过程

摘要：Spark源码分析 – SparkContext 中的例子, 只分析到sc.runJob 那么最终是怎么执行的? 通过DAGScheduler切分成Stage, 封装成taskset, 提交给TaskScheduler, 然后等待调度, 最终到Executor上执行 val sc = new SparkContext(……)val textFile = sc.textFile("READ... 阅读全文

posted @ 2014-01-21 16:38 fxjwind 阅读(3164) 评论(6) 推荐(0)

Spark源码分析 – 汇总索引

摘要：http://jerryshao.me/categories.html#architecture-ref http://blog.csdn.net/pelick/article/details/17222873 如果想了解Spark的设计, 第一个足够如果想梳理Spark的源码整体结构, 第二个也可以 ALL Spark源码分析 – SparkContext Spark源码分... 阅读全文

posted @ 2014-01-16 14:29 fxjwind 阅读(3957) 评论(0) 推荐(0)

Spark源码分析 – Shuffle

摘要：参考详细探究Spark的shuffle实现, 写的很清楚, 当前设计的来龙去脉 Hadoop Hadoop的思路是, 在mapper端每次当memory buffer中的数据快满的时候, 先将memory中的数据, 按partition进行划分, 然后各自存成小文件, 这样当buffer不断的spill的时候, 就会产生大量的小文件所以Hadoop后面直到reduce之前做的所有的事情其实就是... 阅读全文

posted @ 2014-01-16 11:34 fxjwind 阅读(7839) 评论(0) 推荐(2)

Spark源码分析 – SparkEnv

摘要：SparkEnv在两个地方会被创建, 由于SparkEnv中包含了很多重要的模块, 比如BlockManager, 所以SparkEnv很重要 Driver端, 在SparkContext初始化的时候, SparkEnv会被创建 // Create the Spark execution environment (cache, map output tracker, etc) ... 阅读全文

posted @ 2014-01-13 10:54 fxjwind 阅读(2639) 评论(10) 推荐(0)

Spark源码分析 – Checkpoint

摘要：CP的步骤 1. 首先如果RDD需要CP, 调用RDD.checkpoint()来mark 注释说了, 这个需要在Job被执行前被mark, 原因后面看, 并且最好选择persist这个RDD, 否则在存CP文件时需要重新computeRDD内容并且当RDD被CP后, 所有dependencies都会被清除, 因为既然RDD已经被CP, 那么就可以直接从文件读取, 没有必要保留之... 阅读全文

posted @ 2014-01-10 18:24 fxjwind 阅读(3377) 评论(7) 推荐(0)

Spark源码分析 – BlockManager

摘要：参考, Spark源码分析之-Storage模块对于storage, 为何Spark需要storage模块？为了cache RDD Spark的特点就是可以将RDD cache在memory或disk中，RDD是由partitions组成的，对应于block 所以storage模块，就是要实现RDD在memory和disk上的persistent功能首先每个节点都有一个Bloc... 阅读全文

posted @ 2014-01-10 11:19 fxjwind 阅读(4982) 评论(2) 推荐(0)

Spark 源码分析 – BlockManagerMaster&Slave

摘要：BlockManagerMaster 只是维护一系列对BlockManagerMasterActor的接口, 所有的都是通过tell和askDriverWithReply从BlockManagerMasterActor获取数据比较鸡肋的类 private[spark] class BlockManagerMaster(var driverActor: ActorRef) ex... 阅读全文

posted @ 2014-01-10 11:03 fxjwind 阅读(2493) 评论(2) 推荐(0)

Spark 源码分析 -- BlockStore

摘要：BlockStore 抽象接口类, 关键get和put都有两个版本序列化, putBytes, getBytes非序列化, putValues, getValues 其中putValues的返回值为PutResult, 其中的data可能是Iterator或ByteBuffer private[spark] case class PutResult(size: Long, data: Either... 阅读全文

posted @ 2014-01-09 17:48 fxjwind 阅读(1298) 评论(0) 推荐(0)

Spark源码分析 – Executor

摘要：ExecutorBackend 很简单的接口 package org.apache.spark.executor/** * A pluggable interface used by the Executor to send updates to the cluster scheduler. */private[spark] trait ExecutorBackend { def s... 阅读全文

posted @ 2014-01-07 16:52 fxjwind 阅读(1942) 评论(0) 推荐(0)

fxjwind

随笔分类 - Apache Spark