spark笔记
1、在ipython打开spark
from pyspark import SparkContext sc = SparkContext( 'local', 'pyspark')
在ipython notebook打开pyspark参考http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/
2、
spark内部数据流是DAG,节点表示RDD,边表示Transform,只有在调用Actions的时候tranform才开始一次性执行。
3、Transform
integer_RDD=sc.parallelize(range(10), 3) #分成3块 integer_RDD.collect() #显示全部数据 integer_RDD.glom().collect() #分块显示 text_RDD=sc.textFile("~") #从本地文件或者HDFS文件创建 pairs_RDD=text_RDD.flatMap(lambda line:line.split()).map(lambda word:(word,1)) #Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results. #Return a new RDD by applying a function to each element of this RDD. #所以flatMap是先做map再把结果合并 wordcounts_RDD=pairs_RDD.reduceByKey(lambda a,b:a+b) #相同的key合并 filter(func) #keep only elements where func is true sample(withReplacement, fraction, seed) #get a random data fraction coalesce(numPartitions) #merge partitions to reduce them to numPartitions groupByKey #(K, V) pairs => (K, iterable of all V) for k,v in pairs_RDD.groupByKey().collect(): print"Key:",k,",Values:",list(v) repartition(numPartitions) #similar to coalesce, shuffles all data to increase or decrease number of partitions to numPartitions
4、Actions
collect() #copy all elements to the driver take(n) #copy first n elements reduce(func) #aggregate elements with func (takes 2 elements, returns 1) saveAsTextFile(filename) #save to local file or HDFS
5、
RDD.cache() #把RDD放入内存 config=sc.broadcast({"order":3,"filter":True}) accum=sc.accumulator(0)deftest_accum

浙公网安备 33010602011771号