spark笔记

1、在ipython打开spark

from pyspark import  SparkContext
sc = SparkContext( 'local', 'pyspark')

   在ipython notebook打开pyspark参考http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/

2、
spark内部数据流是DAG,节点表示RDD,边表示Transform,只有在调用Actions的时候tranform才开始一次性执行。
3、Transform
integer_RDD=sc.parallelize(range(10), 3)  #分成3块
integer_RDD.collect()  #显示全部数据
integer_RDD.glom().collect()  #分块显示
text_RDD=sc.textFile("~")  #从本地文件或者HDFS文件创建
pairs_RDD=text_RDD.flatMap(lambda line:line.split()).map(lambda word:(word,1))  
#Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
#Return a new RDD by applying a function to each element of this RDD.
#所以flatMap是先做map再把结果合并
wordcounts_RDD=pairs_RDD.reduceByKey(lambda a,b:a+b)  #相同的key合并
filter(func)  #keep only elements where func is true
sample(withReplacement, fraction, seed)  #get a random data fraction
coalesce(numPartitions)  #merge partitions to reduce them to numPartitions
groupByKey   #(K, V) pairs => (K, iterable of all V)
for k,v in pairs_RDD.groupByKey().collect():
  print"Key:",k,",Values:",list(v)
repartition(numPartitions)  #similar to coalesce, shuffles all data to increase or decrease number of partitions to numPartitions

 

 

4、Actions
collect()  #copy all elements to the driver
take(n)   #copy first n elements
reduce(func)   #aggregate elements with func (takes 2 elements, returns 1)
saveAsTextFile(filename)  #save to local file or HDFS

 

5、

RDD.cache()  #把RDD放入内存
config=sc.broadcast({"order":3,"filter":True})
accum=sc.accumulator(0)deftest_accum

 

posted @ 2016-05-10 14:19  zhangm215  阅读(103)  评论(0)    收藏  举报