spark笔记

1、在ipython打开spark

from pyspark import  SparkContext
sc = SparkContext( 'local', 'pyspark')

　　在ipython notebook打开pyspark参考http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/

2、

spark内部数据流是DAG，节点表示RDD，边表示Transform，只有在调用Actions的时候tranform才开始一次性执行。

3、Transform

integer_RDD=sc.parallelize(range(10), 3)　　#分成3块
integer_RDD.collect()　　#显示全部数据
integer_RDD.glom().collect()　　#分块显示
text_RDD=sc.textFile("~")　　#从本地文件或者HDFS文件创建
pairs_RDD=text_RDD.flatMap(lambda line:line.split()).map(lambda word:(word,1))　　
#Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
#Return a new RDD by applying a function to each element of this RDD.
#所以flatMap是先做map再把结果合并
wordcounts_RDD=pairs_RDD.reduceByKey(lambda a,b:a+b)　　#相同的key合并
filter(func)　　#keep only elements where func is true
sample(withReplacement, fraction, seed)　　#get a random data fraction
coalesce(numPartitions)　　#merge partitions to reduce them to numPartitions
groupByKey 　　#(K, V) pairs => (K, iterable of all V)
for k,v in pairs_RDD.groupByKey().collect():
　　print"Key:",k,",Values:",list(v)
repartition(numPartitions)　　#similar to coalesce, shuffles all data to increase or decrease number of partitions to numPartitions

4、Actions

collect()　　#copy all elements to the driver
take(n) 　　#copy first n elements
reduce(func) 　　#aggregate elements with func (takes 2 elements, returns 1)
saveAsTextFile(filename)　　#save to local file or HDFS

5、

RDD.cache()　　#把RDD放入内存
config=sc.broadcast({"order":3,"filter":True})
accum=sc.accumulator(0)deftest_accum

posted @ 2016-05-10 14:19 zhangm215 阅读(103) 评论(0) 收藏举报

刷新页面返回顶部

zhangm215

spark笔记

公告