07 Spark RDD编程 综合实例 英文词频统计
1. 用Pyspark自主实现词频统计过程。
>>> lines = sc.textFile('file:///home/hadoop/cipintongji.txt')
>>> words = lines.flatmap(lambda line: line.lower().split())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'RDD' object has no attribute 'flatmap'
>>> words = lines.flatMap(lambda line: line.lower().split())
>>> words.collect()
['1.the', 'british', 'economy', 'the', 'united', 'kingdom', 'is', 'a', 'major', 'developed', 'capitalist', 'economy.', 'it', 'is', 'the', "world's", 'sixth', 'largest', 'by', 'nominal', 'gdp', 'and', 'the', 'seventh', 'largest', 'by', 'purchasing', 'power', 'parity.[1]', 'it', 'is', 'the', 'third', 'largest', 'economy', 'in', 'europe', 'after', "germany's", 'and', "france's", 'in', 'nominal', 'terms,', 'and', 'the', 'third', 'largest', 'after', "germany's", 'and', "russia's", 'in', 'terms', 'of', 'purchasing', 'power', 'parity.[1]', 'its', 'gdp', 'ppp', 'per', 'capita', 'is', 'the', '18th', 'highest', 'in', 'the', 'world.[1]', 'the', 'united', 'kingdom', 'is', 'also', 'a', 'member', 'of', 'the', 'g8,', 'the', 'commonwealth', 'of', 'nations,', 'the', 'organisation', 'for', 'economic', 'co-operation', 'and', 'development,', 'the', 'world', 'trade', 'organisation,', 'and', 'the', 'european', 'union.', 'the', 'uk', 'was', 'the', 'first', 'country', 'in', 'the', 'world', 'to', 'industrialise', 'in', 'the', '18th', 'and', '19th', 'centuries,', 'and', 'for', 'much', 'of', 'the', '19th', 'century', 'possessed', 'a', 'predominant', 'role', 'in', 'the', 'global', 'economy.', 'however,', 'by', 'the', 'late', '19th', 'century,', 'the', 'second', 'industrial', 'revolution', 'in', 'the', 'united', 'states', 'meant', 'the', 'us', 'had', 'begun', 'to', 'challenge', "britain's", 'role', 'as', 'the', 'leader', 'of', 'the', 'global', 'economy.', 'the', 'extensive', 'war', 'efforts', 'of', 'both', 'world', 'wars', 'in', 'the', '20th', 'century', 'and', 'the', 'dismantlement', 'of', 'the', 'british', 'empire', 'also', 'weakened', 'the', 'uk', 'economy', 'in', 'global', 'terms,', 'and', 'by', 'that', 'time', 'britain', 'had', 'been', 'superseded', 'by', 'the', 'united', 'states', 'as', 'the', 'chief', 'player', 'in', 'the', 'global', 'economy.', 'at', 'the', 'start', 'of', 'the', '21st', 'century', 'however,', 'the', 'uk', 'still', 'possesses', 'a', 'significant', 'role', 'in', 'the', 'global', 'economy,', 'due', 'to', 'its', 'large', 'gross', 'domestic', 'product', 'and', 'the', 'financial', 'importance', 'that', 'its', 'capital,', 'london,', 'possesses', 'in', 'the', 'world.', 'the', 'united', 'kingdom', 'is', 'one', 'of', 'the', "world's", 'most', 'globalised', 'countries.', 'the', 'capital,', 'london', '(see', 'economy', 'of', 'london),', 'is', 'a', 'major', 'financial', 'centre', 'for', 'international', 'business', 'and', 'commerce', 'and', 'is', 'one', 'of', 'three', '"command', 'centres"', 'for', 'the', 'global', 'economy', '(along', 'with', 'new', 'york', 'city', 'and', 'tokyo).[4]', 'the', 'british', 'economy', 'is', 'made', 'up', '(in', 'descending', 'order', 'of', 'size)', 'of', 'the', 'economies', 'of', 'england,', 'scotland,', 'wales', 'and', 'northern', 'ireland.', 'in', '1973,', 'the', 'uk', 'acceded', 'to', 'the', 'european', 'economic', 'community', 'which', 'is', 'now', 'known', 'as', 'the', 'european', 'union', 'after', 'the', 'ratification', 'of', 'the', 'treaty', 'of', 'maastricht', 'in', '1993.', '(from', 'wikipedia,', 'the', 'free', 'encyclopedia)']
>>> words_ = words.map(lambda s : (s,1))
>>> words_.collect()
[('1.the', 1), ('british', 1), ('economy', 1), ('the', 1), ('united', 1), ('kingdom', 1), ('is', 1), ('a', 1), ('major', 1), ('developed', 1), ('capitalist', 1), ('economy.', 1), ('it', 1), ('is', 1), ('the', 1), ("world's", 1), ('sixth', 1), ('largest', 1), ('by', 1), ('nominal', 1), ('gdp', 1), ('and', 1), ('the', 1), ('seventh', 1), ('largest', 1), ('by', 1), ('purchasing', 1), ('power', 1), ('parity.[1]', 1), ('it', 1), ('is', 1), ('the', 1), ('third', 1), ('largest', 1), ('economy', 1), ('in', 1), ('europe', 1), ('after', 1), ("germany's", 1), ('and', 1), ("france's", 1), ('in', 1), ('nominal', 1), ('terms,', 1), ('and', 1), ('the', 1), ('third', 1), ('largest', 1), ('after', 1), ("germany's", 1), ('and', 1), ("russia's", 1), ('in', 1), ('terms', 1), ('of', 1), ('purchasing', 1), ('power', 1), ('parity.[1]', 1), ('its', 1), ('gdp', 1), ('ppp', 1), ('per', 1), ('capita', 1), ('is', 1), ('the', 1), ('18th', 1), ('highest', 1), ('in', 1), ('the', 1), ('world.[1]', 1), ('the', 1), ('united', 1), ('kingdom', 1), ('is', 1), ('also', 1), ('a', 1), ('member', 1), ('of', 1), ('the', 1), ('g8,', 1), ('the', 1), ('commonwealth', 1), ('of', 1), ('nations,', 1), ('the', 1), ('organisation', 1), ('for', 1), ('economic', 1), ('co-operation', 1), ('and', 1), ('development,', 1), ('the', 1), ('world', 1), ('trade', 1), ('organisation,', 1), ('and', 1), ('the', 1), ('european', 1), ('union.', 1), ('the', 1), ('uk', 1), ('was', 1), ('the', 1), ('first', 1), ('country', 1), ('in', 1), ('the', 1), ('world', 1), ('to', 1), ('industrialise', 1), ('in', 1), ('the', 1), ('18th', 1), ('and', 1), ('19th', 1), ('centuries,', 1), ('and', 1), ('for', 1), ('much', 1), ('of', 1), ('the', 1), ('19th', 1), ('century', 1), ('possessed', 1), ('a', 1), ('predominant', 1), ('role', 1), ('in', 1), ('the', 1), ('global', 1), ('economy.', 1), ('however,', 1), ('by', 1), ('the', 1), ('late', 1), ('19th', 1), ('century,', 1), ('the', 1), ('second', 1), ('industrial', 1), ('revolution', 1), ('in', 1), ('the', 1), ('united', 1), ('states', 1), ('meant', 1), ('the', 1), ('us', 1), ('had', 1), ('begun', 1), ('to', 1), ('challenge', 1), ("britain's", 1), ('role', 1), ('as', 1), ('the', 1), ('leader', 1), ('of', 1), ('the', 1), ('global', 1), ('economy.', 1), ('the', 1), ('extensive', 1), ('war', 1), ('efforts', 1), ('of', 1), ('both', 1), ('world', 1), ('wars', 1), ('in', 1), ('the', 1), ('20th', 1), ('century', 1), ('and', 1), ('the', 1), ('dismantlement', 1), ('of', 1), ('the', 1), ('british', 1), ('empire', 1), ('also', 1), ('weakened', 1), ('the', 1), ('uk', 1), ('economy', 1), ('in', 1), ('global', 1), ('terms,', 1), ('and', 1), ('by', 1), ('that', 1), ('time', 1), ('britain', 1), ('had', 1), ('been', 1), ('superseded', 1), ('by', 1), ('the', 1), ('united', 1), ('states', 1), ('as', 1), ('the', 1), ('chief', 1), ('player', 1), ('in', 1), ('the', 1), ('global', 1), ('economy.', 1), ('at', 1), ('the', 1), ('start', 1), ('of', 1), ('the', 1), ('21st', 1), ('century', 1), ('however,', 1), ('the', 1), ('uk', 1), ('still', 1), ('possesses', 1), ('a', 1), ('significant', 1), ('role', 1), ('in', 1), ('the', 1), ('global', 1), ('economy,', 1), ('due', 1), ('to', 1), ('its', 1), ('large', 1), ('gross', 1), ('domestic', 1), ('product', 1), ('and', 1), ('the', 1), ('financial', 1), ('importance', 1), ('that', 1), ('its', 1), ('capital,', 1), ('london,', 1), ('possesses', 1), ('in', 1), ('the', 1), ('world.', 1), ('the', 1), ('united', 1), ('kingdom', 1), ('is', 1), ('one', 1), ('of', 1), ('the', 1), ("world's", 1), ('most', 1), ('globalised', 1), ('countries.', 1), ('the', 1), ('capital,', 1), ('london', 1), ('(see', 1), ('economy', 1), ('of', 1), ('london),', 1), ('is', 1), ('a', 1), ('major', 1), ('financial', 1), ('centre', 1), ('for', 1), ('international', 1), ('business', 1), ('and', 1), ('commerce', 1), ('and', 1), ('is', 1), ('one', 1), ('of', 1), ('three', 1), ('"command', 1), ('centres"', 1), ('for', 1), ('the', 1), ('global', 1), ('economy', 1), ('(along', 1), ('with', 1), ('new', 1), ('york', 1), ('city', 1), ('and', 1), ('tokyo).[4]', 1), ('the', 1), ('british', 1), ('economy', 1), ('is', 1), ('made', 1), ('up', 1), ('(in', 1), ('descending', 1), ('order', 1), ('of', 1), ('size)', 1), ('of', 1), ('the', 1), ('economies', 1), ('of', 1), ('england,', 1), ('scotland,', 1), ('wales', 1), ('and', 1), ('northern', 1), ('ireland.', 1), ('in', 1), ('1973,', 1), ('the', 1), ('uk', 1), ('acceded', 1), ('to', 1), ('the', 1), ('european', 1), ('economic', 1), ('community', 1), ('which', 1), ('is', 1), ('now', 1), ('known', 1), ('as', 1), ('the', 1), ('european', 1), ('union', 1), ('after', 1), ('the', 1), ('ratification', 1), ('of', 1), ('the', 1), ('treaty', 1), ('of', 1), ('maastricht', 1), ('in', 1), ('1993.', 1), ('(from', 1), ('wikipedia,', 1), ('the', 1), ('free', 1), ('encyclopedia)', 1)]
>>> tongji = words_.reduceByKey(lambda a,b:(a+b))
>>> tongji.collect()
[('seventh', 1), ('weakened', 1), ('was', 1), ('organisation,', 1), ('united', 5), ('second', 1), ('europe', 1), ('country', 1), ('power', 2), ('for', 4), ('community', 1), ('its', 3), ('economy', 6), ('1993.', 1), ('predominant', 1), ('business', 1), ('the', 51), ('centuries,', 1), ('large', 1), ('it', 2), ('both', 1), ('first', 1), ('start', 1), ('tokyo).[4]', 1), ('made', 1), ('18th', 2), ('significant', 1), ('19th', 3), ('acceded', 1), ('dismantlement', 1), ('possessed', 1), ('gdp', 2), ('after', 3), ('(in', 1), ('chief', 1), ('had', 2), ('capita', 1), ('century', 3), ('and', 15), ('commonwealth', 1), ('by', 5), ('efforts', 1), ('product', 1), ('capitalist', 1), ('industrialise', 1), ('economy,', 1), ('a', 5), ('three', 1), ('third', 2), ('centre', 1), ('britain', 1), ('uk', 4), ('21st', 1), ('(see', 1), ('also', 2), ('in', 15), ('order', 1), ('industrial', 1), ('1.the', 1), ('terms', 1), ('countries.', 1), ('wikipedia,', 1), ('meant', 1), ('with', 1), ('much', 1), ('highest', 1), ('economic', 2), ('importance', 1), ('world.[1]', 1), ('g8,', 1), ('leader', 1), ('world', 3), ("world's", 2), ('late', 1), ('superseded', 1), ('century,', 1), ("britain's", 1), ('us', 1), ('york', 1), ('due', 1), ('1973,', 1), ('ppp', 1), ('economies', 1), ('england,', 1), ('kingdom', 3), ('new', 1), ('member', 1), ('london,', 1), ('that', 2), ('20th', 1), ('city', 1), ('descending', 1), ('union.', 1), ('been', 1), ('possesses', 2), ('london),', 1), ('of', 16), ('states', 2), ('terms,', 2), ('scotland,', 1), ('financial', 2), ('one', 2), ('gross', 1), ("france's", 1), ('(along', 1), ('role', 3), ('union', 1), ('extensive', 1), ('northern', 1), ('(from', 1), ('empire', 1), ('begun', 1), ('organisation', 1), ('global', 6), ('wales', 1), ('war', 1), ('"command', 1), ('co-operation', 1), ('centres"', 1), ('largest', 4), ('known', 1), ('economy.', 4), ('globalised', 1), ('which', 1), ('purchasing', 2), ('domestic', 1), ('ratification', 1), ('player', 1), ('challenge', 1), ('per', 1), ('treaty', 1), ('british', 3), ('sixth', 1), ('parity.[1]', 2), ('wars', 1), ('still', 1), ('major', 2), ('however,', 2), ('capital,', 2), ('commerce', 1), ('european', 3), ('developed', 1), ('time', 1), ('world.', 1), ("russia's", 1), ('at', 1), ('development,', 1), ('most', 1), ('nations,', 1), ('now', 1), ('revolution', 1), ("germany's", 2), ('nominal', 2), ('trade', 1), ('up', 1), ('is', 10), ('encyclopedia)', 1), ('free', 1), ('international', 1), ('size)', 1), ('to', 4), ('london', 1), ('as', 3), ('maastricht', 1), ('ireland.', 1)]
>>> 201806120060 wenjiaqing



2. 并比较不同计算框架下编程的优缺点、适用的场景。
–Python
–MapReduce
–Hive
–Spark
Mapreduce,它最本质的两个过程就是Map和Reduce,Map的应用在于我们需要数据一对一的元素的映射转换,比如说进行截取,进行过滤,或者任何的转换操作,这些一对一的元素转换就称作是Map;Reduce主要就是元素的聚合,就是多个元素对一个元素的聚合,比如求Sum等,这就是Reduce。
Mapreduce是Hadoop1.0的核心,Spark出现慢慢替代Mapreduce。那么为什么Mapreduce还在被使用呢?因为有很多现有的应用还依赖于它,它不是一个独立的存在,已经成为其他生态不可替代的部分,比如pig,hive等。
尽管MapReduce极大的简化了大数据分析,但是随着大数据需求和使用模式的扩大,用户的需求也越来越多:
1. 更复杂的多重处理需求(比如迭代计算, ML, Graph);
2. 低延迟的交互式查询需求(比如ad-hoc query)
而MapReduce计算模型的架构导致上述两类应用先天缓慢,用户迫切需要一种更快的计算模型,来补充MapReduce的先天不足。
Spark的出现就弥补了这些不足,我们来了解一些Spark的优势:
1.每一个作业独立调度,可以把所有的作业做一个图进行调度,各个作业之间相互依赖,在调度过程中一起调度,速度快。
2.所有过程都基于内存,所以通常也将Spark称作是基于内存的迭代式运算框架。
3.spark提供了更丰富的算子,让操作更方便。
4.更容易的API:支持Python,Scala和Java
其实spark里面也可以实现Mapreduce,但是这里它并不是算法,只是提供了map阶段和reduce阶段,但是在两个阶段提供了很多算法。如Map阶段的map, flatMap, filter, keyBy,Reduce阶段的reduceByKey, sortByKey, mean, gourpBy, sort等。
Hive算是大数据数据仓库的事实标准吧。Hive可以方法HDFS和Hbase上的数据,impala、spark sql、Presto完全能读取hive建立的数据仓库了的数据。一般情况在批处理任务中还在使用Hive,而在热查询做数据展示中大量使用impala、spark sql或Presto。
Hive提供三种访问接口:Cli,web Ui,HiveServer2。

浙公网安备 33010602011771号