2.安装Spark与Python练习

一、安装Spark　

1.检查基础环境hadoop.jdk

　java -version #查看jdk的安装以及版本

jps  #查看当前环境的节点

start-all.sh  #开启hadoop，因为我在本次实验中使用的是分布式模式，所以节点分别在master上和slave1，同时因为在配置环境变量中进行过配置，start-all.sh直接开启的就是hadoop

4.配置文件

cd /usr/local/spark/conf  #进入spark的conf目录下
sudo gedit spark-env.sh  #编辑这个配置文件

5.环境变量

gedit ~/.basnrc #修改该配置文件如下图所示，并且对应上如下图在自己电脑上该目录下的zip

source ~/.bashrc #修改后配置生效

6.试运行python代码

cd /usr/local/spark  #进入spark目录下
sudo gedit wordcount.txt  #创建一个wordcount.txt文本文件，文章内容随便添加

gedit test1.py  #编辑一个py文件,代码如下所示

path='/usr/local/spark/wordcount.txt'  #文本路径
with open(path) as f:
    #读取文本
    text=f.read()
    text = text.lower()  #把所有字母都变成小写，便于统计
    #将文本中特殊字符替换为空格
    for ch in '!"@#$%^&*()+,-./:;<=>?@[\\]_`~{|}':
        text=text.replace(ch," ")
    
    
words = text.split()
#将文本用空格分隔
stop_words = ['so','out','all','for','of','to','on','in','if','by','under','it','at','into','with','about','i','am','are','is','a','the','and','that','before','her','she','my','be','an','from','would','me','got']
lenwords=len(words)  #停用词
afterwords=[]

# 按词频大小排序
for i in range(lenwords):
    z=1
    for j in range(len(stop_words)):
    
        #避开停用词
        if words[i]==stop_words[j]:
            continue
        else:
            if z==len(stop_words):
                afterwords.append(words[i])
                break
            z=z+1
            continue
            
#统计每个单词出现个数
counts = {}
for word in afterwords:
    counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
i=1

# 打印出现次数前15 的单词
while i<=15:
    word,count = items[i-1]
    print("{0:<20}{1}".format(word,count))
    i=i+1

./bin/spark-submit test.py   #提交该目录下的test.py给spark运行

posted on 2022-03-04 14:02 花开花落亦南浔丶阅读(30) 评论(0) 收藏举报