pycharm配置Spark开发环境
1.下载并安装JDK
2.下载并安装python
3.下载hadoop
4.下载winutils.exe并放在hadoop\bin目录下
5.pip install -U -i https://pypi.tuna.tsinghua.edu.cn/simple pyspark安装pyspark和py4j.
6.Pycharm环境测试
import os os.environ['JAVA_HOME'] = "C:\Program Files\Java\jdk1.8.0_191" os.environ['HADOOP_HOME'] = "E:\software\hadoop"
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("spark://192.168.56.102:7077").set("spark.driver.host", "10.88.16.213").set("spark.executor.memory","512m").setAppName("PythonWordCount")
sc = SparkContext(conf=conf)
links = sc.parallelize(["A", "B", "C", "D"])
C = links.flatMap(lambda dest: (dest, 1)).count()
D = links.map(lambda dest: (dest, 1)).count()
print(C)
print(D)
c = links.flatMap(lambda dest: (dest, 1)).collect()
d = links.map(lambda dest: (dest, 1)).collect()
print(c)
print(d)
#注意设置spark.driver.host和内存spark.executor.memory,如果自己是在windows执行程序并且内存并没有那么富裕的话,建议配置这两个参数,spark.executor.memory默认是1GB
hive连接时遇到以下问题
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are:
解决方案:
在直接执行程序的根目录建立/tmp/hive,并执行以下的命令
%HADOOP_HOME%\bin\winutils.exe ls \tmp\hive %HADOOP_HOME%\bin\winutils.exe chmod 777 \tmp\hive %HADOOP_HOME%\bin\winutils.exe ls \tmp\hive