SparkShell(sparkSql) on k8s
k8s上没有搭建zepplin,有时候想使用sparkshell/sparksql查一些数据不是很方便,尤其是数据量大的时候,下面描述一下在k8s上运行一个pod,然后在pod里面运行sparkshell/sparksql这样就可以方便查询数据。
(当然,如果你本机有固定的ip或可以使用花生壳之类的服务,就可以直接使用spark-shell/sparksql 的client模式到k8s上请求资源)
#step1 create a pod as spark-client
cat <<EOF >spark-client.yaml
apiVersion: v1
kind: Pod
metadata:
labels:
run: spark-client
name: spark-client
spec:
containers:
- name: spark-client
image: student2021/spark:301p
imagePullPolicy: Always
securityContext:
allowPrivilegeEscalation: false
runAsUser: 0
command:
- sh
- -c
- "exec tail -f /dev/null"
restartPolicy: Never
serviceAccount: spark
EOF
kubectl apply -n spark-job spark-client.yaml
#step2 enter spark-client pod and run spark-shell or spark-sql
kubectl -n spark-job exec -it spark-client sh
export SPARK_USER=spark
driver_host=$(cat /etc/hosts|grep spark-client|cut -f 1)
echo $driver_host
/opt/spark/bin/spark-shell --conf spark.jars.ivy=/tmp/.ivy \
--master k8s://localhost:18080 \
--deploy-mode client \
--conf spark.kubernetes.namespace=spark \
--conf spark.kubernetes.container.image=student2021/spark:301p \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.driver.pod.name=spark-client \
--conf spark.executor.instances=4 \
--conf spark.executor.memory=4g \
--conf spark.driver.memory=4g \
--conf spark.driver.host=${driver_host} \
--conf spark.driver.port=14040 \
###如果想使用headless service可以执行下面的操作
kubectl expose deployment spark-client --port=14040 --type=ClusterIP --cluster-ip=None
Looking for a job working at Home about MSBI
浙公网安备 33010602011771号