SparkShell(sparkSql) on k8s

k8s上没有搭建zepplin,有时候想使用sparkshell/sparksql查一些数据不是很方便,尤其是数据量大的时候,下面描述一下在k8s上运行一个pod,然后在pod里面运行sparkshell/sparksql这样就可以方便查询数据。

(当然,如果你本机有固定的ip或可以使用花生壳之类的服务,就可以直接使用spark-shell/sparksql 的client模式到k8s上请求资源)

#step1 create a pod as spark-client 
cat <<EOF >spark-client.yaml
apiVersion: v1
kind: Pod
metadata:
  labels:
    run: spark-client
  name: spark-client
spec:
  containers:
  - name: spark-client
    image: student2021/spark:301p
    imagePullPolicy: Always
    securityContext:
        allowPrivilegeEscalation: false
        runAsUser: 0
    command:
      - sh
      - -c
      - "exec tail -f /dev/null"
  restartPolicy: Never
  serviceAccount: spark
EOF 
kubectl apply -n spark-job spark-client.yaml 
#step2 enter spark-client pod and run spark-shell or spark-sql 
kubectl -n spark-job exec -it spark-client sh 
export SPARK_USER=spark
driver_host=$(cat /etc/hosts|grep spark-client|cut -f 1)
echo $driver_host
/opt/spark/bin/spark-shell --conf spark.jars.ivy=/tmp/.ivy \
--master k8s://localhost:18080 \
--deploy-mode client  \
--conf spark.kubernetes.namespace=spark  \
--conf spark.kubernetes.container.image=student2021/spark:301p \
--conf spark.kubernetes.container.image.pullPolicy=Always  \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark  \
--conf spark.kubernetes.driver.pod.name=spark-client  \
--conf spark.executor.instances=4  \
--conf spark.executor.memory=4g \
--conf spark.driver.memory=4g \
--conf spark.driver.host=${driver_host} \
--conf spark.driver.port=14040  \
###如果想使用headless service可以执行下面的操作
kubectl expose deployment spark-client --port=14040 --type=ClusterIP --cluster-ip=None

 

Looking for a job working at Home about MSBI

posted on 2021-04-28 14:46  tneduts  阅读(113)  评论(0编辑  收藏  举报

导航