This post details how to deploy Spark on a Kubernetes cluster.
Dependencies:
- Docker v19.03.8
- Minikube v1.12.1
- Spark v3.0.0
- Hadoop v3.3.0
Contents
Minikube
Minikube is a tool used to run a single-node Kubernetes cluster locally.
Follow the official Install Minikube guide to install it along with a Hypervisor (like VirtualBox or HyperKit), to manage virtual machines, and Kubectl, to deploy and manage apps on Kubernetes.
By default, the Minikube VM is configured to use 1GB of memory and 2 CPU cores. This is not sufficient for Spark jobs, so be sure to increase the memory in your Docker client (for HyperKit) or directly in VirtualBox. Then, when you start Minikube, pass the memory and CPU options to it:
$ minikube start --vm-driver=hyperkit --memory 8192 --cpus 4 or $ minikube start --memory 8192 --cpus 4
Docker
Next, let's build a custom Docker image for Spark 3.0.0, designed for Spark Standalone mode.
Dockerfile:
# base image FROM java:openjdk-8-jdk # define spark and hadoop versions ENV SPARK_VERSION=3.0.0 ENV HADOOP_VERSION=3.3.0 # download and install hadoop RUN mkdir -p /opt && \ cd /opt && \ curl http://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz | \ tar -zx hadoop-${HADOOP_VERSION}/lib/native && \ ln -s hadoop-${HADOOP_VERSION} hadoop && \ echo Hadoop ${HADOOP_VERSION} native libraries installed in /opt/hadoop/lib/native # download and install spark RUN mkdir -p /opt && \ cd /opt && \ curl http://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz | \ tar -zx && \ ln -s spark-${SPARK_VERSION}-bin-hadoop2.7 spark && \ echo Spark ${SPARK_VERSION} installed in /opt # add scripts and update spark default config ADD common.sh spark-master spark-worker / ADD spark-defaults.conf /opt/spark/conf/spark-defaults.conf ENV PATH $PATH:/opt/spark/bin
You can find the above Dockerfile along with the Spark config file and scripts in the spark-kubernetes repo on GitHub.
Build the image:
$ eval $(minikube docker-env) $ docker build -f docker/Dockerfile -t spark-hadoop:3.0.0 ./docker
If you don't want to spend the time building the image locally, feel free to use my pre-built Spark image from Docker Hub: mjhea0/spark-hadoop:3.0.0.
View:
$ docker image ls spark-hadoop REPOSITORY TAG IMAGE ID CREATED SIZE spark-hadoop 3.0.0 8f3ccdadd795 11 minutes ago 911MB
Spark Master
spark-master-deployment.yaml:
kind: Deployment apiVersion: apps/v1 metadata: name: spark-master spec: replicas: 1 selector: matchLabels: component: spark-master template: metadata: labels: component: spark-master spec: containers: - name: spark-master image: spark-hadoop:3.0.0 command: ["/spark-master"] ports: - containerPort: 7077 - containerPort: 8080 resources: requests: cpu: 100m
spark-master-service.yaml:
kind: Service apiVersion: v1 metadata: name: spark-master spec: ports: - name: webui port: 8080 targetPort: 8080 - name: spark port: 7077 targetPort: 7077 selector: component: spark-master
Create the Spark master Deployment and start the Services:
$ kubectl create -f ./kubernetes/spark-master-deployment.yaml $ kubectl create -f ./kubernetes/spark-master-service.yaml
Verify:
$ kubectl get deployments NAME READY UP-TO-DATE AVAILABLE AGE spark-master 1/1 1 1 12s $ kubectl get pods NAME READY STATUS RESTARTS AGE spark-master-6c4469fdb6-rs642 1/1 Running 0 6s
Spark Workers
spark-worker-deployment.yaml:
kind: Deployment apiVersion: apps/v1 metadata: name: spark-worker spec: replicas: 2 selector: matchLabels: component: spark-worker template: metadata: labels: component: spark-worker spec: containers: - name: spark-worker image: spark-hadoop:3.0.0 command: ["/spark-worker"] ports: - containerPort: 8081 resources: requests: cpu: 100m
Create the Spark worker Deployment:
$ kubectl create -f ./kubernetes/spark-worker-deployment.yaml
Verify:
$ kubectl get deployments NAME READY UP-TO-DATE AVAILABLE AGE spark-master 1/1 1 1 92s spark-worker 2/2 2 2 6s $ kubectl get pods NAME READY STATUS RESTARTS AGE spark-master-6c4469fdb6-rs642 1/1 Running 0 114s spark-worker-5d4bdd44db-p2q8v 1/1 Running 0 28s spark-worker-5d4bdd44db-v4d84 1/1 Running 0 28s
Ingress
Did you notice that we exposed the Spark web UI on port 8080? In order to access it outside the cluster, let's configure an Ingress object.
minikube-ingress.yaml:
apiVersion: networking.k8s.io/v1beta1 kind: Ingress metadata: name: minikube-ingress annotations: spec: rules: - host: spark-kubernetes http: paths: - path: / backend: serviceName: spark-master servicePort: 8080
Enable the Ingress addon:
$ minikube addons enable ingress
Create the Ingress object:
$ kubectl apply -f ./kubernetes/minikube-ingress.yaml
Next, you need to update your /etc/hosts file to route requests from the host we defined, spark-kubernetes
, to the Minikube instance.
Add an entry to /etc/hosts:
$ echo "$(minikube ip) spark-kubernetes" | sudo tee -a /etc/hosts
Test it out in the browser at http://spark-kubernetes/:
Test
To test, run the PySpark shell from the the master container:
$ kubectl get pods NAME READY STATUS RESTARTS AGE spark-master-6c4469fdb6-rs642 1/1 Running 0 10m spark-worker-5d4bdd44db-p2q8v 1/1 Running 0 8m42s spark-worker-5d4bdd44db-v4d84 1/1 Running 0 8m42s $ kubectl exec spark-master-6c4469fdb6-rs642 -it -- pyspark
Then run the following code after the PySpark prompt appears:
words = 'the quick brown fox jumps over the\ lazy dog the quick brown fox jumps over the lazy dog' sc = SparkContext() seq = words.split() data = sc.parallelize(seq) counts = data.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b).collect() dict(counts) sc.stop()
You should see:
{'brown': 2, 'lazy': 2, 'over': 2, 'fox': 2, 'dog': 2, 'quick': 2, 'the': 4, 'jumps': 2}
That's it!
You can find the scripts in the spark-kubernetes repo on GitHub. Cheers!