Joom Spark Platform
Joom Spark Platform is a ready-to-use Spark on Kubernetes setup. In a few minutes, you can deploy essential components such as Spark Operator and Hive Metastore, and start your first job.
Joom Spark Platform is presently available for AWS EKS as AWS Marketplace product. Before using it, you need to subscribe for free to get access.
Getting Started
Prerequisites
Make sure you have an AWS EKS cluster, and you have the kubectl
command configured to access it. You will also need the eksctl
and helm
tools installed. If you don’t have a cluster yet, it might be easier to use QuickLaunch, as described later.
Subscription
To obtain the Joom Spark Platform, you need to subscribe (for free) on AWS Marketplace.
Installation
We need to create a namespace and a service account. For initial testing, create a service account with read-only access to S3
kubectl create namespace spark
eksctl create iamserviceaccount \
--name spark \
--namespace spark \
--cluster <ENTER_YOUR_CLUSTER_NAME_HERE> \
--attach-policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \
--approve \
--override-existing-serviceaccounts
Use Helm 3.8.0 or later, and login to the Helm Registry
aws ecr get-login-password --region us-east-1 | helm registry login \
--username AWS --password-stdin 709825985650.dkr.ecr.us-east-1.amazonaws.com
Then, install the Helm chart:
helm install --namespace spark \
joom-spark-platform \
oci://709825985650.dkr.ecr.us-east-1.amazonaws.com/joom/joom-spark-platform \
--version 1.0.0
It might take a couple minutes for all the components to start up. Run
kubectl -n spark get pods
and make sure all pods are ready before proceeding.
The first spark job
Obtain the example Spark application manifest and apply it:
aws s3 cp s3://joom-analytics-cloud-public/examples/minimal/minimal.yaml minimal.yaml
kubectl apply -f minimal.yaml
Finally, watch the output of the Spark job
kubectl -n spark logs demo-minimal-driver -f
You should see a Spark session starting, and a test dataframe printed. If you get an error that the pod does not exist, try again in a few seconds.
Spark jobs that write data
Most likely, you want your Spark jobs to write some data. First, you need to decide what S3 bucket to use. For testing, a new bucket might be the best idea.
We need to give the necessary permissions to the service account. First, delete it
eksctl delete iamserviceaccount --name spark --namespace spark --cluster <ENTER_YOUR_CLUSTER_NAME_HERE>
Wait a couple of minutes and create it again with S3 write access:
eksctl create iamserviceaccount --name spark --namespace spark
--cluster <ENTER_YOUR_CLUSTER_NAME_HERE> \
--attach-policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess \
--approve \
--override-existing-serviceaccounts
Obtain a Spark application manifest:
aws s3 cp s3://joom-analytics-cloud-public/examples/minimal/minimal-write.yaml minimal-write.yaml
In the file, modify the DATA_BUCKET environment variable to the name of your bucket. Then, apply the manifest and review job logs
kubectl apply -f minimal-write.yaml
and view the logs
kubectl -n spark logs demo-minimal-write-driver -f
In the logs, you will see that test data is written to S3 and registered in the Hive metastore.
If you made this far, congratuatlions! At this point, you can write your own spark jobs, put it in your own S3 buckets, and modify the manifests to use them.
Getting Started with QuickLaunch
If you don’t yet have EKS cluster, or if you want to safely experiment in a separate environment, AWS Marketplace provides the QuickLauch functionality that will create a cluster with Joom Spark Platform already installed.
To use it, after subscribing, select “Helm Chart” fullfillment option, and then, on the “Launch” page, select “Launch on a new EKS cluster with QuickLaunch”. Follow the prompts to name the cluster and provide other information, and then wait until it is created.
Then, make sure you have AWS CLI installed, as well as kubectl
and eksctl
libraries.
Connect to the cluster by running
aws eks update-kubeconfig --name <cluster-name> --region <region>
Create a service account with S3 read permissions
eksctl create iamserviceaccount \
--name spark \
--namespace spark \
--cluster <ENTER_YOUR_CLUSTER_NAME_HERE> \
--attach-policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \
--approve \
--override-existing-serviceaccounts
After that, you can proceed to running your first Spark job, as documented under “The first Spark job” section above.
Talk to us
The Joom Spark Platform is free and has no formal support, but we’d be happy to discuss your experience and help, and to generally discuss data engineering and Spark topic. Feel free to schedule a meeting