Running Spark on a Kubernetes cluster is already pretty easy, so it is unclear w...

jstephan · on May 11, 2020

Thanks for the feedback! It's possible to run Spark on Kubernetes using just OS tools - in fact our platform builds upon and contributes to many of these tools. But it's not easy enough, in our humble opinion, you need to build a decent level of expertise on Spark and k8s just to get started, and even more to keep it operational/stable/cost-efficient/secure in the long term.

Regarding costs. By autoscaling the cluster size and minimising our service footprint, the fixed cost for using our platform is around $100/month, which is negligible compared to the cost of most big data projects. We have some ideas on how to drive this fixed cost to zero, and offer a free hosted version of our platform too. It's in the roadmap!

quadrature · on May 11, 2020

The problem being solved here is resource tuning. Which is a problem you will eventually encounter as your data org grows big. Specifically in our case the authors of our spark jobs understand the data modelling well but might not know how to tweak the spark parameters to optimize execution. As mentioned in the post, even if you do know what you're doing the process is long and time consuming. so i definitely see the value add here.

if you need ephemeral spark clusters dataproc in GCP will give that to you, theres probably a similar service in AWS and Azure.

waffletower · on May 11, 2020

AWS EMR is a fairly straight-forward and reasonably cost-effective method to manage ephemeral Spark clusters on Amazon Web Services.

jmngomes · on May 12, 2020

>> Controlling cost is the hard part. You may only need a cluster for 1 hour per day for a nightly aggregation job. Kubernetes clusters are not easy to provision and de-provision, so you end up paying for a cluster for 24 hour days and use it for only 1 hour.

What is the benefit of using Kubernetes to deploy Spark jobs then? Is that approach meant to achieve independence from the hardware?

I'm asking because that is fairly trivial to achieve using, at least, a provider like AWS: you can build a CloudFormation template (or use the AWS API or the web UI) to launch AWS EMR clusters with specific hardware and run any spark jars, and you can use services like DataPipeline or Glue to schedule and/or automate the whole process. So you can use AWS services to set up a schedule that will periodically spin up a cluster with whatever machines you need to run a Spark app and decommission it as soon as its done.

In this case, the EMR cluster comes with the myriad of Hadoop tools and services (and Spark, and other relevant software) preinstalled and ready to use. And most relevant Spark settings are already optimized for the cluster's hardware; but not for the Spark app itself, which is what this solutions seems to address.

tgtweak · on May 12, 2020

We usually run a tiny ec2 instance with airflow on it to spin up spot market instances right-sized to the job and then map it to EMR templates to initiate the spark cluster and submit the job. This is the most cost effective way I've seen. It is limited to batch and you need to set an upper bounds for the spot bid and bid failure logic (fallback to on demand instances, or wait until next run attempt) but in practice it has seldom failed to secure these instances - a handful of times over the last 3 years.

To give you an idea we run an 8x m4.4xlarge job every hour and it costs less than $800/mo including s3 and exfiltration of the output data. On-demand pricing to keep that cluster up persistently would be about $4900/mo.

So, to OP: great platform, but your real value contribution for large users (the ones with budget) would be any cost optimization features you could build in.

PS k8s spark submit feature is amazingly easy and highly recommended for beginners, set up k8s using rancher and spark-submit your way to data devops bliss.

awinder · on May 12, 2020

Seconded, I’m doing this as well with airflow and EMR. Instance fleets makes the fallback logic to on-demand instances super easy (you set the price + time allowed for trying spot and then the on-demand instances you want to fall back to).

renewiltord · on May 12, 2020

Databricks jobs clusters can do this on AWS at least. We use them for that.