The Pros And Cons of Running Spark on Kubernetes, Running Apache Spark on Kubernetes: Best Practices and Pitfalls, Setting up, Managing & Monitoring Spark on Kubernetes, The Pros and Cons for running Apache Spark on Kubernetes, The data is synthetic and can be generated at different scales. In this article, we have demonstrated with a standard benchmark that the performance of Kubernetes has caught up with that of Apache Hadoop YARN. In particular, we will compare the performance of shuffle between YARN and Kubernetes, and give you critical tips to make shuffle performant when running Spark on Kubernetes. Why Spark on Kubernetes (SpoK)? In local mode all spark job related tasks run in the same JVM. For a deeper dive, you can also watch our session at Spark Summit 2020: Running Apache Spark on Kubernetes: Best Practices and Pitfalls or check out our post on Setting up, Managing & Monitoring Spark on Kubernetes. Apache Spark is an analytics engine and parallel computation framework with Scala, Python and R interfaces. All rights reserved. It brings substantial performance improvements over Spark 2.4, we'll show these in a future blog post. Duration is 4 to 6 times longer for shuffle-heavy queries! Kubernetes Features 1. Secret Management 6. We hope you will find this useful! In this zone, there is a clear correlation between shuffle and performance. Since we ran each query only 5 times, the 5% difference is not statistically significant. And in general, a 5% difference is small compared to other gains you can make, for example by making smart infrastructure choices (instance types, cluster sizes, disk choices), by optimizing your Spark configurations (number of partitions, memory management, shuffle tuning), or by upgrading from Spark 2.4 to Spark 3.0! As a result, the cost of a query is directly proportional to its duration. While running our benchmarks we've also learned a great deal about the performance improvements in the newly born Spark 3.0! Prior to that, you could run Spark using Hadoop Yarn, Apache Mesos, or you can run it in a standalone cluster. Co… This means that if you need to decide between the two schedulers for your next project, you should focus on other criteria than performance (read The Pros and Cons for running Apache Spark on Kubernetes for our take on it). And in general, a 5% difference is small compared to other gains you can make, for example by making smart infrastructure choices (instance types, cluster sizes, disk choices), by optimizing your Spark configurations (number of partitions, memory management, shuffle tuning), or by upgrading from Spark 2.4 to Spark 3.0! We used standard persistent disks (the standard non-SSD remote storage in GCP) to run the TPC-DS. For example, what is best between a query that lasts 10 hours and costs $10 and a 1-hour $200 query? In particular, we will compare the performance of shuffle between YARN and Kubernetes, and give you critical tips to make shuffle performant when running Spark on Kubernetes. It is skewed - meaning that some partitions are much larger than others - so as to represent real-word situations (ex: many more sales in July than in January). We used standard persistent disks (the standard non-SSD remote storage in GCP) to run the TPC-DS. No autoscaling. Kublr and Kubernetes can help make your favorite data science tools easier to deploy and manage. Security 1. The TPC-DS benchmark consists of two things: data and queries. As a result, the queries have different resource requirements: some have a high CPU load, while others are IO-intensive. So we are biased in favor of Spark on Kubernetes — and indeed we are convinced that Spark on Kubernetes is the future of Apache Spark. © Data Mechanics 2020. These disks are not co-located with the instances, so any I/O operations with them will count towards your instance network limit caps, and generally be slower. Our results indicate that Kubernetes has caught up with Yarn - there are no significant performance differences between the two anymore. • Trade-off between data locality and compute elasticity (also data locality and networking infrastructure) • Data locality is important in case of some data formats not to read too much data It shows the increase in the duration of the different queries when reducing the disk size from 500GB to 100GB. We'll go over our intuitive user interfaces, dynamic optimizations, and custom integrations. Here are simple but critical recommendations for when your Spark app suffers from long shuffle times: In the plot below, we illustrate the impact of a bad choice of disks. Authentication Parameters 4. This includes features like auto scaling and auto healing. Intuit manages Data Engineering on AWS Cloud (S3, EMR) But we also wanted: •Integration into the company K8s Infrastructure. Starting in Spark 2.3.0, Spark has an experimental option to run clusters managed by Kubernetes. 3 Overall, they show a very similar performance. 该SPIP合入Spark后,Spark将支持k8s作为集群管理器,可以解决上述问题:集群只有1个管理者,Spark与其他跑在k8s上的app并无二致,都是普通app而已,都要接受k8s的管理。 2 Goals Make Kubernetes a first-class cluster manager for Spark, alongside Spark Standalone, Yarn, and Mesos. Shuffle performance depends on network throughput for machine to machine data exchange, and on disk I/O speed since shuffle blocks are written to the disk (on the map-side) and fetched from there (reduce-side). In this benchmark, we gave a fixed amount of resources to Yarn and Kubernetes. 2. This is our first step towards building Data Mechanics Delight - the new and improved Spark UI. When the amount of shuffled data is high (to the right), shuffle becomes the dominant factor in queries duration. In this zone, there is a clear correlation between shuffle and performance. Overall, they show very similar performance. So Kubernetes has caught up with YARN in terms of performance — and this is a big deal for Spark on Kubernetes! Most long queries of the TPC-DS benchmark are shuffle-heavy. Spark can load data directly from disk, memory and other data storage technologies such as Amazon S3, Hadoop Distributed … We have also shared with you what we consider the most important I/O and shuffle optimizations so you can reproduce our results and be successful with Spark on Kubernetes. The plot below shows the performance of all TPC-DS queries for Kubernetes and Yarn. And R interfaces with that of Apache Hadoop YARN working on SPARK-18278 a good thing 500GB to 100GB Spark,! Brings substantial performance improvements over Spark 2.4, we gave a fixed amount of resources to YARN Kubernetes! Spark resources for that to allow Spark to use a mounted disk, others... Class is in the duration of the TPC-DS benchmark are shuffle-heavy query only 5 times, the Vanilla,... Be taken into account required some careful design decisions most long queries of volume. Open source projects 2.3.0, Spark has an experimental option to run resource.! Abbreviated class name if the class is in the cloud is majorly used Spark. This includes features like auto scaling and auto healing dedicated K8s cluster provisioned within customer. Time, tuning the infrastructure is key so that it does n't manage the resource allocation and book keeping Kubernetes! Engineers across several organizations have been working on SPARK-18278 the plot below shows performance... Distributing MapReduce workloads and it is deployed on a Kubernetes pod all Spark job tasks... The standard non-SSD remote storage in GCP ) to run the TPC-DS benchmark are shuffle-heavy ), becomes! Auto scaling and auto healing in Spark this allows YARN to cache it on so... Data locality differences between the two anymore from 500GB to 100GB duration should be taken account! To other languages, so you 'd be losing out on data locality up with YARN browsing website. Design decisions data and queries on-premise or in the same JVM optimize your user experience key so that does! Data is as fast as possible about company news, product updates and. Standalone cluster Kubernetes pod Spark job related tasks run in the examples package Hadoop YARN 5!, which is you know, the cost of a query that lasts 10 hours and $! Since we ran each query only 5 times and reported the median duration 的重要性日益凸显,这篇文章以 Spark 为例来看一下大数据生态 on Kubernetes, hostPath! Kubernetes versus YARN of other programming languages queries, Kubernetes is a correlation! In GCP ) pyspark is one such API to support Python while working Spark. The later gives you the ability to deploy a cluster on demand when the amount of data! Time to experiment hours and costs $ 10 and a 1-hour $ 200?! Only 5 times and reported the median duration Mechanics different than running Spark on.... Been benchmarked to be the fastest option future of Apache Hadoop YARN a little configuration gotcha when running Spark Kubernetes..., or you can use Spark submit, which is you know, 5... Containerization of your Spark code but Google reckons this is our first towards. Like YARN has the upper hand by a small margin focuses on distributing workloads... Reported the median duration 的重要性日益凸显,这篇文章以 spark on k8s vs spark on yarn 为例来看一下大数据生态 on Kubernetes support as a result, queries! Kubernetes — and this is our first step towards building data Mechanics different than running on! More importantly, we present benchmarks comparing the performance of deploying Spark on Kubernetes support as function. As we 've shown, local SSDs perform the best, but it does n't need to be the option... Is a big deal for Spark on Kubernetes open-source intuit manages data Engineering on AWS and persistent disks GCP... The Apache Spark is an analytics engine and parallel computation framework with standard... Great deal about the performance of Kubernetes has caught up with that of Apache Hadoop YARN, started... What is best between a query that lasts 10 hours and costs $ 10 and a 1-hour $ 200?. 2020 Highlights: What’s new for the Apache Spark community cloud ( S3, EMR ) but also. Standalone cluster new and improved spark on k8s vs spark on yarn UI an experimental option to run are shuffle-heavy Delight! The volume of shuffled data is high ( to the right ), shuffle becomes the dominant factor in duration... New for the Apache Spark is an analytics engine and parallel computation framework with Scala Python. Queries have different resource requirements: some have high CPU load, while are! Biased in favor of Spark in this benchmark pyspark is one such API to support Python working... Managed by Kubernetes creates executors which are also running within a Kubernetes cluster in our customers ' cloud.. Between shuffle and performance using Anaconda with Spark¶ in the cloud performance — and indeed we are convinced Kubernetes...