Making Spark Fly: NVIDIA Accelerates World’s Most Popular Data Analytics Platform

NVIDIA GPU acceleration comes to Apache Spark 3.0.
by Erik Pounds

The world’s most popular data analytics application, Apache Spark, now offers revolutionary GPU acceleration to its more than half a million users through the general availability release of Spark 3.0.

Databricks provides the leading cloud-based enterprise Spark platform, run on over a million virtual machines every day. At the Spark + AI Summit today, Databricks announced that Databricks Runtime 7.0 for Machine Learning features GPU-accelerator aware scheduling with Spark 3.0, developed in collaboration with NVIDIA and other community members.

Google Cloud recently announced the availability of a Spark 3.0 preview on Dataproc image version 2.0, noting the powerful NVIDIA GPU acceleration that’s now possible thanks to the collaboration of the open source community. We’ll be hosting a webinar with Google Cloud on July 16 to dive into these exciting new capabilities for data scientists.

In addition, the new open source RAPIDS Accelerator for Apache Spark is now available to accelerate ETL (extract, transform, load) and data transfers to boost analytics performance from end to end, without any code changes.

Faster performance on Spark not only means faster insights, but also reduced costs since enterprises can complete workloads using less infrastructure.

Accelerated Data Analytics: Scientific Computing Makes Sense of AI

Spark is increasingly in the news for good reason.

Data is essential to helping organizations navigate shifting opportunities and possible threats. But to do so, they need to decipher the critical clues hidden in their data.

Organizations add to their heaps of information every time a customer clicks on a website, hosts a call with customer support or generates a daily sales report. With the rise of AI, data analytics has become critical to helping companies spot trends and stay ahead of changing markets.

Until recently, data analytics has relied on small datasets to glean historical data and insights. This data was analyzed through ETL on highly structured data, stored in traditional data warehouses.

ETL often becomes a bottleneck for data scientists working on AI-based predictions and recommendations. Estimated to take up 70-90 percent of a data scientist’s time, ETL slows down workflows and ties up sought-after talent on the most mundane part of their work.

When a data scientist is waiting for ETL, they’re not retraining their models to gain better business intelligence. Traditional CPU infrastructure can’t scale efficiently to accommodate these workloads, which often causes costs to balloon.

With GPU-accelerated Spark, ETL no longer spells trouble. Industries such as healthcare, entertainment, energy, finance, retail and many others can now cost-effectively accelerate their data analytics insights.

The Power of Parallel Processing for Data Analytics

GPU parallel processing allows computers to work on multiple operations at a time. In a data center, these capabilities scale out massively to support complex data analytics projects. With more organizations leveraging AI and machine learning tools, parallel processing has become critical for accelerating data-heavy analytics and the ETL pipelines that drive these workloads.

Consider a retailer seeking to predict what to stock for next season. It would need to examine recent sales as well as last year’s data. A savvy data scientist might add weather models to this analysis to see what impact a wet or dry season would have on the results. They may also integrate sentiment analysis data to assess what trends are most popular this year.

With so many sources of data to analyze, speed is critical to modeling the impact that different variables might have on sales. This is where analytics moves into machine learning, and where GPUs become essential.

RAPIDS Accelerator Supercharges Apache Spark 3.0

As data scientists shift from using traditional analytics to AI applications that better model complex market demands, CPU-based processing can’t keep up without compromising either speed or cost. The growing adoption of AI in analytics has created the need for a new framework to process data quickly and cost-efficiently with GPUs.

The new RAPIDS Accelerator for Apache Spark connects the Spark distributed computing framework to the powerful RAPIDS cuDF library to enable GPU acceleration of Spark DataFrame and Spark SQL operations. The RAPIDS Accelerator also speeds up Spark Shuffle operations by finding the fastest path to move data between Spark nodes.

Visit the GitHub page to access the RAPIDS Accelerator for Apache Spark.

Watch Spark 3.0 sprint on GPUs in this video demo:

To learn more about the Spark 3.0 release, visit the Apache Software Foundation.

Data scientists can learn more about Spark 3.0 in our free Spark 3.0 e-book.