Scale Python with Dask on GPUs

DASK.ORG

Scale Python with Ease

Pandas, Numpy, and scikit-learn packages are efficient, intuitive, and widely trusted—but they weren’t designed to scale. Dask is an open-source tool that can scale Python packages to multiple machines. Developed by core NumPy, pandas, scikit-learn, Jupyter, Dask is freely available and deployed in production across numerous Fortune 500 companies.

Why Use Dask?

Scalable

Code only needs to be written once and then can be deployed locally or to a multi-node cluster using a comfortable Pythonic syntax. No need for code rewrites or retraining.

Resilient

It’s resilient, handling the failure of worker nodes gracefully, and elastic, able to take advantage of new nodes added on the fly.

Simple to Implement

Dask requires no configuration and no setup. Adding even a single machine to computation adds very little cognitive overhead.

Getting Started

It’s easy to get started with Dask quickly. The project is well supported by many tutorials, quick-start guides, and cheat sheets.

Working with Apache Spark?

Dask collaborates with Apache Spark and its ecosystem. However, there are some basic differences.

Get an Overview

Listen to the TalkPython podcast Parallelizing Computation with Dask.”

See How Dask Works

Watch presentations from the Dask Youtube channel.

Access Dask Docs

See the latest documentation from Dask.

Check Out Dask Tutorials

Explore Dask tutorials on Github, see Dask code examples on dask.org and Binder.

Dask on GPUs

Dask can distribute data and computation over multiple GPUs, either in the same system or in a multi-node cluster. Dask integrates with both RAPIDS cuDF, XGBoost, and RAPIDS cuML for GPU-accelerated data analytics and machine learning.

Dataframe and ETL Integration

The RAPIDS cuDF library provides a GPU-backed dataframe class that replicates the popular pandas API. It includes extremely high-performance functions to load CSV, JSON, ORC, Parquet and other file formats directly into GPU memory, eliminating one of the key bottlenecks in many data processing tasks. cuDF includes a variety of other functions supporting GPU-accelerated ETL, such as data subsetting, transformations, one-hot encoding, and more. The RAPIDS team maintains a dask-cudf library that includes helper methods to integrate Dask and cuDF.

XGBoost Integration

XGBoost, the popular open source machine learning library for gradient boosting, now includes integrated support for Dask. Users can partition data across nodes using Dask’s standard data structures, build a DMatrix on each GPU using xgboost.dask.create_worker_dmatrix, and then launch training through xgboost.dask.run. See the xgboost.dask documentation or the Dask+XGBoost GPU example code for more details.

Note: This support is currently available in custom builds, and it is expected to be included in the next official release of XGBoost after 0.90. New users should check out the 10 Minutes to Dask-XGBoost guide to get started quickly.

Integration With Other Machine Learning Algorithms

For other machine learning work on GPU, the dask-cuml library provides a bridge to the RAPIDS cuML package. RAPIDS cuML implements popular machine learning algorithms, including clustering, dimensionality reduction, and regression approaches, with high performance GPU-based implementations, offering speedups of up to 100x over CPU-based approaches. cuML replicates the scikit-learn API, so it integrates well with projects like Dask that include scikit-learn support. Currently, dask-cuml supports distributed clustering and regression algorithms, with new algorithms are being added over time.

Example Notebooks

The RAPIDS Notebooks Extended repository includes several examples with end-to-end examples using Dask for distributed, GPU-accelerated computation. Here’s a few from the collection to get started with.

The Linear Regression with Dask+cuML shows a simple example of how to get started with distributed machine learning. Go to notebook

The end-to-end mortgage example notebook uses Fannie Mae data to predict mortgage delinquency (a classification problem). Go to notebook

The NYC Taxi End-to-End notebook uses trip data to predict New York City taxi fares (a regression problem). Go to notebook

Use Cases

Dask is widely and routinely used, running on everything from laptops to thousand-machine clusters in-house, on the cloud, and on high-performance computing (HPC) supercomputers. Its ability to process hundreds of terabytes of data efficiently makes it a powerful tool in three key areas. See how Dask is being used across industry by reading stories from other Dask users and see specific examples of how people are using Dask.

Retail

Data science and devops teams in the retail industry use Dask to scale their pipelines; taking pandas and machine learning workloads to terabytes of data easily. The Dask interface makes it easy to load in terabytes of tabular data, transform the data with libraries like pandas or RAPIDS cuDF using parallel compute, and pass it off to machine learning–training libraries at scale. See how one major retailer is using RAPIDS and Dask to generate more accurate forecasting models.

HPC

Dask makes it easy to quickly analyze large, multi-dimensional datasets. It provides the same interactivity and ease of development as a system like Spark but is much more aligned to scientific processing, with native code execution, direct integration with systems like SLURM and PBS, and data models that fit multi-dimensional and spatial workloads. It’s also well tuned for high-performance networks and accelerator hardware.

Financial Services

Many financial institutions have large, complex codebases that encode significant business logic, but they now need parallelism. Their systems are more complex than just “a big pandas dataframe” or “a big NumPy array.” These groups use Dask’s lower-level APIs (Delayed, Futures) to add task scheduling and parallelism as a lightweight way to scale out their systems without costly rewrites.

Dask Libraries

Dask provides advanced parallelism for data science, enabling performance at scale for popular Python tools. Below are the key libraries of Dask that make it unique.

User Interfaces

Dask collections, including DataFrame, Array, Delayed, and Futures, provide underlying parallel computing machinery to scale workloads. All come with a purpose-built set of parallel algorithms and programming style.
Learn more

Machine Learning

Dask-ML provides scalable machine learning in Python using Dask with popular machine learning libraries, such as scikit-learn.
Learn more

Parallel Execution

Dask uses task graphs to optimize and execute work in parallel. After Dask generates task graphs, it executes them on parallel hardware with a distributed task scheduler.
Learn more

Deployment

Dask integrates into existing cluster management tooling like Kubernetes and YARN (Hadoop/Spark) and HPC schedulers like SLURM, PBS, and LSF to enable scalable workloads to evolve, reducing the burst workloads of computations by 10X.
Learn more

Visual Interactivity

The Dask interactive dashboard gives real-time feedback during computations, pointing users to hot spots in their code. This helps them use their hardware more effectively to customize load balancing, communications, thread-level profiling, and more.
Learn more

Get Started with Dask