Scale and Accelerate
Recommender Systems on GPUs

NVIDIA Merlin is an open source library designed to accelerate recommender systems on NVIDIA GPUs. It enables data scientists, machine learning engineers, and researchers to build high-performing recommenders at scale. Merlin includes tools to address common ETL, training, and inference challenges. Each stage of the Merlin pipeline is optimized to support hundreds of terabytes of data, which is all accessible through easy-to-use APIs. With Merlin, better predictions and increased click-through rates are within reach.

Accelerated ETL

As the ETL component of the Merlin ecosystem, NVTabular is a feature engineering and preprocessing library for tabular data. It is designed to quickly and easily manipulate terabyte scale datasets that are used to train deep learning based recommender systems. NVTabular uses RAPIDS’ dask_cudf to perform GPU-accelerated transformation.

Read more about NVTabular’s features

Accelerated Training

When training deep learning recommender system models, data loading can be a bottleneck. Merlin accelerates the training of deep learning recommender systems using RAPIDS’ cuDF and DaskcuDF to read asynchronous parquet files. This is used to speed-up existing TensorFlow and PyTorch training pipelines or used with HugeCTR to train deep learning recommender systems written in CUDA C++.

Read more about accelerated training

Accelerated Inference

NVTabular and HugeCTR both support the Triton Inference Server to provide GPU-accelerated inference. The Triton Inference Server simplifies the deployment of AI models to production at scale. It is an inference serving software that is open source and lets teams deploy trained AI models from any framework. The NVTabular ETL workflow and trained deep learning models can be easily deployed to production with only a few steps.

Read more about inference from examples

Getting Started

It is easy to get started with Merlin. There are many examples and blog posts to reference.

Try Now Online

Try on Kaggle with:
GPU-accelerated ETL with NVTabular
Accelerated training pipelines in PyTorch and FastAI

Try Our Notebook Examples

NVTabular and HugeCTR both provide a collection of examples based on a variety of recommender system datasets that are publicly available. Checkout the NVTabular notebooks and HugeCTR notebooks.

Pull Our Docker Container

Merlin published docker containers with pre-installed versions of the latest release on NVIDIA’s NGC repository. Pull the container and try out Merlin yourself.

See The Latest Docs

Access our current installations, guides, and tutorials in latest documentations for NVTabular and HugeCTR.

Read Our Blogs

Learn more about recommender systems and Merlin on our Blog.

Accelerate ETL
with NVTabular


NVTabular is capable of scaling ETL over multiple GPUs and nodes. NVTabular can process the Criteo 1TB Clicks Ads dataset in 13.8 minutes on a GPU and 1.9 minutes on eight GPUs, which is the largest, publicly available recommendation dataset that contains 1.3TB of uncompressed click logs with roughly 4 billion users. NVTabular’s processing time is much faster compared to the original NumPy script that requires 5 days (7200 minutes) and an optimized spark cluster that requires 3 hours (180 minutes). That accounts for a speedup of 13 times to 95 times.

Read more in our blogpost

Accelerate DL Training
with HugeCTR


MLPerf is a consortium of AI leaders from academia, research labs, and industry whose mission is to “build fair and useful benchmarks” that provide unbiased evaluations of training and inference. HugeCTR on DGX-A100 is the fastest commercial solution available to train Facebook’s Deep Learning Recommender Model on 4TB of data. It finishes the training in 3.33 min and is 13.5x faster than the best CPU-only solution.

Read more in our blogpost

Get Started with NVIDIA Merlin