GPU Versions and Other Supplementary Material

GPU Versions and Other Supplementary Material

This page collects information about GPU ports and other supplementary material in a single place for the Tier-1 benchmarks. The information here is meant to help vendors better understand how various applications scale and how they have been ported to GPUs in the past. All information and code on this site is provided as is and there is no warranty that it will work as advertised.

CORAL 2 Memory Requirement Estimates (.xlsx)


There is no "official" GPU version of Nekbone, however there is a GPU branch here:

This version and the CORAL2 code are different branches so there are some differences besides just the GPU code and the GPU implementation is not optimal. GPU code
from the GPU branch could likely be ported to the CORAL/OpenMP branch without large scale effort and the optimization work required is primarily for the local_grad3 and local_grad3_t kernels.


The "experimental" version of the HACC Cuda implementation can be found at the following link:

The README file is not yet updated and will be done later.


The base benchmark version of QMCPack already contains GPU support.


Summary slides for LAMMPs presented in deep-dives

The output script from the benchmark run on Sequoia

Benchmarking data for LAMMPS ReaxFF


There is no GPU support in AMG. There is GPU support in hypre, which is the parent code to AMG. The hypre release v. 2.13.0 with GPU code is available at It requires a few special settings to get full GPU support on the solve cycle:

HYPRE_BoomerAMGSetRelaxType (<>, 18);


It also needs to be configured in a special way. Both configure lines are for P8+Pascal systems at LLNL.

./ configure −−with−nvcc CFLAGS=”−O2 −qmaxmem=−1 −I /usr/ local /cuda/include” CXXFLAGS=”−O2 −qmaxmem=−1 −I /usr/ local /cuda/include”

And if one also wants openmp

./ configure −−with−nvcc −−enable−persistent −−with−openmp −−enable−hopscotch CFLAGS=”−O2 −qmaxmem=−1 −qsmp=omp −I /usr/ local /cuda/include” CXXFLAGS=”−O2 −qmaxmem=−1 −qsmp=omp −I /usr/ local /cuda/include” LDFLAGS=”−qsmp=omp”

Hypre currently requires unified memory to work correctly.

While there are some differences between AMG and AMG2013 most of the underlying code and algorithms are similar and the references on provide more details for people interested in the performance and scalability of the code.


In addition to the information about Kripke that can be found on the Kripke website including an overview paper of the code we provide some supplementary material.

A CUDA port of Kripke can be found on github. The code is a research variant and may be hard to work with and understand. It is provided as is.


The Quicksilver benchmark code contains a CUDA and an OpenMP 4.5 GPU port. Instructions to build these ports are given in the makefile. Some changing of paths may be required for some systems. Unified memory is assumed.

The code is of late beta quality with no current known bugs, but no promises that none exist. Additionally, performance of the code is likely sub-optimal as little work has been done to tune the code for GPUs.

A paper showing performance on modern hardware, discussing the representativeness of Quicksilver to its parent code Mercury, and describing the changes needed to port the original version of Quicksilver to GPUs can be found The related slides are here with the second half of them containing performance data.


A paper describing Pennant, how it has been parallelized for various architecture, including GPUs and performance results can be found here. Here is the source code for a single node GPU/CUDA implementation.

Big Data Analytics Suite

A open source port of the algorithms in this suite can be found here.