CORAL-2 Benchmarks

Introduction

The CORAL-2 benchmarks contained within this site represent their state as of early December 2017. Over the next few months the list of applications will be changing. While we expect most of the changes will be additions it is possible that we will remove applications or change their category.

For now links to the benchmark applications websites are provided. Once we have performed and released a reference figure of merit (FOM) for an application we will include a static source of that application on this site and a document describing the benchmarking procedures for the application. Reference FOMs will be run on either Sequoia or Titan.

The CORAL-2 RFP page and the updated FOM spreadsheet can be found here.

See GPU Versions and Other Supplementary Material for more information.

Questions?

For questions about the benchmarks and other information on this site please contact coral2-benchmarks [at] llnl.gov (coral2-benchmarks[at]llnl[dot]gov).

Tier-1 Benchmark Information

Scalable Science Benchmarks Lines of
Code
Parallelism Language Code Description/Notes
MPI OpenMP/
Pthreads
GPU Fortran Python C C++

HACC

35,000 X X X       X The Hardware Accelerated Cosmology Code (HACC) framework uses N-body techniques to simulate the formation of structure in collisionless fluids under the influence of gravity in an expanding universe. It depends on external FFT library and is typically compute limited achieving 13.92 Petaflops, 69.2% of machine peak on Sequoia.

Nekbone

48,000 X   X X   X   Nekbone is a mini-app derived from the Nek5000 CFD code which is a high order, incompressible Navier- Stokes CFD solver based on the spectral element method. The conjugate gradiant solve is compute intense, contains small messages and frequent allreduces.

QMCPACK

200.000 X X X     X X QMCPACK is a many-body ab initio quantum Monte Carlo code for computing the electronic structure of atoms, molecules, and solids. It is written primarily in C++, and its use of template metaprogramming is known to stress compilers. When run in production, the code is memory bandwidth sensitive, while still needing thread efficiency to realize good performance.

LAMMPS

500,000 X X X     X X LAMMPS is a classical molecular dynamics code. Performance limiters will depend on the problem chosen and could include, compute, memory bandwidth, network bandwidth, and network latency.
 
Throughput Benchmarks Lines of
Code
Parallelism Language Code Description/Notes
MPI OpenMP/
Pthreads
GPU Fortran Python C C++

AMG

65,000 X X X     X   AMG is a parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids. AMG is memory-access bound, generates many small messages and stresses memory and network latency.

Kripke

 

4,000 X X X       X Kripke is a structured deterministic (Sn) transport using RAJA. It contains wavefront algorithms, that stress memory latency and/or bandwidth, and network latency.

Quicksilver

 

10,000 X X X       X Monte Carlo transport benchmark with multi-group cross section lookups. Stresses memory latency, significant branching, and one large kernel that is 1000’s of lines big.

PENNANT

 

3,300 X X X       X PENNANT is a mini-app for hydrodynamics on general unstructured meshes in 2D (arbitrary polygons). It makes heavy use of indirect addressing and irregular memory access patterns.
 
Data Science and Deep Learning Benchmarks Lines of
Code
Parallelism Language Code Description/Notes
MPI OpenMP/ Pthreads GPU Fortran Python C C++

Big Data Analytic Suite

640               The big data analytic suite contains the K-Means observation label, PCA, and SVM benchmarks.

Deep Learning Suite

1,100               The deep learning suite contains: Convolutional Neural Networks (CNNs) that comprises convolutional layers followed by fully connected layers; LSTM recurrent neural network (RNN) architecture that remembers values over arbitrary intervals to deal with temporal and time-series prediction; and, distributed training code for classification in ImageNet data set at scale. Finally, the CANDLE benchmark codes implement deep learning architectures that are relevant to problems in cancer. These architectures address problems at different biological scales, specifically problems at the molecular, cellular and population scales. We will use two diverse benchmark problems, namely, a) P1B1, a sparse autoencoder to compress the expression profile into a low-dimensional vector, and, b) P3B1 a multi-task deep neural net for data extraction from clinical reports.
 
Skeleton Benchmarks Lines of
Code
Parallelism Language Code Description/Notes
MPI OpenMP/ Pthreads GPU Fortran Python C C++

CORAL MPI Benchmarks

1,000 X         X   Subsystem functionality and performance tests. Collection of independent MPI benchmarks to measure various aspects of MPI performance including interconnect messaging rate, latency, aggregate bandwidth, and collective latencies.

Memory Benchmarks

 

1,500   X       X   Memory subsystem functionality and performance tests. Collection of STREAMS and STRIDE memory benchmarks to measure the memory subsystem under a variety of memory access patterns.

ML/DL micro-benchmark Suite

                Sparse and dense convolutions, FFT, double, single and half precision GEMM and other machine/deep learning math algorithms not included in other CORAL benchmark suites.

I/O Suite

   

 

 

          File system metadata benchmark. Interleaved or Random I/O benchmark. Used for testing the performance of parallel filesystems and burst buffers using various interfaces and access patterns.

CLOMP

  X         X   Threading benchmark focused on performance of thread overheads evaluation.

Pynamic

12,000 X       X   X Subsystem functionality and performance test. Dummy application that closely models the footprint of an important Python-based multi-physics ASC code.

RAJA Performance Suite

2,000   X X       X The RAJA performance suite is designed to explore performance of loop-based computational kernels of the sort found in HPC applications. In particular, it is used to assess, monitor, and compare runtime performance of kernels implemented using RAJA and variants implemented using standard or vendor-supported parallel programming models directly. Each kernel in the suite appears in multiple RAJA and non-RAJA variants using parallel programming models such as OpenMP and CUDA.

Tier-2 Benchmark Information

Scalable Science Benchmarks Lines of
Code
Parallelism Language Code Description/Notes
MPI OpenMP/
Pthreads
GPU Fortran Python C C++

ACME summary
(Note: ACME has been renamed E3SM)

  X X X X       ACME is a high-resolution climate simulation code for the entire Earth system, containing five major components for the atmosphere, ocean, land surface, sea ice, and land ice along with a coupler. Performance limiters will be network latency, memory bandwidth, kernel launch overheads on accelerators, and accelerator data transfer latency.
VPIC 90,000 X X         X VPIC (Vector Particle-In-Cell) is a general purpose particle-in-cell simulation code for modeling kinetic plasmas. It employs a second-order, explicit, leapfrog algorithm to update charged particle positions and velocities in order to solve the relativistic kinetic equation for each species in the plasma, along with a full Maxwell description for the electric and magnetic fields evolved via a second- order finite-difference-time-domain (FDTD) solve.
 
Throughput Benchmarks Lines of
Code
Parallelism Language Code Description/Notes
MPI OpenMP/
Pthreads
GPU Fortran Python C C++
AMG 65,000 X X       X   AMG solve is included in the Tier 1 problem. The setup time and time-dependent problem are included here as they stress systems differently.

Laghos

2,000+
dependency on MFEM
X   X       X Laghos solves the time-dependent Euler equation of compressible gas dynamics in a moving Lagrangian frame using unstructured high-order finite element spatial discretization and explicit high-order time-stepping. It is built on top of a general discretization library (MFEM) and supports two modes: *full assembly*, where performance is limited by the data motion in a conjugate gradient (CG) solve, and *partial assembly*, where performance depends mostly on small dense matrix operations and the CG solve communication.

LAMMPS

500,000 X X X     X X One LAMMPS problem is included in Scalable Science. However, many potentials are of interest and a second one is included here for reference.
 
Data Science and Deep Learning Benchmarks Lines of
Code
Parallelism Language Code Description/Notes
MPI OpenMP/ Pthreads GPU Fortran Python C C++

Parallel Integer Sort

 

2,000 X     X   X   The BigSort benchmark sorts a large number of 64-bit integers (from 0 to T) in parallel. In particular, the total size of the data set can exceed the aggregated memory size of all nodes. The goal is to exercise and study a computer system’s memory hierarchy performance when it comes to big data management. The emphasis here is IO, all-to-all communication, and integer operations.
Havoq                 Massively parallel graph analysis algorithms for computing triangles, edges, vertices. Emphasizes load imbalance and irregular random memory accesses.