CORAL-2 Benchmarks

Introduction

The CORAL-2 benchmarks contained within this site represent their state as of early December 2017. Over the next few months the list of applications will be changing. While we expect most of the changes will be additions it is possible that we will remove applications or change their category.

For now links to the benchmark applications websites are provided. Once we have performed and released a reference figure of merit (FOM) for an application we will include a static source of that application on this site and a document describing the benchmarking procedures for the application. Reference FOMs will be run on either Sequoia or Titan.

The CORAL-2 RFP page and the updated FOM spreadsheet can be found here.

See GPU Versions and Other Supplementary Material for more information.

Questions?

For questions about the benchmarks and other information on this site please contact coral2-benchmarks [at] llnl.gov (coral2-benchmarks[at]llnl[dot]gov).

Tier-1 Benchmark Information

Scalable Science Benchmarks	Lines of Code	Parallelism			Language				Code Description/Notes
Scalable Science Benchmarks	Lines of Code	MPI	OpenMP/ Pthreads	GPU	Fortran	Python	C	C++	Code Description/Notes
HACC HACC source HACC summary baseline GPU version: No	35,000	X	X	X				X	The Hardware Accelerated Cosmology Code (HACC) framework uses N-body techniques to simulate the formation of structure in collisionless fluids under the influence of gravity in an expanding universe. It depends on external FFT library and is typically compute limited achieving 13.92 Petaflops, 69.2% of machine peak on Sequoia.
Nekbone Nekbone source Nekbone summary baseline GPU version: No	48,000	X		X	X		X		Nekbone is a mini-app derived from the Nek5000 CFD code which is a high order, incompressible Navier- Stokes CFD solver based on the spectral element method. The conjugate gradiant solve is compute intense, contains small messages and frequent allreduces.
QMCPACK QMCPACK source QMCPACK summary baseline GPU version: Yes	200.000	X	X	X			X	X	QMCPACK is a many-body ab initio quantum Monte Carlo code for computing the electronic structure of atoms, molecules, and solids. It is written primarily in C++, and its use of template metaprogramming is known to stress compilers. When run in production, the code is memory bandwidth sensitive, while still needing thread efficiency to realize good performance.
LAMMPS LAMMPS source LAMMPS summary baseline GPU version: Yes	500,000	X	X	X			X	X	LAMMPS is a classical molecular dynamics code. Performance limiters will depend on the problem chosen and could include, compute, memory bandwidth, network bandwidth, and network latency.

Throughput Benchmarks	Lines of Code	Parallelism			Language				Code Description/Notes
Throughput Benchmarks	Lines of Code	MPI	OpenMP/ Pthreads	GPU	Fortran	Python	C	C++	Code Description/Notes
AMG AMG source AMG summary baseline GPU version: No	65,000	X	X	X			X		AMG is a parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids. AMG is memory-access bound, generates many small messages and stresses memory and network latency.
Kripke Kripke source Kripke summary baseline GPU version: No	4,000	X	X	X				X	Kripke is a structured deterministic (Sn) transport using RAJA. It contains wavefront algorithms, that stress memory latency and/or bandwidth, and network latency.
Quicksilver Quicksilver source Quicksilver summary baseline GPU version: Yes	10,000	X	X	X				X	Monte Carlo transport benchmark with multi-group cross section lookups. Stresses memory latency, significant branching, and one large kernel that is 1000’s of lines big.
PENNANT PENNANT source PENNANT summary baseline GPU version: No	3,300	X	X	X				X	PENNANT is a mini-app for hydrodynamics on general unstructured meshes in 2D (arbitrary polygons). It makes heavy use of indirect addressing and irregular memory access patterns.

Data Science and Deep Learning Benchmarks	Lines of Code	Parallelism			Language				Code Description/Notes
Data Science and Deep Learning Benchmarks	Lines of Code	MPI	OpenMP/ Pthreads	GPU	Fortran	Python	C	C++	Code Description/Notes
Big Data Analytic Suite BDAS source BDAS summary	640								The big data analytic suite contains the K-Means observation label, PCA, and SVM benchmarks.
Deep Learning Suite DLS source DLS summary	1,100								The deep learning suite contains: Convolutional Neural Networks (CNNs) that comprises convolutional layers followed by fully connected layers; LSTM recurrent neural network (RNN) architecture that remembers values over arbitrary intervals to deal with temporal and time-series prediction; and, distributed training code for classification in ImageNet data set at scale. Finally, the CANDLE benchmark codes implement deep learning architectures that are relevant to problems in cancer. These architectures address problems at different biological scales, specifically problems at the molecular, cellular and population scales. We will use two diverse benchmark problems, namely, a) P1B1, a sparse autoencoder to compress the expression profile into a low-dimensional vector, and, b) P3B1 a multi-task deep neural net for data extraction from clinical reports.

Skeleton Benchmarks	Lines of Code	Parallelism			Language				Code Description/Notes
Skeleton Benchmarks	Lines of Code	MPI	OpenMP/ Pthreads	GPU	Fortran	Python	C	C++	Code Description/Notes
CORAL MPI Benchmarks CMB source CMB summary	1,000	X					X		Subsystem functionality and performance tests. Collection of independent MPI benchmarks to measure various aspects of MPI performance including interconnect messaging rate, latency, aggregate bandwidth, and collective latencies.
Memory Benchmarks Stream Stream Source Stride Stride Source	1,500		X				X		Memory subsystem functionality and performance tests. Collection of STREAMS and STRIDE memory benchmarks to measure the memory subsystem under a variety of memory access patterns.
ML/DL micro-benchmark Suite ML/DL source ML/DL summary									Sparse and dense convolutions, FFT, double, single and half precision GEMM and other machine/deep learning math algorithms not included in other CORAL benchmark suites.
I/O Suite IOR IOR source IOR summary MDTest MDT source MDT summary Simul Simul source Simul summary FTree FTree source Ftree summary									File system metadata benchmark. Interleaved or Random I/O benchmark. Used for testing the performance of parallel filesystems and burst buffers using various interfaces and access patterns.
CLOMP CLOMP source CLOMP summary		X					X		Threading benchmark focused on performance of thread overheads evaluation.
Pynamic Pynamic source Pynamic summary	12,000	X				X		X	Subsystem functionality and performance test. Dummy application that closely models the footprint of an important Python-based multi-physics ASC code.
RAJA Performance Suite RAJA source RAJA summary	2,000		X	X				X	The RAJA performance suite is designed to explore performance of loop-based computational kernels of the sort found in HPC applications. In particular, it is used to assess, monitor, and compare runtime performance of kernels implemented using RAJA and variants implemented using standard or vendor-supported parallel programming models directly. Each kernel in the suite appears in multiple RAJA and non-RAJA variants using parallel programming models such as OpenMP and CUDA.

Tier-2 Benchmark Information

Scalable Science Benchmarks	Lines of Code	Parallelism			Language				Code Description/Notes
Scalable Science Benchmarks	Lines of Code	MPI	OpenMP/ Pthreads	GPU	Fortran	Python	C	C++	Code Description/Notes
ACME summary (Note: ACME has been renamed E3SM)		X	X	X	X				ACME is a high-resolution climate simulation code for the entire Earth system, containing five major components for the atmosphere, ocean, land surface, sea ice, and land ice along with a coupler. Performance limiters will be network latency, memory bandwidth, kernel launch overheads on accelerators, and accelerator data transfer latency.
VPIC	90,000	X	X					X	VPIC (Vector Particle-In-Cell) is a general purpose particle-in-cell simulation code for modeling kinetic plasmas. It employs a second-order, explicit, leapfrog algorithm to update charged particle positions and velocities in order to solve the relativistic kinetic equation for each species in the plasma, along with a full Maxwell description for the electric and magnetic fields evolved via a second- order finite-difference-time-domain (FDTD) solve.

Throughput Benchmarks	Lines of Code	Parallelism			Language				Code Description/Notes
Throughput Benchmarks	Lines of Code	MPI	OpenMP/ Pthreads	GPU	Fortran	Python	C	C++	Code Description/Notes
AMG AMG T source AMG T summary	65,000	X	X				X		AMG solve is included in the Tier 1 problem. The setup time and time-dependent problem are included here as they stress systems differently.
Laghos LaghosT source LaghosT summary	2,000+ dependency on MFEM	X		X				X	Laghos solves the time-dependent Euler equation of compressible gas dynamics in a moving Lagrangian frame using unstructured high-order finite element spatial discretization and explicit high-order time-stepping. It is built on top of a general discretization library (MFEM) and supports two modes: full assembly, where performance is limited by the data motion in a conjugate gradient (CG) solve, and partial assembly, where performance depends mostly on small dense matrix operations and the CG solve communication.
LAMMPS LT source LT summary	500,000	X	X	X			X	X	One LAMMPS problem is included in Scalable Science. However, many potentials are of interest and a second one is included here for reference.

Data Science and Deep Learning Benchmarks	Lines of Code	Parallelism			Language				Code Description/Notes
Data Science and Deep Learning Benchmarks	Lines of Code	MPI	OpenMP/ Pthreads	GPU	Fortran	Python	C	C++	Code Description/Notes
Parallel Integer Sort PIS-DS summary PIS-DS source	2,000	X			X		X		The BigSort benchmark sorts a large number of 64-bit integers (from 0 to T) in parallel. In particular, the total size of the data set can exceed the aggregated memory size of all nodes. The goal is to exercise and study a computer system’s memory hierarchy performance when it comes to big data management. The emphasis here is IO, all-to-all communication, and integer operations.
Havoq									Massively parallel graph analysis algorithms for computing triangles, edges, vertices. Emphasizes load imbalance and irregular random memory accesses.

CORAL-2 Benchmarks

Introduction

Questions?

Tier-1 Benchmark Information

Tier-2 Benchmark Information

SITE MAP

LLNL.GOV

ORGANIZATIONS

RESOURCES

SITE MAP

LLNL.GOV

ORGANIZATIONS

RESOURCES

CORAL-2 Benchmarks

Introduction

Questions?

Tier-1 Benchmark Information

Tier-2 Benchmark Information

SITE MAP

LLNL.GOV

ORGANIZATIONS

RESOURCES

STAY CONNECTED