PRESTA MPI Bandwidth and Latency Benchmark v1.1 README File
***********************************************************


Table of contents
=================

 o Code Description 
 o Benchmark Test Summary
 o Benchmark Test Parameters
 o Benchmark Test Descriptions
 o Building Instructions 
 o Files in this Distribution


Code Description 
=================

This benchmark consists of 5 individual tests intended to evaluate the
performance of inter-process communication with MPI.  The provided 
applications test inter-process communication latency and bandwidth for 
standard MPI message passing operations as well as the MPI-2 RMA/one-sided 
operations.  Additionally, collective operation performance is measured for 
MPI_Allreduce in one test and several collective operations in the final
application.

These benchmarks are written in C with MPI message-passing primitives.


Benchmark Test Summary
======================

This section identifies each included test and defines the measurements
provided by each test.


com : 

This test measures total unidirectional and bidirectional inter-process 
bandwidth for pairs of MPI processes, varying both message size and 
number of concurrently communicating process pairs.  Intra-SMP or 
inter-SMP communication, as well as bandwidth out of node and bisectional 
bandwidth, can be measured with proper allocation of the processes.  


laten : 

This test measures latency between MPI processes by measuring the time to 
execute MPI_Send/MPI_Recv ping-pong communication between MPI process pairs,
varying the number of concurrently communicating process pairs.  


rma :

This test measures MPI-2 RMA bandwidth and epoch length for both
unidirectional and bidirectional access patterns.  The number of
operations per epoch can be specified by the user.  


allred : 

This test measures the MPI_Allreduce operation in a simulated
application environment with alternating computation and communication 
phases. 


globalop :

This test, also referred to as the Global-Op benchmark, times MPI 
operation loops with and without a simulated compute phase for 
MPI_Barrier, MPI_Reduce, MPI_Bcast, MPI_Allreduce, and 
MPI_Reduce/MPI_Bcast.



Benchmark Test Parameters
=========================

Operations Per Time Sample :

Timing results are obtained in each of these tests through the use of
the MPI_Wtime function.  To address variation in MPI_Wtime resolution,
all of the tests, with the exception of globalop, allow the
specification of the number of operations to be performed between
MPI_Wtime calls as a command-line argument.  The number of MPI_Wtime
clock ticks per measurement is provided in the output of each test.
This mechanism is provided as a direct means of tuning the test
behavior to the granularity of the MPI_Wtime function.  

Note that for the com, laten, and rma tests, each time sample will 
include the overhead of performing a Barrier operation by the included 
butterfly barrier.  This default behavior is provided as a means of 
ensuring concurrent communication during the test.  Use of the '-n' 
command-line argument will suppress the use of the barrier.  When using 
the default behavior, the effect of the barrier can be minimized by 
using larger numbers of operations per sample.


Participating Processes and Allocation : 

For the com, laten, and rma tests, the allocation of concurrently
communicating process pairs and the number of these pairs can be
specified in several ways.  The intent of the test is to measure
communication for the following configurations:

1) Communication within a SMP, increasing the number of concurrently
communicating process pairs.

2) Communication between two SMPs, increasing the number of concurrently
communicating process pairs.

3) Communication between multiple SMPs, increasing the number of SMPs
with concurrently communicating process pairs.  

4) Communication between multiple SMPs, with active processes on each SMP,
increasing the number of communicating processes on each SMP.


The following command-line flags provide the same functionality across
the com, laten, and rma tests:

    -f [process count source file]
    -p [allocate processes: c(yclic) or b(lock)]   default=b
    -t [processes per SMP]
    -i : print process pair information            default=false
    -r : partner processes with nearby rank        default=false


Process pair partnering is done by default by selecting 1 process from each
half of the rank ids in MPI_COMM_WORLD.  Pair partners are identified by:

  (MPI_COMM_WORLD Rank ID + Size of MPI_COMM_WORLD/2) % Size of MPI_COMM_WORLD

However, if the number of processes per SMP is provided, the
processes can be paired by nearest off-SMP rank through the use of
the '-r' flag.

The default behavior of the tests is to allocate process pairs
incrementally based on rank, beginning with 1 process pair.  For
example, an iteration for 4 processes of a 16 process job with 8
processes per SMP would include the rank pairs (0,8) and (1,9).  Test
configuration #3 mentioned above uses this allocation pattern and
begins with 2*processes-per-SMP active processes.

Since test configuration #4 increases the number of active processes
per SMP, the participating processes must be identified in a cyclic
manner.  Specifying the number of processes per SMP and the '-p c'
option results in cyclic allocation of the active processes. For
example, an iteration for 8 processes of a 32 process job with 8
processes per SMP would include rank pairs (0,16), (8,24), (1,17),
and (9,25).  Test configuration #4 starts with the number of
communicating processes equal to the number of SMPs and increases the
process count by the number of SMPs each iteration.

The process counts can also be specified in a source file with the
'-f' option.  Each count must be on a separate line.


Message Sizes :

The com and rma tests vary message size based on the command-line arguments
message size start, message size stop and message size factor.  Start and
stop are specified in bytes.  The message size factor is the value by which 
the current message size is multiplied to derive the message size for the 
next iteration. 


Benchmark Test Descriptions 
===========================

com :

The "com" test is intended to illustrate at what point the interconnect between
communicating MPI processes becomes saturated for both unidirectional
and bidirectional communication.

As indicated above, com iterates over a range of message sizes for each set
of communicating process pairs.  The entire size/pair iteration is performed
initially for unidirectional and then bidirectional communication.  The 
bandwidth and average operation time are calculated based on the longest time 
sample of any of the participating processes.

Intra- and inter-SMP performance can be evaluated based on process allocation.

The com test requires that the 'number of operations between
measurements' argument (-o) be specified.

com : MPI interprocess communication bandwidth benchmark

  syntax: com [OPTION]...

    -b [message start size]                        default=32
    -e [message stop  size]                        default=8388608
    -f [process count source file]
    -o [number of operations between measurements]
    -p [allocate processes: c(yclic) or b(lock)]   default=b
    -s [message size increase factor]              default=2
    -t [processes per SMP]
    -h : print use information
    -i : print process pair information            default=false
    -n : do not use barrier within measurement     default=barrier used
    -r : partner processes with nearby rank        default=false


laten :

The application "laten" measures inter-process latency by performing 
ping-pong communication using MPI_Send and MPI_Recv.

As indicated above, the laten test varies the number of concurrently 
communicating process pairs for each measurement.  Communication latency is 
determined based on the longest process pair execution of the specified 
number of 0 byte MPI_Send/MPI_Recv operations performed.  The average time 
of a single MPI_Send/MPI_Recv iteration is calculated and divided by two.

Intra- and inter-SMP performance can be evaluated based on process allocation.

The laten test requires that the 'number of operations between
measurements' argument (-o) be specified.

laten : MPI interprocess communication latency benchmark

  syntax: laten [OPTION]...

    -f [process count source file]
    -o [number of operations between measurements]
    -p [allocate processes: c(yclic) or b(lock)]   default=b
    -t [processes per SMP]
    -h : print use information
    -i : print process pair information            default=false
    -n : do not use barrier within measurement     default=barrier used
    -r : partner processes with nearby rank        default=false

rma : 

The application "rma" measures the performance of unidirectional and 
bidirectional communication using the MPI-2 RMA operations MPI_Put
and MPI_Get.  

This test varies message size and concurrent communicating process pairs
in the same manner as the com test.  Note that each process pair performs
operations on a window specific to the two processes.

Note that this test has large memory requirements: for 128 operations
per epoch with operation sizes of 8MB for a bidirectional test, the memory 
requirement for 1 process would be 128 * 8 * 2 = 2048 MB.  For multiple
concurrently communicating process pairs, this can quickly lead to excessive
paging.

The rma test requires that the 'number of operations per epoch'
argument (-o) and the 'number of epochs between measurements' (-c)
argument values be specified.

rma : MPI interprocess communication benchmark for one-sided operations

  syntax: rma [OPTION]...

    -b [message start size]                        default=32
    -c [number of epochs between measurements]
    -e [message stop  size]                        default=8388608
    -f [process count source file]
    -o [number of operations per epoch]
    -p [allocate processes: c(yclic) or b(lock)]   default=b
    -s [message size increase factor]              default=2
    -t [processes per SMP]
    -h : print use information
    -i : print process pair information            default=false
    -n : do not use barrier within measurement     default=barrier used
    -r : partner processes with nearby rank        default=false

allred :

The application "allred" measures the MPI_Allreduce operation in a
simulated application-like context.  A series of computation operations
are timed followed by a series of timed compute/Allreduce iterations.
The computation-only time is subtracted from the compute/Allreduce
time, resulting in the MPI_Allreduce operation time.  The Allreduce 
operations use a fixed operation size of 4 MPI_DOUBLEs.

This application takes 3 arguments: the number of compute iterations, the 
number of loop iterations between time samples, and the total number of 
samples taken.  

To avoid overlap of Allreduce operations, the "Time between Allreduce"
field should be at least as great as the "Op mean" value.  This can be
tuned with the number of compute iterations argument.  If the "Op mean"
results of this test appear inconsistent, the sample and stddev/mean values 
can be examined to determine whether inconsistent data is being gathered.


globalop :

The application "globalop" measures the time to complete loops 
of several MPI collective operations, including MPI_Barrier, MPI_Reduce, 
MPI_Bcast, MPI_Allreduce, and MPI_Reduce/MPI_Bcast.  The operation loops 
are timed with and without a call to a simulated compute-phase function.


Building Instructions
=====================

Configure the Makefile to use the appropriate commands and flags as required
to compile MPI applications on the target system.


File in this Distribution
========================

Makefile
README
README.html
README.rfp
README.rfp.html
allred.c
com.c
globalop.c
laten.c
rma.c
util.c
util.h


Last modified on January 31, 2001 by Chris Chambreau
For information contact:
Chris Chambreau -- chcham@llnl.gov 

UCRL-CODE-2001-028