Privacy & Legal Notice

BlueGene/L Basics

1. Accounts | 2. Access | 3. File Systems |
4. Compilers | 5. Libraries | 6. TotalView Debugger |
7. Performance Tools | 8. Memory Constraints |
9. How to Launch and Manage Jobs |
10. BGL Environment Variables | 11. Mapping Tasks |
12. Known Problems
| 13. Help | 14. Resources

1. Accounts

Contact LC Support (lc-support@llnl.gov or 925-422-4533) to request a login account. For the large BGL system on the SCF, the login nodes are called bgl; ssh'ing to bgl onto one of the individual front end nodes, bglN, where N is a number (currently 1,3-6,8-11). Currently, there is no BGL system available on the OCF.

2. Access

Note that you must use SSH port 922 to connect to BGL (or any systems on the LLNL "yellow" network) from offsite. If you are doing this and still having trouble using SSH to connect to BGL, please contact the LC Hotline (call 925-422-4531 or e-mail lc-hotline@llnl.gov). For more information about SSH and SCP, see (on unclassified network) https://computing.llnl.gov/?set=access&page=index#logging-in1.

Jabber

Communication by the administrators about system activity is conducted through the open-source instant messaging platform, Jabber. Jabber documentation can be found at https://computation-int/icc/conference/.

Separate "conference rooms" are used for BGL systems on the SCF and OCF. For the SCF system, use bgl@conference.llnl.gov; currently, there is no BGL system available on the OCF.  You need to type the conference.llnl.gov part into the 'server' box and the bgl into the 'room' box in your Jabber client after you are connected to the chat.llnl.gov Jabber server.

Top

3. File Systems

User home directories are NFS-mounted. Please do not do any parallel I/O to the file system containing your home directory. It can be used to launch your BGL job, write the job log file(s), etc., but for any sort of parallel I/O for your BGL job, use Lustre.

The Lustre file systems, named /p/lscratch1 and /p/lscratch3, should be used for parallel I/O. We previously warned against having executables reside on Lustre file systems: the underlying issues regarding locking around mmap operations have been addressed, and we have not observed subsequent problems.

SLIC (slic.llnl.gov) is a cluster that mounts the Lustre file systems of BGL. It was designed and constructed solely for the purpose of offloading data to storage. To move data from the BGL system, users run HTAR sessions on the SLIC nodes, either by running HTAR directly or by running it under the Hopper graphical user interface:

Note: This performance guidance does not account for contention of the storage resources from other LC systems.

When compiling a code on Linux to make use of the Large File Support (i.e., > 2 GB files), there is some action required by the user. The quick summary is that you need to compile with the following:

-D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64

Without these defines, you will see the 2 GB limitation on BGL systems. For more details, see /usr/local/docs/Large_Files.txt.

Top

4. Compilers

For details on compiler versions, see Compilers Currently Installed on LC Platforms.

IBM Fortran (mpxlf, mpxlf90, mpxlf95)
IBM C (mpxlc, mpcc)
IBM C++ (mpxlC, mpCC)
GNU C (mpgcc)
GNU C++ (mpg++, mpc++)

There are two families of compilers for BGL: GNU and XL. The compiler drivers listed above are found in /usr/local/bin.

Concurrent with the installation of the Release 3 system software, the C 7.0 and Fortran 9.1 compilers are no longer supported. The XL drivers listed above now employ the C 8.0 and Fortran 10.1 compilers. Users who used direct paths to the compilers instead of the above driver scripts will neeed to modify those paths to point to the new compilers, which are now found under

/opt/ibmcmp/xlf/bg/10.1/
/opt/ibmcmp/vac/bg/8.0/
/opt/ibmcmp/vacpp/bg/8.0/

The following information is available online about the current XL compiler installation:

/opt/ibmcmp/xlf/bg/10.1/README
/opt/ibmcmp/vac/bg/8.0/README
/opt/ibmcmp/vacpp/bg/8.0/README

These are the README files for Fortran, C, and C/C++ respectively. These files contain useful errata to the existing XL compiler documentation.

/opt/ibmcmp/xlf/bg/10.1/doc/en_US
/opt/ibmcmp/vac/bg/8.0/doc/en_US
/opt/ibmcmp/vacpp/bg/8.0/doc/en_US

These directories contain documentation in HTML and PDF format for Fortran and C/C++ respectively.

In addition to the default IBM compilers, sometimes patches become available. When they do, you can get at the patched versions of the compilers in two ways. First, the most stable recent patch can be accessed by prefixing the IBM compiler name with new: newmpxlf, newmpxlc, newmpxlC, etc. When possible, an earlier version of the compiler will be available by prefixing the IBM compiler name with old: oldmpxlf, oldmpxlc, oldmpxlC, etc.

Which version of the compiler/patch these scripts point to will change over time. If you want to make sure you are accessing a specific version/patch, you can instead make use of the scripts in /usr/local/tools/compilers/ibm. For example, /usr/local/tools/compilers/ibm/mpxlC-8.0.0.1 is the base version of the 8.0 xlC compiler, and /usr/local/tools/compilers/ibm/mpxlC-8.0.0.1a is the first patch to that release. (Subsequent patches will use the letters b, c, etc.) These specific compiler versions will never change or disappear as long as those versions of the compilers are still available.

Both the GNU and XL compilers are cross compilers. Although one thinks of BGL as a Linux machine, the compute nodes only run a small kernel, and all libraries are statically linked in with the application, so BGL executables will only run on BGL nodes, not on a Linux box such as a front-end node.

Because cross compilers are used, special paths to the compiler binaries, include files, and libraries are required. To hide these details, scripts have been provided in /usr/local/bin and /usr/local/tools/compilers/ibm to make compiling look similar to the way it is done on LLNL's large IBM SP systems, such as Purple. Only a subset of the compiler names available on the IBM SPs are provided for BGL; they are listed above. Note that all the compiler names begin with "mp". It is assumed that all BGL programs use MPI. The compilers are only accessible from the front end nodes.

When doing mixed-language programming, it is necessary to know what libraries are loaded for each language so you can successfully link the whole program. You can obtain all the gory details by building simple test programs in the needed languages with the -v option to the compiler scripts. For example,

   % mpxlf -v -o hello hello.f
   env LD_LIBRARY_PATH=/opt/ibmcmp/xlf/bg/10.1//lib
/opt/ibmcmp/xlf/bg/10.1/bin/blrts_xlf -v -o hello hello.f -
I/bgl/BlueLight/ppcfloor/bglsys/include -L/bgl/BlueLight/ppcfloor/bglsys/lib
-lmpich.rts -lmsglayer.rts -lrts.rts -ldevices.rts
   exec: export(export,XL_CONFIG=/etc/opt/ibmcmp/xlf/bg/10.1/xlf.cfg:blrts_xlf,NULL)
   exec: /opt/ibmcmp/xlf/bg/10.1/exe/xlfentry(/opt/ibmcmp/xlf/bg/10.1/exe/xlfentry,hello.
f,/var/tmp/user/F823318eOO516,/var/tmp/user/F823318eOO516F.lst,xlfsmsg.cat,xlfms
g.cat,hello.f,OSVAR(bgl.9.0),32,NOZEROSIZE,SAVE,ALIAS(intptr),POSITION(appendold
),XLF90(noautodealloc,nosignedzero),XLF77(intarg,intxor,persistent,noleadzero,ge
dit77,noblankpad,oldboz,softeof),BGL,DEBUG(nblrl),ARCH(440d),TUNE(440),CACHE(lev
el(1),type(i),size(32),line(32),assoc(64),cost(8)),CACHE(level(1),type(d),size(3
2),line(32),assoc(64),cost(8)),CACHE(level(2),type(c),size(4096),line(128),assoc
(8),cost(40)),GNU_VERSION(3.4.3),-
I/bgl/BlueLight/ppcfloor/bglsys/include,WSTREAMS(/var/tmp/user/F823318U4obi8h1,/var/tmp/user/F823318U4obi8b1,/
var/tmp/user/F823318U4obi8s1),DEFMSG(/opt/ibmcmp/xlf/bg/10.1/msg/en_US),-I/opt/ibmcmp/xlf/bg/10.1/include,NULL)
** test === End of Compilation 1 ===
   exec: export(export,XL_FRONTEND=/opt/ibmcmp/xlf/bg/10.1/exe/xlfentry,NULL)
   exec: export(export,XL_ASTI=/opt/ibmcmp/xlf/bg/10.1/exe/xlfhot,NULL)
   exec: export(export,XL_BACKEND=/opt/ibmcmp/xlf/bg/10.1/exe/xlfcode,NULL)
   exec: export(export,XL_LINKER=/bgl/BlueLight/ppcfloor/blrts-gnu/powerpc-bgl-
blrts-gnu/bin/ld,NULL)
   exec: export(export,XL_DIS=/opt/ibmcmp/xlf/bg/10.1/exe/dis,NULL)
   exec: export(export,XL_BOLT=/opt/ibmcmp/xlf/bg/10.1/exe/bolt.blrts,NULL)
   exec: /opt/ibmcmp/xlf/bg/10.1/exe/xlfhot(/opt/ibmcmp/xlf/bg/10.1/exe/xlfhot,/var/tmp/user/F823318U4obi8h1,/var/tmp/u
ser/F823318U4obi8h2,/var/tmp/user/F823318U4obi8b1,/var/tmp/user/F82331
8U4obi8b2,/var/tmp/user/F823318U4obi8s1,/var/tmp/user/F823318U4obi8s2,
/var/tmp/user/F823318eOO516,/var/tmp/user/F823318eOO516A.lst,-
qdebug=nblrl,NULL)
   exec: export(export,XL_FRONTEND=/opt/ibmcmp/xlf/bg/10.1/exe/xlfentry,NULL)
   exec: export(export,XL_ASTI=/opt/ibmcmp/xlf/bg/10.1/exe/xlfhot,NULL)
   exec: export(export,XL_BACKEND=/opt/ibmcmp/xlf/bg/10.1/exe/xlfcode,NULL)
   exec: export(export,XL_LINKER=/bgl/BlueLight/ppcfloor/blrts-gnu/powerpc-bgl-blrts-gnu/bin/ld,NULL)
   exec: export(export,XL_DIS=/opt/ibmcmp/xlf/bg/10.1/exe/dis,NULL)
   exec: export(export,XL_BOLT=/opt/ibmcmp/xlf/bg/10.1/exe/bolt.blrts,NULL)
   exec:
/opt/ibmcmp/xlf/bg/10.1/exe/xlfcode(/opt/ibmcmp/xlf/bg/10.1/exe/xlfcode,
-qdebug=nblrl,/var/tmp/user/F823318U4obi8h2,/var/tmp/user/F823318U4obi8b2,hello.o
,/var/tmp/user/F823318eOO516B.lst,/var/tmp/user/F823318U4obi8s2,NULL)
   1501-510 Compilation successful for file hello.f.
   exec: /bgl/BlueLight/ppcfloor/blrts-gnu/powerpc-bgl-blrts-gnu/bin/ld(/bgl/BlueLight/ppcfloor/blrts-gnu/powerpc-bgl-blrts-gnu/bin/ld,--eh-frame-hdr,-Qy,-melf32ppcblrts,-L/bgl/BlueLight/ppcfloor/bglsys/lib,/bgl/BlueLight/ppcfloor/blrts-
gnu/lib/gcc/powerpc-bgl-blrts-gnu/3.4.3/../../../../powerpc-bgl-blrts-gnu/lib/crt1.o,/bgl/BlueLight/ppcfloor/blrts-
gnu/lib/gcc/powerpc-bgl-blrts-gnu/3.4.3/../../../../powerpc-bgl-blrts-gnu/lib/crti.o,/bgl/BlueLight/ppcfloor/blrts-
gnu/lib/gcc/powerpc-bgl-blrts-gnu/3.4.3/crtbeginT.o,-L/opt/ibmcmp/xlsmp/bg/1.6/blrts_lib,-
L/opt/ibmcmp/xlmass/bg/4.3/blrts_lib,-L/opt/ibmcmp/xlf/bg/10.1/blrts_lib,-L/bgl/BlueLight/ppcfloor/blrts-gnu/lib/gcc/powerpc-bgl-blrts-
gnu/3.4.3,-L/bgl/BlueLight/ppcfloor/blrts-gnu/lib/gcc/powerpc-bgl-
blrts-gnu/3.4.3/../../../../powerpc-bgl-blrts-gnu/lib,-static,-melf32ppcblrts,-
o,hello,hello.o,-lmpich.rts,-lmsglayer.rts,-lrts.rts,-ldevices.rts,-dynamic-
linker,/lib/ld.so.1,-lxlf90,-lxlopt,-lxlomp_ser,-lxl,-lxlfmath,-lm,-lc,-
lgcc,/bgl/BlueLight/ppcfloor/blrts-gnu/lib/gcc/powerpc-bgl-blrts-gnu/3.4.3/crtsavres.o,/bgl/BlueLight/ppcfloor/blrts-gnu/lib/gcc/powerpc-bgl-blrts-gnu/3.4.3/crtend.o,/bgl/BlueLight/ppcfloor/blrts-gnu/lib/gcc/powerpc-bgl-
blrts-gnu/3.4.3/../../../../powerpc-bgl-blrts-gnu/lib/crtn.o,NULL)
   unlink: hello.o
\

This means that if Fortran objects are among those you are linking into an executable via mpxlC, you will need to add the following to your link line:

-L/opt/ibmcmp/xlsmp/bg/1.6/blrts_lib -L/opt/ibmcmp/xlf/bg/10.1/blrts_lib -lxlf90 -lxlopt -lxlomp_ser -lxl -lxlfmath

(Sometimes you can get by with a subset of these libraries.)

For archiving objects and running the C-preprocessor, additional scripts to hide the verbose path and name have been created in /usr/local/bin.

Use the following to access ar, ranlib, and cpp:

GNU ar bglar
GNU ranlib bglranlib
C-preprocessor bglcpp

As an alternative to the scripts, you can use the compilers directly by putting the paths to the compiler binaries, include files, and libraries into your makefiles:

BGL_ROOT /bgl/BlueLight/ppcfloor/bglsys
CC_XL /opt/ibmcmp/vac/bg/8.0/bin/blrts_xlc
CXX_XL /opt/ibmcmp/vacpp/bg/8.0/bin/blrts_xlC
F90_XL /opt/ibmcmp/xlf/bg/10.1/bin/blrts_xlf90
F95_XL /opt/ibmcmp/xlf/bg/10.1/bin/blrts_xlf95
LIBS -L$(BGL_ROOT)/lib -lmpich.rts -lmsglayer.rts \-lrts.rts -ldevices.rts
CFLAGS -O -qarch=440 -I$(BGL_ROOT)/include
FFLAGS -O -qarch=440 -I$(BGL_ROOT)/include
AR /bgl/BlueLight/ppcfloor/blrts-gnu/bin/powerpc-bgl-blrts-gnu-ar
RANLIB /bgl/BlueLight/ppcfloor/blrts-gnu/bin/powerpc-bgl-blrts-gnu-ranlib
CPP /opt/ibmcmp/vacpp/bf/8.0/bin/blrts_xlC -E -I$(BGL_ROOT)/include

Note: This approach will only provide access to the default IBM compilers, not any patch updates. It is recommended that the compiler scripts in /usr/local/bin or /usr/local/tools/compilers/ibm be used instead.

Compiler Flags

As of Release 3 of the BGL system software, there is now more complete information from IBM on using BGL systems and the XL compilers. The "Blue Gene/L: Application Development" redbook is available online at at http://www.redbooks.ibm.com/cgi-bin/searchsite.cgi?query=blue+gene, and a copy can also be found in /usr/local/docs/BGL_ApplicationDevelopment.pdf. The document "Using the XL Compilers for Blue Gene" may be found at http://www-1.ibm.com/support/docview.wss?uid=swg27007895&aid=1 and also as bg_using_xl_compilers.pdf in /opt/ibmcmp/vacpp/bg/8.0/doc/en_US/pdf and /opt/ibmcmp/xlf/bg/10.1/doc/en_US/pdf. In addition, all the other XL compiler manuals may also be found in those two directories. An older document providing information on choosing good compiler flags is /usr/local/docs/BlueGeneOptimizationTips.html.

But there are some other important things for you to know about compiling for BGL, and we describe those here.

The XL compiler versions available are C 8.0 and Fortran 10.1. By default, these compilers use -qarch=440d, which means they generate floating point code for the "double hummer" floating point units. Ideally, that's what you want to do, but this does not always produce the fastest code. You should experiment with both -qarch=440 and -qarch=440d to see which setting provides the best performance.

In addition, it is wise to use the -g flag to store symbolic information in the binary. This symbolic information is needed to translate the hex addresses provided in the stack traceback section of corefiles into source line numbers. Address translation is done with the addr2line utility or via the getstack script

% getstack <corefile> <executable>

which runs addr2line for you on each hex address in the stack traceback section of the corefile.

The XL compilers contain the BGL-specific Dual FPU SIMDization capability. To enable SIMDization, use -qarch=440d -qtune=440 and either -qhot, -O4, or -O5. To have SIMDization diagnostics sent to stdout, use -qdebug=diagnostic. If you employ interprocedural analysis (-ipa, -O4 or -O5), all optimization flags must appear in both compile and link commands because optimization is deferred to link time. In this case, your best bet is to only use -qdebug=diagnostic when linking since the SIMDization messages emitted before linking can be wrong.

Sometimes the compiler will transform one loop into several loops, which can make the SIMDization messages confusing. Use the -qreport flag to have the compiler generate a listing file that shows how loops are transformed. The loop id numbers in the listing file correspond to the loop numbers in the SIMDization messages; matching them up will help you figure out what was really SIMDized and what was not.

The diagnostic messages provide some information as to how well the compiler is doing at generating double hummer code. But if you want to see the details, you need to look at the generated assembler. Unfortuantely, you can't just use the -S to the compiler to emit assembler because that output will not include double hummer instructions. To see the assembler with the double hummer instructions, you have three options:

  1. Use the XL compiler flags -qlist -qsource. This will cause the compiler to generate a .lst file for each source file you compile, and the .lst file will contain an annotated assembler listing.
  2. Use 'objdump -d' on an object file to translate the object code into assembler. For BGL objects, you need to use the BGL-specific version of objdump, /bgl/BlueLight/ppcfloor/blrts-gnu/bin/powerpc-bgl-blrts-gnu-objdump
  3. Use TotalView to disassemble your code from within a debugging session.

Top

5. Libraries

The availability of tuned libraries is pretty limited at this point (LLNL does not have a license for ESSL). But what there is resides in /bgl/local/lib. The libraries currently provided are:

Library Description
libblas440.a Basic blas, no double hummer
liblapack440.a Basic LAPACK, no double hummer
libscalapack.a
    libblacs.a
    libblacsCinit.a
    libblacsF77init.a
Basic ScaLAPACK, no double hummer
libmpitrace.a MPI profiling wrappers for C/C++/Fortran
libmpitrace_c.a MPI profiling wrappers for C/C++ (for backwards compatibility)
libmpitrace_f.a MPI profiling wrappers for Fortran (for backwards compatibility)
libmpihpm.a     MPI profiling wrappers, plus hardware counters
libmpihpm_c.a  MPI profiling wrappers, plus hardware counters for C/C++ (for backwards compatibility)
libmpihpm_f.a   MPI profiling wrappers, plus hardware counters for Fortran (for backwards compatibility)
lib_exit.a        Adds a backtrace to exit() for debugging; adds a backtrace to the MPI error handler
traceback() routine: void traceback(void); replaces libtraceback.a and libmpi_traceback_errors.a
libmemmon.a   Memory use reporting routines
libstackmonitor.a   Earlier version of libmemmon.a
libdgemm_sc_rel3.rts.a      Double-hummer optimized single-core dgemm (CO or VN modes)
libdgemm_dc_rel3.rts.a      Double-hummer optimized dual-core dgemm (CO mode only)

In addition, an untuned FFTW 2.1.5 can be found in /bgl/local/bglfftwgel-2.1.5 on bgl.

Timing Routine

A high precision hardware timer is available on the compute nodes. The following code sample shows how to write a routine for measuring wall-clock time:

/*----------------------------------------------------------*/
/* elapsed-time timing functions */
/*----------------------------------------------------------*/

#include <rts.h>

#define WTIME(TB) TB = rts_get_timebase()
#define TCONV(TB1,TB2) seconds_per_cycle*((double) (TB1 - TB2))

double second()
{
 static int       first = 1;
 static unsigned long long tb0;
 unsigned long long       tb;
 static double       seconds_per_cycle = 1.4285714285714285714e-9;
        /* 700 MHz default */
 if (first) {

  BGLPersonality personality;

  rts_get_personality(&personality, sizeof(personality));
  seconds_per_cycle = 1.0/((double) personality.clockHz);

  first = 0;
  WTIME(tb0);
 }
  return (TCONV(WTIME(tb), tb0));
}

Memmon

The libmemmon.a library provides an API to track memory usage in an application. It works by testing the stack and heap locations at the entry and exit of all subroutines and optionally printing out that information. In addition, the library will provide warnings if the heap space becomes low and will exit if the stack overwrites the heap.

To have libmemmon memory checks inserted into your code, compile using the following flags:

for xl compilers             -qdebug=function_trace
for gcc compiler            -finstrument-functions

and then link in libmemmon.a.

You can customize when libmemmon prints out the memory state to specific subsets of your code and MPI tasks through the following calls:

void memmon_trace_on(int *rank_p)

Enables printing for all tasks for which *rank_p is set to the task's MPI rank, or to -1. Use some other value (e.g., -3) to leave the printing option for that task unchanged.

void memmon_trace_off(int *rank_p)

Disables tracing for all tasks for which *rank_p is set to the task's MPI rank, or to -1. Use some other value (e.g., -3) to leave the printing option for that task unchanged.

void memmon_print_usage()

Print out the current state of the stack and heap on all tasks. This is in addition to the information printed at subroutine entry and exit.

Top

6. TotalView Debugger

The TotalView parallel debugger is provided on BlueGene systems. Our license allows users to scale the tool up to the full size of the BGL system. Its practical scalability limit, however, has been tested to be about 4K compute nodes (although users have reported some success cases up to 8K compute nodes).

Most of the relevant commands are exported from /usr/local/bin:

totalview TotalView graphical user interface
totalviewcli TotalView command line interface
mpirun MPI job launch command (see How to Launch and Manage Jobs)
batchxterm A script to pop up an xterm window from which a user can perform debug sessions interactively.

On BlueGene systems, users must use the debugger with the batchxterm script in order to debug parallel jobs. batchxterm performs setup for the parallel execution environment and pops up an xterm window onto the user's display.

For more information about batchxterm, simply type in batchxterm on a BlueGene/L front-end node.

Usage Example:

On a front-end node, verify that the X setting is properly configured by invoking a simple X application, such as xclock.

Now, to have an xterm window that allows running of parallel jobs on bgl at 512 compute nodes, a user may type in the following command:

login-prompt> batchxterm $DISPLAY bgl 512 60 '-q pbatch'

The used batchxterm arguments specify respectively

<display> <machine name > <num of compute nodes> <session mins>

followed by an additional msub argument (see man msub).

Subsequently, at the resulting batch xterm prompt, the following command would execute codeX at the requested scale:

bxterm-prompt> mpirun -verbose 1 -exe `pwd`/codeX -cwd `pwd`

To launch a parallel job under TotalView's control,

bxterm-prompt> totalview mpirun -a -verbose 1 -exe `pwd`/codeX -cwd `pwd`

This command will pop up two graphical TotalView windows. Clicking the GO button on the bigger of the two will launch the debug target (codeX) and begin the parallel debugging session.

In case the user wants to attach TotalView to a running job, she should first run an application in background. Then, the following command attaches the debugger to the running application:

bxterm-prompt> totalview -pid <mpirunPid> mpirun

Improvising a History Window

A standard TotalView trick to keep a history of what has gone on in a window is to dump a pane's contents onto stdout using Save Pane To File and choosing Append to File, where the file is stdout. /dev/stdout is not defined in SuSE 9, so you need to use /proc/self/fd/1 (file descriptor 1 of the current process, aka standard out) as the file on BGL.

Floating Point Exception Debugging

TotalView supports floating point exception debugging on BlueGene. With the following XL compiler options, an offending task will generate a SIGFPE UNIX signal when it hits one of the listed exceptions. When a process is under the tool's control and generates a SIGFPE signal, it stops immediately. Then, the user can examine its process context using TotalView to perform a root cause analysis.

XL compiler's floating point exceptoin trap options (see man xlf):

-qflttrap=<suboption1>[:...:<suboptionN>] | -qnoflttrap

Determines what types of floating-point exception conditions to detect at run time. The program receives a SIGFPE signal when the corresponding exception occurs.

The suboptions are:

enable Turn on checking for the specified exceptions in the main program.
imprecise Only check for the specified exceptions on subprogram entry and exit.
inexact Detect and trap on floating-point inexact, if exception checking is enabled.
invalid Detect and trap on floating-point invalid operations.
nanq Detect and trap all quiet not-a-number (NaN) values.
overflow Detect and trap on floating-point overflow.
underflow Detect and trap on floating-point underflow.
zerodivide Detect and trap on floating-point division by zero.

A recommended option set is

-qflttrap=enable:inexact:invalid:nanq:overflow:underflow:zerodivide

Known TotalView Issues On BlueGene:

1. TotalView/mpirun doesn't work in the background (e.g., bxterm-prompt> totalview mpirun -a -exe ... & won't work)

2. There is a problem in BGL's mpirun command that affects TotalView when debugging Virtual Node Mode (-mode VN) jobs. To work around this, specify an alternate mpirun_be for now:

bxterm-prompt> totalview mpirun -a -backend /usr/local/rbin/mpirun_be_20080501
-verbose 1 -mode VN -exe `pwd`/codeX -cwd `pwd`

7. Performance Tools

Mpitrace Stackmonitor mpiP TAU Valgrind HPC Toolkit PAPI

Performance tools on BGL are pretty limited compared to other systems. mpiP, which profiles MPI usage, has been ported to BGL, and so has TAU, which allows one to profile and trace applications, and Valgrind, which detects memory leaks.

Another set of performance tools is the HPC Toolkit from IBM's Advanced Computing Technology Center (ACTC). This pre-release software contains a library (hpm) for using the hardware performance monitors on BGL, a library (mp_profiler) and GUI viewer (peekperf) for profiling and tracing MPI performance, and a tool (Xprofiler) for profiling entire applications.

Finally, through the alphaWorks Web site, IBM Research has made available the Task Layout Optimizer for Blue Gene. This is an online service that takes as input a communications matrix you generate via the mpitrace library and upload to the site. The Mapping Service then computes and returns to you an optimized mapping file for your program (see section 11, Mapping Tasks, for instructions on using mapping files). The Task Layout Optimizer for Blue Gene can be found at http://www.alphaworks.ibm.com/tech/bglmap.

Online, there are some notes on how the the performance tools that came from IBM (mpitrace and the HPC Toolkit) were used on the code GTC. This document provides some excellent examples. It can also be found in /usr/local/docs/PerformanceToolsandOptimizationforBlueGene.pdf.

Mpitrace

The first library, mpitrace, allows you to profile and trace MPI calls in your programs, and it now has a new feature to allow subroutine-level profiling of the rest of your application as well. It is found in /bgl/local/lib. The library is implemented via wrappers around MPI calls using the MPI profiling interface, as follows:

int MPI_Send(...) {
   start_timing();
   PMPI_Send(...);
   stop_timing();
   log_the_event();
}

There is one combined wrapper set for Fortran and C:

libmpitrace.a : wrappers for MPI only
libmpihpm.a   : wrappers for MPI plus selected hardware counts

The libmpihpm.a library is the same as the libmpitrace.a library, with the addition that hardware counters are started in the wrapper for MPI_Init() and stopped in the wrapper for MPI_Finalize(). On BG/L, access to the hardware counters is via the bgl_perfctr interface, so you have to also link with -lbgl_perfctr.rts. If you don't need hardware counter values, it is simplest to use the wrappers in libmpitrace.a. Only a few selected counters have been enabled in the libmpihpm.a libraries, including floating-point operations, load/store operations, L3 hits and misses, and torus packet counts.

The wrappers can be used in two modes. The default mode is to collect only a timing summary. The timing summary is collected for the entire run of the application and cannot be restricted to a subset of the program. (If you need a timing summary for just a subset of the program, use mpiP instead. It is similar to mpi_trace.) The other mode is to collect both a timing summary and a time-history of MPI calls suitable for graphical display. To save the time-history, you can set an environment variable on the mpirun command line:

mpirun ... -env TRACE_ALL_EVENTS=yes

Note: Setting environment variables in the shell is not sufficient to make them known at runtime on BGL; they must be set via the -env option to mpirun.

This will save a record of all MPI events after MPI_Init() until the application completes, or until the trace buffer is full. You can also control time-history measurement within the application (but not the timing summary) by calling routines to start/stop tracing:

Fortran Syntax

call trace_start()
do work + mpi ...
call trace_stop()

C Syntax

void trace_start(void);
void trace_stop(void);

trace_start();
do work + mpi ...
trace_stop();

C++ Syntax

extern "C" void trace_start(void);
extern "C" void trace_stop(void);

trace_start();
do work + mpi ...
trace_stop();

When using trace_start()/trace_stop(), don't set TRACE_ALL_EVENTS.

When event tracing is enabled, the wrappers save a time-stamped record of every MPI call for graphical display. This adds some overhead, about 1-2 microseconds per call. The event-tracing method uses a small buffer in memory—up to 3*10**4 events per task—and so this is best suited for short-running applications, or time-stepping codes for just a few steps. The "traceview" utility can be used to display the tracefile recorded with event-tracing mode.

When saving MPI event records, it is easy to generate trace files that are just too large to visualize. To cut down on the data volume, the default behavior when you set RACE_ALL_EVENTS=yes is to save event records from MPI tasks 0-255, or for all MPI processes if there are 256 or fewer processes in MPI_COMM_WORLD. That should be enough to provide a good visual record of the communication pattern. If you want to save data from all tasks, you have to set TRACE_ALL_TASKS=yes. To provide more control, you can set MAX_TRACE_RANK=#. For example, if you set MAX_TRACE_RANK=2048, you will get trace data from 2048 tasks, 0-2047, provided you actually have at least 2048 tasks in your job. By using the time-stamped trace feature selectively, both in time (trace_start/ trace_stop), and by MPI rank, you can get good insight into the MPI performance of very large complex parallel applications.

In summary mode (the default) you normally get just the total elapsed time in each MPI routine, the total communication time, and some info on message-size distributions. You can get MPI profiling information that associates elapsed time in MPI routines with instruction address in the application by setting an environment variable:

export PROFILE_BY_CALLER=yes

This option adds some overhead because it has to do a traceback to identify the code location for each MPI call, but it provides some extra information that can be very useful. Normally you get both the address associated with the location of the MPI call, and also the parent in the call chain. In some cases there may be deeply nested layers on top of MPI, and you may need to profile higher up the call chain. You can do this by setting another environment variable. For example, setting TRACEBACK_LEVEL=2 tells the library to save addresses starting not with the location of the MPI call (level = 1), but from the parent in the call chain (level = 2). To use this profiling option, compile and link with -g; and then use the addr2line utility to translate from instruction address to source file and line number.

In either mode there is an option to collect information about the number of hops for point-to-point communication on the torus. This feature can be enabled by setting an environment variable:

mpirun ... -env TRACE_SEND_PATTERN=yes

When this variable is set, the wrappers keep track of how many bytes are sent to each task, and a matrix is written during MPI_Finalize which lists how many bytes were sent from each task to all other tasks. This matrix can be used as input to external utilities (such as the Task Layout Optimizer for Blue Gene found at http://www.alphaworks.ibm.com/tech/bglmap) that can generate efficient mappings of MPI tasks onto torus coordinates. The wrappers also provide the average number of hops for all flavors of MPI_Send. The wrappers do not track the message-traffic patterns in collective calls, such as MPI_Alltoall. Only point-to-point send operations are tracked.

Top

Another option available in either mode is subroutine-level profiling of the rest of your application. To use this option, you must compile using the flag:

-qdebug=function_trace

This instruments each compiled routine with call to special routines upon entry and exit of every function. A simple "flat profile" can be obtained by starting a timer upon entry, and stopping the timer upon exit for each routine compiled with -qdebug=function_trace.

To enable the feature, set the environment variable FLAT_PROFILE=yes when you submit your job:

mpirun ... -env FLAT_PROFILE=yes

Note: When using the FLAT_PROFILE feature, every subroutine compiled with -qdebug=function_trace will be timed. This can add considerable overhead to your program if there are many calls to small subroutines.

To use mpitrace, just add the trace library to your link line (and compile with -qdebug=function_trace if using the FLAT_PROFILE feature). If you are using the compilation scripts, they will ensure that the mpitrace library comes before the MPI library in the link order. If you are linking with the BGL libraries explicitly, you will need to ensure this yourself with a small change in the makefile for the linking step:

Example: Application with MPI routines called from Fortran only

TRACELIB = -L/bgl/local/lib -lmpitrace_f
LIBS = $(TRACELIB) -L$(BGL_ROOT)/lib -lmpich.rts -lmsglayer.rts \
-lrts.rts -ldevices.rts

Example: Application with MPI routines called from C/C++ only

TRACELIB = -L/bgl/local/lib -lmpitrace_c
LIBS = $(TRACELIB) -L$(BGL_ROOT)/lib -lmpich.rts -lmsglayer.rts \
-lrts.rts -ldevices.rts

Example: Application with MPI routines called from both Fortran and C

TRACELIB = -L/bgl/local/lib -lmpitrace_c
LIBS = $(TRACELIB) -L$(BGL_ROOT)/lib -lmpich.rts -lmsglayer.rts \
-lrts.rts -ldevices.rts

Run the application as you normally would. The wrapper for MPI_Finalize() writes the timing summaries in files called mpi_profile.taskid. The mpi_profile.0 file is special: it contains a timing summary from each task. More detailed information is saved from the task with the minimum, maximum, and median times spent in MPI calls. The flat profile option writes task profiles in files called flat_profile.taskid. The event-tracing mode produces the timing summary files and a binary trace file events.trc. The traceview utility can be used to display the events.trc file in graphical form. To use the event-tracing mode, add -g for both compilation and linking: this is needed to identify the source file and line number for each event.

Trace records contain the starting and ending time for each MPI call and also the parent and grandparent instruction addresses. You can associate instruction addresses with source-file and line-number information using the addr2line utility:

addr2line -e your.executable hex_instruction_address

The -g option can be used along with optimization, but sometimes the actual parent or grandparent may be off a line or two—still adequate to completely identify the event. The "parent address" is the return address after the MPI call. This is typically the next executable statement after the called MPI routine, so don't expect a perfect line-by-line source-code association with your MPI calls. Remember to use -g for both compilation and linking for source-code association.

The event trace file is binary, and so it is sensitive to byte order. While BGL is big endian, your visualization workstation is probably little endian (x86). The trace files are written in little endian format by default. If you use a big endian system for graphical display (examples are Apple OS/X, AIX p-series workstations, etc.), you have two options. You can set an environment variable

mpirun ... -env SWAP_BYTES=no

when you run your job. This will result in a trace file in big endian format. Alternatively you can use the swapbytes utility to convert the trace file from little endian to big endian.

Top

Stackmonitor

The second library, libstackmonitor.a, allows you to track how much stack and heap your program uses. This can be done manually via a user-level API, or semi-automatically using the -pg compiler flag. libstackmonitor.a is found in /bgl/local/lib.

The API consists of the following two functions:

void set_stack_size(void);
void print_stack_size(void);

When the user calls set_stack_size(), the routine logs the current stack address, and keeps track of the minimum stack address. The print_stack_size() routine will print the maximum size of the stack, as well as how much heap has been used:

                          ...
stdout[51]: MPI task 51: max stack = 228.6 KB, max heap = 2068.0 KB
              ...

This print routine should be called from MPI codes while MPI is active: after MPI_Init() and before MPI_Finalize(). The maximum stack and heap used in each task will be printed.

Link with libstackmonitor.a to use these routines:

-L/bgl/local/lib -lstackmonitor

Because it may not be practical to add calls into every subroutine by hand, it would help to have a compiler-generated method to check the stack upon entry into every routine. There is not a direct way to do this, but by co-opting one of the compiler's profiling features, this can be approximated. If you compile with -pg or -p, the compiler generates a call to _mcount upon entry into each routine. libstackmonitor.a leverages this by providing a replacement for _mcount that records the current level of the stack, like a call to set_stack_size(). Unfortunately, _mcount is implemented as a piece of assembler that does not behave like a normal function call. One consequence is that with the _mcount entry, the base address of the stack, rather than the bottom of the stack, is recorded, so the reported stacksize misses the stack space used by the routine that is deepest in the call stack.

With this caveat, you can use the _mcount entry as a semi-automatic way of inserting set_stack_size() calls to your program by adding -pg or -p when you compile, but linking with libstackmonitor.a instead of -pg or -p. You still need to add a call to print_stack_size() to get the information out where you can see it.

Top

mpiP

The mpiP lightweight MPI profiling library is installed in /usr/local/tools/mpiP. Application MPI profile data is provided in a text report generated from within the application call to MPI_Finalize. An advantage of mpiP over mpi_trace is that one can collect profile data for subsections of a program (see below). Use of mpiP on BlueGene systems requires relinking with the mpiP library, for example:

mpxlc -O2 -g -o com com.o util.o -L/usr/local/tools/mpiP/lib -lmpiP

mpiP exhibits two quirks on BGL:

1) If you run a Fortran job, the mpiP output file is named Unknown.NNN.M.Q.mpiP (where NNN, M, and Q are numbers). Normally, 'Unknown' would be the name of your executable, but the mpiP folks haven't yet figured out how to get this information from BGL fortran.

2) The call sites in the output file are given by their hex addresses. Use the mpip-insert-src script (from /usr/local/tools/mpiP/bin) to insert names (mpiP can't do this because it can't fork on BGL):

mpip-insert-src mdcask Unknown.128.0.1.mpiP > mdcask.128.0.1.mpiP

Be sure to compile with -g (in addition to whatever optimization you use) so mpip-insert-src can do its work.

To profile just a subset of your program, in addition to linking in the mpiP library as above, do the following:

For more information on mpiP, please see http://mpip.sourceforge.net/

TAU

On BGL systems, TAU provides profiling and tracing performance analysis capabilities for C, C++, Fortran, and MPI applications. It can report wall-clock time or generate trace data. Data collection instrumentation can be inserted by hand using the TAU API or with an automatic instrumentor. TAU is installed on BlueGene platforms in /usr/local/tools/tau. Documentation can be found in /usr/local/tools/tau/doc. Examples that require some configuration can be found in /usr/local/tools/tau/examples. For more information on TAU, please see http://www.cs.uoregon.edu/research/tau/home.php/.

Valgrind

Valgrind's Memcheck tool is a heavy-weight memory correctness tool. memcheck_link adds link options to create a Memcheck-instrumented BGL executable. To use it, simply precede your normal BGL link line with memcheck_link. For example,

% memcheck_link mpcc -g -o testmpi_mc testmpi.c

Then run the Memcheck-instrumented executable on the BGL system to check it for memory errors (it will run slowly). Task 0 of the Memcheck-instrumented executable will print:

valgrind MPI wrappers active

Memcheck output is written to executable_name.#[.seq].mc for each MPI task #. A .seq number may be added to avoid output file name conflicts.

Use memcheckview (e.g., memcheckview testmpi_mc.0.mc) to view the output files. See https://computing.llnl.gov/code/memcheck/ for how to interpret Memcheck's output.

HPC Toolkit

The HPC Toolkit consists of the following tools:

hpm

A library for using the hardware performance monitors on BGL. It is similar to the library of the same name that has been available on AIX for several years, but there is no hpmcount utility. The pre-release version works well.

mp_profiler

A library for profiling and tracing MPI calls. It is similar to mpitrace, but is not yet as useful. At this point, you are better off using mpitrace.

peekperf

A GUI to view the profiling and trace output from mp_profiler. The pre-release version is quite buggy.

Xprofiler

A GUI to display gprof profiles, including function call graphs and time spent in each source line. Compile and link with -g -pg, and BGL runs will generate gmon.out files that can be viewed with xprofiler. Its only limitation is that it has no real support for parallelism: gmon.out files can be viewed individually or their results coalesced; comparing the results from different tasks requires bringing up separate xprofiler GUIs for each task.

THe HPC Toolkit can be found in /usr/local/hpct_bgl. There is a separate subdirectory for each tool containing documentation, binaries and/or libraries, and examples.

PAPI

PAPI is an application programming interface that allows a user process to access hardware performance counters available on the contemporary microprocessors.

On BGL, version 2 has been ported and installed in /usr/local/tools/papi. Its library, libpapi.rts.a, is in /usr/local/tools/papi/lib and its header files, f77papi.h, f90papi.h, fpapi.h, papi.h and papiStdEvetDefs.h, reside in /usr/local/tools/papi/include. Documentation on how to use the API is in http://icl.cs.utk.edu/projects/papi/wiki/Main_Page. Work is under way to port version 3 (PAPI 3.x) to this machine as well.

On this architecture, PAPI fetches native hardware events through the native performance counter layer (/bgl/BlueLight/ppcfloor/bglsys/lib/libbgl_perfctr.rts.a) and converts them into its predefined event set. The set of currently available predefined events is as follows:

PAPI_L3_TCM Level 3 cache misses
PAPI_L3_LDM Level 3 load misses
PAPI_L3_STM Level 3 store misses
PAPI_FMA_INS FMA instructions completed
PAPI_TOT_CYC Total cycles (Timebase register (null))
PAPI_L2_DCH Level 2 data cache hits
PAPI_L2_DCA Level 2 data cache accesses
PAPI_L3_TCH Level 3 total cache hits
PAPI_FML_INS Floating point multiply instructions
PAPI_FAD_INS Floating point add instructions
PAPI_BGL_OED Floating point Oedipus operations
PAPI_BGL_TS_32B 32B chunks sent in any torus link
PAPI_BGL_TS_FULL CLOCKx2 cycles with no torus token (accum)
PAPI_BGL_TR_DPKT Data packets sent on any tree channel
PAPI_BGL_TR_FULL CLOCKx2 cycles with tree receiver full (accum)

In addition to the predefined events, PAPI also allows users to directly access native events. Users, who desire finer granularity in measuring hardware events, should consider using native event access mode. The native event set is contained in a header file /bgl/BlueLight/ppcfloor/bglsys/include/bgl_perfctr_events.h.

Top

8. Memory Constraints on BGL

Nodes typically have only 512 MB of physical memory, and slightly less than this is available to user programs. For example, a simple program written to measure available memory was able to allocate 508 1-MB buffers per task. In virtual node mode, the same program was able to allocate 252 1-MB buffers per task.

There are now some nodes with 1024 MB of physical memory. These are generally placed in a separate pool (called phighmem) so they can be specifically requested. As with the low memory nodes, up to about 4 MB of memory is used by the run-time system, leaving about 1020 MB of memory available to application tasks in coprocessor mode, and 508 MB in virtual node mode.

Top

9. How to Launch and Manage Jobs

Jobs on bgl are run in batch mode using Moab (see https://computing.llnl.gov/jobs/moab for information on using Moab). There are generally two batch pools bgl: plowmem, which contains nodes with 512 MB of physical memory, and phighmem, with nodes containing 1024 MB of memory. In addition, there is a pdebug pool for small, short running jobs (usually consisting of high memory nodes), and there may be special project related pools at various times. Users should contact the LC Hotline (x2-4531) if they have problems determining the proper banks to use.

BlueGene systems have several unique features making for a few differences in how Moab operates there. On bgl, only a multiple of 512 nodes may be allocated. An Moab request that does not match one of the allowed sizes, for example,

msub -l nodes=100 ...

will wait in the queue indefinitely.

The argument of the -l nodes=<val> option to msub (or the -ln option to the psub wrapper) refers to the number of compute nodes you want your job to run on. With versions of SLURM prior to 1.1, this referred to the number of base partitions. A base partition contains 512 BlueGene compute nodes and is connected in an 8x8x8 three-dimensional torus. This is also known as a midplane.

On bgldev, you can request a portion of the mid-plane:

msub -l nodes=32 ...

You must, however, specify a node count supported by the hardware (32, 128, 512, or 1024 for bgldev).

When using msub, one can use 'k' as a shorthand for 1024:

msub -l nodes=2k ...

The 'k' shorthand, however, is not supported by the psub wrapper.

BGL-specific options can be passed to SLURM via the new msub --slurm option. The --slurm option must be the last msub option on the command line. All options after that are passed directly to Slurm's sbatch command. For example,

msub <program> -l nodes=512 --slurm --linux-image=<path>

When using the psub wrapper, one used to be able to pass BGL-specific attributes via the -bgl psub option, but that option is no longer supported; msub --slurm must now be used instead.

BGL's pdebug partition has been configured to allow job submissions using only the msub (or psub) command. We have configured the BGL/uBGL pdebug partition to disallow job submissions directly to SLURM using the sbatch command. This allows us to schedule the pdebug nodes based on fair-share instead of simple FIFO. This breaks with the policy in place on other LC machines with pdebug partitions due to the large resources allocated to pdebug on BGL and uBLG. Note that when submitting a pdebug job you must use the msub -q pdebug option (or psub -pool debug).

A sample batch script is shown below. Here, mpirun is a BGL-specific command to run jobs; its syntax is different than the mpirun found on other systems (see below):

#!/bin/csh
#MSUB -l nodes=1024
#MSUB -q plowmem
cd /home/jdoe/job1
mpirun -verbose 1 -exe /home/jdoe/job1/a.out -cwd /home/jdoe/job1
echo 'Done'
date

Examining Batch Job Status

Job status can be checked using mshow(1), squeue(1), pstat(1), ju(1), or jr(1).

Killing Batch Jobs

Terminating jobs that have been submitted to Moab should be done using mjobctl(1). Please do not use scancel(1).

For more information on using the batch system, see: https://computing.llnl.gov/tutorials/moab/.

SLURM

Although batch jobs on BGL are now launched via Moab, SLURM is still the underlying job scheduler for BGL. As a result, many SLURM commands still provide useful information on BGL, even though srun may no longer be used. Details of the full BGL implementation can be found at https://computing.llnl.gov/linux/slurm/bluegene.html.

You can see what partitions have been configured via SLURM's smap(1) command:

% smap -cDb
Thu May 15 15:43:34 2008
PARTITION BG_BLOCK STATE USER CONN NODE_USE NODES BP_LIST
debug RMP15My153102251 READY slurm TORUS COPROCESS 1024 bgldev[000x001]

Note that submitting a batch job to a partition size that does not exist will result in the batch job sitting in the queue indefinitely, so it is useful to check on the available partitions before submitting a job.

You can see which partitions are currently running jobs via the blocks_in_use command (not a SLURM command, however):

% blocks_in_use

Block Owner ST MO Allocate Time Jobid Start Time
RMP15My153102251 slurm I C 05-15-15:32:16 NONE n/a

You can also use smap(1) to see a curses generated text-based graph of the jobs running on the machine or what the current partitions are. By default, smap will take over the terminal window it runs in, so a convenient way to use it is in its own window:

% xterm -geometry 120x30 -e /usr/bin/smap -Dj -i10 >& /dev/null &

This invocation will show the jobs currently running on the system (-Dj), but you can type b to see the current blocks (i.e., partitions), and j to return to a jobs view (see the smap(1) man page for more details). In order to display properly, smap needs to run in a window wider than 80 characters; 112 characters is the minimum required for BGL.

smap can also be invoked without the curses interface by using the -c option.

Two additional SLURM commands, squeue(1) and scontrol(1), can also be used to get information on the jobs running on the system. squeue lets you customize which fields are printed out, and you can sort them however you want. By default, it does not print the BGL-specific fields. Below is an example showing how to print the bglblock info, including whether the block is in co-processor or virtual node mode. You can also use an environment variable to control the squeue output format.

% squeue
   JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
   65630 debug jobscrip spelce1 R 2:46:18 64 bgl[000x333]

% squeue -o "%.7i %.9P %.8j %.8u %.2t %.9M %.6D %s %R"
   JOBID PARTITION NAME USER ST TIME NODES CONNECT ROTATE NODE_USE GEOMETRY PART_ID NODELIST(REASON)
   65630 debug jobscrip spelce1 R 2:47:25 64 nav yes coproces 0x0x0 RMP115 bgl[000x333]

And here is an example of using scontrol to get detailed information about a job:

% scontrol show job
JobId=65630 UserId=spelce1(45613) GroupId=spelce1(45613)
  Name=jobscript_SLURM JobState=RUNNING
  Priority=4294901666 Partition=debug BatchFlag=1
  AllocNode:Sid=bgl1:4299 TimeLimit=1440
  StartTime=03/21-07:07:38 EndTime=03/22-07:07:38
  NodeList=bgl[000x333] NodeListIndicies=0,63,-1
  ReqProcs=64 MinNodes=0 Shared=0 Contiguous=0
  MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
  Dependency=0 Account=(null) Reason=None
  ReqNodeList=(null) ReqNodeListIndicies=-1
  ExcNodeList=(null) ExcNodeListIndicies=-1
  Connection=nav Rotate=yes NodeUse=coprocessor Geometry=0x0x0 Part_ID=RMP115

Please refer to the smap(1), squeue(1) and scontrol(1) man pages for more information.

Mpirun

mpirun is the command used inside the batch script to run a job on the partition allocated via SLURM. There is no IBM-distributed man page for mpirun, but documentation on mpirun can be found in

/usr/local/doc/mpirun_manual3.html
/usr/local/doc/mpirun_manual3.pdf

The basic syntax for the mpirun command(s) is

mpirun [-h] [-mode <CO/VN>] [-np <number of tasks>] [-verbose <0-4>]
   [-mapfile <file name>] -exe <executable> [-cwd <directory>]
   [-args "<list of arguments>"] [-env "<env. vars>"]

For options such as -args and -env for which you may want to supply multiple values, simply put those values between double quotes:

mpirun ... -env "TRACE_ALL_EVENTS=yes SWAP_BYTES=no"

There are several things you need to be aware of when using mpirun:

There is a workaround for Fortran, however, that only requires a small modification to the code. At the start of the program add the two lines:

close(5)
open(5)

Then create a file with the data that would have been entered via stdin and either name it fort.5 or softlink the file to fort.5 (ln -s fort.5 filename).

Here is an example of a batch script containing two mpirun invocations:

mpirun -verbose 1 -exe /bgl/test/dd3d -cwd /bgl/test -args cn.128
mpirun -verbose 1 -exe /bgl/test/hello -cwd /bgl/test

Note that the mpirun <executable> must be an absolute pathname (starting with "/").

RAS Events

Sometimes, jobs will fail to run due to a RAS event. A RAS event can indicate any number of problems, either hardware or software. The system administrators currently monitor the system regularly for a known subset of "fatal" RAS errors that always indicate bad hardware according to IBM. Additionally, the sysadmins are developing a knowledge base of other RAS errors that also seem to point to bad hardware. Besides these, the other main categories of RAS errors are typically transient hardware errors (no action required unless a pattern of these against a particular node is observed) and software-related panics from the compute node kernel.

At one of the BGL consortium meetings, IBM mentioned that there is very little protection of the hardware in the compute node kernel. As a result, an errant user application can potentially stomp on registers or other areas of "system space" and cause a node to panic. Expect to see some number of RAS events due to this sort of thing.

If your job fails due to a RAS event and you would like further analysis into the nature of the failure, please contact LC Hotline and open a ticket. Please provide the jobid and any relevant error messages you received.

Top

10. BGL Environment Variables

Several environment variables are available to control the algorithms used by the BGL MPI implementation, as well as some cache behavior. These are documented in Appendix D of the Blue Gene/L: Application Development Redbook, BGL_ApplicationDevelopment.pdf, but we present the basics here.

(Note, too, that Release 3 supports several new system calls. They are documented in Chapter 3 and Appendix E of the Redbook.)

Use the -env "<env. vars>" option to mpirun to set them. For example,

mpirun -verbose 1 -env "BGL_APP_L1_SWOA=1" -exe <executable> ...

The environment variable BGL_APP_L1_SWOA is used to control the store-without-allocate (SWOA) feature of the L1 cache. It is useful when the application is performing writes when the cacheline is not present in the L1 cache. By default (BGL_APP_L1_SWOA=0), the processor first loads the cache line into L1 and performs a read-modify-update. When SWOA is enabled (BGL_APP_L1_SWOA=1), the write is echoed down to the L3 cache and the L3 performs a read-modify-write, thus avoiding pollution of the L1 cache. Some codes can see a big performance boost with SWOA enabled.

By default, if your job encounters an L1 parity error, it will die. For runs on 64k nodes, the mean time between such errors is about 6 hours. You can eliminate nearly all machine checks dur to these parity errors by enabling L1 write-through on a per-job basis. With the previous system software, Release 2, this incurred a performance penalty of 10 to 40%. With the Release 3 software, the penalty is advertised as being much less, but we have not yet quantified that.

To enable write-through, set the environment variable BGL_APP_L1_WRITE_THROUGH=1 in your job.  For example,

mpirun -verbose 1 -mode CO -env BGL_APP_L1_WRITE_THROUGH=1
   -exe /home/bertsch2/src/bgl_hello/mpi_hello
   -cwd/home/bertsch2/src/bgl_hello -args 1

The default setting is BGL_APP_L1_WRITE_THROUGH=0 (or unset), which means no write-through, and your job will die if it encounters an L1 parity error.

The environment variable BGLMPI_RZV (or its synonyms BGLMPI_RVZ and BGLMPI_EAGER), is used the set the threshold between the eager and rendezvous messaging protocols. The value you supply is the threshold in bytes:

mpirun -verbose 1 -env "BGLMPI_RZV=1000" -exe <executable> ...

There are a few new environment variables supported by the Release 3 software. Setting BGLMPI_COPRO_SENDS=1 causes sends, as well as receives, to be handled by the co-processor when running in CO mode. Prior to Release 3, the co-processor only handled receives, which meant that only receives could be overlapped with computation done by the main processor. This new setting might produce better performance for some applications.

In Release 3, interrupt-driven communications support was added to allow one-sided communications (see Chapter 7 of the Blue Gene/L: Application Development Redbook for more information on one-sided communications). The BGLMPI_INTERRUPT environment variable controls the use of interrupts for communications. It has four settings:

    Y: Turn on both send and receive interrupts
    N: Turn off both send and receive interrupts
    S: Turn on only send interrupts
    R: Turn on only receive interrupts

Some codes will benefit from enabling interrupts for receives, sends, or both, but you'll need to experiment to find out what is best for your code.

The other environment variables can turn on/off certain features, such as forcing specific collective operations to use the MPICH implementation rather than the BGL-optimized version.

If you want to disable all IBM BGL MPI optimizations, this can now be done with the environment variable BGLMPI_COLLECTIVE_DISABLE. Just set this environment variable to 1 and your code will revert to MPICH algorithms for all MPI routines.

If you want to modify the behavior of the various collectives individually, then use the following environment variables. Their general form is BGLMPI_{COLLECTIVE}={GI,TREE,TORUS,MPICH}. Currently, if you choose a collective implementation and it cannot be used, the collective will fail over to use the mpich version. However, if you use the default settings, the network heirarchy is used. For example,, BGLMPI_ALLREDUCE=TREE will attempt to use the tree for reductions. If the tree is not available for the communicator used, it will use the MPICH implementation. If the environment variable is left blank, allreduce will choose TREE first, but fail to the TORUS optimized implementation. Below, we show the details just for ALLREDUCE.

1. BGLMPI_ALLREDUCE={TREE, TORUS, MPICH}

TORUS:TREE Tree optimized implementatation for MPI_COMM_WORLD, torus for rectangular communicators. This is the default.

TORUS This chooses the torus optimized implementation of allreduce.

TREE Tree optimized implementation of allreduce.

TREE:d:f:i:2:TORUS:d:f:i:2. Allows the user to selectively control the implementation based on datatype. d=double, f=float, i=integer, 2=double (experimental code). For example, to reduce doubles on the tree and floats on the torus, you would use the following flags: TREE:d:TORUS:f

MPICH Use mpich implementation rather than BGL-optimized algorithms.

2. BGLMPI_BARRIER={GI, TREE, TORUS, MPICH}

GI is the default (works on 32, 64, 512, and midplane multiples). Tree works on COMM_WORLD, TORUS works on rectangular communicators.

3. BGLMPI_BCAST={TREE, TORUS}

4. BGLMPI_REDUCE={TREE, TORUS, MPICH}

5. BGLMPI_ALLTOALL={TORUS:{n}, MPICH}

n is the packet inject parameter (default 3).

6. BGLMPI_ALLTOALLV={TORUS:{n}, MPICH}

n is the packet inject parameter (default 3).

7. BGLMPI_ALLGATHER={TORUS:{n}, MPICH}

n is the packet inject parameter (default 3).

8. BGLMPI_ALLGATHERV={TORUS:{n}, MPICH}

n is the packet inject parameter (default 3).

11. Mapping Tasks

The main network on BGL for point-to-point communication is the torus network. To get the most benefit from the torus network, it is useful to have good communication locality. On BGL a block of nodes is initialized at boot time, and the default mapping is to use (x,y,z) order to lay out MPI tasks onto nodes in the block. This mapping may not be ideal, so the MPI implementation provides two mechanisms to alter this.

The first is the environment variable BGLMPI_MAPPING. Its default setting, BGLMPI_MAPPING=XYZT, is to map MPI tasks to the first CPU in each node in x,y,z order, and, if virtual node mode is used, to the second CPU, using x,y,z order. You can change this by resetting BGLMPI_MAPPING to the order you want. For example, in virtual node mode, BGLMPI_MAPPING=TXYZ will map consecutive pairs of tasks to the two CPUs on each node, and will fill in the nodes in x,y,z order.

The second mechanism is a facility to specify the mapping of MPI tasks onto torus coordinates. The format of the mapping file is:

x0 y0 z0 t0
x1 y1 z1 t1
x2 y2 z2 t2
...

where MPI task 0 is mapped to torus coordinates x0,y0,z0 using processor t0 on that node. The processor number, t0, is always 0 for co-processor mode, and would be either 0 or 1 for virtual node mode. There is one line in the mapping file for each MPI task, in MPI order.

The mapping file is specified via the -mapfile <file name> argument to mpirun.

Mappings can be checked at runtime using routines and data structures defined in rts.h. For example, the following code shows how to determine each task's location in the torus:

#include <rts.h>

.
.
.

BGLPersonality personality;
int my_torus_x, my_torus_y, my_torus_z;
int torus_x_size, torus_y_size, torus_z_size;

MPI_Init(argc, argv);

.
.
.

rts_get_personality(&personality, sizeof(personality));
my_torus_x = personality.xCoord;
my_torus_y = personality.yCoord;
my_torus_z = personality.zCoord;
torus_x_size = personality.xSize;
torus_y_size = personality.ySize;
torus_z_size = personality.zSize;

Choosing an optimimum task mapping can be difficult. But through the alphaWorks Web site, IBM Research has made available the Task Layout Optimizer for Blue Gene. This is an online service that takes as input a communications matrix you generate via the mpitrace library and upload to the site. The Mapping Service then computes and returns to you an optimized mapping file for your program via the -mapfile argument to mpirun. The Task Layout Optimizer for Blue Gene can be found at http://www.alphaworks.ibm.com/tech/bglmap.

Top

12. Known Problems

1. With Release 3 of the BGL system software, the C 7.0 and Fortran 9.1 compilers are no longer supported. Binaries built using the Release 2 software will run. If you recompile under Release 3, however, the entire application, including all its libraries, needs to be rebuilt. If you see an error such as

undefined reference to `__ctype_toupper'

or

Error: 1498 undefined reference to '__ctype_48'

that indicates that you are mixing some Release 3 and Release 2 objects. One way this can happen is if you have hard coded in your makefile the path to the Fortran libraries when linking together C/C++ and Fortran code and you have not updated that path to point to the new Fortran 10.1 libraries.

If the Release 2 object turns out to be a system library and there is no Release 3 version, please contact the lc-hotline@llnl.gov or (925) 422-4531 to open a ticket.

2. With Release 3 of the BGL system software, malloc() sets errno to ENOSYS [38] for a successful allocation, rather than leaving it alone or setting it to 0. Although this is not a violation of errno conventions (errno is only meaningful after a system call has failed, and malloc() sets errno to ENOMEM [12] in that case), this behavior can cause problems with the way some codes do error checking.

3. Some users have noticed that sends issued via MPI_Issend don't make progress as fast as they should. Setting BGLMPI_INTERRUPTS= works around this problem.

4. By default, if your job encounters an L1 parity error, it will die. For full system runs, the mean time between such errors is about 6 hours. At the expense of a 10 to 40% performance penalty, you can eliminate nearly all machine checks due to these parity errors by enabling L1 writethrough on a per job basis. To enable writethrough set the environment variable BGL_APP_L1_WRITE_THROUGH=1 in your job. You do this on the mpirun command line.

Example:

mpirun -verbose 1 -mode CO -env BGL_APP_L1_WRITE_THROUGH=1
   -exe /home/bertsch2/src/bgl_hello/mpi_hello
   -cwd/home/bertsch2/src/bgl_hello -args 1

The default setting is BGL_APP_L1_WRITE_THROUGH=0 (or unset), which means no writethrough, and your job will die if it encounters an L1 parity error.

5. mpirun does not read from standard input. Programs that expect input data via redirection of stdin, for example,

a.out < input

need to be modified to read the input data from a named file.

There is a workaround for Fortran, however, that only requires a small modification to the code. At the start of the program add the two lines:

close(5)
open(5)

Then create a file with the data that would have been entered via stdin and either name it fort.5 or softlink the file to fort.5 (ln -s fort.5 filename).

6. When compiling, if you see the errors:

    "/bgl/BlueLight/ppcfloor/bglsys/include/mpicxx.h", line 26.2: 1540-0859 (S) #error directive: "SEEK_SET is #defined but must not be for the C++ binding of MPI".

    "/bgl/BlueLight/ppcfloor/bglsys/include/mpicxx.h", line 30.2: 1540-0859 (S) #error directive: "SEEK_CUR is #defined but must not be for the C++ binding of MPI".

    "/bgl/BlueLight/ppcfloor/bglsys/include/mpicxx.h", line 35.2: 1540-0859 (S) #error directive: "SEEK_END is #defined but must not be for the C++ binding of MPI".

you have hit a known bug in the MPI-2 standard. You can see documentation on this in the MPICH FAQ at
http://wiki.mcs.anl.gov/mpich2/index.php/Frequently_Asked_Questions.

The underlying problem is that both stdio.h and the MPI C++ interface use SEEK_SET, SEEK_CUR, and SEEK_END.

There are a few ways to workaround this problem:

So why did this problem suddenly appear/disappear? Periodically, IBM refreshes the MPICH base used in the BGL drivers, and sometimes Argonne changes its code. Prior to the Release 2 BGL driver, mpi.h used to have two checks before including mpicxx.h:

#if defined(HAVE_MPI_CXX) && !defined(MPICH_SKIP_MPICXX)
#include "mpicxx.h"
#endif

But in the MPICH base used for the Release 2 driver, this has been changed to simply

#if !defined(MPICH_SKIP_MPICXX)
#include "mpicxx.h"
#endif

Argonne has been asked why they made this change, and perhaps there will be some fix for the problem in the future. But in the meantime, you'll need to use one of the above work-arounds.

7. You may encounter lots of undefined reference to _nss_files_* routines when linking. For example:

/bgl/BlueLight/ppcfloor/blrts-gnu/powerpc-bgl-blrts-gnu/lib/libc.a(nsswitch.o):
In function `__nss_database_lookup':
/bgl/gnu255/gnu/glibc-2.2.5/nss/nsswitch.c:115: undefined reference to `_nss_files_getaliasent_r'
/bgl/gnu255/gnu/glibc-2.2.5/nss/nsswitch.c:115: undefined reference to `_nss_files_endaliasent'
/bgl/gnu255/gnu/glibc-2.2.5/nss/nsswitch.c:116: undefined reference to `_nss_files_setaliasent'
/bgl/gnu255/gnu/glibc-2.2.5/nss/nsswitch.c:116: undefined reference to `_nss_files_getaliasbyname_r'
/bgl/gnu255/gnu/glibc-2.2.5/nss/nsswitch.c:117: undefined reference to `_nss_files_getetherent_r'
/bgl/gnu255/gnu/glibc-2.2.5/nss/nsswitch.c:115: undefined reference to `_nss_files_endetherent'
/bgl/gnu255/gnu/glibc-2.2.5/nss/nsswitch.c:115: undefined reference to `_nss_files_setetherent'
/bgl/gnu255/gnu/glibc-2.2.5/nss/nsswitch.c:119: undefined reference to `_nss_files_getgrent_r'
/bgl/gnu255/gnu/glibc-2.2.5/nss/nsswitch.c:119: undefined reference to `_nss_files_endgrent'
/bgl/gnu255/gnu/glibc-2.2.5/nss/nsswitch.c:119: undefined reference to `_nss_files_setgrent'
                                    .
                                    .
                                    .

BG/L uses static linking, and this can cause some problems with the GNU libraries that were designed for dynamic linking. For static linking, you need to explicitly link against the nss libraries:

% mpxlC ... -lnss_files -lnss_dns -lresolv -lc -lnss_files -lnss_dns -lresolv

The repitition is not a mistake: you need to include the libraries twice.

8. We previously warned against having executables reside on Lustre file systems: the underlying issues regarding locking around mmap operations have been addressed, and we have not observed subsequent problems.

9. TotalView/mpirun doesn't work in the background (e.g., bxterm-prompt> totalview mpirun -a -exe ... & won't work)

10. There is a problem in BGL's mpirun command that affects TotalView when debugging Virtual Node Mode (-mode VN) jobs. To work around this, specify an alternate mpirun_be for now:

bxterm-prompt> totalview mpirun -a -backend /usr/local/rbin/mpirun_be_20080501
-verbose 1 -mode VN -exe `pwd`/codeX -cwd `pwd`

11. The mysterious error message:

lost contact with control node <N>. Connection reset by peer

which sometimes shows up just before your job terminates and before any output from your code, is really an indication that you had an invalid cwd path in your mpirun command.

12. The mysterious error message:

RAS event: KERNEL FATAL: rts tree/torus link training failed:
wanted: A B C X+ X- Y+ Y- Z+ Z- got: B C X+ X- Y+ Y- Z+ Z-

is an indication of faulty hardware. Please contact the LC Hotline for assistance—lc-hotline@llnl.gov or (925) 422-4531.

13. If you encounter the error message:

172.16.128.10:7000: Connection refused

Please contact the LC Hotline for assistance—lc-hotline@llnl.gov or (925) 422-4531 so that the problem may be investigated.

14. The standard cpp accessed by xlf may not work as you expect. The example below shows that it does not strip out comments (the -d flag causes the preprocessed file to be written to Ffoo.f), and this can cause the compile to fail:

% cat foo.F
#
# /* this is a comment */
x = 1
do 11 i = 1,3
write(6,*) x
11 continue
stop
end

% mpxlf -d foo.F
"Ffoo.f", line 1.1: 1515-017 (S) Label contains characters that are not
permitted. Label is ignored.
"Ffoo.f", line 2.24: 1515-019 (S) Syntax is incorrect.
** _main === End of Compilation 1 ===
1501-511 Compilation failed for file foo.F.

% cat Ffoo.f
#
# /* this is a comment */
x = 1
do 11 i = 1,3
write(6,*) x
11 continue
stop
end

The solution is to compile as follows:

% xlf -d -WF,-qlanglvl=stdc89 foo.F

so that the C89 version of the C preprocessor is used, rather than the default extended mode preprocessor.

15. The operating system on BGL supports only a single process per core. This means that system functions such as fork, exec, and system are not supported. While system() is not a system call, it uses fork() and exec() via glibc.

BGL does not provide the same support for gethostname() and getlogin() as Linux provides.

Calls to usleep() are not supported.

A list of supported and unsupported calls is found in syscalls located in /usr/local/docs on BGL.

Top

13. Help

LC Hotline (8:00 a.m.–noon, 1:00–4:45 p.m. Pacific time, Monday–Friday)

LC Operations (24 hours/day, 7 days/week support). Telephone: (925) 422-0484

E-mail Reflector Lists

bgl-status@lists.llnl.gov The Hotline and only a restricted number of sys admins can post here. It includes everyone who has an account on BGL.
bgl-apps-users@llnl.gov A communication tool for the early BGL application developers.

14. Resources

Top


This page last modified on June 2, 2010
For more information, contact lc-hotline@llnl.gov or telephone (925) 422-4531
Page maintained by lc-webers@llnl.gov

UCRL-WEB-229388