While the term co-design as used in the context of DOE exascale efforts is relatively new, the concept is not. LLNL has been using supercomputing to benefit our mission since its founding in 1952, and has since worked closely with vendors throughout those 60+ years of fielding numerous first-of-a-kind machines.
A Selection of Historical Examples of Co-design-like Activities at LLNL
The following examples are no means exhaustive. For details and gaps, readers are encouraged to explore the documents referenced above. This list highlights a few important areas where LLNL has worked closely with the larger community (vendors, academia) on co-design-like activities through the decades, in rough reverse chronological order:
- The IBM BlueGene line of systems started as a concept within IBM to use low-powered cores (and lots of them) on a 3D torus mesh interconnect. It was realized as a product in BlueGene/L, the first of which was delivered to LLNL alongside the ASCI Purple Power5-based system. That BG/L system subsequently spent 3.5 years at the top of the top500 list, displacing the Japanese Earth Simulator and demonstrating effective scaling of MPI on over 100,000 processors for the first time. The design of BlueGene/L, and its descendants BlueGene/P and Q were subsequently greatly influenced through input from both LLNL and Argonne National Laboratory during their entire design cycles.
- The ASCI Purple system design was informed by close interactions between the code developers and IBM staff to understand LLNL’s needs in terms of memory access, memory size, and CPU features for best performance of our applications. For example, the Power5 CPU on ASCI Purple introduced a new instruction to perform faster divides after LLNL proxy applications of the time indicated how much benefit this could provide to performance over the Power4 CPU. This resulted in an 1.8x performance improvement in key applications with only a 1.25x increase in clock speed.
- In the early 2000’s, LLNL spearheaded the use of Linux Clusters at scales typically reserved only for custom proprietary solutions. Some of LLNL’s early work in this space included parallel file system co-design, and system software. The MCR cluster was inserted at #5 of the top500 list in Nov 2002, and linux clusters have been dominating the top500 since.
- Co-design of open source solutions around commodity linux clusters continues to this day, with the Tri-lab Linux Capacity Cluster (TLCC2) and Tripod Operating System Software (TOSS) efforts supplying stable, cost-effective computing for all but the most extreme calculations across the NNSA complex. LLNL works closely with RedHat on the operating system, as well as Whamcloud1 (lustre), Open Fabrics Alliance (infiniband) and SchedMD (slurm) on future designs of HPC system software.
- The TotalView debugger contains a number of features developed at LLNL in partnership with the various vendors who have owned TotalView through the years. Early in Totalview’s lifetime, LLNL negotiated a source code license to ensure important features were developed and deployed. In exchange, Totalview got the insight of LLNL’s experience on some of the largest platforms in the world.
- LTSS (Livermore Time Sharing System) was a complete Operating System for the early Cray machines that was developed at LLNL. Cray adopted LTSS for deployment of their systems outside of LLNL until they developed their own COS and UNICOS operating systems and environment.
- The Livermore Fortran Kernels (more commonly known as the “Livermore Loops”) were developed in the late 1970’s and 1980’s. They were one of the original proxy apps, and were used by virtually every vendor of the day to gain insight into how DOE scientists developed software, and assess their own vector processor performance, the dominant supercomputer processor design of the day.
- LLNL developed a library of common routines called stacklib, an application-relevant library of vector operations that took advantage of the hardware instruction cache to optimize performance. It used the hardware on early CDC and Cray computers so effectively that it became a de facto standard library for years distributed with those machines.
- Digital Equipment Corporation PDP-6 systems that were used to run the LLNL storage archive had LLNL-developed hardware modifications that DEC formally adopted in the PDP-10 system released in 1966.
- LLNL was an early and strong supporter of open standards, starting with the first Fortran standards committee, through Fortran90, and MPI to OpenMP. We’ve long insisted that working toward standards in concert with the vendor community is best for the growth and innovation of the entire community.
Other examples of vendor relationships include:
- HPSS is an industry-standard scalable archival system that started out as a CRADA between LLNL and IBM, eventually expanding to including many other DOE labs and the successful deployment across sites world-wide.
- The TMDS was an early graphical terminal largely influenced by LLNL users. In particular, George Michael was very involved in some of those vendor interactions.
- Other detailed examples of early LLNL system software development and hardware interactions that impacted vendor systems are too numerous to list, but include the introduction of high level languages for applications and system software development, timesharing systems for applications development use, innovations such as checkpoint restart, early compiler loop and code optimizations, and online access for applications to archival files using the IBM photostore.
Detailed Historical Perspectives
George Michael, a senior manager in LLNL Computations and one of the founding fathers of supercomputing, captured the long and rich history he helped make happen in an article titled “An Oral and Pictoral History of Large Scale Scientific Computing as it Occurred at Lawrence Livermore National Laboratory.” He retired from LLNL in 1991 and passed away in 2008. Anyone interested in the history of supercomputing as seen through eyes of Lawrence Livermore should spend time reading this, and the other largely interview-based articles at www.computer-history.info.
William (Bill) Lokke, an LLNL physicist and winner of the 1975 E.O. Lawrence Award for “… original and creative computer calculations of nuclear weapon outputs and the development of methods of calculating radiation opacities…” wrote this retrospective on the early history of computing at LLNL covering 1952 through the 1960’s, outlining the deep co-design collaborations that existed between the labs and computing vendors in that era.
1Whamcloud is now the High Performance Data Division at Intel.