ASC Sequoia logo

ASC Sequoia Request for Proposals

Proposal due date is Thursday, August 21, 2008

The ASC Sequoia RFP has been approved by DOE/NNSA for release and publication on this Web site on July 16, 2008. Interested offerors are advised to discard all preceding draft RFP documents that were published on this Web site prior to July 16, 2008. DOE/NNSA directed changes to the draft RFP, and those changes are reflected in RFP documents now available on this Web site. Interested offerors are advised to base their proposal responses on RFP documents now available on this Web site and any subsequent RFP amendments.

Interested offerors must submit all communication (questions, comments, etc.) about the ASC Sequoia RFP to the LLNS Contract Administrator, Gary Ward, whose contact information is provided below. The ASC Sequoia market discussion phase is complete; therefore, interested offerors are no longer permitted direct communication with the ASC Sequoia technical community except for regular business activities that do not pertain to the ASC Sequoia RFP or as otherwise directed by the LLNS Contract Administrator.

Interested offerors are advised to monitor this Web site for potential ASC Sequoia RFP amendments and other ASC Sequoia RFP information updates. LLNS may notify interested offerors (who have previously contacted LLNS and expressed an interest in the ASC Sequoia RFP) of updated ASC Sequoia RFP information via e-mail; however, LLNS is under no obligation to do so. It is the responsibility of all interested offerors to monitor this Web site for current ASC Sequoia RFP information.

ASC Sequoia Benchmark questions—and only benchmark-related questions—may be submitted via e-mail to

LLNS Contract Administrator
Contact Information

Gary Ward
Supply Chain Management Department
Lawrence Livermore National Laboratory
Phone - (925) 423-5952
E-mail -

ASC Sequoia RFP Components

Electronic formats of ASC Sequoia RFP components are available. To read PDF files, the Adobe Acrobat Reader software is free to download.

The entire RFP is contained in the following Microsoft Windows ZIP file [zip] - Revised August 4, 2008

RFP Amendment 1 (August 4, 2008)

RFP Amendment 2 (August 6, 2008)

RFP Amendment 3 (August 6, 2008)

Benchmark Codes

Summary of Changes (Pre-RFP Release)

Q & A

August 15, 2008

Q1: MPI Allreduce Latency: mpiBench_Allreduce on "Functionality tests" page has runs for Message size (bytes) equals to 1, 2 and 4. However, in output files all timings start from 8. We tried to use -b 0 and -b 1 ("-b <byte> Beginning message size in bytes (default 0)") to change beginning message size to 1 but with no success. Analysis of source code showed that it's impossible to get beginning message size equal to 1 without source code changes (see mpi_Bench_Allreduce.c file at 1014 and 1043 lines). So, is it OK to leave fields for 1, 2, 4 message sizes empty?
A1: The Allreduce tests are set to use double floating-point datatypes. All output in mpiBench displays bytes (not elements). For reductions then, you can only get byte counts that are some multiple of 8 (the size of one double): 0, 8, 16, 32, ....

Q2: MPI Charts page has links to not existing file SequoiaBenchmarkResults_v1.0-cmc.xls
A2: The path included with the benchmark includes a full directory path. The spreadsdheet and associated charts and data can be accessed by modifying this path to reflect the location of the data on your local system.

August 12, 2008

Q1: Section states. 3.4.8 User Task Connectivity API (TR-1)
Offeror may provide APIs to establish connectivity for application programs to make use of the CN interconnect for their communications. Use of the API may be restricted so that users must not have the ability gain access to other jobs' communications.
Is LLNL referring to APIs to allow access to the HSN (High Speed Network) and/or API to receive information (e.g., topology) regarding the network or placement of job on the network?
A1: Section 3.4.8 is a subsection of Section 3.4 "System Resource Management (SRM)." As such, this "User Task Connectivity API" is an API LLNS needs to port the SLURM to the proposed system. This API will be used by SLURM to wire up the MPI comms for the job being launched by SLURM.

Q2: Both the PEPPI response (System overview) and the SOW (e.g., 2.2 Major System Components) ask for detailed block diagram, feeds-and-speeds, etc. Does LLNL have a preference for which location contains such details or can one section refer to the diagrams in another section in order to avoid duplication.
A2: Yes, please avoid duplication. Put the diagrams in where it makes the most logical sense to improve the narrative.

Q3: Regarding UMTmk, the SequoiaBenchmarkResults_v1.0.xls requires to provide WC Time and CPU Time for omega(1:3) > 0 and omega(1:3) < 0, but there is no reference in documents which describes that does omega(1:3) mean.
A3: The references to omega have been removed from the latest (v1.1) Sequoia Benchmark Results spreadsheet.

Q4: Regarding Phloem, it's not clear how to extract required numbers for all MPI subtests ("Functionality Tests" page in SequoiaBenchmarkResults_v1.0.xls) from output benchmarks logs. The procedure isn't described anywhere.
A4: The procedure is: 1) Read relevant SOW sections to get an idea of the information of interest. 2) Read README.sow to identify benchmarks to run and expected configurations. 3) Read benchmark READMEs and examine example run script make targets "commands" and "run" (these make targets, unfortunately, were not well documented). 4) Generate appropriate benchmark output. 5) Examine MPI Data worksheet in SBR workbook. 6) Import appropriate benchmark data into MPI Data worksheet based on data section instructions (e.g., MPI_DATA:A9 contains "Import com results for the process count with the greatest Max Bandwidth") and example format. 7) For the "Functionality Test" worksheet results, the vendor is expected to examine the benchmark results and identify the appropriate value.

Q5: Regarding pynamic, the SOW defines a time comparison between pynamic-pyMPI and pyMPI as the result of the benchmark but it does not define the form in which this comparison should be presented (should it be a ratio of the timing between these two runs?) The format of the result spreadsheet for pynamic does not fit this requirement as well.
A5: Output for the pyMPI run has been added to the latest (v1.1) Sequoia Benchmark Results spreadsheet.

Q6: Regarding pynamic, the SOW does not define how to fill columns G and H (Cold Start Time and Warm Start Time) in the result spreadsheet for pynamic. The benchmark logs don't provide these timings also.
A6: "Cold Start" is the first invocation of Pynamic when the disk buffer cache has not loaded the DLLs. "Warm Start" is a subsequent invocation when the DLLs are in the disk buffer cache. This is described in the latest Pynamic description document on the Sequoia benchmark Web page.

Q7: Regarding pynamic, the SOW does not define corecount number which should be used to run the test. We tried 8 and 64 (1 node and 8 nodes with 8 ppn in each case) and got significantly different timings between these two.
A7: The Offeror may provide pynamic scaling runs as they deem appropriate.

Q8: (Reference: Attachment 2, SOW, Page 77, Section, CDTI Efficiency (TR-2)) This requirement states that the latency for basic operations including a memory/register read/write may not exceed 200 microseconds. Could LLNS please clarify at which point the latency of the operations should be measured? The latency could be measure at one of two places: (1) Is it at the lowest level, where the tool daemons are running on the IO/compute node? An example of a low-level operation is the debugger daemon running on the compute node calling ptrace() or using the /proc file system. Another example of a low-level operation is the debugger daemon running on the IO node communicating with the compute node control interface. (2) Is it at the highest level, where the tool front-ends are running on the login node? An example of a high-level operation is the debugger front end running on the login node making a memory/register read/write request to the debugger daemon running on the IO node.
A8: The CDTI latency of operations should be measured from the source of the request to the response receipt of that request. For a user initiated request (e.g., print a variable value), the latency is measured from the time the tool front-end running on the login node calls the CDTI to the point the data is returned to the tool front-end (round trip). For an interrupt servicing request originating on the CN (e.g., conditional watch point), the latency is measured from the time the interrupt occurs on the CN until the ION Daemon processes (not the FE on the LN) the request and the CN receives the response.

August 8, 2008

Q1: In Section 9.4.2 of the Sequoia SOW document, a calculation process is defined to calculate the aggregate sustained performance figure of merit (FOM) for Sequoia. This calculation uses weights defined in Table 9.2 of the SOW to be applied to the achieved application FOMs to obtain the aggregate sustained performance FOM for the system (S). If this calculation method is applied to a system running an application set achieving the Purple FOMS for AMG, IRS, UMT and SPhot (6 copies each), and 20 times the BGL FOM for LAMMPS as reported in Table 9.3, that system would deliver a sustained performance of 20.0e15 - the target Sequoia sustained system performance. However, in the Sequoia Benchmark Results Spreadsheet version 1.0 posted on the benchmark website, a different set of weights and calculation is performed. In the Spreadsheet, weight factors equal to 1/8 of the weight factors defined in Table 9.2 of the SOW document are used. Please confirm that the correct weight factors to use in the aggregate sustained FOM calculation are the weight factors defined in table 9.2 of the SOW document.
A1: The correct weight factors for the Sequoia marquee benchmarks are in Table 9.2 of the SOW document. A new version of the Sequoia Benchmark Results spreadsheet will be released shortly to reflect the correct application weight factors.

August 6, 2008

Q1: SOW Section 6.4 calls out two on-site people. Sections 8.3.6 and 8.3.10 indirectly indicate three on-site people. How many on-site people are required?
A1: The number of on-site people required is two for the duration of the subcontract. Refer to RFP Amendment 2.

Q2: PEPPI Section 7.2 second paragraph calls for Offeror to price a different software maintenance model than that described in SOW Sections 6.0 (overview) and 6.3 (SW). Which is correct?
A2: SOW Sections 6.0 and 6.3 are correct. Refer to RFP Amendment 2 for a modification to Section 7.2 of the Proposal Evaluation and Proposal Preparation Instructions document, which clarifies that: "Software Maintenance pricing should be based on 7x12 (0800-2000 Pacific time) with one hour response time for all systems proposed starting with system installation through acceptance. From acceptance to five (5.0) years after acceptance Software Maintenance pricing should include an electronic trouble reporting and tracking mechanism available to LLNS 24x7 and periodic software fixes and updates based on LLNS priorities."

August 4, 2008

Q1. Dawn interface to Federated Network. The architecture drawing for ASC Dawn (Figure 1-6 in 02_SequoiaSOW_V06.doc) shows the federated network switch as being 10 Gb ethernet. Thus the network connections from the ION, LN, and SN are shown as 10 GbE. However, Section 4.5.2 requests IB4X QDR or 100 GbE for the ION connection. Similarly, Section 4.6.3 requests IB4X QDR or 100 GbE for the LN configuration, and Section 4.7.4 requests the same for the SN. Please clarify the type (10 GbE vs other) of network connection requirements for the ION (Section 4.5.2), LN (Section 4.6.3), and SN (Section 4.7.4) to the Federated Switch.
A1: 10 Gb/s Ethernet with 10 GBase-SR Optics is preferred for Dawn.

Q2. Dawn Peak+Sustained Section 4.1.1 requests system performance of at least M = P + S = 40.0. P+S = 40 is the Sequoia requirement. Please clarify the M = P + S for the Dawn system.
A2: (Revised August 6, 2008) Offeror is correct in your assessment of M = P + S target requirement for Dawn. After careful analysis, LLNS's Target Requirement for Section 4.1.1. is:
4.1.1. Dawn System Performance (TR-1)
The Dawn system performance may be at least M = P + S = 1.0. Where P is the peak of the system as defined in Section 2.1 and S is the weighted figure of merit for five applications and is defined in Section 9.4.2.

Section 9.4.2, Sequoia Execution Requirements, is modified to describe the target Dawn run-time requirements as follows:

For Dawn, Offeror may run one copy of each IDC benchmark (as described in Section 9.3) with 8,192 MPI tasks. Simultaneous with these IDC runs, Offeror may run LAMMPS utilizing the remaining cores in the system with 32,000 atoms/MPI task. The Dawn sustained performance is the aggregate weighted FOM sum of the 4 IDC runs and the LAMMPS run.

July 31, 2008

Q1a: (Reference: Attachment 2, SOW, Page 102, Section 7.0, Facilities Requirements) "An existing facility, portions of the West and East computer floors in the LLNL B453, will be used for siting the Dawn (west end of East floor) and Sequoia (east end of West floor). See Figure 7-1. Today, the B453 building has approximately 2x125’ x 195’ = 47,500ft2 and 15 MW (7.5 MW for the West floor and 7.5 MW for the East floor) of power for computing systems and peripherals and associated cooling available for this purpose. Prior to the deployment of Sequoia the B453 building will be upgraded to 30 MW total (15.0 MW for the West floor and 15.0 MW for the East Floor). The Purple system will be retired after Dawn is deployed, but before Sequoia is deployed. This will leave approximately 15.0 MW available for Sequoia from the West computer floor. There is approximately 5.0 MW available for Dawn. Facilities modifications to provide the necessary power and cooling for Dawn and Sequoia will need to be accomplished prior to rack delivery. It is therefore essential that Offeror make available to LLNS detailed and accurate (not grossly conservative overestimates) site requirements for the Dawn system at proposal submission time. Less accurate power and cooling estimates for Sequoia at proposal submission, but not grossly conservative overestimates) will be of substantial value as well. LLNS will be responsible for supplying the external elements of the power, cooling, and cable management systems."

System 208V Power 480V Power Cooling (Tons) Floor Space
Dawn 3.5 MW 1.5 MW 2,000 9,000 ft2
Sequoia   15.0 MW 6,000 15,000 ft2

Are the 5 MW and 15MW power figures for Dawn and Sequoia for IT infrastructure only, or should cooling power (pumps, etc) also be within this envelope?
A1a: Yes. If cooling power (pumps, etc.) is required on the computer floor (for water cooled solutions), then this power will come from wall panels on the B453 computer room floor. This will take away power that the Sequoia system racks (IT infrastructure) could use.

Q1b. Referring to the last sentence above: "LLNS will be responsible for supplying the external elements of the power, cooling, and cable management systems." If the offeror supplies racks with a closed loop, liquid cooling system (either water or refrigerant), will LLNS also supply heat exchangers and pumps to transfer heat from the rack cooling systems to LLNS-supplied chilled water? At what feedwater rate and temperature can LLNS supply chilled water?
A1b: (Revised August 5, 2005) No. Offeror needs to supply heat exchangers and pumps to transfer heat from the rack cooling systems to LLNS-supplied chilled water. The maximum feedwater rate is nearly 12,000 gpm and is provided at an adjustable setpoint of 41 to 45°F with approximately ±1°F variation at each point. This feedwater supply is shared with the underfloor air handlers, and they are currently using 1,500 gpm to cool air for the 4.8-MW Purple, BG/L, and other capacity Linux clusters (several MW) in B453. Purple will be retired before Sequoia is installed. So, depending on the heat load in B453 requiring cooling from the underfloor air handlers, some amount less than 12,000 gpm will actually be available.

Q2a: (Reference: Attachment 2, SOW, Page 105, Section 7.3, Rack Height and Weight (TR-1) "System racks will not be taller than 84” high (48U) and not place an average weight load of more than 250 lbs/ft2 over the entire footprint of the system, including hot and cold isles. If Offeror proposes a rack configuration that weighs more 250 lbs/ft2 over the footprint of the rack, then Offeror will indicate how this weight can be redistributed over more area to achieve a load less than 250 lbs/ft2."
Is the 84" rack height limit a strict requirement for the datacenter itself or would taller racks (98.5"), to be delivered horizontally and brought up in the datacenter itself, be considered, as they would fit in the elevator?
A2a: No, the 84" height limit is not a strict requirement it is a target (TR-1) requirement. Yes, 98.5" racks would be considered. However, be advised that ceiling in the B453 is just short of 10' (119") and that the airflow in B453 is from under floor through the racks and exits through plenums in the ceiling. Also recall that the racks need to be mounted on IsoBases and that adds about 3". This leaves less than 17.5" of clearance between the top of the rack and the ceiling for airflow. Minimum height required for fire safety is 18", but we can fuzz this about 1" or so.

Q2b: What is the datacenter ceiling height?
A2b: 119”

Q3: We have a general question regarding file systems available to compute nodes (CN), referred to in Sections 2.2.1, I/O Subsystem Architecture (TR-1) pages 42-42, Section 2.5 I/O Node Requirements (TR-1) page 47, and Section 3.1.2, Function Shipping From LWK (TR-1), page 60.
Is direct access to all site-wide lustre and NFS file systems from each individual CN a requirement, or would job launch staging to the high performance lustre file system be an acceptable solution?
A3: No. We intend to mount all Lustre and NFS file systems on ION and provide access to them from user applications running on the CN associated with that ION via the LWK function shipping facility. Users applications running on the CN accesses these various and numerous file systems via standard POSIX file system interfaces (e.g., Open, Close, Read, Write, and IOCTL).

Q4: (Reference: Attachment 2, SOW, Page 57, Section 2.9.3, LN & SN High IOPS RAID (TR-2)) Section 2.9.3 mentions that you want 50TB for /tmp and /var/tmp (using the EXT3 File system) on each Login Node. But, EXT3 is currently limited to 16TB max per file system, and 2TB file size max. This means only 16TB each could be used for /tmp and /var/tmp. Is the remaining 18TB planned for something else, or do we need to switch to a file system that can serve up all of the space in just /tmp and /var/tmp?
A4: Another Linux file system such as XFS or EXT4 is a reasonable alternative.

July 30, 2008

Q1: (Reference: Attachment 2, SOW, Page 27, Section 1.4, ASC Software Development Environment) Since LLNL has site licenses for TotalView and various commercial compilers, is the offeror expected to propose additional licensing pricing to add seats to the current site licenses or would the new systems fall under your current licensing?
A1: We have 15 seats for Intel, 10 seats for PGI, 10 seats for PathScale compilers on the SCF. We can add seats to our existing contracts if Offeror proposes one or more of these compiler sets. If Offeror proposes some other compiler suite, then Offeror should provide 15 seat licenses. We have one seat for TotalView at 8,208-way parallelism, 4 seats for 1,032-way parallelism, and 4 seats for 128-way parallelism (all for x86-64 platforms). We also have TotalView licenses for Purple (Power5) and BlueGene/L (PowerPC), but these will terminate soon. For Dawn and Sequoia, Offeror should provide sufficient token licenses so that we can use one token per compute node in the system plus eight. The Sequoia token licenses can augment the Dawn licenses (i.e., Offeror can count the Dawn tokens in the Sequoia total).

Q2: (Reference: Attachment 2, SOW, Page 102, Section 7.0, Facilities Requirements, Para. 2, 4th line from bottom, "LLNS would prefer system layouts with less than 3’ for HOT isles and less than 4’ for HOT isles.") Should this be corrected to read as follows? "LLNS would prefer system layouts with less than 3’ for HOT isles and less than 4’ for COLD isles."
A2: Yes.

Q3: (Reference: Attachment 2, SOW, Page 102, Section 7.0, Facilities Requirements, "Power will be provided to racks by under floor electrical outlets supplied by LLNS to Offeror's specifications. Circuit breakers are available in wall panels that can be modified to Offeror's specifications. All other cables must be contained in cable trays supplied by LLNS to Offeror's specifications. Straight point-to-point cable runs can NOT be assumed. LLNS will provide floor tile cut to Offeror's specifications.") Does LLNS want the "other cables" below the floor or above the racks?
A3: We have fielded systems in B453 with all cables below the floor and systems with all cables below the floor except for the interconnect cables, which were installed above the racks. We have found that having the interconnect cables installed above the racks makes system debug and maintenance easier to accomplish. Systems with interconnect cables installed above the racks is acceptable if they are sufficiently supported (row to row) and can be hidden from view with appropriate rack extensions after the system is accepted.

July 29, 2008

Q1: (Reference: Attachment 1.1: Sample R&D Subcontract (B571534), Page 9, Article 15 - General Provisions, Para. D, Cost Accounting Standards.) This interested Offeror is exempt from Cost Accounting Standards under government contracts. Accordingly, and considering that a firm fixed-price subcontract is contemplated. An exception is considered appropriate. How would such an exception be regarded by LLNS?
A1: If an interested Offeror qualifies for an exemption (as provided by the Federal Acquisition Regulation) and is selected for award, then the resulting R&D subcontract will not include CAS related requirements. Interested Offerors who believe an exemption applies should indicate so in their proposals and explain their rationale. LLNS will consider the rationale and determine if an exemption applies.

Q2: (Reference: Attachment 1.2, Sample Build Subcontract (B563020), Page 9, Article 15 - Special Terms and Conditions for Sequoia, Para. 4.) In conjunction with Question No.1, this Offeror will, in lieu of cost data, including "overhead recovery charges," would agree to disclose the actual price(s) paid to its supplier for memory. How would such an exception be regarded by LLNS?
A2: LLNS' intent is to: (1) know the actual price the selected Offeror will pay its vendor(s) for memory, (2) know the corresponding price for memory the selected Offeror proposes to build into the total fixed price of the subcontract, and (3) ensure that the resulting price for memory and total fixed price of the subcontract are fair and reasonable. If an interested Offeror will not apply any markup or burden or other additional charges to the memory price it proposes to build into the total fixed price of the subcontract, then disclosure of the actual price it will pay its memory vendor(s) is considered sufficient.

Q3: (Reference: Attachment 2, Draft SOW, Page 30, Section 1.6 ASC Sequoia Operations, Para 4: "Hardware maintenance services may be required around the clock, with two hour response time during the hours of 8:00 a.m. through 5:00 p.m., Monday through Friday (excluding Laboratory holidays), and less than four hours response time otherwise.") Does the above maintenance requirement apply to LLNL "1st line" maintenance personnel? Please clarify.
A3: This is mistaken. Ignore it. Section 1.0 is for background information.

Q4a: (Reference: Attachment 2, Draft SOW, Page. 92, Section 6.0 Integrated System Features (TR-1), Para. 2, Sentence 6: "Thus, LLNS requires an on-site parts cache of all FRUs and a small system of fully functional hot-spare nodes of each node type.") Is the hot spares system(s)/clusters situated inside or outside the vault?
A4a: Inside the vault.
Q4b: Is the on-site spares cache situated inside or outside the vault?
A4b: Inside the Q area, but outside the vault.

Q5: (Reference #1: Attachment 2, Draft SOW, Page. 98,Section 6.2 Hardware Maintenance (TR-1), Para 1, 5th sentence: ..."maintenance personnel must obtain DOE P clearances for repair actions at LLNL and be escorted during repair actions. USA Citizenship for maintenance personnel is highly preferred because it takes at least 30 days to obtain VTR access for foreign nationals." Reference #2: Attachment 2, Draft SOW, Page 33, Section 1.6.1, Para. 2, "In order to provide adequate support and interface back to the selected Offeror’s development and support organization, on-site (i.e., resident at LLNL), Q-cleared personnel are needed." Reference #3: Attachment 2, Draft SOW, Page 102, Section 7.0 Facilities, Para 6, 9th sentence: "All on-site personnel will require to be DOE Q-cleared or Q-clearable. It will be extremely difficult to provide LLNL site access to foreign nationals.") What is the clearance and citizenship requirement for maintenance personnel (on-site and "backline") and what is the clearance requirement for on-site analyst(s)? Please clarify.
A5: On-site folks (system admin and applications support persons) should be capable of obtaining Q clearances. Others need not have clearances. However, US citizenship is preferred.

Q6: (Reference: Attachment 2, Draft SOW, Page. 98, Section 6.2 Hardware Maintenance (TR-1), Para. 2, "During the period from the start of system installation through acceptance, Offeror support for hardware will be 12 hour a day, seven days a week (0800-2000 Pacific Time Zone), with one hour response time.") Does the one-hour response time refer to telephone response or on-site response? Please clarify.
A6: On-site response time. It is in the best interest of the partnership to get through acceptance quickly and efficiently.

Q7: (Reference: Attachment 2, Draft SOW, Page 101, 7.0 Facilities Requirements, Para. 5: "On-site space will be provided for personnel and equipment storage.") Does the equipment storage area include space for on-site FRU spares and product shipments, prior to installation?
A7: Yes.

Q8: (Reference: Attachment 3: PEPPI, Page 37 Section 1.42, Section 2: Small Business Subcontracting Plans) As allowed under FAR 52.219-9, this Offeror has a company-wide Master Subcontracting Plan. Accordingly, this Offeror intends to utilize the Master Subcontracting Plan in fulfillment of the D&E Subcontract and, as a part of the subcontract, spell out this Offeror's goals under the subcontract. Is this acceptable in lieu of Attachment 9: Model Small Business Subcontracting Plan?
A8: A company wide small business subcontracting plan may be submitted for the build subcontract. An individual plan (i.e., RFP Attachment 9 Model Small Business Subcontracting Plan) should be submitted for the R&D subcontract. If (based on the nature of the proposed R&D) there are no opportunities for subcontracting with small businesses, then Offeror should indicate so in its proposal.

Q9: (Reference: Attachment 5: General Provisions for Fixed Price Supplies and Services, Pages 5, 6 and 7 of 8 Clauses Incorporated By Reference.) Regardless of contract value, not all of the FAR and DEAR clauses listed are mandatory for the acquisition of commercial products and services under a firm fixed-price contract subcontract. Considering that this Offeror will be a subcontractor under LLNS' prime contract with DOE, this Offeror, an acceptable approach in other proposals with other DOE labs, has been to include an exceptions matrix and provide rationale regarding applicability. How would this approach be regarded by LLNS?
A9: LLNS will not disqualify a proposal from the evaluation process based on proposed exceptions to FAR and DEAR clauses. However, interested Offerors are cautioned to carefully consider and limit proposed exceptions to FAR and DEAR requirements, as this could, upon selection for award, create significant delays and barriers to subcontract award. It will be in the best interest of the partnership to expeditiously conclude contract negotiations by minimizing disucssions on unimportant or irrelevant issues. Keep in mind that LLNS considers the build subcontract to be commercial, and the R&D subcontract non-commercial. If an interested Offeror has an alternate opinion, its proposal should indicate so and explain the rationale.

Q10: (Reference: Attachment 6: General Provisions for Commercial Supplies and Services, Pages 5, 6 and 7 of 8 Clauses Incorporated By Reference.) Regardless of contract value, not all of the FAR and DEAR clauses listed are mandatory for the acquisition of commercial products and services under a firm fixed-price contract subcontract. Considering that this Offeror will be a subcontractor under LLNS' prime contract with DOE, this Offeror, an acceptable approach in other proposals with other DOE labs, has been to include an exceptions matrix and provide rationale regarding applicability. How would this approach be regarded by LLNS?
A10: Refer to above Answer No. 9.

Q11: (Reference: Attachment 7, Site Services Requirements, Page 1, B Cleanup: "The Subcontractor shall, at all times, keep the premises and adjoining premises where the work is performed free from accumulations of waste material or rubbish caused by its employees or work of any of its lower-tier subcontractors; and at the completion of the work, the Subcontractor shall remove all rubbish from and about the building and all of its and its lower-tier subcontractor's tools, scaffolding, and surplus materials and shall leave the work area "broom clean" or its equivalent, unless more exactly specified.") Product will ship to LLNL facility using a combination of crates, pallets, boxes and other packaging materials. Where exactly does the vendor dispose of these materials, and what, if any, restrictions apply?
A11: The intent of the above language is that vendors performing work on-site at LLNL must maintain a clean and orderly work place, and clean up after themselves once on-site work is complete. The language does not apply to shipping/packaging material received at LLNL. LLNS is customarily responsible for disposing of shipping/packaging material. With that said, LLNS is receptive to discussions about the selected offeror disposing of shipping/packaging material.

July 28, 2008

Q1: How is the CN aggregate link bandwidth defined?
A1: The peak aggregate link bandwidth is defined in Section 2.3 (third bullet). The delivered aggregate off-node bandwidth target requirement is scaled as 80% of the peak.

Q2: How is the minimum bi-section bandwidth defined?
A2: The peak minimum bisection bandwidth is defined in Section 2.3 (fourth bullet). The delivered minimum bisection bandwidth target requirement is scaled as 80% of the peak.

July 24, 2008

Q1: Should suppliers respond to Section 1.7 relative to the numbers of connections specified in Figure 1.5 and 1.6? What is the significance of those numbers to LLNS?
A1: The material in SOW Section 1 is background information for potential Offeor's to hopefully understand the overall, high level ASC Sequoia context and programmatic objectives. It is for informational LLNS->Offeror purposes so that the actual requirements in the SOW have some context and don't appear to come out of thin air. In the PEPPI document, Section 3.1 (page 12) indicates that the Offeror should replace the LLNS background text with Offeror's "System(s) Overview." Details about what should be in the response to SOW Section 1 are in the PEPPI Section 3.1 and subsections (including the always popular "System Architecture Summary Matrix").

Q2: Should suppliers respond to Section 1.7 relative to TOE devices? Is LLNS using TOE devices today in production? Is LLNS using iWarp on its federated 10 GbE network for Dawn?
A2: No. See above. SAN and External network requirements are in SOW Section 2.3 (delivered bandwidth) with options to expand it in 2.12.1 and 2.12.4. In addition, the types of connections are specified in Section 2.5.3 for ION and 2.6.3 for LN and 2.7.4 for SN. The network protocols required to run over these interfaces are in SOW Section 3.1.8, 3.1.9 and 3.1.10.

Q3: In Section 2.6.2, does each LN have to have approximately 55 TB locally mounted disk? Does LLNS envisage that the total /tmp and /var/tmp disk capacity would be on the order of 0.5 PB ­ 2 PB?
A3: In SOW Section 2.9 we describe the target IO Subsystem Architecture for Sequoia. Diagram 2-1 shows a single high (IOPs) performance and highly reliable shared pool of RAID disk (attached to the LLNS provided SAN) to be supplied by Offeror for use on the SN and LN as "locally mounted disk resource." This assumes Offeror proposed solution can boot from this remote disk RAID disk resource. For the LN the aggregate amount of shared disk that is required is specified in SOW Section 2.6.2. For the SN the aggregate amount of shared disk that is required is specified in SOW Section 2.7.3. Thus the minimum amount of disk required in the shared RAID pool is the sum of these two aggregate requirements.

July 22, 2008

Q1: For 2.1.6 "Broadcast Delivered Latency," what is meant by ping pong latency on a set of tasks? Is it expected that the broadcast latency for a set of end processes should be comparable to the ping pong latency between any two processes in the same set of tasks?
A1: This is section 2.8.6, not 2.1.6. The intent of this requirement is to target the interconnect to have the capability to do broadcasts as quickly as is reasonably possible. The fastest round trip reasonably possible between the source of the broadcast and any particular receiver of the broadcast is the ping pong latency between those two MPI tasks assuming the broadcast and the ping pong operations are implemented on the same network. Thus, LLNS' requirement for broadcast is cast in terms of the delivered ping pong latency from the source of the broadcast to each receiver of the broadcast.

Q2: For 2.1.10 "Cluster Wide High Resolution Event Sequencing," please clarify what you mean by the global interrupt network.
A2: This is section 2.8.10, not 2.1.10. It is assumed that "all the real-time clocks in the system are synchronized using the 'global interrupt network'." If Offeror provides a different mechanism for global clock synchronization or calls it something other than "global interrupt network," then this requirement applies to that mechanism or that mechanism named something else. For the sake of discussion in a generically worded statement of work, LLNS will generically refer to this capability as the "global interrupt network."

Q3: We have downloaded the new STRIDE benchmark tar file and have read the instructions. We have not found the files script.cache and runit that are referenced in the summary file for STRIDE. Can you please provide?
A3: Two files, runit and stride.cache, were inadvertently omitted from the STRIDE benchmark v1.0. The files have been added to the STRIDE v1.1 tar file, which is now available on the ASC Sequoia Benchmark Codes Web site.

July 17, 2008

Q1: Will LLNL be acquiring  the isolation bases, or does the contractor need to plan for this in the bid?
A1: The ISOBASE needs to be part of the Offeror's bid.

July 15, 2008

Q1: In Section 6 you ask to "identify the number of full time maintenance personnel dedicated to servicing the system." Is this the number of LLNL personnel needed to support the systems or an addition of the contractor's engineers needed to be on-site in addition to LLNL personnel?
A1: LLNS assumes you are reading from Section 3.6 of the PEPPI document. This is referring to Offeror's personnel, not LLNS personnel. This could be confusing because LLNS has specified a "self-maintenance" hardware and software maintenance requirement. However, LLNS does require two on-site analysts in Section 6.4 of the SOW. Also, LLNS requires in SOW Section 6.2, "During the period from the start of system installation through acceptance, Offeror support for hardware will be 12 hours a day, seven days a week (0800-2000 Pacific Time Zone), with one hour response time." Therefore, on-site personnel are necessary prior to acceptance. So during different periods in the lifetime of the system, different levels of on-site support are specified (as targets). Offerors may choose to propose an alternative maintenance policy and on-site staffing levels. The request in the PEPPI is for completeness in the proposal response to make sure LLNS understands the proposed level of support and the people power required to execute it.

July 10, 2008

Q1: The LAMMPS Web site no longer contains the LAMMPS version requested in the LAMMPS README. The README states that we should use 22 June 2007. The LAMMPS Web site is continually updated, and the current version is 21 May 2008. Can you provide the 22 June 2007 version, or should we use the 21 May 2008 version?
A1: The LAMMPS 22 June 2007 version has been posted to the Sequoia Benchmarks Web site at

July 1, 2008

Q1: First of all, I'd like to correlate each of the items in RFP/SOW Section 2.8 with the test code. Can you fill in this table so that we may start benchmarking the tests targeted for the Sequoia RFP/SOW requirements.
A1: The SOW item to benchmark mapping is supplied in the phloem/README.sow file. The README.sow does not mention torustest in the 2.8.4 requirement section. LLNL will correct this.

Sequoia RFP/SOW Item Test Code and Invocation/Input Parameters
2.8.1 Messaging Rate sqmr
2.8.2 Delivery Latency Presta com
2.8.3 Off Node Aggregate Delivered Bandwidth linktest
2.8.4 MPI Task Placement Delivered Bandwidth Variation torustest
2.8.5 Delivered Minimum Bi-Section Bandwidth Presta com
2.8.6 Broadcast Delivered Latency mpibench
2.8.7 AllReduce Delivered Latency mpibench
2.8.8 Hardware Bit Error Rate N/A
2.8.9 Global Barriers Network Delivered Latency mpibench
2.8.10 Cluster Wide High Resolution Event Sequencing N/A
2.8.11 Security N/A

Q2: We understand that the ground rules for providing results from running the Sequoia Benchmarks prohibit any modification to the source. However, in the interest of obtaining very accurate results to help satisfy RFP/SOW performance targets, we have reviewed the Phloem benchmarks and have made the following observations:

Q2a: All tests had a common problem: the getopt() function in all benchmarks uses a char ch for the return value of this function. This caused the tests to actually never run as you get the USAGE message all the time. The return value is an int, as such, the codes should change the char ch to int ch for the tests to work. In most cases, the tests should consider the physical coordinates of a root node or neighboring nodes and should not just rely on the logical rank from MPI_Comm_rank().
A2a: Yes, this affects the linktest and torustest benchmarks. LLNL can make this change.

Q2b: Since the purpose of these tests is to primarily measure the physical limits and performance of the system network hardware, the tests should be written in a specific way that considers the network details and topologies. Yet, all tests are generic and network topology-unaware. As such, the logical to physical mapping plays an important role here. Unfortunately, the way the tests are written will cause incorrect estimation of performance (over-estimation or below-estimation).
A2b: It is the vendor's responsibility to handle the logical to physical mapping (e.g., using a node mapping file) and to demonstrate that the mapping they choose corresponds to the rules of the benchmark. Note that some of the benchmarks are designed to measure both the best and worst case performance scenarios.

Q2c: LinkTest: This is a test that measures max utilization of all links of a single compute node (CN). It measures the aggregate bandwidth from a root CN of num_cores tasks to a set of neighboring ranks such that each neighbor rank resides on a single other node. Problematic Issues:

Q2c1: The test seems to be using a root node logically, which is the middle rank of NP. Based on that, the intra-neighborhood (within the node) and the inter-neighborhood are defined. Unfortunately, this won't work correctly because the logical ranks are used to define the neighborhood. As a result, one would think that the tasks cores on the root node reside on the actual root node, yet these cores may be on different node physically due to the default mapping of a given platform.
A2c1: It is the vendor's responsibility to handle the logical to physical mapping and to demonstrate that the mapping they choose corresponds to the rules of the benchmark.
Q2c2: The measurement is not pure as it includes the latency for the Barrier operation. The test should consider timing only the communication operations to measure the bandwidth.
A2c2: LLNL made a conscious decision to include the barriers because that ensures the measurement captures any failure to overlap all of the communication (either way can be inaccurate, but this way is our preference). See Accurately Measuring MPI Broadcasts in a Computational Grid [PDF].
Q2c3: The test hangs with option -t 4 -T 4 -n 1 -N1 (which means, run with four tasks on root CN and four neighbors).
A2c3: We tested to code and this configuration works on LLNL machines. However, we have found a bug in the tests during the result reporting phase, which we will correct. If this is not the problem you are seeing, please run the code with these parameters plus -v and send us the corresponding output file.

Q2d: TorusTest: This test measures the maximal utilization of all links in and out of all CN nodes in a partition for stencil communication patterns. In this benchmark, each rank talks to 26 other ranks. This is like "Gossiping" communication pattern. The test provides a tool to generate the 26 neighbors per task. Problematic Issues:

Q2d1: I think the way the neighbors are generated can cause many unbalanced hot spots in the communication pattern. In other words, some slave ranks appear more frequently as neighbors than others. As such, as more ranks attempt to gossip with such slaves, these slaves would cause node contention, which will spoil the measurement. Remember, we are measuring here links utilization and not node utilization.
Suggestions:I suggest using either a random neighboring list per task or some other method that results in fair assignment of slaves ranks in the gossiping communication.
A2d1: The intention of this benchmark is to measure the best and worst case performance (i.e., variance and topology dependence) of the system. Again, each benchmark has a configuration file to handle mapping. It is the vendor's responsibility to handle the logical to physical mapping and to demonstrate that the mapping they choose corresponds to the rules of the benchmark.
The included test generation tool produces a configuration file for a 3D torus communication in which each node communicates with its 26 neighbors. When used together with an optimal embedding of the 3D torus into the machine's network topology (using a vendor provided mapping scheme) this is intended to provide the best possible configuration. If not, the vendor is free to choose different communication partners following the rules stated in the SOW. However, a random selection of neighbors will not match these rules.

Q2e: MPIBench: This tests the performance of collectives: barrier, bcast and allreduce. Problematic Issues:

Q2e1: In the Bcast test, they actually include the barrier in the measurement, which can spoil the measurement especially for small messages latency.
Suggestions: Consider removing the barrier from the measurement and use max time.
A2e1: Same answer as C.2. LLNL made a conscious decision to include the barriers because that ensures the measurement captures any failure to overlap all of the communication (either way can be inaccurate, but this way is our preference). See Accurately Measuring MPI Broadcasts in a Computational Grid [PDF].

Q2f: MPIGraph: This tests the health and scalability of a high-performance interconnect while subjecting it to a heavy load. This is useful to detect hardware and software problems in a system, such as slow nodes, links, switches, or contention in switch routing. The test uses a logical ring pattern on NP ranks (0 -> 1 -> 2 -> 3 ... -> NP - 1 -> 0). In each of the NP - 1 steps, each rank sends to dst = rank + distance and receives from src = rank - distance (such that 1 <= distance <= NP - 1). Problematic Issues:

Q2f1: The construction of the logical ring is topology-unaware and can result in many messages occupying a single channel (link) leading to extreme and unfair congestion. That is, the test thinks it is performing a communication pattern such that two messages are traversing a given link in a given direction whereas in reality, four or even more are occupying that link in that direction.
Suggestions: Consider the torus topology when constructing the logical rank.
A2f1: Again, we really do provide the flexibility for the user to adjust the mapping. It is the vendor's responsibility to handle the logical to physical mapping and to demonstrate that the mapping they choose corresponds to the rules of the benchmark.

Q2g: SQMR: This tests the messaging rate of CN. Problematic Issues:

Q2g1: Due to not considering the physical topology and coordinates of ranks as well as mapping, the test construct a CORE_COMM communicator thinking that physically all cores reside on the Reference compute node performing the messaging rate. This, unfortunately, is resulting in over-estimation of the messaging rate.
A2g1: It is the vendor's responsibility to handle the logical to physical mapping and to demonstrate that the mapping they choose corresponds to the rules of the benchmark.
Q2g2: There is a bug in the benchmark when reporting the performance numbers. The sum of latencies is divided twice by the num_cores, this gives super-nice numbers.
A2g2: This is a minor bug for a result ("time": average measurement time) that is not of primary interest. The "time" output column is indeed divided twice by num_cores, but the average messaging rate, calculated from the "sum" value, is not. LLNL will remove the second division by num_cores of the time value to address this.
Q2g3: The test uses num_iter to measure the test num_iter of times for a given message size. However, this num_iter is not fixed, rather it starts say with 100 and divides itself down to 2. For large messages, running only two iterations was giving unrealistic results.
Suggestions: (1) Keep the num_iter fixed for all messages. (2) Consider the torus topology when constructing the CORE_COMM communicator and neighbors.
A2g3: The default number of iterations for the benchmark starts with 4096 iterations for a 0B message size and runs up to 26 iterations for 4MB messages. If the user changes the number of iterations from the command line, they may encounter message rates for large messages that are lower that what would be measured with a larger number of iterations. It is the users responsibility to provide default benchmark results, as well as any improved results generated with parameter modifications.

June 17, 2008

Q1: Section 2.3, p. 43, Paragraph 1, states, "Node Interconnect Aggregate Link Bandwidth computation: Intra-cluster network link bandwidth is peak speed at which user data can be moved bi-directionally to/from a compute node over a single active network link. It is calculated by taking the MHz rating of the link time the width in bytes of that link minus the overhead associated with link error protection and addressing. The node interconnect aggregate link bandwidth is the sum over all active compute node links in the system of the node interconnect link bandwidths. Passive standby network interfaces and links for failover may not be counted." Is the "active network link" synonymous with "active compute node link?"
A1: Yes.

Q2: Does LLNL really mean the sum over all links in the system, or the sum over links in a node? Thus, would the B:F number be calculated by taking the bi-directional BW in and out of each link to the node summed over all links to that node and then divided by the FLOPS in the node? Example, homogeneous system with a dual-channel IB coming into all compute nodes and all compute nodes being dual-socket nodes, the vendor calculation would be: Node Inter. Agg. Link BW B:F=(sum achievable BW over both IB channels)/(sum of flops of both sockets on the node).
A2: This text is attempting to define the network bandwidth in+out of a compute node and then sum that over all compute nodes. That is the B. F is the peak of all the compute nodes. If the system is homogenous, this B:F is the same as a single node link bandwidth divided by the node F.

Q3: LLNL plans to do the first level of maintenance. So the supplier would do a second level and provide software licensing. Is this to be included in the budget of $214.5M?
A3: Hardware and software maintenance is included in the budget for $214.5M for five years. Software licensing is also included in this budget. We expect supplier to supply spare parts cache and RMA service and back us up when we have major problems. Also, supplier is responsible for system maintenance between delivery and acceptance.

Q4: The IRS worksheet in the Sequoia benchmark results spreadsheet on your Web page has a place to put in results for a "Sequential" IRS benchmark result. We are not sure what that IRS run should be? We do not see any references to a sequential test in the IRS benchmark instructions. Is it a single core test? I have attached a PDF of the IRS worksheet within the spreadsheet for your reference.
A4: The spreadsheet entry is for a sequential, single processor run. It should be noted in the instructions as well. LLNS will update the instructions.

LLNL logo   ASC logo