

## **RED-SEA overview**

Pedro J. García & Jesús Escudero-Sahuquillo (UCLM)



This project has received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955776. The JU receives support from the European Union's Horizon 2020 research and innovation programme and France, Greece, Germany, Spain, Italy, Switzerland.







### We are one of the "SEA" projects

3 complementary projects addressing Exascale challenges in a Modular Supercomputing Architecture (MSA) context

- In line with several HW/SW Exascale projects funded under previous European programmes
- Funded by the EuroHPC 2019-1 call focused on SW and applications
  - The EuroHPC Joint Undertaking targets Exascale computers in Europe in 2023-24
  - Should contain as many European components are possible
- Coordinated with other on-going European projects, particularly the European Processor Initiative

DEEP-SEA: DEEP Software for Exascale Architectures IO-SEA: Input/Output Software for Exascale Architectures

IDEEP-SEA

- Better manage and program compute and memory heterogeneity
- Targets easier programming for Modular Supercomputers
- Continuation of the DEEP projects series

≈IO-SEA

- Improve I/O and data management in large scale systems
- Builds upon results of SAGE1-2 projects and MAESTRO

Munich, 16/01/2024

RED-SEA: Network Solution for Exascale Architectures



- Develop European network solution
- Focus on BXI (Bull eXascale Interconnect)



### **RED-SEA** motivation

- At Exascale, the **interconnect can become the bottleneck** 
  - Number of components and their heterogeneity is increasing, requirements are diverse
- Crucial aspects for the network:
  - Scalability, reliability: beyond 100K nodes keeping key performance and reliability
  - Sustainability, HPC/datacenter convergence
    - integrate Internet Protocol (IP) and Ethernet and RoCE (RDMA over Converged Ethernet) traffic over the HPC interconnect, at low latency and high message rates
  - **Throughput & bandwidth**: ×4 BW and message rate for each endpoint of the network
    - ×2 link frequency (up to 200Gb/s) and ×2 network interfaces per process (multi-rail)
  - Congestion control, quality of service, isolation, protection, sharing: partition existing HPC system into multiple (private) clouds
  - Programmability, latency: configure the network offload engine, enable compute-in-network, better latency and energy efficiency.
- Overall goal: extend and optimize BXI interconnect for Exascale



### **RED-SEA objectives**



5





### The four pillars of RED-SEA research

|   | Architecture, co-design and performance | Optimizing the fit with the other EuroHPC projects and with the EPI processors                                                | titlete Racionale of Fride Redeter       |
|---|-----------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|------------------------------------------|
|   | High-performance Ethernet               | Development of a high-performance, low-latency, seamless bridge with Ethernet                                                 | Atos                                     |
| * | Efficient Network Resource management   | Including congestion management and Quality-of-<br>Service targets while sharing the platform across<br>application and users | UNIVERSITAT<br>POUTECNICA<br>DE VALENCIA |
|   | Endpoint functions and reliability      | End-to-end enhancements to network services - from programming models to reliability & security and to in-network compute     | © <u>Forth</u>                           |





 BXI as the HPC fabric consisting of two discrete components, a BXI NIC plus a BXI switch, and the BXI fabric manager.



HPC is part of the continuum of computing

workflows

### **RED-SEA: methodology for Co-Design Activity**

#### Application portfolio

- NEST: simulator for spiking neural network models
- LAMMPS: molecular dynamic engine with focus on material modelling
- SOM: artificial neural networks used in the context of unsupervised ML

#### Benchmark portfolio

- GSAS: Global Shared Address Space environment provides a shared memory abstraction model to distributed applications
- DAW: stress the NI capabilities at scale and the QoS capabilities of the interconnect
- LinkTest: scalable benchmark for point-to-point communications
- PCVS: validation engine designed to evaluate the offloading capabilities of high-speed network

#### Collection and Analysis of MPI Network Traces generated by applications

- VEF traces + DIBONA (12 nodes, 768 ARM cores, BXI interconnect)
- Requirements for the applications and co-design recommendations
- Simulator as reference to support the design and implementation of novel IPs proposed in the project
  - Network traces feed the project simulators
  - Extrapolation of the behaviour at large scales (up to 100K nodes)









8 Final 3-SEA-projects workshop: RED-SEA overview

Munich, 16/01/2024

### **RED-SEA: Hardware Testbeds**

| TESTBED     | Features                                                                                                                                  | Outcome                                                                                            | Availability Date                                                 | Remote<br>Access           |
|-------------|-------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|-------------------------------------------------------------------|----------------------------|
| DIBONA      | 4 blades; 768 Arm v8 cores (12 nodes)<br>OS: RHEL 8.4<br>Memory: 256GB per Node: 16x16GB<br>DDR4@2666MT/s                                 | <ul><li>Analysis of BXI 1.3</li><li>net. Traces of apps</li><li>benchmarks</li></ul>               | 16 November 2021                                                  | YES                        |
| DEEPcluster | 2CN + BXI switch                                                                                                                          | T1.2: partec                                                                                       | Q4 2021                                                           | YES                        |
| ExaNeSt     | 64 arm cores; 16 QFDB; 4 mezzanines                                                                                                       | Prototype of FORTH RDMA + cong. mgmt                                                               | Q4 2021                                                           | NO                         |
| INFN-dev    | Alveo board (u50; u200; U280) PCIe gen3/gen4<br>I/O 100gbps (APElink; BXI-link)<br>ExaNet protocol compliant                              | <ul> <li>Prototype of APEnetX</li> <li>Debug &amp; development<br/>INFN WP3 and WP4 IPs</li> </ul> | Q3 2021<br>APEnet v6 (0.1): Q4 2022                               | NO                         |
| TGCC KNL    | 828 nodes (276 blades)<br>Intel(R) Xeon Phi(TM) CPU 7250<br>96 Go of memory (6x16) + 16 Go mcdram<br>OS: RHEL-7.9; interconnect: BXI v1.2 | VEF traces / BXI traces                                                                            | now (only to CEA partner<br>and subject to quota<br>availability) | YES<br>(up to<br>14/11/22) |
| INTI-BXI    | nodes (AMD rome); 2*64 cores/node<br>Mem: 240Go /node<br>4 BXI NICs /node                                                                 | WP4 – T4.5<br>multirail                                                                            | Q1 2022                                                           | No<br>Only to CEA          |





## **RED-SEA: Simulators (I)**

| Simulator<br>(partner) | Features                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | Tasks involved                                                                                                                                                                                                              |
|------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| COSSIM<br>(EXAPSYS)    | <ul> <li>Current</li> <li>Processing in ARM, RISC-V (work-in-progress/eProcessor), Intel (deprecated)</li> <li>Network topologies, routing algos, switches, etc are those supported by OMNET++</li> <li>Main change of OMNET++ has to do with INET packages that have been adapted so as to support full IP, Linux-compatible packets (e.g. including payload)</li> <li>RED-SEA: <ul> <li>NIC Architectural model with several implementation details needed</li> <li>Interconnection scheme of CPU with NIC</li> </ul> </li> </ul>           | T1.4 :<br>MPI packets generated in<br>COSSIM can be integrated in<br>SAURON (VEF Traces)<br>Identify if COSSIM can be<br>connected to SAURON instead of<br>plain OMNET++<br>From WP2 get NIC design<br>compatible with GEM5 |
| SAURON<br>(UCLM)       | <ul> <li>Current:</li> <li><u>Network topologies</u>: Fat-trees, Dragonflies, Slim-flies, KNS, etc.</li> <li><u>Routing algorithms</u>: deterministic (D-mod-K, DESTRO), Oblivious (VLB), and adaptive (PAR, UGAL, Fully, ARNs, etc.)</li> <li><u>Switch buffer organizations</u> (input-queued, virtual output queues, etc.)</li> <li>Congestion management and QoS models</li> <li>Compatible with VEF Traces Framework</li> <li>RED-SEA:</li> <li>BXI3 Architecture (NIC and switch)</li> <li>Protocols designed in WP3 and WP4</li> </ul> | <ul> <li>T1.4:</li> <li>Migration to OMNET++ 6.0</li> <li>Exploring connection with COSSIM</li> <li>All the tasks in WP3: modeling new network management proposals</li> <li>T4.1: modeling e2e protocols</li> </ul>        |



### **RED-SEA: Simulators (II)**

| Simulator<br>(partner) | Features                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | Tasks involved |
|------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|
| DQN_SIM<br>(INFN)      | <ul> <li>Current <ul> <li>Simulation models developed from scratch using the OMNeT++ 5.4 framework</li> <li>N-dim Torus Topology</li> <li>Modelled after the APEnet RDMA network architecture: data-link layer (buffers, virtual channels), network layer (VCT switching, deterministic routing (DOR), Oblivious (random) and Adaptive Routing (*ch, DQN-Routing), transport layer (packet definition, network interface).</li> <li>Interface between OMNeT++ and the Ray distributed execution framework to exploit its services in order to get routing actions from the Deep Q-Network reinforcement learning agent.</li> </ul> </li> <li>RED-SEA: <ul> <li>Port the models to the SAURON framework in order to assess DQN scalability and performance under realistic traffic conditions</li> <li>Study the application of the DQN adaptive routing algo to other topologies and/or network architectures.</li> </ul> </li> </ul> |                |



# **Questions**?



Munich, 16/01/2024