European Exascale System Interconnect and Storage
Innovation and Technology

The ExaNeSt consortium is pursuing several avenues of research to go beyond the current state-of-the-art boundaries, resulting in a number of technical innovations in interconnects, storage, applications, and packaging/cooling. These innovations are being combined in the experimental testbeds being developed and studied by the ExaNeSt consortium. Highlights per thematic area are listed below.

Interconnects

Address translation/transformation technology:

  • Translation of virtual-to-physical packet-destination addresses in the Network Interface using the System MMU component of ARMv8 SoCs, capable of “walking” the user-process page tables, as opposed to a table of “registered” memory regions, as compared to Infiniband-based solutions. This technology facilitates zero-copy RDMA data transfers.

Optical networking technologies:

  • Basic optical structures (e.g. 2x2 photonic switches).
  • Photonics network technology model that includes circuit switching and Wavelength Division Multiplexing.

Switching technologies:

  • A multipath AXI crossbar that can be exploited to deal with contention and fault tolerance
  • A table-less ToR switch that overcomes the limitations of expensive CAM tables.
  • A multi-objective optimization-based topological framework that can be used to select the best topology based on several metrics of interest.

Routing technologies:

  • Ultra-low latency, fully parametric, multi-channel network router for direct (Torus mesh) interconnection with resilience features at packet level embedded in the lite network protocol (EDAC SECDED code for control words and CRC code for data payload) and with negligible impact on network performances.
Storage

Extensions to the BeeGFS parallel file system:

  • Support for mirroring and failover to increase the availability of metadata and general file system in case of node failure. These mechanisms have also been integrated with the BeeGFS-on-demand (BeeOND) functionality, which supports the on-demand setup of a file system instance on top of SSD storage space, to be used as a per-job cache layer.

Database replication technology:

  • A query log based database replication feature in the MonetDB in-memory database system, which, unlike the traditional replication schemes, can be used not only for backup and disaster recovery, but also distributing increasing workloads.

Low-latency I/O path technologies for Linux:

  • DMAP: a low-latency memory-mapped storage access path in the Linux kernel. This implementation has been designed as a replacement for the mmap() infrastructure in Linux, and is being used with the MonetDB in-memory database system from MDBS, in performance evaluation experiments with the TPC-H analytical queries.
  • Iris: separation of the control and data planes in the common I/O path, to minimize system software overheads. Iris utilizes processor virtualization features to provide a fast path for protected accesses to a key-value store.

Virtualization for HPC:

  • API remoting infrastructure for virtual machines running with the KVM hypervisor in Linux. This infrastructure is a novel approach aimed at enriching KVM-enabled virtual machines with RDMA engines and optimized intra-node VM-to-VM data exchange.

Monitoring and Benchmarking Tools:

  • A complete monitoring architecture for the ExaNeSt system environment, covering both compute-nodes and the parallel file system, and combining several performance measurement collection and visualization tools.
  • A checkpoint/restart simulation suite, for use in evaluation experiments of ExaNeSt prototypes.
  • A performance monitoring and analysis tool ("Marvin") to profile a MonetDB database server at the relational algebra level
  • A new benchmark (AirTraffic) for distributed statistical analysis databases.
  • A tool-suite for automating evaluation experiments, with load-stress and fault injection scenaria.
Applications

Run-time environment:

  • Complete software environment for HPC applications on experimental ARMv8-based platforms (application codes together all essential libraries - e.g. gnu scientific libraries, HDF5, cfitsio, fftw)
  • Complete software environment for the MonetDB in-memory database on experimental ARMv8-based platforms

Workload characterization:

  • Network traces collection and complete MPI profiling of scientific applications (using SCALASCA or instrumenting the codes directly)
  • Porting and validation of the Allinea debugging and profiling tools on experimental ARMv8-based platforms, including the introduction of a new tool to profile and account for the usage (bytes exchanged, time spent, number of calls) of MPI API functions

Application scaling improvements:

  • Extraction of OpenCL kernels for acceleration using FPGA resources
  • Re-engineering methodology for HPC applications aiming towards Exascale.
Packaging/Cooling

Compute-Node for dense deployments:

  • Quad-FPGA Daughter Board, Unimem-capable building block: 4 coherence islands with a total of 4x4 ARM A53 cores, with FPGA resources available for accelerators, SSD storage, and high-speed serial links for building scalable network topologies.

Cooling for high-density modular data centers:

  • Hot Water Cooling (>50C), Heat Capture Functions
  • 100kW cabinets (small footprint), 1.0x PUE, >0.9 ERF Potential
  • 48V DC Power Distribution.

 

Questions or additional information? Contact us.