Technology

Hardware

TIDE will focus heavily on graphic processing units keeping the CSU science drivers in mind.

Specifically, TIDE will add: 18 GPU nodes, 6 CPU nodes, and 3 storage nodes. 17 GPU nodes will contain 4x NVIDIA L40 48GB accelerators, and a single GPU node will host 4x NVIDIA A100 80GB accelerators to add a total of 72 GPUs to Nautilus.

The storage nodes will add approximately 240 TB of usable storage available for science driver usage.

Nautilus as a Control Plane for TIDE

The National Research Platform (NRP) Nautilus hypercluster is a distributed Kubernetes cluster forming a collective pool of computational resources.

TIDE will leverage the Nautilus model of “bring your own resource” (BYOR) by adding the hardware to the fabric. CSU users and science drivers will receive preferential scheduling on TIDE nodes and will be able to burst/utilize other CPUs and GPUs as needed. While most Nautilus resources are available to all, some are given priority to projects or groups that have brought systems to or provided funding for specific hardware.

The TIDE nodes will be tainted with a PreferNoSchedule as a restriction for workloads not explicitly tolerated with a TIDE designation to ensure up to 20% of resources are available to other Nautilus users. Making the nodes available for other Nautilus workloads also means the Open Science Grid (OSG) can make use of resources through the OSG Nautilus integration, utilizing spare cycles that would otherwise go unused, in support of the Partnership To Advance Throughput Computing (PATh) project.

Software

Nautilus leverages open-source software and containerization technologies, Kubernetes, to allow the execution of workloads in both interactive, e.g. JupyterHub, or more batch oriented fashion.

Containers allow for portable execution environments “packaged” with required software and data made available via data services. Nautilus makes use of open source monitoring, measurement, and management tools.

In order to facilitate easy access to TIDE resources, a managed JupyterHub instance will provide access to common software container images for many common use cases. CSU users will use their campus credentials to access JupyterHub.

perfSonar is used to continuously monitor connectivity between Nautilus nodes. Tools such as Grafana and Prometheus are used for reporting and alerting.

Network

Network connectivity to the 23 CSU campuses is provided by the Corporation for Education Network Initiatives in California (CENIC).

CENIC‘s California Research and Education Network (CalREN) is a multi-tiered, advanced network serving most research and education institutions in the state. CalREN operates a CalREN-DC network for general use, enterprise network traffic, and a CalREN-HPR network for research-related network traffic.

Nautilus requires a Science DMZ network architecture to integrate into the fabric. SDSU has implemented an instructional and enterprise Science DMZ architecture to isolate research traffic using 2×10 Gb CENIC CalREN-HPR uplinks.

  • For traffic not destined for CalREN-HPR, a 20 Gb interconnect is used to direct traffic over SDSU’s 100 Gb CENIC CalREN-DC connection.