H100 NVL vs. SXM5: NVIDIA's Supercomputing GPUs

September 11, 2024

5 Min Read

By Team Vast

The H100 NVL: Built for AI Inference at Scale

Announced last year, the NVIDIA H100 NVL is an interesting variant on the H100 PCIe card. At first glance, it's exactly what it looks like: two H100 PCIe cards that come already bridged together.

This massive beast of a card spans four slots – with a TDP of 700W to match its size. Given that it's essentially just two H100s glued together, communication is handled by a pair of PCIe 5.0 x16 slots. It uses three NVLink bridges to connect the pair of H100 GPUs, which deliver 600GB/s of bidirectional bandwidth, or about 4.5x the bandwidth of the dual PCIe interfaces.

NVIDIA states that the H100 NVL is capable of 2.4x to 2.6x the performance of two separate H100s (at least when it comes to FP8 and FP16 workloads) – likely in part because it uses faster HBM3 memory rather than HBM2e. Plus, unlike the standard H100 PCIe GPUs, each card on the H100 NVL has all six stacks of HBM memory enabled instead of just five. The result is that the H100 NVL offers a 17.5% memory increase – and a total of 188GB VRAM (2 x 94GB).

A few other standout features include an HBM3 memory clock at ~5.1Gbps; 6144-bit memory bus width; and memory bandwidth of 2 x 3.9TB/s.

According to NVIDIA, the H100 NVL is ideal for inferencing large language models (LLMs):

"These GPUs work as one to deploy large language models and GPT models from anywhere from five billion parameters to 200 [billion]," explained Ian Buck, NVIDIA's VP of Accelerated Computing.

So in addition to being able to handle various high-performance computing (HPC) tasks, the H100 NVL is specifically optimized for deploying already-trained neural networks at scale.

The H100 SXM5: Heavyweight for AI Training

SXM technology is a form factor and interconnect standard primarily used for GPUs in large-scale AI applications and data center environments. Instead of relying on PCIe slots like traditional GPUs, SXM GPUs are directly socketed into the motherboard. This allows for faster, high-bandwidth connections as well as better power delivery and cooling solutions – crucial for high-end GPUs, particularly in dense server environments.

Ultimately, this setup helps ensure that SXM GPUs perform efficiently even under the most intense workloads. For instance, NVIDIA's H100 SXM5 is designed for AI training first and foremost – a powerhouse GPU that is especially well suited to training foundational models.

With 528 tensor cores, 80 GB of HBM3 memory, and a memory bandwidth of 3.35 TB/s, the SXM5 is designed for the most demanding machine learning and deep learning tasks. It also benefits from NVLink interconnects that allow up to 900 GB/s of GPU-to-GPU bandwidth, making it ideal for multi-GPU configurations in large-scale AI model training.

Like the H100 NVL, the SXM5 features HBM3 memory. However, the SXM5’s form factor is a bit more restrictive. With each unit pulling 700W, running these systems requires a lot of power and cooling, which can pose a challenge for some datacenters. Most colocation racks have a power capacity of 6-10KW, meaning systems with four or more SXM5s can push infrastructure to its limits.

(The 700W H100 NVL is comparatively easier to accommodate. A single socket, dual H100 NVL configuration – with four GH100 dies – would likely require around 2.5KW.)

A direct comparison of specs would be helpful in deciding which accelerator best suits your needs, especially when balancing power efficiency, infrastructure requirements, and the performance demands of your workloads.

Which GPU is Right for You?

Here's an outline of some of the features and specs of the H100 NVL and the SXM5, as well as the H100 PCIe for good measure:

| Feature | H100 NVL | H100 PCIe | H100 SXM5 | | -------------------- | ------------------------ | -------------------- | ----------------------------- | | Architecture | Hopper | Hopper | Hopper | | Memory Clock | ~5.1 Gbps HBM3 | 3.2 Gbps HBM2e | 5.23 Gbps HBM3 | | Memory Bus Width | 6144-bit | 5120-bit | 5120-bit | | Memory Bandwidth | 2 x 3.9 TB/s | 2 TB/s | 3.35 TB/s | | FP32 CUDA Cores | 2 x 16896 | 14592 | 16896 | | Tensor Cores | 2 x 528 | 456 | 528 | | Boost Clock | 1.98 GHz | 1.75 GHz | 1.98 GHz | | VRAM | 2 x 94 GB (188 GB) | 80 GB | 80 GB | | INT8 Tensor | 2 x 1980 TOPS | 1513 TOPS | 1980 TOPS | | FP16 Tensor | 2 x 990 TFLOPS | 756 TFLOPS | 990 TFLOPS | | FP32 Tensor | 2 x 495 TFLOPS | 378 TFLOPS | 495 TFLOPS | | FP64 Tensor | 2 x 67 TFLOPS | 51 TFLOPS | 67 TFLOPS | | Interconnect | NVLink 4 (600 GB/s) | NVLink 4 (600 GB/s) | NVLink 4, 18 Links (900 GB/s) | | GPU | 2 x GH100 | GH100 | GH100 | | Transistor Count | 2 x 80B | 80B | 80B | | TDP | 700-800W | 350W | 700W | | Interface | 2 x PCIe 5.0 (Quad Slot) | PCIe 5.0 (Dual Slot) | SXM5 |

The choice between the NVIDIA H100 NVL and SXM5 ultimately depends on your specific project requirements. The NVL offers a more versatile and accessible option for various applications, while the SXM5 provides exceptional performance and scalability for demanding workloads.

If you prioritize flexibility and ease of integration, the NVL could be the better choice. However, if you need the highest possible performance and are willing to invest in a more specialized setup, the SXM5 is your ideal solution. Carefully consider your project's needs, budget, and infrastructure before making a decision.

H100 NVL vs. SXM5: NVIDIA's Supercomputing GPUs

The H100 NVL: Built for AI Inference at Scale

The H100 SXM5: Heavyweight for AI Training

Which GPU is Right for You?

August 2024 Product Update from Vast.ai

Vast.ai GPUs Can Now Be Rented Through SkyPilot

Running OpenAI's GPT-OSS on Vast.ai

Subscribe for our product updates.