NVIDIA Introduces NVSHMEM 3.0 with Enhanced GPU Communication Features

Jessie A Ellis
Sep 07, 2024 08:39

NVIDIA’s NVSHMEM 3.0 provides multi-node assist, ABI backward compatibility, and CPU-assisted InfiniBand GPU Direct Async, enhancing GPU communication.

NVIDIA has introduced the discharge of NVSHMEM 3.0, the most recent model of its parallel programming interface designed to facilitate environment friendly and scalable communication for NVIDIA GPU clusters. This replace, a part of NVIDIA Magnum IO and based mostly on OpenSHMEM, goals to boost utility portability and compatibility throughout varied platforms, in accordance with the NVIDIA Technical Weblog.

New Options and Interface Assist

NVSHMEM 3.0 introduces a number of new options, together with multi-node, multi-interconnect assist, host-device ABI backward compatibility, and CPU-assisted InfiniBand GPU Direct Async (IBGDA).

Multi-Node, Multi-Interconnect Assist

The brand new model helps connectivity between a number of GPUs inside a node over P2P interconnects, equivalent to NVIDIA NVLink/PCIe, and throughout nodes utilizing RDMA interconnects like InfiniBand and RDMA over Converged Ethernet (RoCE). This enhancement consists of platform assist for a number of racks of NVIDIA GB200 NVL72 techniques related by way of RDMA networks.

Host-System ABI Backward Compatibility

NVSHMEM 3.0 introduces backward compatibility throughout minor variations, permitting purposes linked to an older model of NVSHMEM to run on techniques with newer variations. This characteristic facilitates smoother updates and reduces the necessity for recompiling purposes with every new launch.

CPU-Assisted InfiniBand GPU Direct Async

The newest launch additionally helps CPU-assisted IBGDA, which divides management airplane tasks between the GPU and CPU. This strategy helps enhance IBGDA adoption on non-coherent platforms and relaxes administrative-level configuration constraints in large-scale clusters.

Non-Interface Assist and Minor Enhancements

NVSHMEM 3.0 consists of minor enhancements and non-interface assist, equivalent to:

Object-Oriented Programming Framework for Symmetric Heap

This model introduces an object-oriented programming (OOP) framework to handle totally different sorts of symmetric heaps, together with static and dynamic gadget reminiscence. The OOP framework simplifies the extension to superior options and improves information encapsulation.

Efficiency Enhancements and Bug Fixes

NVSHMEM 3.0 brings varied efficiency enhancements and bug fixes, together with enhancements in IBGDA setup, block-scoped on-device reductions, system-scoped atomic reminiscence operation (AMO), and crew administration.

Abstract

The discharge of NVSHMEM 3.0 marks a major improve in NVIDIA’s parallel programming interface. Key options equivalent to multi-node multi-interconnect assist, host-device ABI backward compatibility, and CPU-assisted IBGDA intention to boost GPU communication and utility portability. Directors and builders can now replace to newer variations of NVSHMEM with out disrupting present purposes, making certain smoother transitions and higher efficiency in large-scale GPU clusters.

Picture supply: Shutterstock

Source link