NVIDIA Introduces NVSHMEM 3.0 with Enhanced GPU Communication Features


Jessie
A
Ellis


Sep
07,
2024
08:39

NVIDIA’s
NVSHMEM
3.0
offers
multi-node
support,
ABI
backward
compatibility,
and
CPU-assisted
InfiniBand
GPU
Direct
Async,
enhancing
GPU
communication.

NVIDIA Introduces NVSHMEM 3.0 with Enhanced GPU Communication Features

NVIDIA
has
announced
the
release
of
NVSHMEM
3.0,
the
latest
version
of
its
parallel
programming
interface
designed
to
facilitate
efficient
and
scalable
communication
for
NVIDIA
GPU
clusters.
This
update,
part
of
NVIDIA
Magnum
IO
and
based
on
OpenSHMEM,
aims
to
enhance
application
portability
and
compatibility
across
various
platforms,
according
to
the

NVIDIA
Technical
Blog
.

New
Features
and
Interface
Support

NVSHMEM
3.0
introduces
several
new
features,
including
multi-node,
multi-interconnect
support,
host-device
ABI
backward
compatibility,
and
CPU-assisted
InfiniBand
GPU
Direct
Async
(IBGDA).

Multi-Node,
Multi-Interconnect
Support

The
new
version
supports
connectivity
between
multiple
GPUs
within
a
node
over
P2P
interconnects,
such
as
NVIDIA
NVLink/PCIe,
and
across
nodes
using
RDMA
interconnects
like
InfiniBand
and
RDMA
over
Converged
Ethernet
(RoCE).
This
enhancement
includes
platform
support
for
multiple
racks
of
NVIDIA
GB200
NVL72
systems
connected
through
RDMA
networks.

Host-Device
ABI
Backward
Compatibility

NVSHMEM
3.0
introduces
backward
compatibility
across
minor
versions,
allowing
applications
linked
to
an
older
version
of
NVSHMEM
to
run
on
systems
with
newer
versions.
This
feature
facilitates
smoother
updates
and
reduces
the
need
for
recompiling
applications
with
each
new
release.

CPU-Assisted
InfiniBand
GPU
Direct
Async

The
latest
release
also
supports
CPU-assisted
IBGDA,
which
divides
control
plane
responsibilities
between
the
GPU
and
CPU.
This
approach
helps
improve
IBGDA
adoption
on
non-coherent
platforms
and
relaxes
administrative-level
configuration
constraints
in
large-scale
clusters.

Non-Interface
Support
and
Minor
Enhancements

NVSHMEM
3.0
includes
minor
enhancements
and
non-interface
support,
such
as:

Object-Oriented
Programming
Framework
for
Symmetric
Heap

This
version
introduces
an
object-oriented
programming
(OOP)
framework
to
manage
different
kinds
of
symmetric
heaps,
including
static
and
dynamic
device
memory.
The
OOP
framework
simplifies
the
extension
to
advanced
features
and
improves
data
encapsulation.

Performance
Improvements
and
Bug
Fixes

NVSHMEM
3.0
brings
various
performance
improvements
and
bug
fixes,
including
enhancements
in
IBGDA
setup,
block-scoped
on-device
reductions,
system-scoped
atomic
memory
operation
(AMO),
and
team
management.

Summary

The
release
of
NVSHMEM
3.0
marks
a
significant
upgrade
in
NVIDIA’s
parallel
programming
interface.
Key
features
such
as
multi-node
multi-interconnect
support,
host-device
ABI
backward
compatibility,
and
CPU-assisted
IBGDA
aim
to
enhance
GPU
communication
and
application
portability.
Administrators
and
developers
can
now
update
to
newer
versions
of
NVSHMEM
without
disrupting
existing
applications,
ensuring
smoother
transitions
and
better
performance
in
large-scale
GPU
clusters.

Image
source:
Shutterstock

Comments are closed.