NVIDIA Unveils NCCL 2.22 with Enhanced Memory Efficiency and Faster Initialization

The
NVIDIA
Collective
Communications
Library
(NCCL)
has
released
its
latest
version,
NCCL
2.22,
bringing
significant
enhancements
aimed
at
optimizing
memory
usage,
accelerating
initialization
times,
and
introducing
a
cost
estimation
API.
These
updates
are
crucial
for
high-performance
computing
(HPC)
and

artificial
intelligence
(AI)
applications,
according
to
the

NVIDIA
Technical
Blog.

Release
Highlights

NVIDIA
Magnum
IO
NCCL
is
designed
to
optimize
inter-GPU
and
multi-node
communication,
which
is
essential
for
efficient
parallel
computing.
Key
features
of
the
NCCL
2.22
release
include:

Lazy
Connection
Establishment:
This
feature
delays
the
creation
of
connections
until
they
are
needed,
significantly
reducing
GPU
memory
overhead.
New
API
for
Cost
Estimation:
A
new
API
helps
optimize
compute
and
communication
overlap
or
research
the
NCCL
cost
model.
Optimizations
for
ncclCommInitRank:
Redundant
topology
queries
are
eliminated,
speeding
up
initialization
by
up
to
90%
for
applications
creating
multiple
communicators.
Support
for
Multiple
Subnets
with
IB
Router:
Adds
support
for
communication
in
jobs
spanning
multiple
InfiniBand
subnets,
enabling
larger
DL
training
jobs.

Features
in
Detail

Lazy
Connection
Establishment

NCCL
2.22
introduces
lazy
connection
establishment,
which
significantly
reduces
GPU
memory
usage
by
delaying
the
creation
of
connections
until
they
are
actually
needed.
This
feature
is
particularly
beneficial
for
applications
that
use
a
narrow
scope,
such
as
running
the
same
algorithm
repeatedly.
The
feature
is
enabled
by
default
but
can
be
disabled
by
setting
NCCL_RUNTIME_CONNECT=0.

New
Cost
Model
API

The
new
API,
ncclGroupSimulateEnd,
allows
developers
to
estimate
the
time
required
for
operations,
aiding
in
the
optimization
of
compute
and
communication
overlap.
While
the
estimates
may
not
perfectly
align
with
reality,
they
provide
a
useful
guideline
for
performance
tuning.

Initialization
Optimizations

To
minimize
initialization
overhead,
the
NCCL
team
has
introduced
several
optimizations,
including
lazy
connection
establishment
and
intra-node
topology
fusion.
These
improvements
can
reduce
ncclCommInitRank
execution
time
by
up
to
90%,
making
it
significantly
faster
for
applications
that
create
multiple
communicators.

New
Tuner
Plugin
Interface

The
new
tuner
plugin
interface
(v3)
provides
a
per-collective
2D
cost
table,
reporting
the
estimated
time
needed
for
operations.
This
allows
external
tuners
to
optimize
algorithm
and
protocol
combinations
for
better
performance.

Static
Plugin
Linking

For
convenience
and
to
avoid
loading
issues,
NCCL
2.22
supports
static
linking
of
network
or
tuner
plugins.
Applications
can
specify
this
by
setting
NCCL_NET_PLUGIN
or
NCCL_TUNER_PLUGIN
to
STATIC_PLUGIN.

Group
Semantics
for
Abort
or
Destroy

NCCL
2.22
introduces
group
semantics
for
ncclCommDestroy
and
ncclCommAbort,
allowing
multiple
communicators
to
be
destroyed
simultaneously.
This
feature
aims
to
prevent
deadlocks
and
improve
user
experience.

IB
Router
Support

With
this
release,
NCCL
can
operate
across
different
InfiniBand
subnets,
enhancing
communication
for
larger
networks.
The
library
automatically
detects
and
establishes
connections
between
endpoints
on
different
subnets,
using
FLID
for
higher
performance
and
adaptive
routing.

Bug
Fixes
and
Minor
Updates

The
NCCL
2.22
release
also
includes
several
bug
fixes
and
minor
updates:

Support
for
the
allreduce
tree
algorithm
on
DGX
Google
Cloud.
Logging
of
NIC
names
in
IB
async
errors.
Improved
performance
of
registered
send
and
receive
operations.
Added
infrastructure
code
for
NVIDIA
Trusted
Computing
Solutions.
Separate
traffic
class
for
IB
and
RoCE
control
messages
to
enable
advanced
QoS.
Support
for
PCI
peer-to-peer
communications
across
partitioned
Broadcom
PCI
switches.

Summary

The
NCCL
2.22
release
introduces
several
significant
features
and
optimizations
aimed
at
improving
performance
and
efficiency
for
HPC
and
AI
applications.
The
improvements
include
a
new
tuner
plugin
interface,
support
for
static
linking
of
plugins,
and
enhanced
group
semantics
to
prevent
deadlocks.

Image
source:
Shutterstock

NVIDIA Unveils NCCL 2.22 with Enhanced Memory Efficiency and Faster Initialization

Release Highlights

Features in Detail

Lazy Connection Establishment

New Cost Model API

Initialization Optimizations

New Tuner Plugin Interface

Static Plugin Linking

Group Semantics for Abort or Destroy

IB Router Support

Bug Fixes and Minor Updates