NVIDIA Unveils NCCL 2.22 with Enhanced Memory Efficiency and Faster Initialization


Caroline
Bishop


Sep
21,
2024
13:38

NVIDIA
introduces
NCCL
2.22,
focusing
on
memory
efficiency,
faster
initialization,
and
cost
estimation
for
improved
HPC
and
AI
applications.

NVIDIA Unveils NCCL 2.22 with Enhanced Memory Efficiency and Faster Initialization

The
NVIDIA
Collective
Communications
Library
(NCCL)
has
released
its
latest
version,
NCCL
2.22,
bringing
significant
enhancements
aimed
at
optimizing
memory
usage,
accelerating
initialization
times,
and
introducing
a
cost
estimation
API.
These
updates
are
crucial
for
high-performance
computing
(HPC)
and





artificial
intelligence

(AI)
applications,
according
to
the

NVIDIA
Technical
Blog
.

Release
Highlights

NVIDIA
Magnum
IO
NCCL
is
designed
to
optimize
inter-GPU
and
multi-node
communication,
which
is
essential
for
efficient
parallel
computing.
Key
features
of
the
NCCL
2.22
release
include:


  • Lazy
    Connection
    Establishment:

    This
    feature
    delays
    the
    creation
    of
    connections
    until
    they
    are
    needed,
    significantly
    reducing
    GPU
    memory
    overhead.

  • New
    API
    for
    Cost
    Estimation:

    A
    new
    API
    helps
    optimize
    compute
    and
    communication
    overlap
    or
    research
    the
    NCCL
    cost
    model.

  • Optimizations
    for

    ncclCommInitRank
    :

    Redundant
    topology
    queries
    are
    eliminated,
    speeding
    up
    initialization
    by
    up
    to
    90%
    for
    applications
    creating
    multiple
    communicators.

  • Support
    for
    Multiple
    Subnets
    with
    IB
    Router:

    Adds
    support
    for
    communication
    in
    jobs
    spanning
    multiple
    InfiniBand
    subnets,
    enabling
    larger
    DL
    training
    jobs.

Features
in
Detail

Lazy
Connection
Establishment

NCCL
2.22
introduces
lazy
connection
establishment,
which
significantly
reduces
GPU
memory
usage
by
delaying
the
creation
of
connections
until
they
are
actually
needed.
This
feature
is
particularly
beneficial
for
applications
that
use
a
narrow
scope,
such
as
running
the
same
algorithm
repeatedly.
The
feature
is
enabled
by
default
but
can
be
disabled
by
setting

NCCL_RUNTIME_CONNECT=0
.

New
Cost
Model
API

The
new
API,

ncclGroupSimulateEnd
,
allows
developers
to
estimate
the
time
required
for
operations,
aiding
in
the
optimization
of
compute
and
communication
overlap.
While
the
estimates
may
not
perfectly
align
with
reality,
they
provide
a
useful
guideline
for
performance
tuning.

Initialization
Optimizations

To
minimize
initialization
overhead,
the
NCCL
team
has
introduced
several
optimizations,
including
lazy
connection
establishment
and
intra-node
topology
fusion.
These
improvements
can
reduce

ncclCommInitRank

execution
time
by
up
to
90%,
making
it
significantly
faster
for
applications
that
create
multiple
communicators.

New
Tuner
Plugin
Interface

The
new
tuner
plugin
interface
(v3)
provides
a
per-collective
2D
cost
table,
reporting
the
estimated
time
needed
for
operations.
This
allows
external
tuners
to
optimize
algorithm
and
protocol
combinations
for
better
performance.

Static
Plugin
Linking

For
convenience
and
to
avoid
loading
issues,
NCCL
2.22
supports
static
linking
of
network
or
tuner
plugins.
Applications
can
specify
this
by
setting

NCCL_NET_PLUGIN

or

NCCL_TUNER_PLUGIN

to

STATIC_PLUGIN
.

Group
Semantics
for
Abort
or
Destroy

NCCL
2.22
introduces
group
semantics
for

ncclCommDestroy

and

ncclCommAbort
,
allowing
multiple
communicators
to
be
destroyed
simultaneously.
This
feature
aims
to
prevent
deadlocks
and
improve
user
experience.

IB
Router
Support

With
this
release,
NCCL
can
operate
across
different
InfiniBand
subnets,
enhancing
communication
for
larger
networks.
The
library
automatically
detects
and
establishes
connections
between
endpoints
on
different
subnets,
using
FLID
for
higher
performance
and
adaptive
routing.

Bug
Fixes
and
Minor
Updates

The
NCCL
2.22
release
also
includes
several
bug
fixes
and
minor
updates:

  • Support
    for
    the

    allreduce

    tree
    algorithm
    on
    DGX
    Google
    Cloud.
  • Logging
    of
    NIC
    names
    in
    IB
    async
    errors.
  • Improved
    performance
    of
    registered
    send
    and
    receive
    operations.
  • Added
    infrastructure
    code
    for
    NVIDIA
    Trusted
    Computing
    Solutions.
  • Separate
    traffic
    class
    for
    IB
    and
    RoCE
    control
    messages
    to
    enable
    advanced
    QoS.
  • Support
    for
    PCI
    peer-to-peer
    communications
    across
    partitioned
    Broadcom
    PCI
    switches.

Summary

The
NCCL
2.22
release
introduces
several
significant
features
and
optimizations
aimed
at
improving
performance
and
efficiency
for
HPC
and
AI
applications.
The
improvements
include
a
new
tuner
plugin
interface,
support
for
static
linking
of
plugins,
and
enhanced
group
semantics
to
prevent
deadlocks.

Image
source:
Shutterstock

Comments are closed.