NVIDIA NVLink and NVSwitch Enhance Large Language Model Inference

Large
language
models
(LLMs)
are
expanding
rapidly,
necessitating
increased
computational
power
for
processing
inference
requests.
To
meet
real-time
latency
requirements
and
serve
a
growing
number
of
users,
multi-GPU
computing
is
essential,
according
to
the

NVIDIA
Technical
Blog.

Benefits
of
Multi-GPU
Computing

Even
if
a
large
model
fits
within
a
single
state-of-the-art
GPU’s
memory,
the
rate
at
which
tokens
are
generated
depends
on
the
total
compute
power
available.
Combining
the
capabilities
of
multiple
cutting-edge
GPUs
makes
real-time
user
experiences
possible.
Techniques
like
tensor
parallelism
(TP)
allow
for
fast
processing
of
inference
requests,
optimizing
both
user
experience
and
cost
by
carefully
selecting
the
number
of
GPUs
for
each
model.

Multi-GPU
Inference:
Communication-Intensive

Multi-GPU
TP
inference
involves
splitting
each
model
layer’s
calculations
across
multiple
GPUs.
The
GPUs
must
communicate
extensively,
sharing
results
to
proceed
with
the
next
model
layer.
This
communication
is
critical
as
Tensor
Cores
often
remain
idle
waiting
for
data.
For
instance,
a
single
query
to
Llama
3.1
70B
may
require
up
to
20
GB
of
data
transfer
per
GPU,
highlighting
the
need
for
a
high-bandwidth
interconnect.

NVSwitch:
Key
for
Fast
Multi-GPU
LLM
Inference

Effective
multi-GPU
scaling
requires
GPUs
with
excellent
per-GPU
interconnect
bandwidth
and
fast
connectivity.
The
NVIDIA
Hopper
Architecture
GPUs,
equipped
with
fourth-generation
NVLink,
can
communicate
at
900
GB/s.
When
combined
with
NVSwitch,
every
GPU
in
a
server
can
communicate
at
this
speed
simultaneously,
ensuring
non-blocking
communication.
Systems
like
NVIDIA
HGX
H100
and
H200,
featuring
multiple
NVSwitch
chips,
provide
significant
bandwidth,
enhancing
overall
performance.

Performance
Comparisons

Without
NVSwitch,
GPUs
must
split
bandwidth
into
multiple
point-to-point
connections,
reducing
communication
speed
as
more
GPUs
are
involved.
For
example,
a
point-to-point
architecture
provides
only
128
GB/s
of
bandwidth
for
two
GPUs,
whereas
NVSwitch
offers
900
GB/s.
This
difference
substantially
impacts
overall
inference
throughput
and
user
experience.
Tables
in
the
original
blog
illustrate
the
bandwidth
and
throughput
benefits
of
NVSwitch
over
point-to-point
connections.

Future
Innovations

NVIDIA
continues
to
innovate
with
NVLink
and
NVSwitch
technologies
to
push
real-time
inference
performance
boundaries.
The
upcoming
NVIDIA
Blackwell
architecture
will
feature
fifth-generation
NVLink,
doubling
speeds
to
1,800
GB/s.
Additionally,
new
NVSwitch
chips
and
NVLink
switch
trays
will
enable
larger
NVLink
domains,
further
enhancing
performance
for
trillion-parameter
models.

The
NVIDIA
GB200
NVL72
system,
connecting
36
NVIDIA
Grace
CPUs
and
72
NVIDIA
Blackwell
GPUs,
exemplifies
these
advancements.
This
system
allows
all
72
GPUs
to
function
as
a
single
unit,
achieving
30x
faster
real-time
trillion-parameter
inference
compared
to
previous
generations.

Image
source:
Shutterstock

NVIDIA NVLink and NVSwitch Enhance Large Language Model Inference

Benefits of Multi-GPU Computing

Multi-GPU Inference: Communication-Intensive

NVSwitch: Key for Fast Multi-GPU LLM Inference

Performance Comparisons

Future Innovations

Benefits
of
Multi-GPU
Computing

Multi-GPU
Inference:
Communication-Intensive

NVSwitch:
Key
for
Fast
Multi-GPU
LLM
Inference

Performance
Comparisons

Future
Innovations