NVIDIA Enhances RDMA Performance with DOCA GPUNetIO


NVIDIA Enhances RDMA Performance with DOCA GPUNetIO

NVIDIA
has
unveiled
new
capabilities
for
its
DOCA
GPUNetIO
library,
enabling
GPU-accelerated
Remote
Direct
Memory
Access
(RDMA)
for
real-time
inline
GPU
packet
processing.
This
enhancement
leverages
technologies
such
as
GPUDirect
RDMA
and
GPUDirect
Async,
allowing
a
CUDA
kernel
to
directly
communicate
with
the
network
interface
card
(NIC),
bypassing
the
CPU.
This
update
aims
to
improve
GPU-centric
applications
by
reducing
latency
and
CPU
utilization,
according
to
the

NVIDIA
Technical
Blog
.

Enhanced
RDMA
Functionality

Previously,
DOCA
GPUNetIO,
along
with
DOCA
Ethernet
and
DOCA
Flow,
was
used
for
packet
transmissions
over
the
Ethernet
transport
layer.
The
latest
update,
DOCA
2.7,
introduces
a
new
set
of
APIs
that
enable
RDMA
communications
directly
from
a
GPU
CUDA
kernel
using
RoCE
or
InfiniBand
transport
layers.
This
development
allows
for
high-throughput,
low-latency
data
transfers
by
enabling
the
GPU
to
control
the
data
path
of
the
RDMA
application.

RDMA
GPU
Data
Path

RDMA
allows
direct
access
between
the
main
memory
of
two
hosts
without
involving
the
operating
system,
cache,
or
storage.
This
is
achieved
by
registering
and
sharing
a
local
memory
area
with
the
remote
host,
enabling
high-throughput
and
low-latency
data
transfers.
The
process
involves
three
fundamental
steps:
local
configuration,
exchange
of
information,
and
data
path
execution.

With
the
new
GPUNetIO
RDMA
functions,
the
application
can
manage
the
data
path
of
the
RDMA
application
on
the
GPU,
executing
the
data
path
step
with
a
CUDA
kernel
instead
of
the
CPU.
This
reduces
latency
and
frees
up
CPU
cycles,
allowing
the
GPU
to
be
the
main
controller
of
the
application.

Performance
Comparison

NVIDIA
has
conducted
performance
comparisons
between
GPUNetIO
RDMA
functions
and
IB
Verbs
RDMA
functions
using
the
perftest
microbenchmark
suite.
The
tests
were
executed
on
a
Dell
R750
machine
with
an
NVIDIA
H100
GPU
and
a
ConnectX-7
network
card.
The
results
show
that
DOCA
GPUNetIO
RDMA
performance
is
comparable
to
IB
Verbs
perftest,
with
both
methods
achieving
similar
peak
bandwidth
and
elapsed
times.

For
the
performance
tests,
parameters
were
set
to
1
RDMA
queue,
2,048
iterations,
and
512
RDMA
writes
per
iteration,
with
message
sizes
ranging
from
64
to
4,096
bytes.
Both
implementations
reached
up
to
16
GB/s
in
peak
bandwidth
when
increased
to
four
queues,
demonstrating
the
scalability
and
efficiency
of
the
new
GPUNetIO
RDMA
functions.

Benefits
and
Applications

The
architectural
choice
of
offloading
RDMA
data
path
control
to
the
GPU
offers
several
benefits:

  • Scalability:
    Capable
    of
    managing
    multiple
    RDMA
    queues
    in
    parallel.
  • Parallelism:
    High
    degree
    of
    parallelism
    with
    several
    CUDA
    threads
    working
    simultaneously.
  • Lower
    CPU
    Utilization:
    Platform-independent
    performance
    with
    minimal
    CPU
    involvement.
  • Reduced
    Bus
    Transactions:
    Fewer
    internal
    bus
    transactions,
    as
    the
    CPU
    is
    no
    longer
    responsible
    for
    data
    synchronization.

This
update
is
particularly
beneficial
for
network
applications
where
data
processing
occurs
on
the
GPU,
enabling
more
efficient
and
scalable
solutions.
For
more
details,
visit
the

NVIDIA
Technical
Blog
.

Image
source:
Shutterstock

Comments are closed.