NVIDIA Enhances RDMA Performance with DOCA GPUNetIO

NVIDIA
has
unveiled
new
capabilities
for
its
DOCA
GPUNetIO
library,
enabling
GPU-accelerated
Remote
Direct
Memory
Access
(RDMA)
for
real-time
inline
GPU
packet
processing.
This
enhancement
leverages
technologies
such
as
GPUDirect
RDMA
and
GPUDirect
Async,
allowing
a
CUDA
kernel
to
directly
communicate
with
the
network
interface
card
(NIC),
bypassing
the
CPU.
This
update
aims
to
improve
GPU-centric
applications
by
reducing
latency
and
CPU
utilization,
according
to
the

NVIDIA
Technical
Blog.

Enhanced
RDMA
Functionality

Previously,
DOCA
GPUNetIO,
along
with
DOCA
Ethernet
and
DOCA
Flow,
was
used
for
packet
transmissions
over
the
Ethernet
transport
layer.
The
latest
update,
DOCA
2.7,
introduces
a
new
set
of
APIs
that
enable
RDMA
communications
directly
from
a
GPU
CUDA
kernel
using
RoCE
or
InfiniBand
transport
layers.
This
development
allows
for
high-throughput,
low-latency
data
transfers
by
enabling
the
GPU
to
control
the
data
path
of
the
RDMA
application.

RDMA
GPU
Data
Path

RDMA
allows
direct
access
between
the
main
memory
of
two
hosts
without
involving
the
operating
system,
cache,
or
storage.
This
is
achieved
by
registering
and
sharing
a
local
memory
area
with
the
remote
host,
enabling
high-throughput
and
low-latency
data
transfers.
The
process
involves
three
fundamental
steps:
local
configuration,
exchange
of
information,
and
data
path
execution.

With
the
new
GPUNetIO
RDMA
functions,
the
application
can
manage
the
data
path
of
the
RDMA
application
on
the
GPU,
executing
the
data
path
step
with
a
CUDA
kernel
instead
of
the
CPU.
This
reduces
latency
and
frees
up
CPU
cycles,
allowing
the
GPU
to
be
the
main
controller
of
the
application.

Performance
Comparison

NVIDIA
has
conducted
performance
comparisons
between
GPUNetIO
RDMA
functions
and
IB
Verbs
RDMA
functions
using
the
perftest
microbenchmark
suite.
The
tests
were
executed
on
a
Dell
R750
machine
with
an
NVIDIA
H100
GPU
and
a
ConnectX-7
network
card.
The
results
show
that
DOCA
GPUNetIO
RDMA
performance
is
comparable
to
IB
Verbs
perftest,
with
both
methods
achieving
similar
peak
bandwidth
and
elapsed
times.

For
the
performance
tests,
parameters
were
set
to
1
RDMA
queue,
2,048
iterations,
and
512
RDMA
writes
per
iteration,
with
message
sizes
ranging
from
64
to
4,096
bytes.
Both
implementations
reached
up
to
16
GB/s
in
peak
bandwidth
when
increased
to
four
queues,
demonstrating
the
scalability
and
efficiency
of
the
new
GPUNetIO
RDMA
functions.

Benefits
and
Applications

The
architectural
choice
of
offloading
RDMA
data
path
control
to
the
GPU
offers
several
benefits:

Scalability:
Capable
of
managing
multiple
RDMA
queues
in
parallel.
Parallelism:
High
degree
of
parallelism
with
several
CUDA
threads
working
simultaneously.
Lower
CPU
Utilization:
Platform-independent
performance
with
minimal
CPU
involvement.
Reduced
Bus
Transactions:
Fewer
internal
bus
transactions,
as
the
CPU
is
no
longer
responsible
for
data
synchronization.

This
update
is
particularly
beneficial
for
network
applications
where
data
processing
occurs
on
the
GPU,
enabling
more
efficient
and
scalable
solutions.
For
more
details,
visit
the

NVIDIA
Technical
Blog.

Image
source:
Shutterstock

NVIDIA Enhances RDMA Performance with DOCA GPUNetIO

Enhanced RDMA Functionality

RDMA GPU Data Path

Performance Comparison

Benefits and Applications

Enhanced
RDMA
Functionality

RDMA
GPU
Data
Path

Performance
Comparison

Benefits
and
Applications