NVIDIA Unveils Grouped GEMM APIs in cuBLAS 12.5 to Boost DL and HPC Performance

The
latest
release
of
the
NVIDIA
cuBLAS
library,
version
12.5,
brings
significant
updates
aimed
at
enhancing
the
functionality
and
performance
of
deep
learning
(DL)
and
high-performance
computing
(HPC)
workloads,
according
to
NVIDIA
Technical
Blog.
Key
updates
include
the
introduction
of
Grouped
GEMM
APIs,
improved
matrix
multiplication
(matmul)
performance
on
NVIDIA
Hopper
(H100
and
H200)
and
Ada
(L40S)
GPUs,
and
enhanced
performance
tuning
options.

Grouped
GEMM
APIs

The
newly
introduced
Grouped
GEMM
APIs
generalize
batched
APIs
by
allowing
different
matrix
sizes,
transpositions,
and
scaling
factors
to
be
grouped
and
executed
in
one
kernel
launch.
This
approach
has
shown
a
1.2x
speedup
in
certain
scenarios,
such
as
the
generation
phase
of
a
mixture-of-experts
(MoE)
model
with
batch
sizes
of
8
and
64
and
FP16
inputs
and
outputs.

Two
new
sets
of
APIs
support
Grouped
GEMM:

cublas<t>gemmGroupedBatched
for
FP32
(including
TF32)
and
FP64
precisions.
cublasGemmGroupedBatchedEx
for
FP16,
BF16,
FP32
(including
TF32),
and
FP64
precisions.

These
APIs
support
variable
shapes,
transpositions,
and
scaling
factors.
Examples
can
be
found
on
the

NVIDIA/CUDALibrarySamples
GitHub
repository.

Latest
LLM
Matmul
Performance
on
NVIDIA
H100,
H200,
and
L40S
GPUs

Recent
performance
snapshots
show
significant
speedups
for
Llama
2
70B
and
GPT3
training
phases
on
NVIDIA
H100,
H200,
and
L40S
GPUs.
The
H200
GPU,
in
particular,
demonstrates
nearly
3x
and
5x
speedups
compared
to
the
A100
for
Llama
2
70B
and
GPT3
training
phases,
respectively.
These
improvements
are
measured
without
locking
GPU
clocks
and
account
for
the
number
of
times
each
GEMM
is
repeated
in
the
workload.

Figure
1.
Speedup
of
the
GEMM-only
fraction
of
e2e
workloads

Library
Performance
and
Benchmarking

Several
enhancements
have
been
made
to
runtime
performance
heuristics
and
performance
tuning
APIs.
The
cuBLAS
library
uses
a
recommender
system
at
runtime
to
dispatch
the
fastest
available
configuration
for
any
user-requested
matmuls.
This
system
is
trained
on
actual
timing
data
from
a
wide
range
of
problems
and
configurations.

Figure
2.
Sampling
of
various
GEMMs
using
multiple
configurations
in
different
kernel
families

For
advanced
users,
the

cublasLtMatmulAlgoGetHeuristic
API
enables
performance
tuning
to
achieve
faster
implementations.
Examples
of
auto-tuning
in
cuBLAS
can
be
found
on
the

NVIDIA/CUDALibrarySamples
repository.

Figure
4.
An
example
of
auto-tuning
in
cuBLAS

Better
Functionality
and
Performance
in
cuBLASLt

Since
cuBLAS
12.0,
numerous
enhancements
have
been
introduced:

Fused
epilogue
support
parity
between
BF16
and
FP16
precisions
on
NVIDIA
Ampere
and
Ada.
Additional
fused
epilogues
on
NVIDIA
Hopper
and
Ampere.
Support
for
FP8
on
Ada
GPUs
and
performance
updates
on
Ada
L4,
L40,
and
L40S.
Removal
of
M,
N,
and
batch
size
limitations
of
cuBLASLt
matmul
API.
Improved
performance
of
heuristics
cache
for
workloads
with
high
eviction
rate.
cuBLAS
symbols
are
available
in
CUDA
Toolkit
symbols
for
Linux
repository.

For
more
information
on
cuBLAS,
see
the

documentation
and

samples.

Image
source:
Shutterstock

.
.
.

NVIDIA Unveils Grouped GEMM APIs in cuBLAS 12.5 to Boost DL and HPC Performance

Grouped GEMM APIs

Latest LLM Matmul Performance on NVIDIA H100, H200, and L40S GPUs

Library Performance and Benchmarking

Better Functionality and Performance in cuBLASLt

Tags

Grouped
GEMM
APIs

Latest
LLM
Matmul
Performance
on
NVIDIA
H100,
H200,
and
L40S
GPUs

Library
Performance
and
Benchmarking

Better
Functionality
and
Performance
in
cuBLASLt