NVIDIA Unveils Grouped GEMM APIs in cuBLAS 12.5 to Boost DL and HPC Performance
The
latest
release
of
the
NVIDIA
cuBLAS
library,
version
12.5,
brings
significant
updates
aimed
at
enhancing
the
functionality
and
performance
of
deep
learning
(DL)
and
high-performance
computing
(HPC)
workloads,
according
to
NVIDIA
Technical
Blog.
Key
updates
include
the
introduction
of
Grouped
GEMM
APIs,
improved
matrix
multiplication
(matmul)
performance
on
NVIDIA
Hopper
(H100
and
H200)
and
Ada
(L40S)
GPUs,
and
enhanced
performance
tuning
options.
Grouped
GEMM
APIs
The
newly
introduced
Grouped
GEMM
APIs
generalize
batched
APIs
by
allowing
different
matrix
sizes,
transpositions,
and
scaling
factors
to
be
grouped
and
executed
in
one
kernel
launch.
This
approach
has
shown
a
1.2x
speedup
in
certain
scenarios,
such
as
the
generation
phase
of
a
mixture-of-experts
(MoE)
model
with
batch
sizes
of
8
and
64
and
FP16
inputs
and
outputs.
Two
new
sets
of
APIs
support
Grouped
GEMM:
-
cublas<t>gemmGroupedBatched
for
FP32
(including
TF32)
and
FP64
precisions. -
cublasGemmGroupedBatchedEx
for
FP16,
BF16,
FP32
(including
TF32),
and
FP64
precisions.
These
APIs
support
variable
shapes,
transpositions,
and
scaling
factors.
Examples
can
be
found
on
the
NVIDIA/CUDALibrarySamples
GitHub
repository.
Latest
LLM
Matmul
Performance
on
NVIDIA
H100,
H200,
and
L40S
GPUs
Recent
performance
snapshots
show
significant
speedups
for
Llama
2
70B
and
GPT3
training
phases
on
NVIDIA
H100,
H200,
and
L40S
GPUs.
The
H200
GPU,
in
particular,
demonstrates
nearly
3x
and
5x
speedups
compared
to
the
A100
for
Llama
2
70B
and
GPT3
training
phases,
respectively.
These
improvements
are
measured
without
locking
GPU
clocks
and
account
for
the
number
of
times
each
GEMM
is
repeated
in
the
workload.
Library
Performance
and
Benchmarking
Several
enhancements
have
been
made
to
runtime
performance
heuristics
and
performance
tuning
APIs.
The
cuBLAS
library
uses
a
recommender
system
at
runtime
to
dispatch
the
fastest
available
configuration
for
any
user-requested
matmuls.
This
system
is
trained
on
actual
timing
data
from
a
wide
range
of
problems
and
configurations.
For
advanced
users,
the
cublasLtMatmulAlgoGetHeuristic
API
enables
performance
tuning
to
achieve
faster
implementations.
Examples
of
auto-tuning
in
cuBLAS
can
be
found
on
the
NVIDIA/CUDALibrarySamples
repository.
Better
Functionality
and
Performance
in
cuBLASLt
Since
cuBLAS
12.0,
numerous
enhancements
have
been
introduced:
-
Fused
epilogue
support
parity
between
BF16
and
FP16
precisions
on
NVIDIA
Ampere
and
Ada. -
Additional
fused
epilogues
on
NVIDIA
Hopper
and
Ampere. -
Support
for
FP8
on
Ada
GPUs
and
performance
updates
on
Ada
L4,
L40,
and
L40S. -
Removal
of
M,
N,
and
batch
size
limitations
of
cuBLASLt
matmul
API. -
Improved
performance
of
heuristics
cache
for
workloads
with
high
eviction
rate. -
cuBLAS
symbols
are
available
in
CUDA
Toolkit
symbols
for
Linux
repository.
For
more
information
on
cuBLAS,
see
the
documentation
and
samples.
Image
source:
Shutterstock
.
.
.
Comments are closed.