NVIDIA Unveils Grouped GEMM APIs in cuBLAS 12.5 to Boost DL and HPC Performance


NVIDIA Unveils Grouped GEMM APIs in cuBLAS 12.5 to Boost DL and HPC Performance

The
latest
release
of
the
NVIDIA
cuBLAS
library,
version
12.5,
brings
significant
updates
aimed
at
enhancing
the
functionality
and
performance
of
deep
learning
(DL)
and
high-performance
computing
(HPC)
workloads,
according
to
NVIDIA
Technical
Blog.
Key
updates
include
the
introduction
of
Grouped
GEMM
APIs,
improved
matrix
multiplication
(matmul)
performance
on
NVIDIA
Hopper
(H100
and
H200)
and
Ada
(L40S)
GPUs,
and
enhanced
performance
tuning
options.

Grouped
GEMM
APIs

The
newly
introduced
Grouped
GEMM
APIs
generalize
batched
APIs
by
allowing
different
matrix
sizes,
transpositions,
and
scaling
factors
to
be
grouped
and
executed
in
one
kernel
launch.
This
approach
has
shown
a
1.2x
speedup
in
certain
scenarios,
such
as
the
generation
phase
of
a
mixture-of-experts
(MoE)
model
with
batch
sizes
of
8
and
64
and
FP16
inputs
and
outputs.

Two
new
sets
of
APIs
support
Grouped
GEMM:


  1. cublas<t>gemmGroupedBatched

    for
    FP32
    (including
    TF32)
    and
    FP64
    precisions.

  2. cublasGemmGroupedBatchedEx

    for
    FP16,
    BF16,
    FP32
    (including
    TF32),
    and
    FP64
    precisions.

These
APIs
support
variable
shapes,
transpositions,
and
scaling
factors.
Examples
can
be
found
on
the

NVIDIA/CUDALibrarySamples

GitHub
repository.

Latest
LLM
Matmul
Performance
on
NVIDIA
H100,
H200,
and
L40S
GPUs

Recent
performance
snapshots
show
significant
speedups
for
Llama
2
70B
and
GPT3
training
phases
on
NVIDIA
H100,
H200,
and
L40S
GPUs.
The
H200
GPU,
in
particular,
demonstrates
nearly
3x
and
5x
speedups
compared
to
the
A100
for
Llama
2
70B
and
GPT3
training
phases,
respectively.
These
improvements
are
measured
without
locking
GPU
clocks
and
account
for
the
number
of
times
each
GEMM
is
repeated
in
the
workload.

speedup-gemm-only-fraction-e2e-workloads-2.png

Figure
1.
Speedup
of
the
GEMM-only
fraction
of
e2e
workloads

Library
Performance
and
Benchmarking

Several
enhancements
have
been
made
to
runtime
performance
heuristics
and
performance
tuning
APIs.
The
cuBLAS
library
uses
a
recommender
system
at
runtime
to
dispatch
the
fastest
available
configuration
for
any
user-requested
matmuls.
This
system
is
trained
on
actual
timing
data
from
a
wide
range
of
problems
and
configurations.

gemm-sampling-kernel-families-cublas.png

Figure
2.
Sampling
of
various
GEMMs
using
multiple
configurations
in
different
kernel
families

For
advanced
users,
the

cublasLtMatmulAlgoGetHeuristic

API
enables
performance
tuning
to
achieve
faster
implementations.
Examples
of
auto-tuning
in
cuBLAS
can
be
found
on
the

NVIDIA/CUDALibrarySamples

repository.


auto-tuning-cublas-1.png

Figure
4.
An
example
of
auto-tuning
in
cuBLAS

Better
Functionality
and
Performance
in
cuBLASLt

Since
cuBLAS
12.0,
numerous
enhancements
have
been
introduced:

  1. Fused
    epilogue
    support
    parity
    between
    BF16
    and
    FP16
    precisions
    on
    NVIDIA
    Ampere
    and
    Ada.
  2. Additional
    fused
    epilogues
    on
    NVIDIA
    Hopper
    and
    Ampere.
  3. Support
    for
    FP8
    on
    Ada
    GPUs
    and
    performance
    updates
    on
    Ada
    L4,
    L40,
    and
    L40S.
  4. Removal
    of
    M,
    N,
    and
    batch
    size
    limitations
    of
    cuBLASLt
    matmul
    API.
  5. Improved
    performance
    of
    heuristics
    cache
    for
    workloads
    with
    high
    eviction
    rate.
  6. cuBLAS
    symbols
    are
    available
    in
    CUDA
    Toolkit
    symbols
    for
    Linux
    repository.

For
more
information
on
cuBLAS,
see
the

documentation

and

samples
.



Image
source:
Shutterstock

.
.
.

Tags

Comments are closed.