NVIDIA Enhances Llama 3.1 405B Performance with TensorRT Model Optimizer

Meta’s
Llama
3.1
405B
large
language
model
(LLM)
is
achieving
new
levels
of
performance
thanks
to
NVIDIA’s
TensorRT
Model
Optimizer,
according
to
the

NVIDIA
Technical
Blog.
The
enhancements
have
resulted
in
up
to
a
1.44x
increase
in
throughput
when
running
on
NVIDIA
H200
GPUs.

Outstanding
Llama
3.1
405B
Inference
Throughput
with
TensorRT-LLM

TensorRT-LLM
has
already
delivered
remarkable
inference
throughput
for
Llama
3.1
405B
since
the
model’s
release.
This
was
achieved
through
various
optimizations,
including
in-flight
batching,
KV
caching,
and
optimized
attention
kernels.
These
techniques
have
accelerated
inference
performance
while
maintaining
lower
precision
compute.

TensorRT-LLM
added
support
for
the
official
Llama
FP8
quantization
recipe,
which
calculates
static
and
dynamic
scaling
factors
to
preserve
maximum
accuracy.
Additionally,
user-defined
kernels
such
as
matrix
multiplications
from
FBGEMM
are
optimized
via
plug-ins
inserted
into
the
network
graph
at
compile
time.

Boosting
Performance
Up
to
1.44x
with
TensorRT
Model
Optimizer

NVIDIA’s
custom
FP8
post-training
quantization
(PTQ)
recipe,
available
through
the
TensorRT
Model
Optimizer
library,
enhances
Llama
3.1
405B
throughput
and
reduces
latency
without
sacrificing
accuracy.
This
recipe
incorporates
FP8
KV
cache
quantization
and
self-attention
static
quantization,
reducing
inference
compute
overhead.

Table
1
demonstrates
the
maximum
throughput
performance,
showing
significant
improvements
across
various
input
and
output
sequence
lengths
on
an
8-GPU
HGX
H200
system.
The
system
features
eight
NVIDIA
H200
Tensor
Core
GPUs
with
141
GB
of
HBM3e
memory
each
and
four
NVLink
Switches,
providing
900
GB/s
of
GPU-to-GPU
bandwidth.

Maximum Throughput Performance – Output Tokens/Second 8 NVIDIA H200 Tensor Core GPUs
Input \| Output Sequence Lengths	2,048 \| 128	32,768 \| 2,048	120,000 \| 2,048
TensorRT Model Optimizer FP8	463.1	320.1	71.5
Official Llama FP8 Recipe	399.9	230.8	49.6
Speedup	1.16x	1.39x	1.44x

Table
1.
Maximum
throughput
performance
of
Llama
3.1
405B
with
NVIDIA
internal
measurements

Similarly,
Table
2
presents
the
minimum
latency
performance
using
the
same
input
and
output
sequence
lengths.

Batch Size = 1 Performance – Output Tokens/Second 8 NVIDIA H200 Tensor Core GPUs
Input \| Output Sequence Lengths	2,048 \| 128	32,768 \| 2,048	120,000 \| 2,048
TensorRT Model Optimizer FP8	49.6	44.2	27.2
Official Llama FP8 Recipe	37.4	33.1	22.8
Speedup	1.33x	1.33x	1.19x

Table
2.
Minimum
latency
performance
of
Llama
3.1
405B
with
NVIDIA
internal
measurements

These
results
indicate
that
H200
GPUs
with
TensorRT-LLM
and
TensorRT
Model
Optimizer
are
delivering
superior
performance
in
both
latency-optimized
and
throughput-optimized
scenarios.
The
TensorRT
Model
Optimizer
FP8
recipe
also
achieved
comparable
accuracy
with
the
official
Llama
3.1
FP8
recipe
on
the
Massively
Multitask
Language
Understanding
(MMLU)
and
MT-Bench
benchmarks.

Fitting
Llama
3.1
405B
on
Just
Two
H200
GPUs
with
INT4
AWQ

For
developers
with
hardware
resource
constraints,
the
INT4
AWQ
technique
in
TensorRT
Model
Optimizer
compresses
the
model,
allowing
Llama
3.1
405B
to
fit
on
just
two
H200
GPUs.
This
method
reduces
the
required
memory
footprint
significantly
by
compressing
the
weights
down
to
4-bit
integers
while
encoding
activations
using
FP16.

Tables
4
and
5
show
the
maximum
throughput
and
minimum
latency
performance
measurements,
demonstrating
that
the
INT4
AWQ
method
provides
comparable
accuracy
scores
to
the
Llama
3.1
official
FP8
recipe
from
Meta.

Maximum Throughput Performance – Output Tokens/Second 2 NVIDIA H200 Tensor Core GPUs
Input \| Output Sequence Lengths	2,048 \| 128	32,768 \| 2,048	60,000 \| 2,048
TensorRT Model Optimizer INT4 AWQ	75.6	28.7	16.2

Table
4.
Maximum
throughput
performance
of
Llama
3.1
405B
with
NVIDIA
internal
measurements

Batch Size = 1 Performance – Output Tokens/Second 2 NVIDIA H200 Tensor Core GPUs
Input \| Output Sequence Lengths	2,048 \| 128	32,768 \| 2,048	60,000 \| 2,048
TensorRT Model Optimizer INT4 AWQ	21.6	18.7	12.8

Table
5.
Minimum
latency
performance
of
Llama
3.1
405B
with
NVIDIA
internal
measurements

NVIDIA’s
advancements
in
TensorRT
Model
Optimizer
and
TensorRT-LLM
are
paving
the
way
for
enhanced
performance
and
efficiency
in
running
large
language
models
like
Llama
3.1
405B.
These
improvements
offer
developers
more
flexibility
and
cost-efficiency,
whether
they
have
extensive
hardware
resources
or
more
constrained
environments.

Image
source:
Shutterstock

NVIDIA Enhances Llama 3.1 405B Performance with TensorRT Model Optimizer

Outstanding Llama 3.1 405B Inference Throughput with TensorRT-LLM

Boosting Performance Up to 1.44x with TensorRT Model Optimizer

Fitting Llama 3.1 405B on Just Two H200 GPUs with INT4 AWQ

Outstanding
Llama
3.1
405B
Inference
Throughput
with
TensorRT-LLM

Boosting
Performance
Up
to
1.44x
with
TensorRT
Model
Optimizer

Fitting
Llama
3.1
405B
on
Just
Two
H200
GPUs
with
INT4
AWQ