NVIDIA Enhances TensorRT Model Optimizer v0.15 with Improved Inference Performance

NVIDIA
has
introduced
the
latest
v0.15
release
of
the
NVIDIA
TensorRT
Model
Optimizer,
a
cutting-edge
quantization
toolkit
designed
to
enhance
model
optimization
techniques
such
as
quantization,
sparsity,
and
pruning.
This
update
aims
to
reduce
model
complexity
and
optimize
the
inference
speed
of
generative
AI
models,
according
to

NVIDIA
Technical
Blog.

Cache
Diffusion

The
new
version
includes
support
for
cache
diffusion,
building
on
the
previously
established
8-bit
post-training
quantization
(PTQ)
technique.
This
feature
accelerates
diffusion
models
at
inference
time
by
reusing
cached
outputs
from
previous
denoising
steps.
Methods
like
DeepCache
and
block
caching
optimize
inference
speed
without
additional
training.
This
mechanism
leverages
the
temporal
consistency
of
high-level
features
between
consecutive
denoising
steps,
making
it
compatible
with
models
like
DiT
and
UNet.

Developers
can
enable
cache
diffusion
by
using
a
single
‘cachify’
instance
in
the
Model
Optimizer
with
the
diffusion
pipeline.
For
instance,
enabling
cache
diffusion
in
a
Stable
Diffusion
XL
(SDXL)
model
on
an
NVIDIA
H100
Tensor
Core
GPU
delivers
a
1.67x
speedup
in
images
per
second.
This
speedup
further
increases
when
FP8
is
also
enabled.

Quantization-Aware
Training
with
NVIDIA
NeMo

Quantization-aware
training
(QAT)
simulates
the
effects
of
quantization
during
neural
network
training
to
recover
model
accuracy
post-quantization.
This
process
involves
computing
scaling
factors
and
incorporating
simulated
quantization
loss
into
the
fine-tuning
process.
The
Model
Optimizer
uses
custom
CUDA
kernels
for
simulated
quantization,
achieving
lower
precision
model
weights
and
activations
for
efficient
hardware
deployment.

Model
Optimizer
v0.15
expands
QAT
integration
support
to
include
NVIDIA
NeMo,
an
enterprise-grade
platform
for
developing
custom
generative
AI
models.
This
first-class
support
for
NeMo
models
allows
users
to
fine-tune
models
directly
with
the
original
training
pipeline.
For
more
details,
see
the
QAT
example
in
the
NeMo
GitHub
repository.

QLoRA
Workflow

Quantized
Low-Rank
Adaptation
(QLoRA)
is
a
fine-tuning
technique
that
reduces
memory
usage
and
computational
complexity
during
model
training.
It
combines
quantization
with
Low-Rank
Adaptation
(LoRA),
making
large
language
model
(LLM)
fine-tuning
more
accessible.
Model
Optimizer
now
supports
the
QLoRA
workflow
with
NVIDIA
NeMo
using
the
NF4
data
type.
For
a
Llama
13B
model
on
the
Alpaca
dataset,
QLoRA
can
reduce
peak
memory
usage
by
29-51%
while
maintaining
model
accuracy.

Expanded
Support
for
AI
Models

The
latest
release
also
expands
support
for
a
wider
suite
of
AI
models,
including
Stability.ai’s
Stable
Diffusion
3,
Google’s
RecurrentGemma,
Microsoft’s
Phi-3,
Snowflake’s
Arctic
2,
and
Databricks’
DBRX.
For
more
details,
refer
to
the
example
scripts
and
support
matrix
available
in
the
Model
Optimizer
GitHub
repository.

Get
Started

NVIDIA
TensorRT
Model
Optimizer
provides
seamless
integration
with
NVIDIA
TensorRT-LLM
and
TensorRT
for
deployment.
It
is
available
for
installation
on
PyPI
as
nvidia-modelopt.
Visit
the
NVIDIA
TensorRT
Model
Optimizer
GitHub
page
for
example
scripts
and
recipes
for
inference
optimization.
Comprehensive
documentation
is
also
available.

Image
source:
Shutterstock

NVIDIA Enhances TensorRT Model Optimizer v0.15 with Improved Inference Performance

Cache Diffusion

Quantization-Aware Training with NVIDIA NeMo

QLoRA Workflow

Expanded Support for AI Models

Get Started

Cache
Diffusion

Quantization-Aware
Training
with
NVIDIA
NeMo

QLoRA
Workflow

Expanded
Support
for
AI
Models

Get
Started