NVIDIA Triton Inference Server Excels in MLPerf Inference 4.1 Benchmarks

NVIDIA’s
Triton
Inference
Server
has
achieved
remarkable
performance
in
the
latest
MLPerf
Inference
4.1
benchmarks,
according
to
the

NVIDIA
Technical
Blog.
The
server,
running
on
a
system
with
eight
H200
GPUs,
demonstrated
virtually
identical
performance
to
NVIDIA’s
bare-metal
submission
on
the
Llama
2
70B
benchmark,
highlighting
its
capability
to
balance
feature-rich,
production-grade
AI
inference
with
peak
throughput
performance.

NVIDIA
Triton
Key
Features

NVIDIA
Triton
is
an
open-source
AI
model-serving
platform
designed
to
streamline
and
accelerate
the
deployment
of
AI
inference
workloads
in
production.
Key
features
include
universal
AI
framework
support,
seamless
cloud
integration,
business
logic
scripting,
model
ensembles,
and
a
model
analyzer.

Universal
AI
Framework
Support

Initially
launched
in
2016
with
support
for
the
NVIDIA
TensorRT
backend,
Triton
now
supports
all
major
frameworks
including
TensorFlow,
PyTorch,
ONNX,
and
more.
This
broad
support
allows
developers
to
quickly
deploy
new
models
into
existing
production
instances,
significantly
reducing
time
to
market.

Seamless
Cloud
Integration

NVIDIA
Triton
integrates
deeply
with
major
cloud
service
providers,
enabling
easy
deployment
in
the
cloud
with
minimal
or
no
code
required.
It
supports
platforms
like
OCI
Data
Science,
Azure
ML
CLI,
GKE-managed
clusters,
and
AWS
Deep
Learning
containers,
among
others.

Business
Logic
Scripting

Triton
allows
for
the
incorporation
of
custom
Python
or
C++
scripts
into
production
pipelines
through
business
logic
scripting,
enabling
organizations
to
tailor
AI
workloads
to
their
specific
needs.

Model
Ensembles

Model
Ensembles
enable
enterprises
to
connect
pre-
and
post-processing
workflows
into
cohesive
pipelines
without
programming,
optimizing
infrastructure
costs
and
reducing
latency.

Model
Analyzer

The
Model
Analyzer
feature
allows
experimentation
with
various
deployment
configurations,
visually
mapping
these
configurations
to
identify
the
most
efficient
setup
for
production
use.
It
also
includes
GenA-Perf,
a
tool
designed
for
generative
AI
performance
benchmarking.

Exceptional
Throughput
Results
at
MLPerf
4.1

At
MLPerf
Inference
v4.1,
hosted
by
MLCommons,
NVIDIA
Triton
demonstrated
its
capabilities
on
a
TensorRT-LLM
optimized
Llama-v2-70B
model.
The
server
achieved
performance
nearly
identical
to
bare-metal
submissions,
proving
that
enterprises
can
achieve
both
feature-rich
production-grade
AI
inference
and
peak
throughput
performance
simultaneously.

MLPerf
Benchmark
Submission
Details

The
submission
included
two
scenarios:
Offline,
where
inputs
are
batch
processed,
and
Server,
which
mimics
real-world
production
deployments
with
discrete
input
requests.
The
NVIDIA
Triton
implementation
used
a
gRPC
client-server
setup,
with
the
server
providing
a
gRPC
endpoint
to
interact
with
TensorRT-LLM.

Next
In-Person
User
Meetup

NVIDIA
announced
the
next
Triton
user
meetup
on
September
9,
2024,
at
the
Fort
Mason
Center
For
Arts
&
Culture
in
San
Francisco.
The
event
will
focus
on
new
LLM
features
and
future
innovations.

Image
source:
Shutterstock

NVIDIA Triton Inference Server Excels in MLPerf Inference 4.1 Benchmarks

NVIDIA Triton Key Features

Universal AI Framework Support

Seamless Cloud Integration

Business Logic Scripting

Model Ensembles

Model Analyzer

Exceptional Throughput Results at MLPerf 4.1

MLPerf Benchmark Submission Details

Next In-Person User Meetup