NVIDIA Launches GenAI-Perf for Optimizing Generative AI Model Performance
NVIDIA
has
unveiled
a
new
tool,
GenAI-Perf,
aimed
at
enhancing
the
performance
measurement
and
optimization
of
generative
AI
models.
According
to
the
NVIDIA
Technical
Blog,
this
tool
is
incorporated
into
the
latest
release
of
NVIDIA
Triton
and
is
designed
to
aid
machine
learning
engineers
in
finding
the
optimal
balance
between
latency
and
throughput,
especially
crucial
for
large
language
models
(LLMs).
Key
Metrics
for
LLM
Performance
When
dealing
with
LLMs,
performance
metrics
extend
beyond
traditional
latency
and
throughput.
Key
metrics
include:
-
Time
to
first
token:
The
time
between
when
a
request
is
sent
and
the
receipt
of
the
first
response. -
Output
token
throughput:
The
number
of
output
tokens
generated
per
second. -
Inter-token
latency:
The
time
between
intermediate
responses
divided
by
the
number
of
generated
tokens.
These
metrics
are
essential
for
applications
where
quick
and
consistent
performance
is
paramount,
with
time
to
first
token
often
being
the
highest
priority.
Introducing
GenAI-Perf
GenAI-Perf
is
designed
to
accurately
measure
these
specific
metrics,
helping
users
determine
optimal
configurations
for
peak
performance
and
cost-effectiveness.
The
tool
supports
industry-standard
datasets
like
OpenOrca
and
CNN_dailymail
and
facilitates
standardized
performance
evaluations
across
various
inference
engines
through
an
OpenAI-compatible
API.
GenAI-Perf
is
intended
to
be
the
default
benchmarking
tool
for
all
NVIDIA
generative
AI
offerings,
including
NVIDIA
NIM,
NVIDIA
Triton
Inference
Server,
and
NVIDIA
TensorRT-LLM.
This
facilitates
easy
comparisons
among
different
serving
solutions
that
support
the
OpenAI-compatible
API.
Supported
Endpoints
and
Usage
Currently,
GenAI-Perf
supports
three
OpenAI
endpoint
APIs:
Chat,
Chat
Completions,
and
Embeddings.
As
new
model
types
emerge,
additional
endpoints
will
be
introduced.
GenAI-Perf
is
also
open
source,
accepting
community
contributions.
To
get
started
with
GenAI-Perf,
users
can
install
the
latest
Triton
Inference
Server
SDK
container
from
NVIDIA
GPU
Cloud.
Running
the
container
and
server
involves
specific
commands
tailored
to
the
type
of
model
being
used,
such
as
GPT2
for
chat
and
chat-completion
endpoints,
and
intfloat/e5-mistral-7b-instruct
for
embeddings.
Profiling
and
Results
For
profiling
OpenAI
chat-compatible
models,
users
can
run
specific
commands
to
measure
performance
metrics
such
as
request
latency,
output
sequence
length,
and
input
sequence
length.
Sample
results
for
GPT2
show
metrics
like:
-
Request
latency
(ms):
Average
of
1679.30,
with
a
minimum
of
567.31
and
a
maximum
of
2929.26. -
Output
sequence
length:
Average
of
453.43,
ranging
from
162
to
784. -
Output
token
throughput
(per
sec):
269.99.
Similarly,
for
profiling
OpenAI
embeddings-compatible
models,
users
can
generate
a
JSONL
file
with
sample
texts
and
run
GenAI-Perf
to
obtain
metrics
such
as
request
latency
and
request
throughput.
Conclusion
GenAI-Perf
provides
a
comprehensive
solution
for
benchmarking
generative
AI
models,
offering
insights
into
critical
performance
metrics
and
facilitating
optimization.
As
an
open-source
tool,
it
allows
for
continuous
improvement
and
adaptation
to
new
model
types
and
requirements.
Image
source:
Shutterstock
Comments are closed.