NVIDIA Launches GenAI-Perf for Optimizing Generative AI Model Performance


Timothy
Morano


Aug
02,
2024
02:46

NVIDIA
introduces
GenAI-Perf,
a
new
tool
for
benchmarking
generative
AI
models,
enhancing
performance
measurement
and
optimization.

NVIDIA Launches GenAI-Perf for Optimizing Generative AI Model Performance

NVIDIA
has
unveiled
a
new
tool,
GenAI-Perf,
aimed
at
enhancing
the
performance
measurement
and
optimization
of
generative
AI
models.
According
to
the

NVIDIA
Technical
Blog
,
this
tool
is
incorporated
into
the
latest
release
of
NVIDIA
Triton
and
is
designed
to
aid
machine
learning
engineers
in
finding
the
optimal
balance
between
latency
and
throughput,
especially
crucial
for
large
language
models
(LLMs).

Key
Metrics
for
LLM
Performance

When
dealing
with
LLMs,
performance
metrics
extend
beyond
traditional
latency
and
throughput.
Key
metrics
include:


  • Time
    to
    first
    token:

    The
    time
    between
    when
    a
    request
    is
    sent
    and
    the
    receipt
    of
    the
    first
    response.

  • Output
    token
    throughput:

    The
    number
    of
    output
    tokens
    generated
    per
    second.

  • Inter-token
    latency:

    The
    time
    between
    intermediate
    responses
    divided
    by
    the
    number
    of
    generated
    tokens.

These
metrics
are
essential
for
applications
where
quick
and
consistent
performance
is
paramount,
with
time
to
first
token
often
being
the
highest
priority.

Introducing
GenAI-Perf

GenAI-Perf
is
designed
to
accurately
measure
these
specific
metrics,
helping
users
determine
optimal
configurations
for
peak
performance
and
cost-effectiveness.
The
tool
supports
industry-standard
datasets
like
OpenOrca
and
CNN_dailymail
and
facilitates
standardized
performance
evaluations
across
various
inference
engines
through
an
OpenAI-compatible
API.

GenAI-Perf
is
intended
to
be
the
default
benchmarking
tool
for
all
NVIDIA
generative
AI
offerings,
including
NVIDIA
NIM,
NVIDIA
Triton
Inference
Server,
and
NVIDIA
TensorRT-LLM.
This
facilitates
easy
comparisons
among
different
serving
solutions
that
support
the
OpenAI-compatible
API.

Supported
Endpoints
and
Usage

Currently,
GenAI-Perf
supports
three
OpenAI
endpoint
APIs:
Chat,
Chat
Completions,
and
Embeddings.
As
new
model
types
emerge,
additional
endpoints
will
be
introduced.
GenAI-Perf
is
also
open
source,
accepting
community
contributions.

To
get
started
with
GenAI-Perf,
users
can
install
the
latest
Triton
Inference
Server
SDK
container
from
NVIDIA
GPU
Cloud.
Running
the
container
and
server
involves
specific
commands
tailored
to
the
type
of
model
being
used,
such
as
GPT2
for
chat
and
chat-completion
endpoints,
and
intfloat/e5-mistral-7b-instruct
for
embeddings.

Profiling
and
Results

For
profiling
OpenAI
chat-compatible
models,
users
can
run
specific
commands
to
measure
performance
metrics
such
as
request
latency,
output
sequence
length,
and
input
sequence
length.
Sample
results
for
GPT2
show
metrics
like:


  • Request
    latency
    (ms):

    Average
    of
    1679.30,
    with
    a
    minimum
    of
    567.31
    and
    a
    maximum
    of
    2929.26.

  • Output
    sequence
    length:

    Average
    of
    453.43,
    ranging
    from
    162
    to
    784.

  • Output
    token
    throughput
    (per
    sec):

    269.99.

Similarly,
for
profiling
OpenAI
embeddings-compatible
models,
users
can
generate
a
JSONL
file
with
sample
texts
and
run
GenAI-Perf
to
obtain
metrics
such
as
request
latency
and
request
throughput.

Conclusion

GenAI-Perf
provides
a
comprehensive
solution
for
benchmarking
generative
AI
models,
offering
insights
into
critical
performance
metrics
and
facilitating
optimization.
As
an
open-source
tool,
it
allows
for
continuous
improvement
and
adaptation
to
new
model
types
and
requirements.

Image
source:
Shutterstock

Comments are closed.