IBM Research Unveils Cost-Effective AI Inferencing with Speculative Decoding

IBM
Research
has
announced
a
significant
breakthrough
in
AI
inferencing,
combining
speculative
decoding
with
paged
attention
to
enhance
the
cost
performance
of
large
language
models
(LLMs).
This
development
promises
to
make
customer
care
chatbots
more
efficient
and
cost-effective,
according
to

IBM
Research.

In
recent
years,
LLMs
have
improved
the
ability
of
chatbots
to
understand
customer
queries
and
provide
accurate
responses.
However,
the
high
cost
and
slow
speed
of
serving
these
models
have
hindered
broader
AI
adoption.
Speculative
decoding
emerges
as
an
optimization
technique
to
accelerate
AI
inferencing
by
generating
tokens
faster,
which
can
reduce
latency
by
two
to
three
times,
thereby
improving
customer
experience.

Despite
its
advantages,
reducing
latency
traditionally
comes
with
a
trade-off:
decreased
throughput,
or
the
number
of
users
that
can
simultaneously
utilize
the
model,
which
increases
operational
costs.
IBM
Research
has
tackled
this
challenge
by
cutting
the
latency
of
its
open-source
Granite
20B
code
model
in
half
while
quadrupling
its
throughput.

Speculative
Decoding:
Efficiency
in
Token
Generation

LLMs
use
a
transformer
architecture,
which
is
inefficient
at
generating
text.
Typically,
a
forward
pass
is
required
to
process
each
previously
generated
token
before
producing
a
new
one.
Speculative
decoding
modifies
this
process
to
evaluate
several
prospective
tokens
simultaneously.
If
these
tokens
are
validated,
one
forward
pass
can
generate
multiple
tokens,
thus
increasing
inferencing
speed.

This
technique
can
be
executed
by
a
smaller,
more
efficient
model
or
part
of
the
main
model
itself.
By
processing
tokens
in
parallel,
speculative
decoding
maximizes
the
efficiency
of
each
GPU,
potentially
doubling
or
tripling
inferencing
speed.
Initial
introductions
of
speculative
decoding
by
DeepMind
and
Google
researchers
utilized
a
draft
model,
while
newer
methods,
such
as
the
Medusa
speculator,
eliminate
the
need
for
a
secondary
model.

IBM
researchers
adapted
the
Medusa
speculator
by
conditioning
future
tokens
on
each
other
rather
than
on
the
model’s
next
predicted
token.
This
approach,
combined
with
an
efficient
fine-tuning
method
using
small
and
large
batches
of
text,
aligns
the
speculator’s
responses
closely
with
the
LLM,
significantly
boosting
inferencing
speeds.

Paged
Attention:
Optimizing
Memory
Usage

Reducing
LLM
latency
often
compromises
throughput
due
to
increased
GPU
memory
strain.
Dynamic
batching
can
mitigate
this
but
not
when
speculative
decoding
is
also
competing
for
memory.
IBM
researchers
addressed
this
by
employing
paged
attention,
an
optimization
technique
inspired
by
virtual
memory
and
paging
concepts
from
operating
systems.

Traditional
attention
algorithms
store
key-value
(KV)
sequences
in
contiguous
memory,
leading
to
fragmentation.
Paged
attention,
however,
divides
these
sequences
into
smaller
blocks,
or
pages,
that
can
be
accessed
as
needed.
This
method
minimizes
redundant
computation
and
allows
the
speculator
to
generate
multiple
candidates
for
each
predicted
word
without
duplicating
the
entire
KV-cache,
thus
freeing
up
memory.

Future
Implications

IBM
has
integrated
speculative
decoding
and
paged
attention
into
its
Granite
20B
code
model.
The
IBM
speculator
has
been
open-sourced
on
Hugging
Face,
enabling
other
developers
to
adapt
these
techniques
for
their
LLMs.
IBM
plans
to
implement
these
optimization
techniques
across
all
models
on
its
watsonx
platform,
enhancing
enterprise
AI
applications.

Image
source:
Shutterstock

IBM Research Unveils Cost-Effective AI Inferencing with Speculative Decoding

Speculative Decoding: Efficiency in Token Generation

Paged Attention: Optimizing Memory Usage

Future Implications

Speculative
Decoding:
Efficiency
in
Token
Generation

Paged
Attention:
Optimizing
Memory
Usage

Future
Implications