TEAL Introduces Training-Free Activation Sparsity to Boost LLM Efficiency


Zach
Anderson


Sep
01,
2024
08:34

TEAL
offers
a
training-free
approach
to
activation
sparsity,
significantly
enhancing
the
efficiency
of
large
language
models
(LLMs)
with
minimal
degradation.

TEAL Introduces Training-Free Activation Sparsity to Boost LLM Efficiency

TEAL
(Training-Free
Activation
Sparsity
in
LLMs)
has
emerged
as
a
groundbreaking
approach
to
improve
the
efficiency
of
large
language
models
(LLMs)
without
requiring
additional
training.
According
to

together.ai
,
this
method
applies
magnitude
pruning
to
hidden
states
throughout
the
model,
achieving
40-50%
activation
sparsity
with
minimal
degradation.
This
innovation
allows
for
the
transfer
of
fewer
weights
to
on-chip
memory,
addressing
the
memory-bound
nature
of
LLM
inference
and
translating
into
1.53-1.8x
wall-clock
speedups
in
single-batch
decoding.

Background

LLMs
are
known
for
their
massive
size,
which
poses
challenges
during
inference,
primarily
due
to
the
speed
limitations
of
transferring
parameters
from
device
memory
to
registers.
Various
techniques
such
as
quantization,
weight
sparsity,
and
speculative
decoding
have
been
developed
to
tackle
this ‘memory
wall’.
Activation
sparsity,
which
leverages
zero
values
in
hidden
states,
is
a
less
explored
method
that
avoids
transferring
unnecessary
weight
channels
during
decoding.

Older
models
like
OPT-175B
show
high
activation
sparsity,
enabling
methods
like
DejaVu
to
achieve
significant
speedups.
However,
newer
models
like
LLaMA
have
moved
to
SwiGLU
variants,
making
it
harder
to
apply
such
methods.
Recent
research
has
attempted
to ‘recover’
models
that
exhibit
activation
sparsity,
but
these
require
extensive
retraining
on
massive
datasets.

Motivating
Study:
Distributional
Properties
of
Activations
in
LLMs

Research
has
shown
that
hidden
states
in
LLMs
exhibit
outliers
and
are
zero-centered
with
similar
distributional
shapes
across
layers.
Specifically,
states
before
MLP
and
Attention
Blocks
are
Gaussian-shaped,
while
intermediate
states
are
Laplacian-shaped.
This
suggests
that
many
low-magnitude
activations
can
be
pruned
with
negligible
model
degradation,
a
concept
also
observed
in
other
studies
like
CATS.

TEAL

TEAL
introduces
an
optimization
by
sparsifying
every
tensor
in
the
model,
achieving
near-zero
degradation
at
25%
sparsity
and
minimal
degradation
at
40%
sparsity.
At
50%
sparsity,
Llama-3
variants
show
slightly
more
degradation
compared
to
older
Llama-2
and
Mistral
variants.
TEAL
outperforms
CATS
by
sparsifying
every
tensor
and
choosing
to
sparsify
through
input,
yielding
lower
error.

Hardware-Aware
Speed-up

To
benchmark
real-world
speedups,
TEAL
was
integrated
with
GPT-Fast,
achieving
significant
speedups
of
up
to
1.53x
and
1.8x
at
40%
and
50%
sparsity,
respectively.
While
the
kernel
is
faster
than
cuBLAS
at
0%
sparsity,
there
is
still
room
for
further
optimization.

Compatibility
with
Quantization

TEAL
also
demonstrates
compatibility
with
quantization,
another
technique
for
efficient
LLM
inference.
Combining
activation
sparsity
and
quantization
unlocks
new
regimes
for
transferring
memory
to
GPU
registers,
allowing
for
higher
inference
speed-ups.

Applications

TEAL’s
most
immediate
application
is
accelerating
inference
in
resource-constrained
edge
settings,
particularly
in
single-batch
scenarios.
It
also
aids
inference
providers
like
Together
AI,
which
hosts
over
100
open-source
models
across
a
large
fleet
of
GPUs,
by
serving
models
more
efficiently.

Image
source:
Shutterstock

Comments are closed.