NVIDIA Delves into RAPIDS cuVS IVF-PQ for Accelerated Vector Search


Zach
Anderson


Jul
18,
2024
20:12

NVIDIA
explores
the
RAPIDS
cuVS
IVF-PQ
algorithm,
enhancing
vector
search
performance
through
compression
and
GPU
acceleration.

NVIDIA Delves into RAPIDS cuVS IVF-PQ for Accelerated Vector Search

In
a
detailed
blog
post,
NVIDIA
has
provided
insights
into
their
RAPIDS
cuVS
IVF-PQ
algorithm,
which
aims
to
accelerate
vector
search
by
leveraging
GPU
technology
and
advanced
compression
techniques.
This
is
part
one
of
a
two-part
series
that
continues
from
their
previous
exploration
of
the
IVF-Flat
algorithm.

IVF-PQ
Algorithm
Introduction

The
blog
post
introduces
IVF-PQ
(Inverted
File
Index
with
Product
Quantization),
an
algorithm
designed
to
enhance
search
performance
and
reduce
memory
usage
by
storing
data
in
a
compressed
form.
This
method,
however,
comes
at
the
cost
of
some
accuracy,
a
trade-off
that
will
be
further
explored
in
the
second
part
of
the
series.

IVF-PQ
builds
upon
the
concepts
of
IVF-Flat,
which
uses
an
inverted
file
index
to
limit
the
search
complexity
to
a
smaller
subset
of
data
through
clustering.
Product
quantization
(PQ)
adds
another
layer
of
compression
by
encoding
database
vectors,
making
the
process
more
efficient
for
large
datasets.

Performance
Benchmarks

NVIDIA
shared
benchmarks
using
the
DEEP
dataset,
which
contains
a
billion
records
and
96
dimensions,
amounting
to
360
GiB
in
size.
A
typical
IVF-PQ
configuration
compresses
this
into
an
index
of
54
GiB
without
significantly
impacting
search
performance,
or
as
small
as
24
GiB
with
a
slight
slowdown.
This
compression
allows
the
index
to
fit
into
GPU
memory.

Comparisons
with
the
popular
CPU
algorithm
HNSW
on
a
100-million
subset
of
the
DEEP
dataset
show
that
cuVS
IVF-PQ
can
significantly
accelerate
both
index
building
and
vector
search.

Algorithm
Overview

IVF-PQ
follows
a
two-step
process:
a
coarse
search
and
a
fine
search.
The
coarse
search
is
identical
to
IVF-Flat,
while
the
fine
search
involves
calculating
distances
between
query
points
and
vectors
in
probed
clusters,
but
with
the
vectors
stored
in
a
compressed
format.

This
compression
is
achieved
through
PQ,
which
approximates
a
vector
using
two-level
quantization.
This
allows
IVF-PQ
to
fit
more
data
into
GPU
memory,
enhancing
memory
bandwidth
utilization
and
speeding
up
the
search
process.

Optimizations
and
Performance

NVIDIA
has
implemented
various
optimizations
in
cuVS
to
ensure
the
IVF-PQ
algorithm
performs
efficiently
on
GPUs.
These
include:

  • Fusing
    operations
    to
    reduce
    output
    size
    and
    optimize
    memory
    bandwidth
    utilization.
  • Storing
    the
    lookup
    table
    (LUT)
    in
    GPU
    shared
    memory
    when
    possible
    for
    faster
    access.
  • Using
    a
    custom
    8-bit
    floating
    point
    data
    type
    in
    the
    LUT
    for
    faster
    data
    conversion.
  • Aligning
    data
    in
    16-byte
    chunks
    to
    optimize
    data
    transfers.
  • Implementing
    an
    “early
    stop”
    check
    to
    avoid
    unnecessary
    distance
    computations.

NVIDIA’s
benchmarks
on
a
100-million
scale
dataset
show
that
IVF-PQ
outperforms
IVF-Flat,
particularly
with
larger
batch
sizes,
achieving
up
to
3-4
times
the
number
of
queries
per
second.

Conclusion

IVF-PQ
is
a
robust
ANN
search
algorithm
that
leverages
clustering
and
compression
to
enhance
search
performance
and
throughput.
The
first
part
of
NVIDIA’s
blog
series
provides
a
comprehensive
overview
of
the
algorithm’s
workings
and
its
advantages
on
GPU
platforms.
For
more
detailed
performance
tuning
recommendations,
NVIDIA
encourages
readers
to
explore
the
second
part
of
their
series.

For
more
information,
visit
the

NVIDIA
Technical
Blog
.

Image
source:
Shutterstock

Comments are closed.