NVIDIA Delves into RAPIDS cuVS IVF-PQ for Accelerated Vector Search

In
a
detailed
blog
post,
NVIDIA
has
provided
insights
into
their
RAPIDS
cuVS
IVF-PQ
algorithm,
which
aims
to
accelerate
vector
search
by
leveraging
GPU
technology
and
advanced
compression
techniques.
This
is
part
one
of
a
two-part
series
that
continues
from
their
previous
exploration
of
the
IVF-Flat
algorithm.

IVF-PQ
Algorithm
Introduction

The
blog
post
introduces
IVF-PQ
(Inverted
File
Index
with
Product
Quantization),
an
algorithm
designed
to
enhance
search
performance
and
reduce
memory
usage
by
storing
data
in
a
compressed
form.
This
method,
however,
comes
at
the
cost
of
some
accuracy,
a
trade-off
that
will
be
further
explored
in
the
second
part
of
the
series.

IVF-PQ
builds
upon
the
concepts
of
IVF-Flat,
which
uses
an
inverted
file
index
to
limit
the
search
complexity
to
a
smaller
subset
of
data
through
clustering.
Product
quantization
(PQ)
adds
another
layer
of
compression
by
encoding
database
vectors,
making
the
process
more
efficient
for
large
datasets.

Performance
Benchmarks

NVIDIA
shared
benchmarks
using
the
DEEP
dataset,
which
contains
a
billion
records
and
96
dimensions,
amounting
to
360
GiB
in
size.
A
typical
IVF-PQ
configuration
compresses
this
into
an
index
of
54
GiB
without
significantly
impacting
search
performance,
or
as
small
as
24
GiB
with
a
slight
slowdown.
This
compression
allows
the
index
to
fit
into
GPU
memory.

Comparisons
with
the
popular
CPU
algorithm
HNSW
on
a
100-million
subset
of
the
DEEP
dataset
show
that
cuVS
IVF-PQ
can
significantly
accelerate
both
index
building
and
vector
search.

Algorithm
Overview

IVF-PQ
follows
a
two-step
process:
a
coarse
search
and
a
fine
search.
The
coarse
search
is
identical
to
IVF-Flat,
while
the
fine
search
involves
calculating
distances
between
query
points
and
vectors
in
probed
clusters,
but
with
the
vectors
stored
in
a
compressed
format.

This
compression
is
achieved
through
PQ,
which
approximates
a
vector
using
two-level
quantization.
This
allows
IVF-PQ
to
fit
more
data
into
GPU
memory,
enhancing
memory
bandwidth
utilization
and
speeding
up
the
search
process.

Optimizations
and
Performance

NVIDIA
has
implemented
various
optimizations
in
cuVS
to
ensure
the
IVF-PQ
algorithm
performs
efficiently
on
GPUs.
These
include:

Fusing
operations
to
reduce
output
size
and
optimize
memory
bandwidth
utilization.
Storing
the
lookup
table
(LUT)
in
GPU
shared
memory
when
possible
for
faster
access.
Using
a
custom
8-bit
floating
point
data
type
in
the
LUT
for
faster
data
conversion.
Aligning
data
in
16-byte
chunks
to
optimize
data
transfers.
Implementing
an
“early
stop”
check
to
avoid
unnecessary
distance
computations.

NVIDIA’s
benchmarks
on
a
100-million
scale
dataset
show
that
IVF-PQ
outperforms
IVF-Flat,
particularly
with
larger
batch
sizes,
achieving
up
to
3-4
times
the
number
of
queries
per
second.

Conclusion

IVF-PQ
is
a
robust
ANN
search
algorithm
that
leverages
clustering
and
compression
to
enhance
search
performance
and
throughput.
The
first
part
of
NVIDIA’s
blog
series
provides
a
comprehensive
overview
of
the
algorithm’s
workings
and
its
advantages
on
GPU
platforms.
For
more
detailed
performance
tuning
recommendations,
NVIDIA
encourages
readers
to
explore
the
second
part
of
their
series.

For
more
information,
visit
the

NVIDIA
Technical
Blog.

Image
source:
Shutterstock

NVIDIA Delves into RAPIDS cuVS IVF-PQ for Accelerated Vector Search

IVF-PQ Algorithm Introduction

Performance Benchmarks

Algorithm Overview

Optimizations and Performance

Conclusion

IVF-PQ
Algorithm
Introduction

Performance
Benchmarks

Algorithm
Overview

Optimizations
and
Performance