Enhancing AI Search Precision: NVIDIA Boosts RAG Pipelines with Re-Ranking

In
the
rapidly
evolving
landscape
of
AI-driven
applications,
re-ranking
has
emerged
as
a
pivotal
technique
to
enhance
the
precision
and
relevance
of
enterprise
search
results,
according
to
the

NVIDIA
Technical
Blog.
By
leveraging
advanced
machine
learning
algorithms,
re-ranking
refines
initial
search
outputs
to
better
align
with
user
intent
and
context,
significantly
improving
the
effectiveness
of
semantic
search.

Role
of
Re-Ranking
in
AI

Re-ranking
plays
a
crucial
role
in
optimizing
retrieval-augmented
generation
(RAG)
pipelines,
ensuring
that
large
language
models
(LLMs)
operate
with
the
most
pertinent
and
high-quality
information.
This
dual
benefit
of
re-ranking—enhancing
both
semantic
search
and
RAG
pipelines—makes
it
an
indispensable
tool
for
enterprises
aiming
to
deliver
superior
search
experiences
and
maintain
a
competitive
edge
in
the
digital
marketplace.

What
is
Re-Ranking?

Re-ranking
is
a
sophisticated
technique
used
to
enhance
the
relevance
of
search
results
by
utilizing
the
advanced
language
understanding
capabilities
of
LLMs.
Initially,
a
set
of
candidate
documents
or
passages
is
retrieved
using
traditional
information
retrieval
methods
like
BM25
or
vector
similarity
search.
These
candidates
are
then
fed
into
an
LLM
that
analyzes
the
semantic
relevance
between
the
query
and
each
document.
The
LLM
assigns
relevance
scores,
enabling
the
re-ordering
of
documents
to
prioritize
the
most
pertinent
ones.

This
process
significantly
improves
the
quality
of
search
results
by
going
beyond
mere
keyword
matching
to
understand
the
context
and
meaning
of
the
query
and
documents.
Re-ranking
is
typically
used
as
a
second
stage
after
an
initial
fast
retrieval
step,
ensuring
that
only
the
most
relevant
documents
are
presented
to
the
user.
It
can
also
combine
results
from
multiple
data
sources
and
integrate
into
a
RAG
pipeline
to
further
ensure
that
context
is
ideally
tuned
for
the
specific
query.

NVIDIA’s
Implementation
of
Re-Ranking

In
this
post,
the
NVIDIA
Technical
Blog
illustrates
the
use
of
the
NVIDIA
NeMo
Retriever
reranking
NIM.
This
transformer
encoder,
a
LoRA
fine-tuned
version
of
Mistral-7B,
uses
only
the
first
16
layers
for
higher
throughput.
The
last
embedding
output
by
the
decoder
model
is
used
as
a
pooling
strategy,
and
a
binary
classification
head
is
fine-tuned
for
the
ranking
task.

To
access
the
NVIDIA
NeMo
Retriever
collection
of
world-class
information
retrieval
microservices,
see
the

NVIDIA
API
Catalog.

Combining
Results
from
Multiple
Data
Sources

In
addition
to
enhancing
accuracy
for
a
single
data
source,
re-ranking
can
be
used
to
combine
multiple
data
sources
in
a
RAG
pipeline.
Consider
a
pipeline
with
data
from
a
semantic
store
and
a
BM25
store.
Each
store
is
queried
independently
and
returns
results
that
the
individual
store
considers
to
be
highly
relevant.
Figuring
out
the
overall
relevance
of
the
results
is
where
re-ranking
comes
into
play.

The
following
code
example
combines
the
previous
semantic
search
results
with
BM25
results.
The
results
in
combined_docs
are
ordered
by
their
relevance
to
the
query
by
the
reranking
NIM.

 all_docs = docs + bm25_docs reranker.top_n = 5 combined_docs = reranker.compress_documents(query=query, documents=all_docs)

Connecting
to
a
RAG
Pipeline

In
addition
to
using
re-ranking
independently,
it
can
be
added
to
a
RAG
pipeline
to
further
enhance
responses
by
ensuring
that
they
use
the
most
relevant
chunks
for
augmenting
the
original
query.

In
this
case,
connect
the
compression_retriever
object
from
the
previous
step
to
the
RAG
pipeline.

 from langchain.chains import RetrievalQA
from langchain_nvidia_ai_endpoints import ChatNVIDIA chain = RetrievalQA.from_chain_type( llm=ChatNVIDIA(temperature=0), retriever=compression_retriever
)
result = chain({"query": query})
print(result.get("result"))

The
RAG
pipeline
now
uses
the
correct
top-ranked
chunk
and
summarizes
the
main
insights:

 The A100 GPU is used for training the 7B model in the supervised fine-tuning/instruction tuning ablation study. The training is performed on 16 A100 GPU nodes, with each node having 8 GPUs. The training hours for each stage of the 7B model are: projector initialization: 4 hours; visual language pre-training: 30 hours; and visual instruction-tuning: 6 hours. The total training time corresponds to 5.1k GPU hours, with most of the computation being spent on the pre-training stage. The training time could potentially be reduced by at least 30% with proper optimization. The high image resolution of 336 ×336 used in the training corresponds to 576 tokens/image.

Conclusion

RAG
has
emerged
as
a
powerful
approach,
combining
the
strengths
of
LLMs
and
dense
vector
representations.
By
using
dense
vector
representations,
RAG
models
can
scale
efficiently,
making
them
well-suited
for
large-scale
enterprise
applications,
such
as
multilingual
customer
service
chatbots
and
code
generation
agents.

As
LLMs
continue
to
evolve,
RAG
will
play
an
increasingly
important
role
in
driving
innovation
and
delivering
high-quality,
intelligent
systems
that
can
understand
and
generate
human-like
language.

When
building
a
RAG
pipeline,
it’s
crucial
to
correctly
split
the
vector
store
documents
into
chunks
by
optimizing
the
chunk
size
for
the
specific
content
and
selecting
an
LLM
with
a
suitable
context
length.
In
some
cases,
complex
chains
of
multiple
LLMs
may
be
required.
To
optimize
RAG
performance
and
measure
success,
use
a
collection
of
robust
evaluators
and
metrics.

For
more
information
about
additional
models
and
chains,
see

NVIDIA
AI
LangChain
endpoints.

Image
source:
Shutterstock

Enhancing AI Search Precision: NVIDIA Boosts RAG Pipelines with Re-Ranking

Role of Re-Ranking in AI

What is Re-Ranking?

NVIDIA’s Implementation of Re-Ranking

Combining Results from Multiple Data Sources

Connecting to a RAG Pipeline