NVIDIA Introduces NeMo Retriever for Enhanced RAG Pipelines

Enterprises
are
increasingly
looking
to
leverage
their
vast
data
reserves
to
improve
operational
efficiency,
reduce
costs,
and
boost
productivity.
NVIDIA’s
latest
offering,
the
NeMo
Retriever,
aims
to
facilitate
this
by
enabling
developers
to
build
and
deploy
advanced
retrieval-augmented
generation
(RAG)
pipelines.
According
to
the

NVIDIA
Technical
Blog,
the
NeMo
Retriever
collection
introduces
four
new
community-based
NeMo
Retriever
NIMs
designed
for
text
embedding
and
reranking.

New
Models
for
Enhanced
Text
Retrieval

NVIDIA
has
announced
the
release
of
three
NeMo
Retriever
Embedding
NIMs
and
one
NeMo
Retriever
Reranking
NIM.
These
models
are:

NV-EmbedQA-E5-v5:
Optimized
for
text
question-answering
retrieval.
NV-EmbedQA-Mistral7B-v2:
A
multilingual
model
fine-tuned
for
text
embedding
and
accurate
question
answering.
Snowflake-Arctic-Embed-L:
An
optimized
model
for
text
embedding.
NV-RerankQA-Mistral4B-v3:
Fine-tuned
for
text
reranking
and
accurate
question
answering.

Understanding
the
Retrieval
Pipeline

The
retrieval
pipeline
employs
embedding
models
to
generate
vector
representations
of
text
for
semantic
encoding,
stored
in
a
vector
database.
When
a
user
queries
the
database,
the
question
is
encoded
into
a
vector,
which
is
matched
against
stored
vectors
to
retrieve
relevant
information.
Reranking
models
then
score
the
relevance
of
the
retrieved
text
chunks,
ensuring
the
most
accurate
information
is
presented.

Embedding
models
offer
speed
and
cost-efficiency,
while
reranking
models
provide
higher
accuracy.
By
combining
these
models,
enterprises
can
achieve
a
balance
between
performance
and
cost,
using
embedding
models
to
identify
relevant
chunks
and
reranking
models
to
refine
the
results.

NeMo
Retriever
NIMs:
Cost
and
Stability

Cost

NeMo
Retriever
NIMs
are
designed
to
reduce
time-to-market
and
operational
costs.
These
containerized
solutions,
equipped
with
industry-standard
APIs
and
Helm
charts,
facilitate
easy
and
scalable
model
deployment.
Utilizing
the
NVIDIA
AI
Enterprise
software
suite,
NIMs
maximize
model
inference
efficiency,
thereby
lowering
deployment
costs.

Stability

The
NIMs
are
part
of
the
NVIDIA
AI
Enterprise
license,
which
ensures
API
stability,
security
patches,
quality
assurance,
and
support,
providing
a
seamless
transition
from
prototype
to
production
for
AI-driven
enterprises.

Selecting
NIMs
for
Your
Pipeline

When
designing
a
retrieval
pipeline,
developers
need
to
balance
accuracy,
latency,
data
ingestion
throughput,
and
production
throughput.
NVIDIA
offers
guidelines
for
selecting
the
appropriate
NIMs
based
on
these
factors:

Maximize
throughput
and
minimize
latency:
Use
NV-EmbedQA-E5-v5
for
optimized
lightweight
embedding
model
inference.
Optimize
for
low-volume,
low-velocity
databases:
Use
NV-EmbedQA-Mistral7B-v2
for
both
ingestion
and
production
to
balance
throughput
and
accuracy
with
low
latency.
Optimize
for
high-volume,
high-velocity
data:
Combine
NV-EmbedQA-E5-v5
for
document
ingestion
with
NV-RerankQA-Mistral-4B-v3
for
reranking
to
enhance
retrieval
accuracy.

Performance
benchmarks,
such
as
NQ,
HotpotQA,
FiQA,
and
TechQA,
show
that
NeMo
Retriever
NIMs
achieve
significant
improvements
in
embedding
and
reranking
performance,
making
them
suitable
for
various
enterprise
retrieval
use
cases.

Getting
Started

Developers
can
explore
the
NVIDIA
NeMo
Retriever
NIMs
in
the

API
catalog
and
access
NVIDIA’s
generative
AI
examples
on
GitHub.
NVIDIA
also
offers
labs
to
try
the
AI
Chatbot
with
RAG
workflow
through
NVIDIA
LaunchPad,
allowing
for
customization
and
deployment
of
NIMs
across
various
data
environments.

Image
source:
Shutterstock

NVIDIA Introduces NeMo Retriever for Enhanced RAG Pipelines

New Models for Enhanced Text Retrieval

Understanding the Retrieval Pipeline

NeMo Retriever NIMs: Cost and Stability