NVIDIA Unveils Blueprint for Enterprise-Scale Multimodal Document Retrieval Pipeline

In
an
exciting
development,
NVIDIA
has
unveiled
a
comprehensive
blueprint
for
building
an
enterprise-scale
multimodal
document
retrieval
pipeline.
This
initiative
leverages
the
company’s
NeMo
Retriever
and
NIM
microservices,
aiming
to
revolutionize
how
businesses
extract
and
utilize
vast
amounts
of
data
from
complex
documents,
according
to

NVIDIA
Technical
Blog.

Harnessing
Untapped
Data

Every
year,
trillions
of
PDF
files
are
generated,
containing
a
wealth
of
information
in
various
formats
such
as
text,
images,
charts,
and
tables.
Traditionally,
extracting
meaningful
data
from
these
documents
has
been
a
labor-intensive
process.
However,
with
the
advent
of
generative
AI
and
retrieval-augmented
generation
(RAG),
this
untapped
data
can
now
be
efficiently
utilized
to
uncover
valuable
business
insights,
thereby
enhancing
employee
productivity
and
reducing
operational
costs.

The
multimodal
PDF
data
extraction
blueprint
introduced
by
NVIDIA
combines
the
power
of
the
NeMo
Retriever
and
NIM
microservices
with
reference
code
and
documentation.
This
combination
allows
for
accurate
extraction
of
knowledge
from
massive
volumes
of
enterprise
data,
enabling
employees
to
make
informed
decisions
swiftly.

Building
the
Pipeline

The
process
of
building
a
multimodal
retrieval
pipeline
on
PDFs
involves
two
key
steps:
ingesting
documents
with
multimodal
data
and
retrieving
relevant
context
based
on
user
queries.

Ingesting
Documents

The
first
step
involves
parsing
PDFs
to
separate
different
modalities
such
as
text,
images,
charts,
and
tables.
Text
is
parsed
as
structured
JSON,
while
pages
are
rendered
as
images.
The
next
step
is
to
extract
textual
metadata
from
these
images
using
various
NIM
microservices:

nv-yolox-structured-image:
Detects
charts,
plots,
and
tables
in
PDFs.
DePlot:
Generates
descriptions
of
charts.
CACHED:
Identifies
various
elements
in
graphs.
PaddleOCR:
Transcribes
text
from
tables
and
charts.

After
extracting
the
information,
it
is
filtered,
chunked,
and
stored
in
a
VectorStore.
The
NeMo
Retriever
embedding
NIM
microservice
converts
the
chunks
into
embeddings
for
efficient
retrieval.

Retrieving
Relevant
Context

When
a
user
submits
a
query,
the
NeMo
Retriever
embedding
NIM
microservice
embeds
the
query
and
retrieves
the
most
relevant
chunks
using
vector
similarity
search.
The
NeMo
Retriever
reranking
NIM
microservice
then
refines
the
results
to
ensure
accuracy.
Finally,
the
LLM
NIM
microservice
generates
a
contextually
relevant
response.

Cost-Effective
and
Scalable

NVIDIA’s
blueprint
offers
significant
benefits
in
terms
of
cost
and
stability.
The
NIM
microservices
are
designed
for
ease
of
use
and

scalability,
allowing
enterprise
application
developers
to
focus
on
application
logic
rather
than
infrastructure.
These
microservices
are
containerized
solutions
that
come
with
industry-standard
APIs
and
Helm
charts
for
easy
deployment.

Moreover,
the
full
suite
of
NVIDIA
AI
Enterprise
software
accelerates
model
inference,
maximizing
the
value
enterprises
derive
from
their
models
and
reducing
deployment
costs.
Performance
tests
have
shown
significant
improvements
in
retrieval
accuracy
and
ingestion
throughput
when
using
NIM
microservices
compared
to
open-source
alternatives.

Collaborations
and
Partnerships

NVIDIA
is
partnering
with
several
data
and
storage
platform
providers,
including
Box,
Cloudera,
Cohesity,
DataStax,
Dropbox,
and
Nexla,
to
enhance
the
capabilities
of
the
multimodal
document
retrieval
pipeline.

Cloudera

Cloudera’s
integration
of
NVIDIA
NIM
microservices
in
its
AI
Inference
service
aims
to
combine
the
exabytes
of
private
data
managed
in
Cloudera
with
high-performance
models
for
RAG
use
cases,
offering
best-in-class
AI
platform
capabilities
for
enterprises.

Cohesity

Cohesity’s
collaboration
with
NVIDIA
aims
to
add
generative
AI
intelligence
to
customers’
data
backups
and
archives,
enabling
quick
and
accurate
extraction
of
valuable
insights
from
millions
of
documents.

Datastax

DataStax
aims
to
leverage
NVIDIA’s
NeMo
Retriever
data
extraction
workflow
for
PDFs
to
enable
customers
to
focus
on
innovation
rather
than
data
integration
challenges.

Dropbox

Dropbox
is
evaluating
the
NeMo
Retriever
multimodal
PDF
extraction
workflow
to
potentially
bring
new
generative
AI
capabilities
to
help
customers
unlock
insights
across
their
cloud
content.

Nexla

Nexla
aims
to
integrate
NVIDIA
NIM
in
its
no-code/low-code
platform
for
Document
ETL,
enabling
scalable
multimodal
ingestion
across
various
enterprise
systems.

Getting
Started

Developers
interested
in
building
a
RAG
application
can
experience
the
multimodal
PDF
extraction
workflow
through
NVIDIA’s
interactive
demo
available
in
the
NVIDIA
API
Catalog.
Early
access
to
the
workflow
blueprint,
along
with
open-source
code
and
deployment
instructions,
is
also
available.

Image
source:
Shutterstock

NVIDIA Unveils Blueprint for Enterprise-Scale Multimodal Document Retrieval Pipeline

Harnessing Untapped Data

Building the Pipeline

Ingesting Documents

Retrieving Relevant Context

Cost-Effective and Scalable

Collaborations and Partnerships