NVIDIA NIM Enhances RAG Applications for Veterinary AI

The
advent
of
large
language
models
(LLMs)
has
significantly
benefited
the
AI
industry,
offering
versatile
tools
capable
of
generating
human-like
text
and
handling
a
wide
range
of
tasks.
However,
while
LLMs
demonstrate
impressive
general
knowledge,
their
performance
in
specialized
fields,
such
as
veterinary
science,
is
limited
when
used
out
of
the
box.
To
enhance
their
utility
in
specific
areas,
two
primary
strategies
are
commonly
adopted
in
the
industry:

fine-tuning
and

retrieval-augmented
generation
(RAG).

Fine-Tuning
vs.
RAG

Fine-tuning
involves
training
the
model
on
a
carefully
curated
and
structured
dataset,
demanding
substantial
hardware
resources,
as
well
as
the
involvement
of
domain
experts,
a
process
that
is
often
time-consuming
and
costly.
Unfortunately,
in
many
fields,
it’s
incredibly
challenging
to
access
domain
experts
in
a
way
that
is
compatible
with
business
constraints.

Conversely,
RAG
involves
building
a
comprehensive
corpus
of
knowledge
literature,
alongside
an
effective
retrieval
system
that
extracts
relevant
text
chunks
to
address
user
queries.
By
adding
this
retrieved
information
to
the
user
query,
LLMs
can
produce
better
answers.
Although
this
approach
still
requires
subject
matter
experts
to
curate
the
best
sources
for
the
dataset,
it
is
more
tractable
and
business-compatible
than
fine-tuning.
Also,
since
extensive
training
of
the
model
isn’t
necessary,
this
approach
is
less
computationally
intensive
and
more
cost-effective.

NVIDIA
NIM
and
NLP
Pipelines

NVIDIA
NIM
streamlines
the
design
of
NLP
pipelines
using
LLMs.
These
microservices
simplify
the
deployment
of
generative
AI
models
across
platforms,
allowing
teams
to
self-host
LLMs
while
offering
standard
APIs
to
build
applications.

NIM
abstracts
model
inference
internals
like
execution
engines
and
runtime
operations,
ensuring
optimal
performance
with

TensorRT-LLM,
vLLM,
and
others.
Key
features
include:

Scalable
deployment
Support
for
diverse
LLM
architectures
with
optimized
engines
Flexible
integration
into
existing
workflows
Enterprise-grade
security
with
safetensors
and
constant
CVE
monitoring

Developers
can
run
NIM
microservices
with
Docker
and
perform
inference
using
APIs.
Specialized
trained
model
weights
can
also
be
used
for
specific
tasks,
such
as
document
parsing,
by
modifying
container
commands.

Reimagining
Veterinary
Care
with
AI

At
AITEM,
a
member
of
the

NVIDIA
Inception
Program
for
startups,
collaboration
with
NVIDIA
has
focused
on
AI-based
solutions
across
multiple
fields,
including
industrial
and
life
sciences.
In
the
veterinary
sector,
AITEM
is
working
on

LAIKA,
an
innovative
AI
copilot
designed
to
assist
veterinarians
by
processing
patient
data
and
offering
diagnostic
suggestions,
guidance,
and
clarifications.

LAIKA
integrates
multiple
LLMs
and
RAG
pipelines.
The
RAG
component
retrieves
relevant
information
from
a
curated
dataset
of
veterinary
resources.
During
preparation,
each
resource
is
divided
into
chunks,
with
embeddings
calculated
and
stored
in
the
RAG
database.
During
inference,
the
query
is
pre-processed
and
its
embeddings
are
computed
and
compared
with
those
in
the
RAG
database
using
geometric
distance
metrics.
The
closest
matches
are
selected
as
the
most
relevant
and
used
to
generate
responses.

Due
to
potential
redundancy
in
the
RAG
database,
multiple
retrieved
chunks
might
contain
the
same
information,
limiting
the
diversity
of
concepts
provided
to
the
answer
system.
To
address
this,
LAIKA
employs
the
Maximal
Marginal
Relevance
(MMR)
algorithm
to
minimize
chunk
redundancy
and
ensure
a
broader
range
of
relevant
information.

NVIDIA
NeMo
Retriever
Reranking
NIM
Microservice

The

NVIDIA
API
Catalog
includes
NeMo
Retriever
NIM
microservices
that
enable
organizations
to
seamlessly
connect
custom
models
to
diverse
business
data
and
deliver
highly
accurate
responses.
The

NVIDIA
Retrieval
QA
Mistral
4B
reranking
NIM
microservice
is
designed
to
assess
the
probability
that
a
given
text
passage
contains
relevant
information
for
answering
a
user
query.
Integrating
this
model
into
the
RAG
pipeline
enables
filtering
out
retrievals
that
do
not
pass
the
reranking
model’s
evaluation,
ensuring
that
only
the
most
relevant
and
accurate
information
is
used.

To
assess
the
impact
of
this
step
on
the
RAG
pipeline,
AITEM
designed
an
experiment:

Extract
a
dataset
of
~100
anonymized
questions
from
LAIKA
users.
Run
the
current
RAG
pipeline
to
retrieve
chunks
for
each
question.
Sort
the
retrieved
chunks
based
on
probabilities
provided
by
the
reranking
model.
Evaluate
each
chunk
for
relevance
to
the
query.
Analyze
the
reranking
model’s
probability
distribution
in
relation
to
the
relevance
determined
in
Step
4.
Compare
the
ranking
of
chunks
in
Step
3
against
their
relevance
from
Step
4.

User
questions
in
LAIKA
can
vary
significantly
in
form.
Some
queries
contain
detailed
explanations
of
a
situation
but
lack
a
specific
question.
Others
contain
precise
inquiries
regarding
research,
while
some
seek
guidance
or
differential
diagnoses
based
on
clinical
cases
or
analysis
documents.

Due
to
the
large
number
of
chunks
per
question,
AITEM
used
the

Llama
3.1
70B
Instruct
NIM
microservice
for
evaluation,
which
is
also
available
in
the
NVIDIA
API
Catalog.

To
better
understand
the
reranking
model’s
performance,
specific
queries
and
model
responses
were
examined
in
detail.
Table
1
highlights
the
top
and
bottom
reranked
chunks
for
a
sample
query
regarding
differential
diagnoses
for
a
cat
losing
weight.

Text	Reranking Logit
Causes of weight loss that can be particularly difficult to diagnose … include gastric disease not causing vomiting, intestinal disease not causing vomiting or diarrhea, hepatic disease …	3.3125
Differential diagnoses for nonspecific signs like anorexia, weight loss, vomiting, and diarrhea … acute pancreatitis is rare in cats, … signs are nonspecific and ill-defined (anorexia, lethargy, weight loss).	2.3222
Severe weight loss (with or without increased appetite) may be noted where there is cancer cachexia, maldigestion/malabsorption … Appetite may be increased in some conditions, such as hyperthyroidism in cats, … However, a normal appetite does not rule out the presence of a serious condition.	2.2265
Overall, weight loss was the most common presenting sign … with little difference between the groups …	-5.0078
Other client complaints include lethargy, anorexia, weight loss, vomiting …	-7.3672
There were 6 British Shorthair, 4 European Shorthair, and 1 Bengal cat … Reported clinical signs by owners included: reduced appetite or anorexia…	-10.3281

Table
1.
Three
highest-ranked
chunks
and
three
lowest-ranked
text
chunks

Figure
4
compares
the
reranking
model
probability
output
distribution
(in
logits)
between
relevant
(good)
and
irrelevant
(bad)
chunks.
The
probabilities
for
good
chunks
are
higher
compared
to
bad
chunks,
and
a
t-test
confirmed
that
this
difference
is
statistically
significant,
with
a
p-value
lower
than
3e-72.

Figure
4.
Distribution
of
reranking
model
output
in
terms
of
logits

Figure
5
shows
the
distribution
difference
in
the
reranking-induced
sorting
positions:
good
chunks
are
predominantly
in
top
positions,
while
bad
chunks
are
lower.
The
Mann-Whitney
test
confirmed
that
these
differences
are
statistically
significant,
resulting
in
a
p-value
lower
than
9e-31.

Figure
5.
Distribution
of
reranking
model-induced
sorting
among
the
retrieved
chunks

Figure
6
shows
the
ranking
distribution
and
helps
define
an
effective
cutoff
point.
In
the
top
five
positions,
most
chunks
are
good,
while
the
majority
of
chunks
in
positions
11-15
are
bad.
Thus,
retaining
only
the
top
five
retrievals
or
another
chosen
number
can
serve
as
one
way
to
effectively
exclude
most
of
the
bad
chunks.

Figure
6.
Balance
between
good
and
bad
chunks
by
position
in
the
sorting
induced
by
the
reranking
model

To
optimize
retrieval
pipelines,
and
minimize
ingestion
costs
while
maximizing
accuracy,
a
lightweight
embedding
model
can
be
paired
with
the
NVIDIA
reranking
NIM
microservice,
to
boost
retrieval
accuracy.
Execution
time
can
be
improved
by
1.75x
(Figure
7).

Figure
7.
NVIDIA
reranking
NIM
microservice
comparison

Better
Answers
with
the
NVIDIA
Reranking
NIM
Microservice

The
results
demonstrate
that
adding
the
NVIDIA
reranking
NIM
microservice
to
the
LAIKA
RAG
pipeline
positively
affects
the
relevance
of
retrieved
chunks.
By
forwarding
more
precise,
specialized
information
to
the
downstream
answering
LLM,
it
equips
the
model
with
the
knowledge
that’s
necessary
for
highly
specialized
fields
like
veterinary
science.

The
NVIDIA
reranking
NIM
microservice,
available
in
the

NVIDIA
API
Catalog,
simplifies
adoption
as
you
can
easily
pull
and
run
the
model
and
infer
its
evaluations
through
APIs.
This
eliminates
stress
related
to
environment
settings
and
manual
optimization,
as
it
comes
pre-quantized
and
optimized
with
NVIDIA
TensorRT
for
almost
any
platform.

For
more
information
and
the
latest
updates
about
LAIKA
and
other
AITEM
projects,
see

AITEM
Solutions
and
follow

LAIKA
and

AITEM
on
LinkedIn.

Image
source:
Shutterstock

NVIDIA NIM Enhances RAG Applications for Veterinary AI

Fine-Tuning vs. RAG

NVIDIA NIM and NLP Pipelines

Reimagining Veterinary Care with AI

NVIDIA NeMo Retriever Reranking NIM Microservice

Better Answers with the NVIDIA Reranking NIM Microservice

Fine-Tuning
vs.
RAG

NVIDIA
NIM
and
NLP
Pipelines

Reimagining
Veterinary
Care
with
AI

NVIDIA
NeMo
Retriever
Reranking
NIM
Microservice

Better
Answers
with
the
NVIDIA
Reranking
NIM
Microservice