Enhancing LLM Tool-Calling Performance with Few-Shot Prompting

LangChain
has
recently
unveiled
the
results
of
its
experiments
aimed
at
enhancing
the
performance
of
large
language
models
(LLMs)
in
tool-calling
tasks
through
few-shot
prompting.
According
to
the

LangChain
Blog,
the
experiments
demonstrate
that
few-shot
prompting
significantly
improves
model
accuracy,
particularly
for
complex
tasks.

Few-Shot
Prompting:
A
Game
Changer

Few-shot
prompting
involves
including
example
model
inputs
and
desired
outputs
in
the
model
prompt.
Research,
including
a
study
referenced
by
LangChain,
has
shown
that
this
technique
can
drastically
enhance
model
performance
across
a
broad
spectrum
of
tasks.
However,
there
are
numerous
ways
to
construct
few-shot
prompts,
and
few
established
best
practices
exist.

LangChain’s
experiments
were
conducted
on
two
datasets:

Query
Analysis
and

Multiverse
Math.
The
Query
Analysis
dataset
involves
invoking
different
search
indexes
based
on
user
queries,
while
the
Multiverse
Math
dataset
tests
function
calling
in
a
more
complex,
agentic
workflow.
The
experiments
benchmarked
multiple
OpenAI
and
Anthropic
models,
experimenting
with
various
methods
of
providing
few-shot
examples
to
the
models.

Constructing
the
Few-Shot
Dataset

The
few-shot
dataset
for
the
Multiverse
Math
task
was
created
manually
and
contained
13
datapoints.
Different
few-shot
techniques
were
employed
to
evaluate
their
effectiveness:

Zero-shot:
Only
a
basic
system
prompt
and
the
question
were
provided
to
the
model.
Few-shot-static-msgs,
k=3:
Three
fixed
examples
were
passed
as
messages
between
the
system
prompt
and
the
human
question.
Few-shot-dynamic-msgs,
k=3:
Three
dynamically
selected
examples
were
passed
as
messages
based
on
semantic
similarity
between
the
current
and
example
questions.
Few-shot-str,
k=13:
All
thirteen
examples
were
converted
into
one
long
string
appended
to
the
system
prompt.
Few-shot-msgs,
k=13:
All
thirteen
examples
were
passed
as
messages
between
the
system
prompt
and
the
human
question.

Results
and
Insights

The
results
revealed
several
key
trends:

Few-shot
prompting
significantly
improves
performance
across
the
board.
For
instance,
Claude
3
Sonnet’s
performance
increased
from
16%
using
zero-shot
to
52%
with
three
semantically
similar
examples
as
messages.
Using
semantically
similar
examples
as
messages
yields
better
results
than
using
static
examples
or
strings.
The
Claude
models
benefit
more
from
few-shot
prompting
than
the
GPT
models.

An
example
question
that
initially
received
an
incorrect
answer
without
few-shot
prompting
was
corrected
after
few-shot
prompting,
demonstrating
the
technique’s
effectiveness.

Future
Directions

The
study
opens
several
avenues
for
future
exploration:

Comparing
the
impact
of
inserting
negative
few-shot
examples
(wrong
answers)
versus
positive
ones.
Identifying
the
best
methods
for
semantic
search
retrieval
of
few-shot
examples.
Determining
the
optimal
number
of
few-shot
examples
for
the
best
performance-cost
trade-off.
Evaluating
whether
trajectories
that
include
initial
errors
and
subsequent
corrections
are
more
beneficial
than
those
that
are
correct
on
the
first
pass.

LangChain
invites
further
benchmarking
and
ideas
for
future
evaluations
to
continue
advancing
the
field.

Image
source:
Shutterstock

Enhancing LLM Tool-Calling Performance with Few-Shot Prompting

Few-Shot Prompting: A Game Changer

Constructing the Few-Shot Dataset

Results and Insights

Future Directions

Few-Shot
Prompting:
A
Game
Changer

Constructing
the
Few-Shot
Dataset

Results
and
Insights

Future
Directions