Enhancing LLM Tool-Calling Performance with Few-Shot Prompting


Alvin
Lang


Jul
24,
2024
19:18

LangChain’s
experiments
reveal
how
few-shot
prompting
significantly
boosts
LLM
tool-calling
accuracy,
especially
for
complex
tasks.

Enhancing LLM Tool-Calling Performance with Few-Shot Prompting

LangChain
has
recently
unveiled
the
results
of
its
experiments
aimed
at
enhancing
the
performance
of
large
language
models
(LLMs)
in
tool-calling
tasks
through
few-shot
prompting.
According
to
the

LangChain
Blog
,
the
experiments
demonstrate
that
few-shot
prompting
significantly
improves
model
accuracy,
particularly
for
complex
tasks.

Few-Shot
Prompting:
A
Game
Changer

Few-shot
prompting
involves
including
example
model
inputs
and
desired
outputs
in
the
model
prompt.
Research,
including
a
study
referenced
by
LangChain,
has
shown
that
this
technique
can
drastically
enhance
model
performance
across
a
broad
spectrum
of
tasks.
However,
there
are
numerous
ways
to
construct
few-shot
prompts,
and
few
established
best
practices
exist.

LangChain’s
experiments
were
conducted
on
two
datasets:

Query
Analysis

and

Multiverse
Math
.
The
Query
Analysis
dataset
involves
invoking
different
search
indexes
based
on
user
queries,
while
the
Multiverse
Math
dataset
tests
function
calling
in
a
more
complex,
agentic
workflow.
The
experiments
benchmarked
multiple
OpenAI
and
Anthropic
models,
experimenting
with
various
methods
of
providing
few-shot
examples
to
the
models.

Constructing
the
Few-Shot
Dataset

The
few-shot
dataset
for
the
Multiverse
Math
task
was
created
manually
and
contained
13
datapoints.
Different
few-shot
techniques
were
employed
to
evaluate
their
effectiveness:

  • Zero-shot:
    Only
    a
    basic
    system
    prompt
    and
    the
    question
    were
    provided
    to
    the
    model.
  • Few-shot-static-msgs,
    k=3:
    Three
    fixed
    examples
    were
    passed
    as
    messages
    between
    the
    system
    prompt
    and
    the
    human
    question.
  • Few-shot-dynamic-msgs,
    k=3:
    Three
    dynamically
    selected
    examples
    were
    passed
    as
    messages
    based
    on
    semantic
    similarity
    between
    the
    current
    and
    example
    questions.
  • Few-shot-str,
    k=13:
    All
    thirteen
    examples
    were
    converted
    into
    one
    long
    string
    appended
    to
    the
    system
    prompt.
  • Few-shot-msgs,
    k=13:
    All
    thirteen
    examples
    were
    passed
    as
    messages
    between
    the
    system
    prompt
    and
    the
    human
    question.

Results
and
Insights

The
results
revealed
several
key
trends:

  • Few-shot
    prompting
    significantly
    improves
    performance
    across
    the
    board.
    For
    instance,
    Claude
    3
    Sonnet’s
    performance
    increased
    from
    16%
    using
    zero-shot
    to
    52%
    with
    three
    semantically
    similar
    examples
    as
    messages.
  • Using
    semantically
    similar
    examples
    as
    messages
    yields
    better
    results
    than
    using
    static
    examples
    or
    strings.
  • The
    Claude
    models
    benefit
    more
    from
    few-shot
    prompting
    than
    the
    GPT
    models.

An
example
question
that
initially
received
an
incorrect
answer
without
few-shot
prompting
was
corrected
after
few-shot
prompting,
demonstrating
the
technique’s
effectiveness.

Future
Directions

The
study
opens
several
avenues
for
future
exploration:

  1. Comparing
    the
    impact
    of
    inserting
    negative
    few-shot
    examples
    (wrong
    answers)
    versus
    positive
    ones.
  2. Identifying
    the
    best
    methods
    for
    semantic
    search
    retrieval
    of
    few-shot
    examples.
  3. Determining
    the
    optimal
    number
    of
    few-shot
    examples
    for
    the
    best
    performance-cost
    trade-off.
  4. Evaluating
    whether
    trajectories
    that
    include
    initial
    errors
    and
    subsequent
    corrections
    are
    more
    beneficial
    than
    those
    that
    are
    correct
    on
    the
    first
    pass.

LangChain
invites
further
benchmarking
and
ideas
for
future
evaluations
to
continue
advancing
the
field.

Image
source:
Shutterstock

Comments are closed.