Anyscale Explores Direct Preference Optimization Using Synthetic Data


Felix
Pinkston


Aug
22,
2024
03:00

Anyscale’s
latest
blog
post
delves
into
Direct
Preference
Optimization
(DPO)
with
synthetic
data,
highlighting
its
methodology
and
applications
in
tuning
language
models.

Anyscale Explores Direct Preference Optimization Using Synthetic Data

According
to
Anyscale,
Direct
Preference
Optimization
(DPO)
has
emerged
as
a
significant
methodology
for
tuning
language
models
to
align
their
outputs
with
human
preferences.
The
company’s
latest
blog
post
provides
an
in-depth
case
study
on
the
application
of
DPO
using
synthetic
data,
particularly
in
the
context
of
summarization
tasks.

Synthetic
Data
Generation

Synthetic
data
generation
has
become
a
powerful
technique
for
creating
high-quality
datasets.
Anyscale’s
approach
leverages
AI
models
as
data
augmenters
and
judges
to
improve
subsequent
models.
The
blog
outlines
a
detailed
pipeline
for
synthetic
data
generation,
emphasizing
the
utility
of
Ray
Data
and
vLLM
for
scaling
and
rapid
experimentation.

DPO
Training
and
Insights

Direct
Preference
Optimization
(DPO)
offers
a
balanced
trade-off
between
complexity
and
effectiveness,
making
it
a
widely
adopted
algorithm
for
preference
tuning.
Anyscale
has
integrated
DPO
into
its
LLM
suite,
enabling
users
to
build
preference-tuned
models
through
an
intuitive
API.
The
blog
covers
modeling
insights
and
experiments
conducted
on
DPO
for
summarization.

Evaluation

Anyscale
utilizes
Ray
Data
and
vLLM
for
batch
inference
to
evaluate
the
generated
summaries
at
scale.
Evaluation
is
crucial
for
determining
the
quality
of
models,
and
Anyscale
emphasizes
the
importance
of
task-specific
evaluation
aligned
with
training
objectives.
The
blog
provides
key
details
on
setting
up
preference
functions
for
effective
evaluation.

Comparison
with
Supervised
Fine-Tuning

The
blog
contrasts
DPO
with
traditional
supervised
fine-tuning
(SFT).
While
SFT
relies
on
high-quality
data
collection
and
exact
imitation
of
desired
behavior,
preference
tuning
focuses
on
whether
a
response
is
preferred
over
another.
This
approach
allows
for
scalable
data
generation
and
on-policy
data
collection,
directly
addressing
model-specific
issues.

Case
Study:
Summarization

The
case
study
applies
DPO
to
the
Mistral-7B-instruct-v0.1
model
for
summarizing
CNN
articles.
Anyscale
designed
a
synthetic
summarization
preference
dataset,
using
a
synthetic
judge
to
reduce
costs
and
ensure
alignment
between
training
and
evaluation.
The
preference
function
combines
word
count
minimization
and
Q&A
accuracy
to
evaluate
summaries.

Data
Generation

Anyscale
used
the
Mistral-7B-Instruct-v0.1
model
to
generate
on-policy
data
for
summarization.
The
process
involved
generating
multiple
summaries
for
each
article
and
using
the
Llama-3-70B-Instruct
model
to
create
and
answer
multiple-choice
questions
about
the
original
text.
This
method
ensured
diverse
outputs
and
accurate
evaluation.

DPO
Training

Anyscale
implemented
DPO
within
its
LLM
post-training
offering,
allowing
users
to
configure
hyperparameters
and
compute
resources
for
training
runs.
The
blog
provides
a
detailed
example
of
a
DPO
training
configuration,
emphasizing
the
importance
of
the
β
hyperparameter
and
efficient
training
using
Ray.

Evaluation

Evaluation
involved
computing
win-rates
for
each
model,
comparing
DPO-trained
models
with
the
original
and
other
baselines.
The
results
demonstrated
DPO’s
advantage
in
balancing
accuracy
and
compression,
outperforming
both
SFT
and
GPT-4o
baselines.

Insights
and
Challenges

Anyscale
identified
key
insights
for
DPO
training,
including
the
critical
role
of
β
and
learning
rate
hyperparameters.
The
blog
also
discusses
failure
modes,
such
as
long
off-topic
endings
and
gibberish
tokens,
highlighting
the
need
for
careful
hyperparameter
tuning
and
monitoring.

Iterative
On-Policy
Training

The
blog
suggests
iterative
on-policy
training
as
a
method
to
enhance
DPO
performance.
By
regenerating
training
data
with
the
fine-tuned
model
and
applying
additional
DPO
rounds,
Anyscale
achieved
significant
performance
gains,
making
DPO
competitive
with
traditional
RLHF
methods.

For
the
full
detailed
case
study
and
methodology,
readers
can
refer
to
the
original
post
on

Anyscale
.

Image
source:
Shutterstock

Comments are closed.