Dragonfly: Enhanced Vision-Language Model with Multi-Resolution Zoom Launched by Together.ai

Together.ai
has
announced
the
launch
of
Dragonfly,
an
innovative
vision-language
model
designed
to
enhance
fine-grained
visual
understanding
and
reasoning
about
image
regions.
The
architecture
leverages
multi-resolution
zoom-and-select
capabilities
to
optimize
multi-modal
reasoning
while
maintaining
context
efficiency,
according
to

Together
AI.

Dragonfly
Model
Architecture

Dragonfly
employs
two
primary
strategies:
multi-resolution
visual
encoding
and
zoom-in
patch
selection.
These
techniques
enable
the
model
to
focus
on
fine-grained
details
of
image
regions,
enhancing
its
commonsense
reasoning
capabilities.
The
architecture
processes
images
at
multiple
resolutions—low,
medium,
and
high—dividing
each
image
into
sub-images
that
are
encoded
into
visual
tokens.
These
tokens
are
then
projected
into
a
language
space,
forming
a
concatenated
sequence
that
feeds
into
the
language
model.

Zoom-in
Patch
Selection:
Dragonfly
employs
a
selective
approach
for
high-resolution
images,
identifying
and
retaining
only
the
sub-images
that
provide
the
most
significant
visual
information.
This
targeted
selection
reduces
redundancy
and
improves
the
overall
model
efficiency.

Performance
and
Evaluation

Dragonfly
demonstrates
promising
performance
on
several
vision-language
benchmarks,
including
commonsense
visual
question
answering
and
image
captioning.
The
model
achieved
competitive
results
on
benchmarks
such
as
AI2D,
ScienceQA,
MMMU,
MMVet,
and
POPE,
showcasing
its
effectiveness
in
fine-grained
understanding
of
image
regions.

Benchmark
Performance:

Model	AI2D	ScienceQA	MMMU	MMVet	POPE
VILA	–	68.2		34.9	85.5
LLaVA-v1.5 (Vicuna-7B)	54.8	70.4	35.3	30.5	85.9
LLaVA-v1.6 (Mistral-7B)	60.8	72.8	33.4	44.8	86.7
QWEN-VL-chat	52.3	68.2	35.9	–	–
Dragonfly (LLaMA-8B)	63.6	80.5	37.8	35.9	91.2

Dragonfly-Med

In
collaboration
with
Stanford
Medicine,
Together.ai
has
also
introduced
Dragonfly-Med,
a
version
fine-tuned
on
1.4
million
biomedical
image-instruction
data.
This
model
excels
in
high-resolution
medical
data
tasks,
outperforming
previous
models
like
Med-Gemini
on
multiple
medical
imaging
benchmarks.

Evaluation
on
Medical
Benchmarks

Dragonfly-Med
was
evaluated
on
visual
question-answering
and
clinical
report
generation
tasks,
achieving
state-of-the-art
results
on
several
benchmarks:

Dataset	Metric	Med-Gemini	Dragonfly-Med (LLaMA-8B)
VQA-RAD	Acc (closed)	69.7	77.4
SLAKE	Acc (closed)	84.8	90.4
Path-VQA	Acc (closed)	83.3	92.3

Conclusion
and
Future
Work

Dragonfly’s
architecture
offers
a
new
research
direction
by
focusing
on
zooming
in
on
image
regions
to
capture
more
fine-grained
visual
information.
Together.ai
plans
to
continue
improving
the
model’s
capabilities
and
exploring
new
architectures
and
visual
encoding
strategies
to
benefit
broader
scientific
fields.

The
collaboration
with
Stanford
Medicine
and
the
utilization
of
resources
like
Meta
LLaMA3
and
CLIP
from
OpenAI
have
been
crucial
in
developing
Dragonfly.
The
model’s
codebase
also
builds
upon
the
foundations
of
Otter
and
LLaVA-UHD.

Image
source:
Shutterstock

.
.
.

Dragonfly: Enhanced Vision-Language Model with Multi-Resolution Zoom Launched by Together.ai

Dragonfly Model Architecture

Performance and Evaluation

Dragonfly-Med

Evaluation on Medical Benchmarks

Conclusion and Future Work

Tags

Dragonfly
Model
Architecture

Performance
and
Evaluation

Evaluation
on
Medical
Benchmarks

Conclusion
and
Future
Work