Mistral AI Unveils Pixtral 12B: A Groundbreaking Multimodal Model

Mistral
AI
has
officially
launched
Pixtral
12B,
the
first-ever
multimodal
model
from
the
company,
designed
to
handle
both
text
and
image
data
seamlessly.
The
model
is
licensed
under
Apache
2.0,
according
to

Mistral
AI.

Key
Features
of
Pixtral
12B

Pixtral
12B
stands
out
due
to
its
natively
multimodal
capabilities,
trained
with
interleaved
image
and
text
data.
The
model
incorporates
a
new
400M
parameter
vision
encoder
and
a
12B
parameter
multimodal
decoder
based
on
Mistral
Nemo.
This
architecture
allows
it
to
support
variable
image
sizes
and
aspect
ratios,
and
process
multiple
images
within
its
long
context
window
of
128K
tokens.

Performance-wise,
Pixtral
12B
excels
in
multimodal
tasks
and
maintains
state-of-the-art
performance
on
text-only
benchmarks.
It
has
achieved
a
52.5%
score
on
the
MMMU
reasoning
benchmark,
surpassing
several
larger
models.

Performance
and
Evaluation

Pixtral
12B
was
designed
as
a
drop-in
replacement
for
Mistral
Nemo
12B,
delivering
best-in-class
multimodal
reasoning
without
compromising
on
text
capabilities
like
instruction
following,
coding,
and
math.
The
model
was
evaluated
using
a
consistent
evaluation
harness
across
various
datasets,
and
it
outperforms
both
open
and
closed
models
such
as
Claude
3
Haiku.
Notably,
Pixtral
even
matches
or
exceeds
the
performance
of
larger
models
like
LLaVa
OneVision
72B
on
multimodal
benchmarks.

In
instruction
following,
Pixtral
particularly
excels,
showing
a
20%
relative
improvement
in
text
IF-Eval
and
MT-Bench
over
the
nearest
open-source
model.
It
also
performs
strongly
on
multimodal
instruction
following
benchmarks,
outperforming
models
like
Qwen2-VL
7B
and
Phi-3.5
Vision.

Architecture
and
Capabilities

The
architecture
of
Pixtral
12B
is
designed
to
optimize
for
both
speed
and
performance.
The
vision
encoder
tokenizes
images
at
their
native
resolution
and
aspect
ratio,
converting
them
into
image
tokens
for
each
16×16
patch
in
the
image.
These
tokens
are
then
flattened
to
create
a
sequence,
with
[IMG
BREAK]
and
[IMG
END]
tokens
added
between
rows
and
at
the
end
of
the
image.
This
allows
the
model
to
accurately
understand
complex
diagrams
and
documents
while
providing
fast
inference
speeds
for
smaller
images.

Pixtral’s
final
architecture
comprises
two
components:
the
Vision
Encoder
and
the
Multimodal
Transformer
Decoder.
The
model
is
trained
to
predict
the
next
text
token
on
interleaved
image
and
text
data,
allowing
it
to
process
any
number
of
images
with
arbitrary
sizes
in
its
large
context
window
of
128K
tokens.

Practical
Applications

Pixtral
12B
has
shown
exceptional
performance
in
various
practical
applications,
including
reasoning
over
complex
figures,
chart
understanding,
and
multi-image
instruction
following.
For
example,
it
can
combine
information
from
multiple
tables
into
a
single
markdown
table
or
generate
HTML
code
to
create
a
website
based
on
an
image
prompt.

How
to
Access
Pixtral

Users
can
easily
try
Pixtral
via
Le
Chat,
Mistral
AI’s
conversational
chat
interface,
or
through
La
Plateforme,
which
allows
integration
via
API
calls.
Detailed
documentation
is
available
for
those
interested
in
leveraging
Pixtral’s
capabilities
in
their
applications.

For
those
who
prefer
running
Pixtral
locally,
the
model
can
be
accessed
through
the
mistral-inference
library
or
the
vLLM
library,
which
offers
higher
serving
throughput.
Detailed
instructions
for
setup
and
usage
are
provided
in
the
documentation.

Image
source:
Shutterstock

Mistral AI Unveils Pixtral 12B: A Groundbreaking Multimodal Model

Key Features of Pixtral 12B

Performance and Evaluation

Architecture and Capabilities

Practical Applications

How to Access Pixtral

Key
Features
of
Pixtral
12B

Performance
and
Evaluation

Architecture
and
Capabilities

Practical
Applications

How
to
Access
Pixtral