NVIDIA Unveils Mistral-NeMo-Minitron 8B Model with Superior Accuracy

NVIDIA,
in
collaboration
with
Mistral
AI,
has
announced
the
release
of
the
Mistral-NeMo-Minitron
8B
model,
a
highly
advanced
open-access
large
language
model
(LLM).
According
to
the

NVIDIA
Technical
Blog,
this
model
surpasses
other
models
of
a
similar
size
in
terms
of
accuracy
on
nine
popular
benchmarks.

Advanced
Model
Pruning
and
Distillation

The
Mistral-NeMo-Minitron
8B
model
was
developed
by
width-pruning
the
larger
Mistral
NeMo
12B
model,
followed
by
a
light
retraining
process
using
knowledge
distillation.
This
methodology,
originally
proposed
by
NVIDIA
in
their
paper
on

Compact
Language
Models
via
Pruning
and
Knowledge
Distillation,
has
been
validated
through
multiple
successful
implementations,
including
the
NVIDIA
Minitron
8B
and
4B
models,
as
well
as
the
Llama-3.1-Minitron
4B
model.

Model
pruning
involves
reducing
the
size
and
complexity
of
a
model
by
either
dropping
layers
(depth
pruning)
or
neurons
and
attention
heads
(width
pruning).
This
process
is
often
paired
with
retraining
to
recover
any
lost
accuracy.
Model
distillation,
on
the
other
hand,
transfers
knowledge
from
a
large,
complex
model
(the
teacher
model)
to
a
smaller,
simpler
model
(the
student
model),
aiming
to
retain
much
of
the
predictive
power
of
the
original
model
while
being
more
efficient.

The
combination
of
pruning
and
distillation
allows
for
the
creation
of
progressively
smaller
models
from
a
large
pretrained
model.
This
approach
significantly
reduces
the
computational
cost,
as
only
100-400
billion
tokens
are
needed
for
retraining,
compared
to
the
much
larger
datasets
required
for
training
from
scratch.

Mistral-NeMo-Minitron
8B
Performance

The
Mistral-NeMo-Minitron
8B
model
demonstrates
leading
accuracy
on
several
benchmarks,
outperforming
other
models
in
its
class,
including
the
Llama
3.1
8B
and
Gemma
7B
models.
The
table
below
highlights
the
performance
metrics:

	Training tokens	Wino-Grande 5-shot	ARC Challenge 25-shot	MMLU 5-shot	Hella Swag 10-shot	GSM8K 5-shot	TruthfulQA 0-shot	XLSum en (20%) 3-shot	MBPP 0-shot	Human Eval 0-shot
Llama 3.1 8B	15T	77.27	57.94	65.28	81.80	48.60	45.06	30.05	42.27	24.76
Gemma 7B	6T	78	61	64	82	50	45	17	39	32
Mistral-NeMo-Minitron 8B	380B	80.35	64.42	69.51	83.03	58.45	47.56	31.94	43.77	36.22
Mistral NeMo 12B	N/A	82.24	65.10	68.99	85.16	56.41	49.79	33.43	42.63	23.78

Table
1.
Accuracy
of
the
Mistral-NeMo-Minitron
8B
base
model
compared
to
the
teacher
Mistral-NeMo
12B,
Gemma
7B,
and
Llama-3.1
8B
base
models.
Bold
numbers
represent
the
best
among
the
8B
model
class

Implementation
and
Future
Work

Following
the
best
practices
of
structured
weight
pruning
and
knowledge
distillation,
the
Mistral-NeMo
12B
model
was
width-pruned
to
yield
the
8B
target
model.
The
process
involved
fine-tuning
the
unpruned
Mistral
NeMo
12B
model
using
127
billion
tokens
to
correct
for
distribution
shifts,
followed
by
width-only
pruning
and
distillation
using
380
billion
tokens.

The
Mistral-NeMo-Minitron
8B
model
showcases
superior
performance
and
efficiency,
making
it
a
significant
advancement
in
the
field
of
AI.
NVIDIA
plans
to
continue
refining
the
distillation
process
to
produce
even
smaller
and
more
accurate
models.
The
implementation
of
this
technique
will
be
gradually
integrated
into
the
NVIDIA
NeMo
framework
for
generative
AI.

For
further
details,
visit
the

NVIDIA
Technical
Blog.

Image
source:
Shutterstock

NVIDIA Unveils Mistral-NeMo-Minitron 8B Model with Superior Accuracy

Advanced Model Pruning and Distillation

Mistral-NeMo-Minitron 8B Performance

Implementation and Future Work

Advanced
Model
Pruning
and
Distillation

Mistral-NeMo-Minitron
8B
Performance

Implementation
and
Future
Work