NVIDIA and Mistral Launch NeMo 12B: A High-Performance Language Model on a Single GPU

NVIDIA,
in
collaboration
with
Mistral,
has
unveiled
the
Mistral
NeMo
12B,
a
groundbreaking
language
model
that
promises
leading
performance
across
various
benchmarks.
This
advanced
model
is
optimized
to
run
on
a
single
GPU,
making
it
a
cost-effective
and
efficient
solution
for
text-generation
applications,
according
to
the

NVIDIA
Technical
Blog.

Mistral
NeMo
12B

The
Mistral
NeMo
12B
model
is
a
dense
transformer
model
with
12
billion
parameters,
trained
on
a
vast
multilingual
vocabulary
of
131,000
words.
It
excels
in
a
wide
range
of
tasks,
including
common
sense
reasoning,
coding,
math,
and
multilingual
chat.
The
model’s
performance
on
benchmarks
such
as
HellaSwag,
Winograd,
and
TriviaQA
highlights
its
superior
capabilities
compared
to
other
models
like
Gemma
2
9B
and
Llama
3
8B.

Model	Context Window	HellaSwag (0-shot)	Winograd (0-shot)	NaturalQ (5-shot)	TriviaQA (5-shot)	MMLU (5-shot)	OpenBookQA (0-shot)	CommonSenseQA (0-shot)	TruthfulQA (0-shot)	MBPP (pass@1 3-shots)
Mistral NeMo 12B	128k	83.5%	76.8%	31.2%	73.8%	68.0%	60.6%	70.4%	50.3%	61.8%
Gemma 2 9B	8k	80.1%	74.0%	29.8%	71.3%	71.5%	50.8%	60.8%	46.6%	56.0%
Llama 3 8B	8k	80.6%	73.5%	28.2%	61.0%	62.3%	56.4%	66.7%	43.0%	57.2%

Table
1.
Mistral
NeMo
model
performance
across
popular
benchmarks

With
a
128K
context
length,
Mistral
NeMo
can
process
extensive
and
complex
information,
resulting
in
coherent
and
contextually
relevant
outputs.
The
model
is
trained
on
Mistral’s
proprietary
dataset,
which
includes
a
significant
amount
of
multilingual
and
code
data,
enhancing
feature
learning
and
reducing
bias.

Optimized
Training
and
Inference

The
training
of
Mistral
NeMo
is
powered
by

NVIDIA
Megatron-LM,
a
PyTorch-based
library
that
provides
GPU-optimized
techniques
and
system-level
innovations.
This
library
includes
core
components
such
as
attention
mechanisms,
transformer
blocks,
and
distributed
checkpointing,
facilitating
large-scale
model
training.

For
inference,
Mistral
NeMo
leverages

TensorRT-LLM
engines,
which
compile
the
model
layers
into
optimized
CUDA
kernels.
These
engines
maximize
inference
performance
through
techniques
like
pattern
matching
and
fusion.
The
model
also
supports
inference
in
FP8
precision
using

NVIDIA
TensorRT-Model-Optimizer,
making
it
possible
to
create
smaller
models
with
lower
memory
footprints
without
sacrificing
accuracy.

The
ability
to
run
the
Mistral
NeMo
model
on
a
single
GPU
improves
compute
efficiency,
reduces
costs,
and
enhances
security
and
privacy.
This
makes
it
suitable
for
various
commercial
applications,
including
document
summarization,
classification,
multi-turn
conversations,
language
translation,
and
code
generation.

Deployment
with
NVIDIA
NIM

The
Mistral
NeMo
model
is
available
as
an
NVIDIA
NIM
inference
microservice,
designed
to
streamline
the
deployment
of
generative
AI
models
across
NVIDIA’s
accelerated
infrastructure.
NIM
supports
a
wide
range
of
generative
AI
models,
offering
high-throughput
AI
inference
that
scales
with
demand.
Enterprises
can
benefit
from
increased
token
throughput,
which
directly
translates
to
higher
revenue.

Use
Cases
and
Customization

The
Mistral
NeMo
model
is
particularly
effective
as
a
coding
copilot,
providing
AI-powered
code
suggestions,
documentation,
unit
tests,
and
error
fixes.
The
model
can
be
fine-tuned
with
domain-specific
data
for
higher
accuracy,
and
NVIDIA
offers
tools
for
aligning
the
model
to
specific
use
cases.

The
instruction-tuned
variant
of
Mistral
NeMo
demonstrates
strong
performance
across
several
benchmarks
and
can
be
customized
using

NVIDIA
NeMo,
an
end-to-end
platform
for
developing
custom
generative
AI.
NeMo
supports
various
fine-tuning
techniques
such
as
parameter-efficient
fine-tuning
(PEFT),
supervised
fine-tuning
(SFT),
and
reinforcement
learning
from
human
feedback
(RLHF).

Getting
Started

To
explore
the
capabilities
of
the
Mistral
NeMo
model,
visit
the

Artificial
Intelligence
solution
page.
NVIDIA
also
offers
free
cloud
credits
to
test
the
model
at
scale
and
build
a
proof
of
concept
by
connecting
to
the
NVIDIA-hosted
API
endpoint.

Image
source:
Shutterstock

NVIDIA and Mistral Launch NeMo 12B: A High-Performance Language Model on a Single GPU

Mistral NeMo 12B

Optimized Training and Inference

Deployment with NVIDIA NIM

Use Cases and Customization

Getting Started

Mistral
NeMo
12B

Optimized
Training
and
Inference

Deployment
with
NVIDIA
NIM

Use
Cases
and
Customization

Getting
Started