NVIDIA NeMo Curator Enhances Non-English Dataset Preparation for LLM Training

Data
curation
is
critical
for
developing
effective
and
fair
large
language
models
(LLMs).
High-quality,
diverse
training
data
directly
impacts
LLM
performance
by
addressing
issues
like
bias,
inconsistencies,
and
redundancy.
NVIDIA
has
recently
announced
the
open-source
release
of
the
NVIDIA
NeMo
Curator,
a
data
curation
library
designed
to
enhance
LLM
training
accuracy
through
scalable
and
efficient
dataset
preparation.

Importance
of
Data
Curation

When
training
localized
multilingual
LLMs,
particularly
for
low-resourced
languages,
web-crawled
data
such
as
OSCAR
is
vital.
However,
this
data
often
contains
noise,
irrelevant
content,
duplicates,
and
formatting
issues.
Effective
data
curation
is
essential
to
mitigate
these
problems
and
ensure
high-quality
LLM
performance.
The
NeMo
Curator
offers
a
customizable
and
modular
interface
that
simplifies
pipeline
expansion
and
accelerates
model
convergence
by
preparing
high-quality
tokens.

NeMo
Curator
Overview

The
NeMo
Curator
leverages
GPU-accelerated
data
curation
using
Dask
and
RAPIDS,
enabling
users
to
mine
high-quality
text
at
scale
from
massive
uncurated
web
corpora
as
well
as
custom
datasets.
For
instance,
a
data
curation
pipeline
can
be
constructed
using
the
Thai
Wikipedia
dataset,
a
smaller
subset
of
the
Wikipedia
dataset,
which
can
be
processed
on
a
single
GPU.
Wikipedia
is
considered
high-quality
for
LLM
pretraining
due
to
its
accurate,
well-structured
content.
NeMo
Curator
enhances
this
by
detecting
and
filtering
low-quality
documents,
ensuring
only
the
best
data
is
used
for
training.

Data
Curation
Pipeline
Example

Using
the
Thai
Wikipedia
as
an
example,
the
data
curation
pipeline
involves
several
steps:

Download
and
extract
the
dataset
to
a
JSONL
file.
Perform
preliminary
data
cleaning,
including
language
separation
and
Unicode
text
fixes.
Advanced
cleaning,
such
as
GPU-accelerated
exact
and
fuzzy
deduplication,
and
heuristic
filtering.

For
the
complete
code
sample
for
this
tutorial,
see
the

NVIDIA
NeMo
Curator
GitHub
repo.

Prerequisites
and
Setup

To
use
GPU-accelerated
deduplication,
the
following
hardware
setup
is
recommended:

NVIDIA
GPU:
This
tutorial
uses
the
NVIDIA
A10
24GB
GPU.
CUDA
and
NVIDIA
Drivers:
CUDA
12.2
with
Driver
535.154.05.
Ubuntu
22.04.
NVIDIA-container-toolkit
version
1.14.6.

To
install
the
NeMo
Curator
library,
run
the
following
commands:

git clone https://github.com/NVIDIA/NeMo-Curator.git
cd NeMo-Curator
pip install --extra-index-url https://pypi.nvidia.com "[cuda12x]"

Advanced
Data
Cleaning

Advanced
data
curation
techniques
such
as
deduplication
and
heuristic
filtering
are
applied
to
yield
better
data
quality.
For
example,
the
ExactDuplicates
class
removes
identical
documents
using
GPU-accelerated
implementations
from
the
RAPIDS
cuDF
library.
Similarly,
the
FuzzyDuplicates
class
removes
near-identical
documents
using
the
MinhashLSH
algorithm,
which
is
computationally
efficient.

Heuristic
Filtering

Heuristic
filtering
helps
remove
low-quality
content
from
the
dataset
using
simple,
efficient-to-compute
rules.
At
the
time
of
publication,
NeMo
Curator
provides
24
heuristics
for
natural
languages
and
eight
for
coding
languages.
These
filters
can
be
applied
using
a
YAML
config
file
to
define
the
filters
for
heuristic
filtering.

Next
Steps

The
tutorial
demonstrated
how
to
construct
a
sample
data
curation
pipeline
for
Thai
Wikipedia
data.
For
more
information
and
examples,
see
the

collection
of
data
curation
examples
on
GitHub.
Enterprises
can
also
request
access
to
the
NVIDIA
NeMo
Curator
microservice,
which
provides
streamlined
performance
and

scalability.

Image
source:
Shutterstock

NVIDIA NeMo Curator Enhances Non-English Dataset Preparation for LLM Training

Importance of Data Curation

NeMo Curator Overview

Data Curation Pipeline Example

Prerequisites and Setup

Advanced Data Cleaning

Heuristic Filtering

Next Steps