IBM Research Unveils Innovations to Accelerate Enterprise AI Training

IBM
Research
has
unveiled
groundbreaking
innovations
aimed
at
scaling
the
data
processing
pipeline
for
enterprise
AI
training,
according
to
IBM
Research.
These
advancements
are
designed
to
expedite
the
creation
of
powerful
AI
models,
such
as
IBM’s
Granite
models,
by
leveraging
the
abundant
capacity
of
CPUs.

Optimizing
Data
Preparation

Before
training
AI
models,
vast
amounts
of
data
must
be
prepared.
This
data
often
comes
from
diverse
sources
like
websites,
PDFs,
and
news
articles,
and
must
undergo
several
preprocessing
steps.
These
steps
include
filtering
out
irrelevant
HTML
code,
removing
duplicates,
and
screening
for
abusive
content.
These
tasks,
though
critical,
are
not
constrained
by
the
availability
of
GPUs.

Petros
Zerfos,
IBM
Research’s
principal
research
scientist
for
watsonx
data
engineering,
emphasized
the
importance
of
efficient
data
processing.
“A
large
part
of
the
time
and
effort
that
goes
into
training
these
models
is
preparing
the
data
for
these
models,”
Zerfos
said.
His
team
has
been
developing
methods
to
enhance
the
efficiency
of
data
processing
pipelines,
drawing
expertise
from
various
domains
including
natural
language
processing,
distributed
computing,
and
storage
systems.

Leveraging
CPU
Capacity

Many
steps
in
the
data
processing
pipeline
involve
“embarrassingly
parallel”
computations,
allowing
each
document
to
be
processed
independently.
This
parallel
processing
can
significantly
speed
up
data
preparation
by
distributing
tasks
across
numerous
CPUs.
However,
some
steps,
such
as
removing
duplicate
documents,
require
access
to
the
entire
dataset,
which
cannot
be
performed
in
parallel.

To
accelerate
IBM’s
Granite
model
development,
the
team
has
developed
processes
to
rapidly
provision
and
utilize
tens
of
thousands
of
CPUs.
This
approach
involves
marshalling
idle
CPU
capacity
across
IBM’s
Cloud
datacenter
network,
ensuring
high
communication
bandwidth
between
CPUs
and
data
storage.
Traditional
object
storage
systems
often
cause
CPUs
to
idle
due
to
low
performance;
thus,
the
team
employed
IBM’s
high-performance
Storage
Scale
file
system
to
cache
active
data
efficiently.

Scaling
Up
AI
Training

Over
the
past
year,
IBM
has
scaled
up
to
100,000
vCPUs
in
the
IBM
Cloud,
processing
14
petabytes
of
raw
data
to
produce
40
trillion
tokens
for
AI
model
training.
The
team
has
automated
these
data
pipelines
using
Kubeflow
on
IBM
Cloud.
Their
methods
have
proven
to
be
24
times
faster
in
processing
data
from
Common
Crawl
compared
to
previous
techniques.

All
of
IBM’s
open-sourced
Granite
code
and
language
models
have
been
trained
using
data
prepared
through
these
optimized
pipelines.
Additionally,
IBM
has
made
significant
contributions
to
the
AI
community
by
developing
the
Data
Prep
Kit,
a
toolkit
hosted
on
GitHub.
This
kit
streamlines
data
preparation
for
large
language
model
applications,
supporting
pre-training,
fine-tuning,
and
retrieval-augmented
generation
(RAG)
use
cases.
Built
on
distributed
processing
frameworks
like
Spark
and
Ray,
the
kit
allows
developers
to
build
scalable
custom
modules.

For
more
information,
visit
the
official

IBM
Research
blog.

Image
source:
Shutterstock

IBM Research Unveils Innovations to Accelerate Enterprise AI Training

Optimizing Data Preparation

Leveraging CPU Capacity

Scaling Up AI Training

Optimizing
Data
Preparation

Leveraging
CPU
Capacity

Scaling
Up
AI
Training