Together AI Boosts NVIDIA H200 and H100 GPU Cluster Performance with Kernel Collection

Together
AI
has
announced
a
significant
enhancement
to
its
GPU
clusters
with
the
integration
of
the
NVIDIA
H200
Tensor
Core
GPU,
according
to

together.ai.
This
upgrade
will
be
accompanied
by
the
Together
Kernel
Collection
(TKC),
a
custom-built
kernel
stack
designed
to
optimize
AI
operations,
providing
substantial
performance
boosts
for
both
training
and
inference
tasks.

Enhanced
Performance
with
TKC

The
Together
Kernel
Collection
(TKC)
is
engineered
to
accelerate
common
AI
operations
significantly.
When
compared
to
standard
PyTorch
implementations,
TKC
offers
up
to
a
24%
speedup
for
frequently
used
training
operators
and
up
to
a
75%
speedup
for
FP8
inference
operations.
This
improvement
is
poised
to
reduce
GPU
hours,
leading
to
cost
efficiencies
and
faster
time
to
market.

Training
and
Inference
Optimization

TKC’s
optimized
kernels,
such
as
the
multi-layer
perceptron
(MLP)
with
SwiGLU
activation,
are
crucial
for
training
large
language
models
(LLMs)
like
Llama-3.
These
kernels
are
reported
to
be
22-24%
faster
than
standard
implementations,
with
potential
improvements
up
to
10%
faster
compared
to
the
best
existing
baselines.
Inference
tasks
benefit
from
a
robust
stack
of
FP8
kernels,
which
Together
AI
has
optimized
to
deliver
more
than
75%
speedup
over
base
PyTorch
implementations.

Native
PyTorch
Compatibility

TKC
is
fully
integrated
with
PyTorch,
enabling
AI
developers
to
utilize
its
optimizations
seamlessly
within
their
existing
frameworks.
This
integration
simplifies
the
adoption
of
TKC,
making
it
as
easy
as
changing
import
statements
within
PyTorch.

Production-Level
Testing

Together
AI
ensures
that
TKC
undergoes
rigorous
testing
to
meet
production-level
standards,
guaranteeing
high
performance
and
reliability
for
real-world
applications.
All
Together
GPU
Clusters,
whether
H200
or
H100,
will
feature
TKC
out
of
the
box.

NVIDIA
H200:
Faster
Performance
and
Larger
Memory

The
NVIDIA
H200
Tensor
Core
GPU,
built
on
the
Hopper
architecture,
is
designed
for
high-performance
AI
and
HPC
workloads.
According
to
NVIDIA,
the
H200
offers
40%
faster
inference
performance
on
Llama
2
13B
and
90%
faster
on
Llama
2
70B,
compared
to
its
predecessor,
the
H100.
The
H200
features
141GB
of
HBM3e
memory
and
4.8TB/s
of
memory
bandwidth,
nearly
doubling
the
capacity
and
1.4
times
the
bandwidth
of
the
H100.

High-Performance
Interconnectivity

Together
GPU
Clusters
leverage
the
SXM
form
factor
for
high
bandwidth
and
fast
data
transfer,
supported
by
NVIDIA’s
NVLink
and
NVSwitch
technologies
for
ultra-high-speed
communication
between
GPUs.
Combined
with
NVIDIA
Quantum-2
3200Gb/s
InfiniBand
Networking,
this
setup
is
ideal
for
large-scale
AI
training
and
HPC
workloads.

Cost-Effective
Infrastructure

Together
AI
offers
significant
cost
savings,
with
infrastructure
designed
to
be
up
to
75%
more
cost-effective
compared
to
cloud
providers
like
AWS.
The
company
also
provides
flexible
commitment
options,
from
one
month
to
five
years,
ensuring
the
right
resources
at
every
stage
of
the
AI
development
lifecycle.

Reliability
and
Support

Together
AI’s
GPU
clusters
come
with
a
99.9%
uptime
SLA
and
are
backed
by
rigorous
acceptance
testing.
The
company’s
White
Glove
Service
offers
end-to-end
support,
from
cluster
setup
to
ongoing
maintenance,
ensuring
peak
performance
for
AI
models.

Flexible
Deployment
Options

Together
AI
provides
several
deployment
options,
including
Slurm
for
high-performance
workload
management,
Kubernetes
for
containerized
AI
workloads,
and
bare
metal
clusters
running
Ubuntu
for
direct
access
and
ultimate
flexibility.
These
options
cater
to
different
AI
project
needs,
from
large-scale
training
to
production-level
inference.

Together
AI
continues
to
support
the
entire
AI
lifecycle
with
its
high-performance
NVIDIA
H200
GPU
Clusters
and
the
Together
Kernel
Collection.
The
platform
is
designed
to
optimize
performance,
reduce
costs,
and
ensure
reliability,
making
it
an
ideal
choice
for
accelerating
AI
development.

Image
source:
Shutterstock

Together AI Boosts NVIDIA H200 and H100 GPU Cluster Performance with Kernel Collection

Enhanced Performance with TKC

Training and Inference Optimization

Native PyTorch Compatibility

Production-Level Testing

NVIDIA H200: Faster Performance and Larger Memory

High-Performance Interconnectivity

Cost-Effective Infrastructure

Reliability and Support

Flexible Deployment Options