Together AI Unveils Inference Engine 2.0 with Turbo and Lite Endpoints

Together
AI
has
announced
the
release
of
its
new
Inference
Engine
2.0,
which
includes
the
highly
anticipated
Turbo
and
Lite
endpoints.
This
new
inference
stack
is
designed
to
provide
significantly
faster
decoding
throughput
and
superior
performance
compared
to
existing
solutions.

Performance
Enhancements

According
to

together.ai,
the
Together
Inference
Engine
2.0
offers
decoding
throughput
that
is
four
times
faster
than
the
open-source
vLLM
and
outperforms
commercial
solutions
such
as
Amazon
Bedrock,
Azure
AI,
Fireworks,
and
Octo
AI
by
1.3x
to
2.5x.
The
engine
achieves
over
400
tokens
per
second
on
Meta
Llama
3
8B,
thanks
to
advancements
in
FlashAttention-3,
faster
GEMM
&
MHA
kernels,
quality-preserving
quantization,
and
speculative
decoding.

New
Turbo
and
Lite
Endpoints

Together
AI
has
introduced
new
Turbo
and
Lite
endpoints,
starting
with
Meta
Llama
3.
These
endpoints
aim
to
balance
performance,
quality,
and
cost,
allowing
enterprises
to
avoid
compromises.
Together
Turbo
closely
matches
the
quality
of
full-precision
FP16
models,
while
Together
Lite
offers
the
most
cost-efficient
and
scalable
Llama
3
models
available.

Together
Turbo
endpoints
provide
fast
FP8
performance
while
maintaining
quality,
matching
FP16
reference
models
and
outperforming
other
FP8
solutions
on
AlpacaEval
2.0.
These
Turbo
endpoints
are
priced
at
$0.88
per
million
tokens
for
70B
and
$0.18
for
8B,
making
them
significantly
more
affordable
than
GPT-4o.

Together
Lite
endpoints
use
INT4
quantization
to
offer
high-quality
AI
models
at
a
lower
cost,
priced
at
$0.10
per
million
tokens
for
Llama
3
8B
Lite,
which
is
six
times
lower
than
GPT-4o-mini.

Adoption
and
Endorsements

Over
100,000
developers
and
companies,
including
Zomato,
DuckDuckGo,
and
the
Washington
Post,
are
already
utilizing
the
Together
Inference
Engine
for
their
Generative
AI
applications.
Rinshul
Chandra,
COO
of
Food
Delivery
at
Zomato,
praised
the
engine
for
its
high
quality,
speed,
and
accuracy.

Technical
Innovations

The
Together
Inference
Engine
2.0
incorporates
several
technical
advancements,
including
FlashAttention-3,
custom-built
speculators,
and
quality-preserving
quantization
techniques.
These
innovations
contribute
to
the
engine’s
superior
performance
and
cost-efficiency.

Future
Outlook

Together
AI
plans
to
continue
pushing
the
boundaries
of
AI
acceleration.
The
company
aims
to
extend
support
for
new
models,
techniques,
and
kernels,
ensuring
the
Together
Inference
Engine
remains
at
the
forefront
of
AI
technology.

The
Turbo
and
Lite
endpoints
for
Llama
3
models
are
available
starting
today,
with
plans
to
expand
to
other
models
soon.
For
more
information,
visit
the
Together
AI

pricing
page.

Image
source:
Shutterstock

Together AI Unveils Inference Engine 2.0 with Turbo and Lite Endpoints

Performance Enhancements

New Turbo and Lite Endpoints

Adoption and Endorsements

Technical Innovations

Future Outlook

Performance
Enhancements

New
Turbo
and
Lite
Endpoints

Adoption
and
Endorsements

Technical
Innovations

Future
Outlook