NVIDIA Unveils BigVGAN v2: Pioneering Zero-Shot Waveform Audio Generation


Zach
Anderson


Sep
06,
2024
11:03

NVIDIA’s
BigVGAN
v2
sets
a
new
standard
in
zero-shot
waveform
audio
generation,
achieving
state-of-the-art
quality
with
up
to
3x
faster
synthesis
speed.

NVIDIA Unveils BigVGAN v2: Pioneering Zero-Shot Waveform Audio Generation

NVIDIA
has
announced
the
release
of
BigVGAN
v2,
a
groundbreaking
generative
AI
model
for
zero-shot
waveform
audio
generation,
according
to
the

NVIDIA
Technical
Blog
.
The
new
model
delivers
significant
improvements
in
speed
and
quality,
positioning
itself
as
a
state-of-the-art
solution
in
the
field
of
audio
generative
AI.

BigVGAN:
A
Universal
Neural
Vocoder

BigVGAN
is
a
universal
neural
vocoder
designed
to
synthesize
audio
waveforms
from
Mel
spectrograms.
The
model
employs
a
fully
convolutional
architecture
with
several
upsampling
blocks
and
residual
dilated
convolution
layers.
A
key
feature
is
the
anti-aliased
multiperiodicity
composition
(AMP)
module,
which
is
optimized
for
generating
high-frequency
and
periodic
sound
waves,
reducing
artifacts
in
the
process.

Improvements
in
BigVGAN
v2

BigVGAN
v2
introduces
several
enhancements
over
its
predecessor:


  • State-of-the-art
    audio
    quality

    across
    various
    metrics
    and
    audio
    types.

  • Up
    to
    3x
    faster
    synthesis
    speed

    through
    optimized
    CUDA
    kernels.

  • Pretrained
    checkpoints

    for
    diverse
    audio
    configurations.

  • Support
    for
    a
    sampling
    rate
    up
    to
    44
    kHz
    ,
    covering
    the
    highest
    frequencies
    audible
    to
    humans.

Generating
Every
Sound
in
the
World

Waveform
audio
generation
is
crucial
for
virtual
worlds
and
has
been
a
significant
focus
of
research.
BigVGAN
v2
addresses
previous
limitations
by
delivering
high-quality
audio
with
enhanced
fine
details.
Trained
using
NVIDIA
A100
Tensor
Core
GPUs
and
a
dataset
over
100
times
larger
than
its
predecessor,
BigVGAN
v2
can
generate
high-quality
sound
waves
from
various
domains,
including
speech,
environmental
sounds,
and
music.

Reaching
the
Highest
Frequency
Sound
the
Human
Ear
Can
Detect

Previous
models
were
limited
to
sampling
rates
between
22
kHz
and
24
kHz.
BigVGAN
v2
extends
this
range
to
44
kHz,
capturing
the
entire
human
auditory
spectrum.
This
allows
the
model
to
reproduce
comprehensive
soundscapes,
from
robust
drums
to
crisp
cymbals
in
music.

Faster
Synthesis
with
Custom
CUDA
Kernels

BigVGAN
v2
also
features
accelerated
synthesis
speed,
using
custom
CUDA
kernels
to
achieve
up
to
3x
faster
inference
than
the
original
BigVGAN.
These
kernels
enable
the
generation
of
audio
waveforms
up
to
240
times
faster
than
real-time
on
a
single
NVIDIA
A100
GPU.

Audio
Quality
Results

BigVGAN
v2
shows
superior
audio
quality
for
speech
and
general
audio
compared
to
its
predecessor,
as
well
as
comparable
results
to
the
Descript
Audio
Codec
at
a
44
kHz
sampling
rate.
This
demonstrates
the
model’s
capability
to
produce
high-quality
waveforms
across
various
audio
types.

Conclusion

NVIDIA’s
BigVGAN
v2
sets
a
new
benchmark
in
audio
synthesis,
achieving
state-of-the-art
quality
across
all
audio
types
and
covering
the
full
range
of
human
hearing.
The
model’s
synthesis
speed
is
now
up
to
3x
faster,
making
it
highly
efficient
for
diverse
audio
configurations.

For
more
information,
users
are
encouraged
to
review
the

BigVGAN
v2

model
card
on
GitHub.

Image
source:
Shutterstock

Comments are closed.