Anthropic Expands AI Model Safety Bug Bounty Program

The
rapid
advancement
of
artificial
intelligence
(AI)
model
capabilities
necessitates
equally
swift
progress
in
safety
protocols.
According
to

Anthropic,
the
company
is
expanding
its
bug
bounty
program
to
introduce
a
new
initiative
aimed
at
finding
flaws
in
the
mitigations
designed
to
prevent
misuse
of
their
models.

Bug
bounty
programs
are
essential
in
fortifying
the
security
and
safety
of
technological
systems.
Anthropic’s
new
initiative
focuses
on
identifying
and
mitigating
universal
jailbreak
attacks,
which
are
exploits
that
could
consistently
bypass
AI
safety
guardrails
across
various
sectors.
This
initiative
targets
high-risk
domains
such
as
chemical,
biological,
radiological,
and
nuclear
(CBRN)
safety,
as
well
as
cybersecurity.

Our
Approach

To
date,
Anthropic
has
operated
an
invite-only
bug
bounty
program
in
collaboration
with

HackerOne,
rewarding
researchers
for
identifying
model
safety
issues
in
publicly
released
AI
models.
The
newly
announced
bug
bounty
initiative
aims
to
test
Anthropic’s
next-generation
AI
safety
mitigation
system,
which
has
not
yet
been
publicly
deployed.
Key
features
of
the
program
include:

Early
Access:
Participants
will
receive
early
access
to
test
the
latest
safety
mitigation
system
before
its
public
deployment.
They
will
be
challenged
to
identify
potential
vulnerabilities
or
ways
to
circumvent
safety
measures
in
a
controlled
environment.
Program
Scope:
Anthropic
offers
bounty
rewards
of
up
to
$15,000
for
novel,
universal
jailbreak
attacks
that
could
expose
vulnerabilities
in
critical,
high-risk
domains
such
as
CBRN
and
cybersecurity.
A
universal
jailbreak
is
a
type
of
vulnerability
allowing
consistent
bypassing
of
AI
safety
measures
across
a
wide
range
of
topics.
Detailed
instructions
and
feedback
will
be
provided
to
program
participants.

Get
Involved

This
model
safety
bug
bounty
initiative
will
initially
be
invite-only,
conducted
in
partnership
with
HackerOne.
While
starting
as
invite-only,
Anthropic
plans
to
broaden
the
initiative
in
the
future.
This
initial
phase
aims
to
refine
processes
and
provide
timely,
constructive
feedback
to
submissions.
Experienced
AI
security
researchers
or
those
with
expertise
in
identifying
jailbreaks
in
language
models
are
encouraged
to
apply
for
an
invitation
through
the

application
form
by
Friday,
August
16.
Selected
applicants
will
be
contacted
in
the
fall.

In
the
meantime,
Anthropic
actively
seeks
reports
on
model
safety
concerns
to
improve
current
systems.
Potential
safety
issues
can
be
reported
to
[email protected]
with
sufficient
details
for
replication.
More
information
can
be
found
in
the
company’s

Responsible
Disclosure
Policy.

This
initiative
aligns
with
commitments
Anthropic
has
signed
with
other
AI
companies
for
responsible
AI
development,
such
as
the

Voluntary
AI
Commitments
announced
by
the
White
House
and
the

Code
of
Conduct
for
Organizations
Developing
Advanced
AI
Systems
developed
through
the
G7
Hiroshima
Process.
The
goal
is
to
accelerate
progress
in
mitigating
universal
jailbreaks
and
strengthen
AI
safety
in
high-risk
areas.
Experts
in
this
field
are
encouraged
to
join
this
crucial
effort
to
ensure
that
as
AI
capabilities
advance,
safety
measures
keep
pace.

Image
source:
Shutterstock

Anthropic Expands AI Model Safety Bug Bounty Program

Our Approach

Get Involved

Our
Approach

Get
Involved