Anthropic Expands AI Model Safety Bug Bounty Program


Darius
Baruo


Aug
08,
2024
14:47

Anthropic
broadens
its
AI
model
safety
bug
bounty
program
to
address
universal
jailbreak
vulnerabilities,
offering
rewards
up
to
$15,000.

Anthropic Expands AI Model Safety Bug Bounty Program

The
rapid
advancement
of
artificial
intelligence
(AI)
model
capabilities
necessitates
equally
swift
progress
in
safety
protocols.
According
to

Anthropic
,
the
company
is
expanding
its
bug
bounty
program
to
introduce
a
new
initiative
aimed
at
finding
flaws
in
the
mitigations
designed
to
prevent
misuse
of
their
models.

Bug
bounty
programs
are
essential
in
fortifying
the
security
and
safety
of
technological
systems.
Anthropic’s
new
initiative
focuses
on
identifying
and
mitigating
universal
jailbreak
attacks,
which
are
exploits
that
could
consistently
bypass
AI
safety
guardrails
across
various
sectors.
This
initiative
targets
high-risk
domains
such
as
chemical,
biological,
radiological,
and
nuclear
(CBRN)
safety,
as
well
as
cybersecurity.

Our
Approach

To
date,
Anthropic
has
operated
an
invite-only
bug
bounty
program
in
collaboration
with

HackerOne
,
rewarding
researchers
for
identifying
model
safety
issues
in
publicly
released
AI
models.
The
newly
announced
bug
bounty
initiative
aims
to
test
Anthropic’s
next-generation
AI
safety
mitigation
system,
which
has
not
yet
been
publicly
deployed.
Key
features
of
the
program
include:


  • Early
    Access:

    Participants
    will
    receive
    early
    access
    to
    test
    the
    latest
    safety
    mitigation
    system
    before
    its
    public
    deployment.
    They
    will
    be
    challenged
    to
    identify
    potential
    vulnerabilities
    or
    ways
    to
    circumvent
    safety
    measures
    in
    a
    controlled
    environment.

  • Program
    Scope:

    Anthropic
    offers
    bounty
    rewards
    of
    up
    to
    $15,000
    for
    novel,
    universal
    jailbreak
    attacks
    that
    could
    expose
    vulnerabilities
    in
    critical,
    high-risk
    domains
    such
    as
    CBRN
    and
    cybersecurity.
    A
    universal
    jailbreak
    is
    a
    type
    of
    vulnerability
    allowing
    consistent
    bypassing
    of
    AI
    safety
    measures
    across
    a
    wide
    range
    of
    topics.
    Detailed
    instructions
    and
    feedback
    will
    be
    provided
    to
    program
    participants.

Get
Involved

This
model
safety
bug
bounty
initiative
will
initially
be
invite-only,
conducted
in
partnership
with
HackerOne.
While
starting
as
invite-only,
Anthropic
plans
to
broaden
the
initiative
in
the
future.
This
initial
phase
aims
to
refine
processes
and
provide
timely,
constructive
feedback
to
submissions.
Experienced
AI
security
researchers
or
those
with
expertise
in
identifying
jailbreaks
in
language
models
are
encouraged
to
apply
for
an
invitation
through
the

application
form

by
Friday,
August
16.
Selected
applicants
will
be
contacted
in
the
fall.

In
the
meantime,
Anthropic
actively
seeks
reports
on
model
safety
concerns
to
improve
current
systems.
Potential
safety
issues
can
be
reported
to
[email protected]
with
sufficient
details
for
replication.
More
information
can
be
found
in
the
company’s

Responsible
Disclosure
Policy
.

This
initiative
aligns
with
commitments
Anthropic
has
signed
with
other
AI
companies
for
responsible
AI
development,
such
as
the

Voluntary
AI
Commitments

announced
by
the
White
House
and
the

Code
of
Conduct
for
Organizations
Developing
Advanced
AI
Systems

developed
through
the
G7
Hiroshima
Process.
The
goal
is
to
accelerate
progress
in
mitigating
universal
jailbreaks
and
strengthen
AI
safety
in
high-risk
areas.
Experts
in
this
field
are
encouraged
to
join
this
crucial
effort
to
ensure
that
as
AI
capabilities
advance,
safety
measures
keep
pace.

Image
source:
Shutterstock

Comments are closed.