GitHub Reports Four Major Incidents Affecting Services in July 2024


James
Ding


Aug
16,
2024
02:42

GitHub
experienced
four
significant
service
disruptions
in
July
2024,
affecting
Webhooks,
Copilot,
and
Actions.
Learn
more
about
the
incidents
and
mitigation
steps.

GitHub Reports Four Major Incidents Affecting Services in July 2024

GitHub
faced
a
challenging
month
in
July
2024,
with
four
major
incidents
leading
to
degraded
performance
across
several
of
its
services,
according
to

The
GitHub
Blog
.

Incident
Breakdown


July
5
(lasting
97
minutes)

On
July
5,
from
16:31
to
18:08
UTC,
GitHub’s
Webhooks
service
experienced
performance
degradation
due
to
a
configuration
change
that
removed
authentication
from
background
job
requests,
causing
these
requests
to
be
rejected.
The
incident
led
to
delays
in
Webhooks
deliveries,
with
an
average
delay
of
24
minutes
and
a
maximum
of
71
minutes.
A
secondary
issue
from
18:21
to
21:14
UTC
further
affected
GitHub
Actions
runs
on
pull
requests,
adding
delays
to
job
delivery.

To
prevent
future
occurrences,
GitHub
has
updated
dashboards,
improved
health
checks,
and
introduced
new
alerts
for
similar
issues.
The
company
is
also
working
on
better
workload
isolation
to
minimize
the
impact
of
such
incidents.


July
13
(lasting
19
hours
and
26
minutes)

On
July
13,
from
00:01
to
19:27
UTC,
GitHub
Copilot
services
were
significantly
degraded.
The
error
rate
for
Copilot
code
completions
reached
1.16%,
while
GitHub
Copilot
Chat
peaked
at
63%.
The
issue
was
traced
back
to
a
resource
cleanup
job
executed
by
a
partner
service,
which
mistakenly
targeted
essential
resources.
GitHub
managed
to
mitigate
the
impact
while
resources
were
being
restored.

GitHub
is
now
collaborating
with
partner
services
to
implement
safeguards
against
future
incidents
and
enhance
traffic
rerouting
processes
for
quicker
mitigation.


July
16
(lasting
149
minutes)

On
July
16,
from
00:30
to
03:07
UTC,
Copilot
Chat
was
degraded
and
rejected
all
requests,
with
an
error
rate
close
to
100%.
The
issue
was
triggered
during
routine
maintenance
when
GitHub
services
were
disconnected
and
overwhelmed
the
dependent
service
during
reconnections.

To
address
this,
GitHub
is
improving
its
reconnection
and
circuit-breaking
logic
to
prevent
similar
disruptions
in
the
future.


July
18
(lasting
231
minutes)

On
July
18,
starting
at
22:38
UTC,
network
issues
within
an
upstream
provider
led
to
degraded
experiences
across
Actions,
Copilot,
and
GitHub
Pages
services.
Up
to
50%
of
Actions
workflow
jobs
were
stuck
in
the
queuing
state,
and
users
faced
issues
with
enabling
Actions
or
registering
self-hosted
runners.
The
problem
was
caused
by
an
unreachable
backend
resource
in
the
central
US
region.

GitHub
mitigated
the
issue
by
updating
the
replication
configuration,
which
allowed
successful
requests
while
one
region
was
unavailable.
The
company
is
now
enhancing
its
replication
and
failover
workflows
to
better
handle
such
situations
and
reduce
recovery
time.

Future
Mitigation
Steps

In
response
to
these
incidents,
GitHub
is
taking
multiple
steps
to
improve
its
service
resilience.
These
include
updating
dashboards,
enhancing
health
checks,
implementing
new
alerts,
collaborating
with
partner
services,
and
improving
reconnection
and
circuit-breaking
logic.
The
company
is
also
focused
on
better
workload
isolation
and
enhancing
replication
and
failover
workflows.

For
real-time
updates
on
status
changes
and
post-incident
recaps,
users
are
encouraged
to
follow
GitHub’s

status
page

and
the

GitHub
Engineering
Blog
.

Image
source:
Shutterstock

Comments are closed.