In February, we experienced three incidents that resulted in degraded performance across GitHub services. This report also sheds light into a January incident that resulted in degraded performance for GitHub Packages and GitHub Pages and another January incident that impacted Git users.
January 30 21:31 UTC (lasting 35 minutes)
On January 30 at 21:36 UTC, our alerting system detected a 500 error response increase in requests made to the Container registry. As a result, most builds on GitHub Pages and requests to GitHub Packages failed during the incident.
Upon investigation, we found that a change was made to the Container registry Redis configuration at 21:30 UTC to enforce authentication on Redis connections. There was an issue with the Container registry production deployment file where client connections were unable to authenticate due to a hard coded connection string, resulting in errors and preventing successful connections.
At 22:12 UTC, we reverted the configuration change for Redis authentication. Container registry began recovering two minutes later, and GitHub Pages was considered healthy again by 22:21 UTC.
To help prevent future incidents, we improved management of secrets in the Container registry’s Redis deployment configurations and added extra test coverage for authenticated Redis connections.
January 30 18:35 UTC (lasting 7 hours)
On January 30 at 18:35 UTC, GitHub deployed a change which slightly altered the compression settings on source code downloads. This change altered the checksums of the resulting archive files, resulting in unforeseen consequences for a number of communities. The contents of these files were unchanged, but many communities had come to rely on the precise layout of bytes also being unchanged. When we realized the impact we reverted the change and communicated with affected communities.
We did not anticipate the broad impact this change would have on a number of communities and are implementing new procedures to prevent future incidents. This includes working through several improvements in our deployment of Git throughout GitHub and adding a checksum validation to our workflow.
See this related blog post for details about our plan going forward.
February 7 21:30 UTC (lasting 20 hours and 35 minutes)
On February 7 at 21:30 UTC, our monitors detected failures creating, starting, and connecting to GitHub Codespaces in the Southeast Asia region, caused by a datacenter outage of our cloud provider. To reduce the impact to our customers during this time, we redirected codespace creations to a secondary location, allowing new codespaces to be used. Codespaces in that region recovered automatically when the datacenter recovered, allowing existing codespaces in the region to be restarted. Codespaces in other regions were not impacted during this incident.
Based on learnings from this incident, we are evaluating expanding our regional redundancy and have started making architectural changes to better handle temporary regional and datacenter outages, including more regularly exercising our failover capabilities.
February 18 02:36 UTC (lasting 2 hours and 26 minutes)
On February 18 at 02:36 UTC, we became aware of errors in our application code pointing to connectivity issues to our MySQL databases. Upon investigation, we believe these errors were due to a few unhealthy deployments of our sharding middleware. At 03:30 UTC, we performed a re-deployment of the database infrastructure in an effort to remediate. Unfortunately, this propagated the issue to all Kubernetes pods, leading to system-wide errors. As a result, multiple services returned 500 error responses and GitHub users were experiencing issues signing in to GitHub.com.
At 04:30 UTC, we found that the database topology in 30% of our deployments was corrupted, which prevented applications from connecting to the database. We applied a copy of the correct database topology to all deployments, which resolved the errors across services by 05:00 UTC. Users were then able to sign in to GitHub.com.
To help prevent future incidents, we added a monitor to detect database topology errors so we can identify this well in advance of these changes impacting production systems. We have also improved our observability around topology reloads, both successful and erroneous ones. We are also doing a deeper review of the contributing factors to this incident to learn and improve both our architecture and operations to prevent a recurrence.
February 28 16:05 UTC (lasting 1 hour and 26 minutes)
On February 28 at 16:05 UTC, we were notified of degraded performance for GitHub Codespaces. We resolved the incident at 17:31 UTC.
Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update in next month’s report.