In May, we experienced four incidents that resulted in degraded performance across GitHub services. This report also sheds light into three April incidents that resulted in degraded performance across GitHub services.
On April 25 at 23:11 UTC, a subset of users began to see a degraded experience with GitHub Copilot code completions. We publicly statused GitHub Copilot to yellow at 23:26 UTC, and to red at 23:41 UTC. As engineers identified the impact to be a subset of requests, we statused back to yellow at 23:48 UTC. The incident was fully resolved on April 26 at 00:02 UTC, and we publicly statused green at 00:30 UTC.
The degradation consisted of a rolling partial outage across all three GitHub Copilot regions: US Central, US East, and Switzerland North. Each of these regions experienced approximately 15-20 minutes of degraded service during the global incident. At the peak, 6% of GitHub Copilot code completion requests failed.
We identified the root cause to be a faulty configuration change by an automated maintenance process. The process was initiated across all regions sequentially, and placed a subset of faulty nodes in service before the rollout was halted by operators. Automated traffic rollover from the failed nodes and regions helped to mitigate the issue.
Our efforts to prevent a similar incident in the future include both reducing the batch size and iteration speed of the automated maintenance process, and lowering our time to detection by adjusting our alerting thresholds.
On April 26 at 08:59 UTC, our internal monitors notified us of degraded availability with GitHub Packages. Users would have noticed slow or failed GitHub Packages upload and download requests. Our investigation revealed a spike in connection errors to our primary database node. We quickly took action to resolve the issue by manually restarting the database. At 09:56 UTC, all errors were cleared and users experienced a complete recovery of the GitHub Packages service. A planned migration of GitHub Packages database to a more robust platform was completed on May 2, 2023 to prevent this issue from recurring.
On April 28 at 12:26 UTC, we were notified of degraded availability for GitHub Codespaces. Users in the East US region experienced failures when creating and resuming codespaces. At 12:45 UTC, we used regional failover to redirect East US codespace creates and resumes to the nearest healthy region, East US 2, and users experienced a complete and nearly immediate recovery of GitHub Codespaces.
Our investigation indicated our cloud provider had experienced an outage in the East US region, with virtual machines in that region experiencing internal operation errors. Virtual machines in the East US 2 region (and all other regions) were healthy, which enabled us to use regional failover to successfully recover GitHub Codespaces for our East US users. When our cloud provider’s outage was resolved, we were able to seamlessly direct all of our East US GitHub Codespaces uses back with no downtime.
Long-term mitigation is focused on reducing our time to detection for outages such as this by improving our monitors and alerts, as well as reducing our time to mitigate by making our regional failover tooling and documentation more accessible.
On May 4th at 15:23 UTC, our monitors detected degraded performance for Git Operations, GitHub APIs, GitHub Issues, GitHub Pull Requests, GitHub Webhooks, GitHub Actions, GitHub Pages, GitHub Codespaces, and GitHub Copilot. After troubleshooting we were able to mitigate the issue by performing a primary failover on our repositories database cluster. Further investigation indicated the root cause was connection pool exhaustion on our proxy layer. Prior updates to this configuration were inconsistently applied. We audited and fixed our proxy layer connection pool configurations during this incident, and updated our configuration automation to dynamically apply config changes without disruption to ensure consistent configuration of database proxies moving forward.
On May 9 at 11:27 UTC, users began to see failures to read or write Git data. These failures continued until 12:33 UTC, affecting Git Operations, GitHub Issues, GitHub Actions, GitHub Codespaces, GitHub Pull Requests, GitHub Web Hooks, and GitHub APIs. Repositories and GitHub Pull Requests required additional time to fully recover job results and search capabilities, with recovery completing at 21:20 UTC. On May 11 at 13:33 UTC, similar failures occurred affecting the same services until 14:40 UTC. Again, GitHub Pull Requests required additional time to fully recover search capabilities, with recovery completing at 18:54 UTC. We discussed both of these events in a previous blog post and can confirm they share the same root cause.
Based on our investigation we determined that the cause of this crash is due to a bug in the database version we are running, and the conditions causing this bug were more likely to happen in a custom configuration on this data cluster. We updated our configuration to match the rest of our database clusters, and this cluster is no longer vulnerable to this kind of failover.
The bug has since been reported to the database maintainers, accepted as a private bug, and fixed. The fix is slated for a release expected in July.
There have been several directions of work in response to these incidents to avoid reoccurrence. We have focused on removing special case configurations of our database clusters to avoid unpredictable behavior from custom configurations. Across feature areas, we have also expanded tooling around graceful degradation of web pages when dependencies are unavailable.
On May 10 at 12:38 UTC, issuance of auth tokens for GitHub Apps started failing, impacting GitHub Actions, GitHub API Requests, GitHub Codespaces, Git Operations, GitHub Pages, and GitHub Pull Requests. We identified the cause of these failures to be a significant increase in write latency on a shared permissions database cluster. First responders mitigated the incident by identifying the data shape in new API calls that was causing very expensive database write transactions and timeouts in a loop and blocking the source. We shared additional details on this incident in a previous blog post, but we wanted to share an update on our follow-up actions. Beyond the immediate work to address the expensive query pattern that caused this incident, we completed an audit of other endpoints to identify and correct any similar patterns. We completed improvements to the observability of API errors and have further work in progress to improve diagnosis of unhealthy MySQL write patterns. We also completed improvements to tools, documentation and playbooks, and training for both the technical diagnosis and our general incident response to address issues encountered while mitigating this issue and to reduce the time to mitigate similar incidents in the future.
On May 16 at 21:08 UTC, we were alerted to degradation of multiple services. GitHub Issues, GitHub Pull Requests, and Git Ops were unavailable while GitHub API, GitHub Actions, GitHub Pages, and GitHub Codespaces were all partially unavailable. Alerts indicated that the primary database of a cluster supporting key-value data had experienced a hardware crash. The cluster was left in such a state that our failover automation was unable to select a new primary to promote due to the risk of data loss. Our first responder evaluated the cluster, determined it was safe to proceed, and then manually triggered a failover to a new primary host 11 minutes after the server crash. We aspire to reduce our response time moving forward and are looking into improving our alerting for cases like this. Long-term mitigation is focused on reducing dependency on this cluster as a single point of failure for much of the site.