


We were able to identify the issue and statused red. This impacted a little less than 20% of our active users at the time.Īt approximately 16:45 UTC, we were alerted and noticed elevated error rates in the API and began investigating causes. Users with those subscriptions received an error from our API responsible for creating authentication tokens. June 21 17:02 UTC (lasting 1 hour and 10 minutes)ĭuring this incident, shortly after the GA of Copilot, users with either a Marketplace or Sponsorship plan were unable to use Copilot. We are also investigating how we can put in guardrails to ensure production database access is limited to services that own the data. We are adding alerts to identify increases in load to the proxy server to catch issues like this early. These queries have been moved to a dedicated analytics setup that does not serve production traffic. The large load they placed on the database caused the crash loops and the broader impact. Ultimately, this issue was traced back to a set of data analysis queries being pointed at an incorrect database. By 10:28 UTC, we were confident that the memory increase had mitigated the issue, and statused Actions green. We started to see recovery in Actions even before 10:08 UTC, and statused to yellow at 10:17 UTC. A change was created to increase the available memory to these pods, which fully rolled out by 10:08 UTC.

Once we started to investigate, we noticed that the pods running the proxy server for the database were crash-looping due to out-of-memory errors. Our on-call engineer was paged and Actions was statused red. The cause of these delays was excessive load on a proxy server that routes traffic to the database.Īt 09:37 UTC, Actions service noticed a marked increase in the time it takes customer jobs to start. June 1 09:40 UTC (lasting 48 minutes)ĭuring this incident, customers experienced delays in the start up of their GitHub Actions workflows. This report also sheds light into an incident that impacted multiple services in May. In June, we experienced four incidents resulting in significant impact and degraded state of availability to multiple services.
