On August 9th around 12:03pm PT we started to see degraded performance on the public production API. While taking a closer look, we saw that many internal services were healthy but timing out when making requests to other internal services.
We tracked the connectivity issues down to our internal load balancers, which were failing to serve requests because they were out of free disk space.
How did we fix it?
All internal load balancers were moved to fresh hosts. In addition, all other hosts that directly support the public production API were cycled.
This issue was considered resolved at 1:44pm PT.
How will we prevent this in the future?
The hosts that support our internal load balancers have been added to our host refresh schedule to ensure they are cycled out of service before exhausting their resources.