Production Timeouts
Incident Report for Synapse
Postmortem

What happened?
On August 9th around 12:03pm PT we started to see degraded performance on the public production API. While taking a closer look, we saw that many internal services were healthy but timing out when making requests to other internal services.

We tracked the connectivity issues down to our internal load balancers, which were failing to serve requests because they were out of free disk space.

How did we fix it?
All internal load balancers were moved to fresh hosts. In addition, all other hosts that directly support the public production API were cycled.

This issue was considered resolved at 1:44pm PT.

How will we prevent this in the future?
The hosts that support our internal load balancers have been added to our host refresh schedule to ensure they are cycled out of service before exhausting their resources.

Posted Aug 23, 2019 - 14:45 PDT

Resolved
This incident has been resolved.
We will have a root cause analysis by the end of next week.
Posted Aug 09, 2019 - 15:46 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
We will have a root cause analysis by the end of next week.
Posted Aug 09, 2019 - 14:00 PDT
Investigating
We are experiencing timeout errors in production. This is also affecting the Synapse dashboard.
Posted Aug 09, 2019 - 12:13 PDT
This incident affected: CORE APIs.