APIs outage

Incident Report for Tipser

Resolved

At 12:25 PM UTC our APIs became unreachable. The alerting immediately triggered a response of the SRE team.

The cause was quickly determined - a recent change in an infrastructure code caused a removal of API services from our production cluster. It didn't happen on a development environment where the change was tested due to differences in the environments' setup.

The fix was developed and tested on our main service. As a result our /v4 and /v5 endpoints became available again at 12:32 PM UTC. At 12:47 PM UTC /v3 and the rest of the services were operational too.

To make sure that mistakes like this one won't repeat, following action have been taken:
1. Immediately we blocked an ability to remove services during deployments.
2. We prioritised a task to make our development environments configured in the same manner as our production one, as this problem could have been spotted earlier if that was the case.

Posted Jun 06, 2022 - 14:30 CEST