At 12:25 PM UTC our APIs became unreachable. The alerting immediately triggered a response of the SRE team.
The cause was quickly determined - a recent change in an infrastructure code caused a removal of API services from our production cluster. It didn't happen on a development environment where the change was tested due to differences in the environments' setup.
The fix was developed and tested on our main service. As a result our /v4 and /v5 endpoints became available again at 12:32 PM UTC. At 12:47 PM UTC /v3 and the rest of the services were operational too.
To make sure that mistakes like this one won't repeat, following action have been taken: 1. Immediately we blocked an ability to remove services during deployments. 2. We prioritised a task to make our development environments configured in the same manner as our production one, as this problem could have been spotted earlier if that was the case.