Note: All times shown below are in Universal Coordinated Time (UTC) and took place on Wednesday May 24.
At 13:20 we upgraded the software on one of our message-queue servers, which primarily handles uptime monitoring measurements and alerts
Around 13:25 we noticed that the alerting queues started to build up, meaning that all types of alerts from Pingdom where slowing down. Email, SMS and app alerts started to get delayed. Naturally this was a problem, and the upgrade was reverted.
At 13:30 the downgrade was complete, but it took until 14:03 to clear all queues. Essentially this means that all alerts were delayed, some up to 10-20 minutes, and a very select few unfortunate customers missed alerts completely. This outage only affected those who had outages during the time 13:20 to 14:03.
To prevent this from happening again we are separating the services running on this particular server and adding additional tests (the upgrade had of course passed all QA before, but we obviously missed something). If you have any questions please reach out to us here.