Issues with Beepmanager alerting
Incident Report for Pingdom

Alerting Issues Post Mortem

Note: All times shown below are in Universal Coordinated Time (UTC) and took place on Wednesday May 24.

The Outage and how it affected Pingdom

At 13:20 we upgraded the software on one of our message-queue servers, which primarily handles uptime monitoring measurements and alerts

Around 13:25 we noticed that the alerting queues started to build up, meaning that all types of alerts from Pingdom where slowing down. Email, SMS and app alerts started to get delayed. Naturally this was a problem, and the upgrade was reverted.

At 13:30 the downgrade was complete, but it took until 14:03 to clear all queues. Essentially this means that all alerts were delayed, some up to 10-20 minutes, and a very select few unfortunate customers missed alerts completely. This outage only affected those who had outages during the time 13:20 to 14:03.

What we learned and what we’ll do in the future

To prevent this from happening again we are separating the services running on this particular server and adding additional tests (the upgrade had of course passed all QA before, but we obviously missed something). If you have any questions please reach out to us here.

Posted 3 months ago. May 26, 2017 - 13:41 UTC

Resolved
Issue resolved, queues cleared and alerting is fully restored. Post mortem to follow on Friday.
Posted 3 months ago. May 24, 2017 - 14:34 UTC
Monitoring
Issues resolved, we continue to monitor the situation while some alerting queues catch up. No more missed alerts but a few minutes of delays can be expected in some cases.
Posted 3 months ago. May 24, 2017 - 13:56 UTC
Identified
Beepmanager alerting issues. Our ops team has identified an issue with our Beepmanager service.
Some alerts are not sent out, this affects all types of alerts. Uptime monitoring is working so there are no 'false' alerts, but some alerts can be missed.

This affects roughly 18% of Pingdom users, if you are unsure whether you have beepmanager or not enabled, log in to my Pingdom and see if you have the "Alerting" option in the left hand menu. If you don't, you're not affected. It also only affects accounts created before June 2016, so if you are sure your account is created after that time you are not affected.

We continue to investigate this.
Posted 3 months ago. May 24, 2017 - 13:48 UTC