My Pingdom and monitoring/alerting outage
Incident Report for Pingdom

We experienced a period of downtime across our services this morning and would like to apologize for any inconvenience caused by this. Thankfully, we were able to quickly identify the issue and resolve it, allowing for a full resumption of our services. We’ve compiled a post-mortem below to give you full transparency into what happened.

The Outage

Note: All times referenced below are Coordinated Universal Time (UTC).

  • Our user service was unavailable for 33 minutes on Wednesday September 21st 2016 from 06:50 until 07:23.
  • Most services across Pingdom were affected during the outage, including monitoring and alerting, which could lead to gaps in dashboard data.
  • Pingdom DevOPS identified and resolved the error leading to complete restoration of the service.

As part of our effort to constantly increase the security of our service, especially when it comes to areas that store customer data, we were introducing individual authorization levels to each product. The result of this would limit the access rights of each product to only the necessary customer information they require from our databases and in turn limit the amount of damage caused if a service was to be compromised.

Due to human error, incorrect credentials were used when deploying these changes, rendering most services unable to access the necessary databases. This in turn caused an outage across our whole platform.

The issue has now been resolved, and as a result, our service is now fully restored and all user data even more secure. However, we take these issues extremely seriously and want to apologize for any negative effect this might have had our your reliance of our services. Moving forward, we will endeavour, as always, to avoid errors of this nature by ensuring all work and credentials are double-checked before deployment.

Posted 9 months ago. Sep 21, 2016 - 09:08 UTC

Resolved
All systems functional, post-mortem to follow.
Posted 9 months ago. Sep 21, 2016 - 07:47 UTC
Monitoring
Issue resolved and my Pingdom access restored.

Some monitoring is still catching up but should be fixed shortly. We're continuing to monitor the situation.
Posted 9 months ago. Sep 21, 2016 - 07:31 UTC
Identified
We've identified a major issue where a change to my Pingdom has caused both access problems to the dashboard and issue with monitoring and alerting.

As the issue has been identified we're hard at work fixing it right now, updates to follow soon.
Posted 9 months ago. Sep 21, 2016 - 07:19 UTC