Note: All times shown below are in Universal Coordinated Time (UTC).
TL;DR AWS US-East went down, Pingdom had temporary issues with Page Speed Checks and report history. Uptime monitoring and alerting unaffected.
At 18:35 Pingdom operations noticed errors with our US-East region Page Speed Check service. Around 19:00 AWS acknowledged that a problem with a high rate of errors to S3 (Simple Storage Service) caused issues for their customers. The problem was identified as having originated from a data center region in Virginia. With over 1 million users of the service, predictably many services were affected by the data, including Pingdom.
By 23:00, AWS had reported a full recovery of service levels to S3.
At 18:35 on February 28th 2017, we began experiencing issues with Amazon S3, resulting in problems fetching and displaying data for our Page Speed and historical Uptime reports.
The issues with AWS meant that at 18:35, we noticed degraded performance for Page Speed monitoring for our US-East and Europe probe clusters. Service for Page Speed monitoring via these servers was restored at 23:30 when AWS reported that their service had resumed normal operation.
Some Transaction Check alerts were delayed up to 10-15 minutes during this outage as well, when the S3 outage impacted EC2 services.
We continued to monitor the situation throughout the incident and were able confirm that full service levels had been restored to across the Pingdom monitoring service by 23:30.
Whilst we endeavour to maintain 100% uptime across the Pingdom service, and spread our services that rely on AWS across different availability zones, the plain truth is that we will, like the majority of other Internet companies, always be reliant on external hosting services such as AWS.