Issues with Page Speed reports
Incident Report for Pingdom

AWS S3 Postmortem

Note: All times shown below are in Universal Coordinated Time (UTC).

TL;DR AWS US-East went down, Pingdom had temporary issues with Page Speed Checks and report history. Uptime monitoring and alerting unaffected.

The Outage

At 18:35 Pingdom operations noticed errors with our US-East region Page Speed Check service. Around 19:00 AWS acknowledged that a problem with a high rate of errors to S3 (Simple Storage Service) caused issues for their customers. The problem was identified as having originated from a data center region in Virginia. With over 1 million users of the service, predictably many services were affected by the data, including Pingdom.

By 23:00, AWS had reported a full recovery of service levels to S3.

Effect on Pingdom services

At 18:35 on February 28th 2017, we began experiencing issues with Amazon S3, resulting in problems fetching and displaying data for our Page Speed and historical Uptime reports.

The issues with AWS meant that at 18:35, we noticed degraded performance for Page Speed monitoring for our US-East and Europe probe clusters. Service for Page Speed monitoring via these servers was restored at 23:30 when AWS reported that their service had resumed normal operation.

Some Transaction Check alerts were delayed up to 10-15 minutes during this outage as well, when the S3 outage impacted EC2 services.

We continued to monitor the situation throughout the incident and were able confirm that full service levels had been restored to across the Pingdom monitoring service by 23:30.

Avoiding disruption in the future

Whilst we endeavour to maintain 100% uptime across the Pingdom service, and spread our services that rely on AWS across different availability zones, the plain truth is that we will, like the majority of other Internet companies, always be reliant on external hosting services such as AWS.

Posted 8 months ago. Mar 01, 2017 - 14:54 UTC

Resolved
All services are now fully operational and have returned to normal. Post mortem to follow.
Posted 8 months ago. Feb 28, 2017 - 22:28 UTC
Update
Page Speed monitoring back up and running for US-East and Europe. AWS has updated their status page saying service is operating normally: https://status.aws.amazon.com/ We continue to monitor the situation and will be back with an update in 30 minutes or as soon as the situation changes.
Posted 8 months ago. Feb 28, 2017 - 22:18 UTC
Update
No change in the situation at the time being, will return with more updates in 30 minutes or if the situation changes.
Posted 8 months ago. Feb 28, 2017 - 21:53 UTC
Update
Due to the issues AWS are experiencing the Page Speed monitoring is now also affected, currently for US-East and Europe. We're investigating the issue and monitoring the situation. Will return with an update in 30 minutes or when the situation changes.
Posted 8 months ago. Feb 28, 2017 - 21:23 UTC
Update
We have yet to see any improvement after AWS latest update on the situation (12:52 PM PST): https://status.aws.amazon.com/ We are however still monitoring the issue and will return with another update in 30 minutes or as soon as the situation changes.
Posted 8 months ago. Feb 28, 2017 - 21:15 UTC
Update
Still no change to the situation at the moment, will return in 30 minutes with another update or when the situation changes.
Posted 8 months ago. Feb 28, 2017 - 20:44 UTC
Update
No update to the situation at the time being, will return with more updates in 30 minutes or if the situation changes.
Posted 8 months ago. Feb 28, 2017 - 20:13 UTC
Monitoring
We're still monitoring the issue, but have seen no change in the situation on our side as of yet. Will return again with more updates in 30 minutes or if the situation changes
Posted 8 months ago. Feb 28, 2017 - 19:41 UTC
Update
No change in the situation at the time being, will return with more updates in 30 minutes or if the situation changes.
Posted 8 months ago. Feb 28, 2017 - 19:12 UTC
Investigating
Due to issues with Amazon S3 (https://status.aws.amazon.com/) there are currently problems with the Page Speed reports and results may be delayed in the account. This also affects downloading PDF reports for the Page Speed checks. Older data in the Uptime reports may also be affected.

Monitoring and alerting is not affected.

We continue to monitor the situation and will update this incident at least once every 30 minutes or when we have an update on the situation.
Posted 8 months ago. Feb 28, 2017 - 18:36 UTC