Yesterday, November 5th, between the hours of 17:39 to 21:03 (GMT+1) we suffered a major outage, affecting all services on our platform.
During integration of a server monitoring service, an unfortunate and unlikely human error caused our entire cloud of EC2 Amazon application instances to be terminated. We want to point out that at no point was any of our user’s data in peril.
So left only with just the data, but no Amazon instances to run the platform, we had to face a task we weren’t quite prepared for, which was to rebuild the entire stack from the ground up based on previous snapshots. What we thought would be a 1 hour job, became more difficult than expected.
After setting up the new machines, rerouting them and restoring the snapshots, we were able to redeploy the platform, however not without some tweaking.
Shortly after getting the servers up and running again, we realised our recently renewed SSL certificate was not working properly, and we had to have a new one issued and installed. During this time, for around 20 minutes, the HTTPS access to the restored servers was displaying an expired certificate warning.
With the platform and typeforms back online, we proceeded to work on restoring the email notification service, which, again, proved to be again more difficult than expected: the snapshot was a little older than the rest, and many things had to be reconfigured form scratch. The notifications service resumed at 22:44h (GMT+1).
We apologize for any inconvenience caused to any of our users, and would also like to thank all those users that showed tremendous patience (on social media) during the outage.
UPDATE: Unfortunately today we had another partial disruption to the service @16:27 GMT+1 Nov 6 2013 , some users we’re unable to login. We’ve now re-established the service fully, and we expect plain sailing onward.