Yesterday we had our second longest unannounced downtime ever, and unlike last time (when our provider’s power network experienced a meltdown), this downtime event was entirely due to a fault of our own.
We dropped the ball here, and we’re really sorry.
Below you’ll find notes on what caused the issues, and what lessons we learned from the event.
Timeline of events
Toggl was down for a little over an hour, with smaller performance issues occurring over the course of several hours.
- At 14:45 UTC yesterday we spotted first problems. Queries were taking too long, and we were alerted.
- By 15:00 UTC we were down and stayed down for over an hour. We then experienced intermittent connectivity for around 3 hours.
- By 19:30 UTC, the service was back in limited mode.
- Last issues were cleared by 20:45 UTC.
Cause of server issues
The reason behind the initial crash was really simple – a piece of code relating to improvements was not optimized enough and thus took too long to complete.
As it was one of the most used bits, it meant the situation was gradually getting worse and worse, with more and more people waiting for data. until we ran out of resources on the database level. That’s when the initial crash happened.
Technical details & our response
We spotted the issue and started mitigating, first by fixing the inefficiencies and then by adding resources. In the meantime, as the database end of the system was malfunctioning, the API machines were experiencing their own problems due to excessive traffic and hundreds of thousands of reconnects happening simultaneously. When it became too much to handle, the API servers dropped out of the pool. We countered that by adding more API resources.
Fixes and added resources didn’t have a lasting effect so we decided to revert all changes to code to its last known good state from last week.
This did not improve matters. On the contrary – we suddenly couldn’t get the site running at all.
We struggled quite a while with causes until we spotted a suspicious detail – all the new database resources we had added were sitting idle. By re-checking all the setup it turned out most of the API servers had dropped out of the pool of available resources, and all the traffic was served by a couple of remaining API servers.
Yes, you read that right – at one point we only had two operational API servers running and we failed to notice that for the better part of an hour.
That’s what I meant when I said we really dropped the ball here.
Turns out the actual code issue was fixed within 30 minutes, but most of the time was spent on chasing ghosts – trying to fix things that didn’t have any effect and running circles around the real problem – API servers, that were just nonfunctional.
As soon as we figured that out, the service was back up – but we had lost several hours.
Aftermath & lessons learned
This has been a very painful, but also a very valuable lesson for us.
It was a case of not thinking clearly, trying to fix things that weren’t really broken and ignoring things that did deserve attention.
That is down to the processes of how we operate and the critical data we are monitoring. We plan to improve on both.
I cannot promise you downtimes won’t happen ever again – in the world of ever more complex IT solutions and ever growing data, some errors are bound to happen at some point in time. But we will learn from this particular case and will do our best to anticipate similar scenarios.
I’d like to apologize once again for the inconvenience the downtime may have caused you.
Thank you for your patience and trust in us.