Many of you experienced a slow and unresponsive Toggl Track on Monday, April 12, especially on our web app. This was due to an unprecedented load on our API and database servers. We were unable to address the situation with regular scaling and had to take Toggl Track offline for a short period of maintenance to fully restore the service. The incident lasted for roughly 3 hours, of which we were totally offline for 20 minutes.
We want to apologize for any inconvenience caused by this downtime. Over the last two weeks, we have fully analyzed what happened, found the root causes of the issue, and have changed our priorities to address them.
- 14:02 UTC: Some of you started experiencing slower response times in our web app
- 14:15 UTC: We received the first reports about the unresponsive web app from you
- 14:20 UTC: We started investigating
- 14:35 UTC – 16:20 UTC: We attempted to scale our server capacity in multiple ways, while at the same time trying to reduce incoming traffic, but the problem persisted
- 16:40 UTC: We moved the Toggl Track platform into unplanned maintenance mode
- 17:20 UTC: We restored the service
We continued to monitor all our services in the coming days and implemented some quick changes to reduce traffic from our web app with minimal impact on the user. This prevented a similar problem from recurring as we investigated the root causes.
Symptoms and root causes
The first and most direct symptom we noticed during the incident was a significant increase in database load. We first believed this to be in line with the significant increase in API traffic which we also observed at the same time. This is why we at first tried limiting traffic and scaling our infrastructure by adding more capacity to handle the increased traffic.
We now understand why this didn’t work as expected. While we were seeing higher traffic levels than we ever had before, the database load had increased more than linearly.
The reason for this was a multiplication of requests—on multiple levels—which got worse the more overloaded our systems were. There were several reasons for this multiplication under extreme load. The biggest factors were (1) internal retries of failing requests and (2) mismatching timeouts.
The retries can cause a single request (which has failed or timed out) to try querying the underlying system multiple times. Due to inconsistently configured timeouts, the system might have still been trying to fulfill earlier requests at the same time as the newer requests.
This can cause a downwards spiral of more and more requests queued within the system, leading to reduced performance, more timeouts and retries, and finally to longer queues, and so on.
Imagine a busy call center where everyone starts calling in with more and more phones at the same time because their original call didn’t get an answer fast enough.
This downward spiral is the reason that adding more capacity did not solve the problem. We added more servers, only to be quickly overloaded with requests. Our only option was to reset the platform by taking it down completely for a short time until all pending requests were cleared out. We later saw that it took between 5 and 10 minutes to clear some requests from the system entirely.
Background and priorities
We’re taking this incident seriously, and changing our priorities accordingly.
Over the past year, we focused heavily on improving our feature set. We invested a lot into talking to our users and trying to solve their problems by helping you see where your time was going. We also grew a lot as a company and as a product, welcoming a lot more of you into the Toggl Track crew–some as co-workers, and many more as users. It has been and continues to be an exciting journey.
But this incident shows that we did not invest enough into our core platform–at least not at a level required to support growth at the quality of service you’ve come to expect from us. We’d been spotting a couple of smaller issues in recent months and we now understand that they all shared some of the same root causes.
This is why we’ve changed our priorities for this quarter and will focus more on improving our platform, both short term and long term. We’ve even retasked one of our product development teams to help us with these efforts in the coming months.
I wish there were a simple solution to the problems described above. However, as we’ve grown, so has the complexity of the issues we encounter. That’s why it will take some time and a lot of effort to address the root causes of these problems. And at the same time we must also try to anticipate and get ahead of the next complex problem.
For now, we’ve taken several concrete steps to avoid similar problems in the near future, including reducing traffic from our own apps. In addition, we’ve performed a full analysis of the current situation to ensure we’re not missing anything important.
We also made several immediate improvements to our infrastructure to ensure that even if we experience similar difficulties again, they won’t affect the whole platform, but only isolated parts of it.
Next, we’ll be addressing the root causes of the incident and looking for ways to avoid the request multiplication issue we observed. We’ll be reviewing internal timeout and retry mechanisms and aborting failing requests earlier and at all levels of the system as part of this effort.
Finally, we are reviewing traffic from both our own apps and from 3rd party integrations. which have become a large part of our overall load. We want to better anticipate your needs from our apps and APIs so we can improve them to serve you better, while at the same time make them more resilient.
In the meantime, we’re actively hiring technical specialists to further strengthen our team. We are also reviewing several internal processes to ensure they can handle future incidents faster and more smoothly.
We have our work cut out for us as we continue on this journey. For now, please accept our apologies for any difficulties caused by this downtime. Thank you.