Track

Information About Toggl Downtime on April 1st

Hands carrying a ball of yarn that is the Toggl logo On Wednesday, April 1st, we had to make the difficult decision to take service offline in order to perform data maintenance.

On Wednesday, April 1st, we deployed a technical update to our third party integrations system. Not long after the release, we noticed some possible data anomalies. As concern grew about the integrity of third party data being brought into our system, we had to make the difficult decision to take service offline in order to perform data maintenance. The total downtime was 1 hour and 40 minutes. 

This downtime on April 1st resulted in the elevated error rates we observed on April 2nd between 6:30 UTC and 9:45 UTC. During this time, some of the API requests to Toggl failed and needed to be retried.

The timeline

  • 09:42 UTC: Updates were deployed to our integrations system
  • 11:08 UTC: We were alerted to the first signs of a problem
  • 11:19 UTC: We took the precaution of stopping integrations 
  • 12:00 UTC: The issue was escalated to a major incident
  • 13:08 UTC: The Toggl platform was moved into unplanned maintenance
  • 15:00 UTC: Service was restored

The underlying issue & our response

During the critical period between 09:42 UTC and 11:19 UTC, any logged-in user with admin rights who visited the integration section of Toggl was affected. During the time of the visit, active integration synchronization caused some workspace data to be misallocated, such that workspaces were populated with data generated by the third party integrations of other users.

The problem created on average 42 unique data points belonging to other users’ third party integrations in a given workspace, with “project name” and, in some cases, “client name” being the most commonly misallocated types of data.

The total number of users who may have been affected represent 0.07% of our active user base. Our estimate is that only a small fraction of those users viewed the actual project names or client names created by mistake – either in our apps or replicated via integrations. 

During the unplanned maintenance and following hours, we removed all misallocated data, performed the necessary integrity checks, and made sure the service itself is secure.

I would like to stress that the core Toggl platform and its API were not directly affected.

We will be directly contacting users that were affected beginning the week of April 6, 2020 to let them know how they were impacted.

Technical background 

Toggl consists of several architectural components: core platform, reports, exports, and third party integrations – as mentioned. While we would like to support all components at the same level, it’s the core platform that receives the most attention. Our integrations system has not been receiving the dedicated attention it deserves. 

We’ve known for some time that we needed to dedicate additional resources to integrations, and earlier this year we decided to do so, increasing the level of our technical maintenance. 

Ironically, the technical update released on April 1st had a specific bug (a race condition) that was not detected during code review or testing. The bug only surfaced under a very high user load. 

We can’t deny that this is a difficult issue to face. A component we considered external to the core platform pushed data via our API and led to the exposure of some user data. 

The data maintenance we performed created additional pressure on our main database server powered by PostgreSQL. The next morning, on April 2nd, the internal maintenance process unexpectedly started at 6:00 UTC and led to a slowdown of the service that contributed to elevated error rates. We have mitigated it by extending capacity, but the final solution only came when the database maintenance process had finished.

Aftermath and going forward

I don’t like it when post mortems say “we do take security seriously”. But it’s true. We have exposed some user information that we should not have. There’s no running away from it. 

Our processes allowed a significant number of changes to go out in a single release, making it more difficult to review and test. We need to revisit our process and reduce the likelihood of such an event happening in the future.

We also need to re-evaluate our approach to integrations in light of this incident. 

As a business, Toggl has grown significantly in the last two years. We are already working on reorganizing internal structures to better address procedural and communication shortcomings, which should help prevent this kind of oversight in the future.

Additionally, I will be starting as Interim IT Operations Team Lead to provide dedicated attention and resources to these issues. 

Please accept our apology for any difficulty this might have caused during an already turbulent period.



April 6, 2020

Related Posts

Track

New Guide To Establishing a Successful Remote Culture by Toggl

Our friends from Toggl have released an awesome guide to establishing a successful remote culture. This well-written guide answers some questions like how to build a remote team, how to lead a remote team or why build one in the first place. 10 awesome companies like Invision, GitLab, Zapier and yours truly, Toggl Plan, share

Track

How I Increased My Productivity 238% by Toggling My Activities

I started using Toggl in the beginning of 2014. Since that time, I’ve: Reduced time wasting activities by 46% Increased revenue by 27% And increased my revenue per hour by 238% I know this because – thanks to Toggl – I’m able to track all of it. In this blog post, I’ll show you exactly how

Track

Working from Home with Kids: What Works and What Doesn’t

We’ve compiled a list of what works when you work from home with kids, courtesy of 12 people that have tried everything—from splitting their screen between work email and YouTube cartoons to making an egg timer their best friend.