On Wednesday, April 1st, we deployed a technical update to our third party integrations system. Not long after the release, we noticed some possible data anomalies. As concern grew about the integrity of third party data being brought into our system, we had to make the difficult decision to take service offline in order to perform data maintenance. The total downtime was 1 hour and 40 minutes.
This downtime on April 1st resulted in the elevated error rates we observed on April 2nd between 6:30 UTC and 9:45 UTC. During this time, some of the API requests to Toggl failed and needed to be retried.
- 09:42 UTC: Updates were deployed to our integrations system
- 11:08 UTC: We were alerted to the first signs of a problem
- 11:19 UTC: We took the precaution of stopping integrations
- 12:00 UTC: The issue was escalated to a major incident
- 13:08 UTC: The Toggl platform was moved into unplanned maintenance
- 15:00 UTC: Service was restored
The underlying issue & our response
During the critical period between 09:42 UTC and 11:19 UTC, any logged-in user with admin rights who visited the integration section of Toggl was affected. During the time of the visit, active integration synchronization caused some workspace data to be misallocated, such that workspaces were populated with data generated by the third party integrations of other users.
The problem created on average 42 unique data points belonging to other users’ third party integrations in a given workspace, with “project name” and, in some cases, “client name” being the most commonly misallocated types of data.
The total number of users who may have been affected represent 0.07% of our active user base. Our estimate is that only a small fraction of those users viewed the actual project names or client names created by mistake – either in our apps or replicated via integrations.
During the unplanned maintenance and following hours, we removed all misallocated data, performed the necessary integrity checks, and made sure the service itself is secure.
I would like to stress that the core Toggl platform and its API were not directly affected.
We will be directly contacting users that were affected beginning the week of April 6, 2020 to let them know how they were impacted.
Toggl consists of several architectural components: core platform, reports, exports, and third party integrations – as mentioned. While we would like to support all components at the same level, it’s the core platform that receives the most attention. Our integrations system has not been receiving the dedicated attention it deserves.
We’ve known for some time that we needed to dedicate additional resources to integrations, and earlier this year we decided to do so, increasing the level of our technical maintenance.
Ironically, the technical update released on April 1st had a specific bug (a race condition) that was not detected during code review or testing. The bug only surfaced under a very high user load.
We can’t deny that this is a difficult issue to face. A component we considered external to the core platform pushed data via our API and led to the exposure of some user data.
The data maintenance we performed created additional pressure on our main database server powered by PostgreSQL. The next morning, on April 2nd, the internal maintenance process unexpectedly started at 6:00 UTC and led to a slowdown of the service that contributed to elevated error rates. We have mitigated it by extending capacity, but the final solution only came when the database maintenance process had finished.
Aftermath and going forward
I don’t like it when post mortems say “we do take security seriously”. But it’s true. We have exposed some user information that we should not have. There’s no running away from it.
Our processes allowed a significant number of changes to go out in a single release, making it more difficult to review and test. We need to revisit our process and reduce the likelihood of such an event happening in the future.
We also need to re-evaluate our approach to integrations in light of this incident.
As a business, Toggl has grown significantly in the last two years. We are already working on reorganizing internal structures to better address procedural and communication shortcomings, which should help prevent this kind of oversight in the future.
Additionally, I will be starting as Interim IT Operations Team Lead to provide dedicated attention and resources to these issues.
Please accept our apology for any difficulty this might have caused during an already turbulent period.