How can I fix this now?
The faster we remove the faulty versions from circulation, the faster we can relieve the limit for sync requests and everyone can continue working as normal.
On Linux – There are no current issues with this app; however, users on Linux may see some sync limiting, as we have had to limit syncing abilities across all desktop app platforms.
The whole story
On the 1st of October, our backend monitoring system started sending alerts that the load on our servers was increasing at a fast pace. The backend team started to investigate right away and raised an alarm with all the development teams. After narrowing down the possibilities, we found out that desktop apps had started to make a lot more connections to our servers with the latest stable releases on Mac and Windows. The desktop team called all hands on deck to find the exact point of error, fix it, and release a new version.
Timeline of events
- On Sep 24th and 26th, we rolled out updates for the Windows (version 7.4.1012) and the Mac (version 7.4.1015) desktop app.
- Oct 1st, 8:50 am UTC, first issues started happening. It took some time for our users to update their apps, which is why we only detected something was wrong after so many days.
- Oct 1st, 12 pm UTC, we had to make the choice to kill the desktop API, making the desktop apps usable only in offline mode. The apps were making too many requests to our servers at the time.
- Oct 1st, 2:45 pm UTC, we pinpointed the issue in our desktop apps and rolled out updates to our users (version 7.4.1022) and re-enabled the desktop API but with a limit on how many requests the app can make in a minute.
- Oct 2nd, 1:30 pm UTC, we monitored the situation and observed that the newest fixed versions were still causing too many connections to servers. We pinpointed that although the original issue was fixed, a new issue of duplicating entries had appeared. This forced the app to go crazy with syncing.
- Oct 2nd, 5:30 pm UTC, the desktop team yet again reverted all changes and rolled out new fixed releases for Mac and Windows (version 7.4.1023).
- Oct 3rd, we’re monitoring the situation and verifying that the latest release works properly before removing the rate-limiting for desktop apps.
- Oct 4th, We’re working on fixing issues with duplicated time entries. Apps do not create duplicate items anymore but there are still duplicate time entries in our database that were created by previous versions of our apps. We are creating a solution to detect and delete all faulty time entries from our servers. The users are going to be sent an e-mail that their duplicate entries were deleted. In case they detect something extra was deleted, they can write back to us and get their entry restored.
What we learned from this and what are we going to do to prevent it from happening in the future
Audit all network communications.
We will audit all syncing connections in the desktop apps and create an additional layer of automated tests to catch any big changes in network connections.
Add deploy tracking to our desktop apps
We will have a better overview of how each release affects the backend and we will be able to catch and fix potential issues faster.
Improve in-app messaging
Aside from the actual technical issues we also want to address the lack of communication through our desktop apps to avoid confusion in the future. Our apps were suggesting that there is something wrong with a user’s connection instead of informing them about the actual reason why their app was not able to connect to our servers. We will build an in-app information system that will enable us to send important updates to users quickly when needed.
These are the main points we are going to work on. If you want to get the whole picture, head over to our Github and check out the full list of issues under our October fix-up project
We’d like to apologize again for any inconvenience caused. Thank you for your patience.
The Toggl Desktop team