Update on murena.io service outage

GaelDuval · October 7, 2024, 9:47am

Since Oct 6 19:50 CET, most murena.io services have been unreachable.

Affected services

all murena.io services including drive, calendar, notes and email (@murena.io / @e.email)
Over the air (OTA) /e/OS updates
App Lounge

All other websites including e.foundation, gitlab.e.foundation, community.e.foundation and murena.com remain unaffected.

Context

The reason behind this outage is related to our storage infrastructure that is getting old while the number of murena.io active users has grown a lot during the past two years.

An infrastructure evolution and consolidation was planned, but last week, our current infrastructure lost some storage nodes and switched to a degraded mode.

Unfortunately, yesterday we faced another issue that forced us to put the infrastructure in maintenance mode until it gets fully restored.

Actions we have taken:

We are prioritizing email service recovery as soon as possible, and it is ongoing progress. We will update this today to give a first ETA.
Meanwhile, we have already fixed the App Lounge service, which is now back, still in a degraded mode: only commercial apps (from Play Store) can currently be installed. An ETA for the service to be fully restored will be communicated later.
We’re taking the current outage as an opportunity to accelerate the murena.io storage infrastructure evolution and consolidation that was planned. New storage nodes have been added and the synchronization process is already in progress. But due to the high volume of data and the tests procedures we need to perform before putting it back online, we prefer to be clear about the fact that it is probable that the ETA for murena.io services (excepted email services that will be back sooner) will be counted in days, not in hours.
OTA updates repair have been low prioritized since the impact for users is the lowest.

We’d like also to make it clear that there was no data leak involved in this situation.

We sincerely apologize for the inconvenience and will keep you updated about progresses and ETA during the next hours.

The Murena Team

Update Oct 7th

we have an ETA for the mail service @murena.io/@e.email on Thursday 10/10 morning CET.
we don’t have yet an ETA for murena.io other services, as we are still evaluating a few different options to make it up again safely in the most reasonable amount of time

Update Oct 9th

ETA for mail service @murena.io/@e.email: Thursday Oct 10th, afternoon CEST
ETA for drive at murena.io: (hopefully) Monday Oct 14th
ETA for calendar, contacts, mail, passwords at https://murena.io : Friday Oct 11th
ETA for /e/OS images & OTA updates: start rolling on Thursday Oct 10th (not all devices will be available at the begining)

Update Oct 10th

ETA for mail service @murena.io/@e.email: Thursday Oct 10th, evening CEST.
ETA for drive at murena.io: end of next week.
ETA for calendar, contacts, mail, passwords at https://murena.io : Friday Oct 11th.
ETA for /e/OS images & OTA updates: OTA service reopened, /e/OS downloads have started to roll out for most devices - some IPv6 access issues are currently being fixed.

Update Oct 11th

update about email services @murena.io/@e.email: our email servers have been reopened yesterday evening as expected to handle incoming and outgoing email queues that were pending for several days. During the night, the service has also been reopened briefly to users, but as the load was extremely high on servers, we have suspended it until incoming and outgoing queues get emptied. ETA for reopening the service to users is this morning if the load is acceptable. If the load remains too high we will have to modify an email route, which might take a few more hours.
update about /e/OS images & OTA updates: IPv6 access is now working.

Update2/ Oct 11th

ETA for mail service @murena.io/@e.email: we have started to open again email access, starting with a bit more than 50% Premium users. Note: only fetching emails is possible for now. Sending will be opened once the load is acceptable.
ETA for calendar, contacts, mail, passwords at https://murena.io : Monday October 14th 2024 (testing/QA is in progress)
ETA for /e/OS images & OTA updates: OTA service reopened, as well as /e/OS downloads for most devices. Download speed should be way better than in the past.
ETA for drive at murena.io (files/images/videos) is still uncertain as we have remaining issues to fix with the storage infrastructure. Will update early next week.

Update Oct 12th

Email service is now fully operational (receiving and sending) for Premium accounts. Please note that some emails that would have been received between Sunday 13 5:00 CEST and Sunday 13 19:50 won’t show up for now. Email service for free users will start opening progressively on Monday Oct 14th.

Update Oct 14th

Mail service : @murena.io/@e.email: fully operational for Premium members. ETA for free users: not before Tuesday 15 evening CEST.
Calendar/Contacts/webmail/passwords at https://murena.io: ETA end of this week.
/e/OS images & OTA updates: operational.
drive at murena.io: ETA is still uncertain as we have remaining issues to fix with the storage infrastructure.
FOSS apps in App Lounge (F-Droid): no ETA yet.

Update Oct 16th

Mail service: email @murena.io/@e.email is now fully back for all members.

Update Oct 18th

FOSS apps in App Lounge (F-Droid): should now work normally (installation and updates)
Calendar/Contacts/webmail/passwords at https://murena.io: is currently being tested internally. ETA for public opening is Monday Oct, 21st.

Update Oct 21st

Minimal murena.io is now publicly available with Calendar/Contacts/Webmail/passwords apps.

Update Oct 25th

Dear everyone,

Thank you for your enduring patience with this outage. We are working tirelessly to bring Murena Workspace fully back online. Since the begining of the outage we have been able to put back in place:

/e/OS image download and OTA download
email service @e.email/@murena.io
murena.io partial setup: calendar, contacts, webmail and passwords

What is still missing is access to files/photos/videos that were stored at murena.io.

We’d like to give more explanation about why it’s taking so long. It all started with several defective hard-drives in our storage cluster. Our storage cluster has many disks with a lot of redundancy, but this time, it went to an unstable state that made us decide to stop it until we completely fix it. Unfortunately, several additional issues arose on top of the pre-existing issues. The resulting complexity led us to acquire expert advices from a specialist company to avoid further complications. Given the size of the cluster, each procedure like checking some data and reorganizing some data takes a long time (sometimes several days). After a comprehensive situation analysis, the expert company has advised use to reinforce the cluster with additional and new servers and disks before rerunning the stabilization again, this to avoid falling again soon or later in a cascade of new disruptions.

So we are at this point right now: new hardware has been ordered in the DC, which should normally be available and set up next week.

At that point, we can start again to stabilize the storage cluster, run through all appropriate validation procedures. Unfortunately, this process can still take several weeks, maybe less if we are lucky, so one week ago we have taken the decision to restore our cold backups. In the best-case scenario, if we can get our storage cluster up and running again soon, this cold storage recovery will have been useless. In the other case, the backup will allow us to restore users’ files on a new infrastructure. We’re in the middle of the restoration process and it will still take several days to complete, maybe up to one week.

When this backup has been restored, we will decide about the best route to take. It’s possible that we will provide access for everyone to the restored files in read-only mode first until we fix the storage infrastructure.

We apologize again for the inconvenience and will keep you informed about the status every time we have significant news.

Gaël & the Murena Team.

Update Nov 7th

There is no concrete news available this week, but we wanted to make a partial update:

Regarding the consolidation of the storage cluster: the new hardware is unfortunately still not available, as we depend on suppliers, but we have good hope it will be completely installed and enabled early next week. Starting from this point we can resume work on the storage cluster stabilization, testing, and hopefully reopen the service.
The backup restoration is not complete yet: it takes a lot of time when you are dealing with close to 100TB of data, not even considering data transfers.

So depending on which set of data will be available first, we will reopen a dedicated access, starting with Premium users accounts.

Update Nov 15th

This week’s update:

The new hardware has finally been received, installed and tested. The cluster storage is currently rebuilding including this new hardware. Once fully consolidated, the plan for next week is to resume work on testing and stabilization of the filesystem.
The backup restoration is not complete yet as it had to be paused due to some more hardware issues we had to fix. The restoration process has partially started to resume today.

Next week, we hope we will be able to give an ETA for service restoration, that will depend on the storage cluster stabilization and backup restoration status.

Update Nov 22nd

This weeks update:

The file storage cluster has been consolidated (all data copied from old servers to new servers). A very long process of full filesystem scanning/analysing is currently running. Once completed, work can resume on stabilization of the filesystem (hopefully next week).
Backup restoration is still ongoing, as we have encountered some issues with a part of the backup.

ETA for full service restoration cannot be given yet.

Update Nov 29th

This week’s update:

The file storage cluster scanning/analysis has been completed mid of this week, but we had to wait for two additional days for external expert availability to resume stabilization. First attempt today has not been successful, we will make a new attempt on Monday.
Backup restoration is still ongoing. We’re putting a plan in place to provide access to ready available backups to users.

ETA for full service restoration cannot be given yet.

Update Dec 6th

This week’s update:

Second attempt has been made on Monday, but didn’t succeed. Then we faced additional delays from the consulting company that is helping us on the recovery. They recommended a software upgrade that we did on Wednesday to solve the issue. Unfortunately that forced us to schedule and prepare a new full scan, which has started today morning. It is expected to run for ~5 days. Then we will make a new attempt to access the filesystem.
Backup restoration is not fully complete yet unfortunately. On Dec 16th, we will start to send individual download links to available backups archives to users.

ETA for full service restoration cannot be given yet.

Update Dec 12th

The new file storage cluster scanning/analysis has completed on the 10th. The consulting company that is helping us on the recovery will be able to assist on the next steps starting from tomorrow the 13th. We will update tomorrow if there are significant news to share.

Update Dec 20th

This week’s update:

Finally some better news: after several failed new attempts to access files contents, we finally succeeded and made some first copy tests which have been successful (so far) for users’ data. The bad news is that there is still a filesystem inconsistency that prevents us to fully stabilize the cluster to make it fully up and running again. The external consultants are trying to understand why and have contacted the CephFS developers for advice. First feedback is that there could be a software bug that results into an unexpected filesystem state. This explaination still needs to be confirmed but it would explain why we failed to reenable the storage cluster for more than 2 months by using the standard recovery procedures. What’s next: we’re conducting more extraction tests (copying and opening files) to validate that we can now fully access all data. Once those test are completed, the current plan is to copy all the data to a new and clean filesystem and replug to murena.io.
Meanwhile, we have made significant progress with the backup. We’re still struggling with 1/3 of the backups, but for the other 2/3, we have put in place a service that will offer to users to download their own archive of data. We intended to launch this service early this week but we had to deal with a few issues, which postponed the opening of this process to today. Some users have started to receive an email with a link and a procedure to download their own archive. The process is still slow because we want to ensure that it fully works as there can be some edge cases downloading and decrypting huge archives. Then we will accelerate the rate of sending during the next days. Note: individual archives will be protected by a password that is available in murena.io password application for each user (i.e. NOT the user’s password).

Update Jan 4th

This week’s update:

Storage cluster filesystem: copying contents from the unstable FS now works. So the main plain is still to copy all users’ data to a new, clean filesystem (which is now ready). We have started to do so for the 1/3 users with a problematic backup. Copy status today is 23% completion and going.
Backup recovery: we have been able to send file archives links to 2/3 Premium users. This is all completed. For 1/3 remaining, we are waiting for the FS copy (see 1) to complete and we will be able to send links to the remaining premium users. In parallel, we have started to send links to other “free” users.

If the copy for all users is complete on the new filesystem, and fully verified (we have seen some errors during the copy), we should be able to restart the storage service at murena.io in a reasonable amount of time now (count several weeks though).

Update Jan 17th 2025

This week’s update: this update is partial and based on figures collected this week, we will update with more accurate details early next week.

Storage cluster filesystem: the copy of 1/3 users with problematic backup is still ongoing and has reached 75% completion.
Backup recovery: we have started to send backup links to all users, with a bit more than 60% completion.

ETA for file storage feature back at murena.io: we expect to reintroduce file storage at murena.io early February.

Note: this update is partial and based on figures collected this week, we will update with more accurate details early next week.

Update Jan 24 2025

This week’s update:

Storage cluster filesystem: the copy of 1/3 users with problematic backup is still now complete. It will start to be processed next week to send remaining archives to these users.
Backup recovery: appart from users in 1.:

all premium users received a link to their backup
all “donation” users received a link to their backup
89% (and rolling) of “free” users received a link to their backup. ETA for completion is Jan 26.

ETA for starting reintroduction of the file storage feature back at murena.io: Feb 10.

Update Feb 7 2025

This week’s update:

Storage cluster filesystem: the sharing of 1/2 users with problematic backup is still now complete. It is currently paused because the filesystem is currently rebalancing after we added some new server. Will resume early next week hopefully.
ETA for reopening the file storage feature back at murena.io: Feb 12.

Update Feb 12 2025

Due to some delays with some server shipment, we have to update slightly ETAs for reopening the file storage feature at murena.io:

Feb 13: will be technically ready: we will reintroduce the feature for some internal accounts and proceed to more tests in production
Feb 17: will reopen to Premium users by batch of 500 users per hour
Feb 18: will reopen to Free users by batch of 5000 users per hour

Update March 5th 2025

This is probably the final technical update since most of Murena.io services are now fully restored:

File storage is now fully operational and /e/OS eDrive is syncing
Office document management back
Document sharing back
Gallery back
Notes back online only, we need to push an update to /e/OS notes to allow syncing again, this to prevent some edge cases
Some client access are still disabled until further testing is completed

More information will come separately later regarding Premium plans compensation, and regarding the service evolution (new features and SLA).

Regain your privacy! Adopt /e/OS the deGoogled mobile OS and online services phone