Get the latest tech news
Gandi March 9, 2025 incident postmortem
On Sunday 2025-03-09, Gandi experienced a major incident on its platform caused by a filer storage system outage affecting multiple services including mailboxes. What was the root cause of the incident?: The main cause was the failure of an SSD storage filer.
On Sunday 2025-03-09, Gandi experienced a major incident on its platform caused by a filer storage system outage affecting multiple services including mailboxes. Time stamps (UTC)Event 2025-03-09 00:31:10 Incident started, and on-call responders began investigating over 1500 alerts; difficult to know what was the root cause, and the monitoring bot was unavailable 2025-03-09 01:11:19 Incident was escalated and CTO responded 2025-03-09 01:21:51 Public status published on status.gandi.net with the first impacted services identified 2025-03-09 01:23:31 Attempt to declare incident via ChatOps tooling 2025-03-09 01:25:15 VPN outage identified for non Ops team employees 2025-03-09 01:33:03 Problem identified: a filer has crashed 2025-03-09 01:34:46 Filer restart attempted 2025-03-09 01:47:09 Filer restart failed 2025-03-09 02:16:21 Responder dispatched to datacenter 2025-03-09 03:31:11 First report from datacenter – filer restarted manually after power disconnection 2025-03-09 04:03:05 Attempted restart failed to resolve the issue 2025-03-09 04:15:51 Start service storage failover to a different filer 2025-03-09 05:37:27 All impacted systems identified; we identifier that all emails are correctly queued and there is no possible data loss 2025-03-09 06:40:04 Additional responders arrive on site 2025-03-09 07:01:41 Firmware update started 2025-03-09 07:15:07 First critical service to respawn identified 2025-03-09 07:20:55 Firmware update failed 2025-03-09 07:30:40 Firmware update successful, but the problem is still persistent 2025-03-09 07:41:11 We identified that the firmware issue may be related to a PCI device, so we had to unrack the filer and remove all PCI devices 2025-03-09 09:15:57 We managed to get our monitoring bot back online 2025-03-09 10:25:00 We managed to recover the VPN so the support team could work correctly 2025-03-09 16:49:15 We managed to recover all the services except mailboxes 2025-03-09 16:50:10 We started recovering the mailboxes 2025-03-10 10:29:06 The filer was back online after multiple hardware changes, and VMs were also back online 2025-03-10 11:30:15 We identified that in some cases, some mail servers didn’t mount the mailbox NFS system and were storing the email locally. To add the complexity of the situation, the customer support team were also not able to operate as all of their tools were using either internal authentication or IP restrictions requiring a connection to the VPN
Or read this on Hacker News