LAX7 - Hardware failure
Incident Report for CubedHost, LLC
Postmortem

Between February 1st, 2021 and February 5th, 2021, Los Angeles, CA node 7 (also known as LAX7) had experience some downtime due to what appeared to be a hardware fault/failure. Our internal teams had confirmed that there were no actual signs of hardware failure, specifically relating to the NVMe disks in the system that had been disappearing from the OS, based on SMART checks and NVMe status when the drives were both available.

After what appeared to be a random period of time, what is normally a period of minimal I/O activity, a cleanup service was triggered to run a “trim” process on the NVMe disk. This prompted one or both NVMe disks to become fully unavailable until a hard power off and power on. A BIOS update was applied on February 5th in the early morning in UTC / late evening of February 4th in the US. Sadly, this was not the last we had heard from these disks.

Upon further investigation on February 5th, our team had found a related reported issue of disks disappearing due to an error/miscommunication between the Linux kernel and the NVMe disk controller in regards to software-based power state management. The patch has been applied to LAX7 on February 5th, 2021 and has remained online without a hitch.

The specific affected disk models do not appear to be deployed in any new systems from our datacenter and we are preemptively applying this patch to any other systems that contain the same model in our next maintenance window.

We appreciate your patience and understanding in the downtime prompted by this issue early on in this month.

TL;DR? Due to a power state management (aka APST) issue between the Linux kernel and the disks in LAX7, one or both disks would become unavailable at unspecified timeframes. A patch was applied that has resolved the matter permanently.

Posted Feb 24, 2021 - 18:39 UTC

Resolved
This incident has been resolved.
Posted Feb 05, 2021 - 20:23 UTC
Monitoring
One fix has been implemented, a kernel-level change may be required should the issue not be resolved with today's changes. We will continue to monitor the system throughout the day for any recurrence of the issue with disks.
Posted Feb 05, 2021 - 18:37 UTC
Identified
LAX7 has remained stable for approximately 2 hours, and upon further investigation, we may have determined a root cause for the troubles encountered on this system. We will be rebooting the system to apply some BIOS-level changes to confirm - additional information will be available soon.
Posted Feb 05, 2021 - 18:16 UTC
Investigating
We're currently investigating a recurrence of what appears to be an intermittent hardware failure.
Posted Feb 05, 2021 - 14:36 UTC
This incident affected: Prisma - Game Hosting Platform (Los Angeles, CA, USA).