Between February 1st, 2021 and February 5th, 2021, Los Angeles, CA node 7 (also known as LAX7) had experience some downtime due to what appeared to be a hardware fault/failure. Our internal teams had confirmed that there were no actual signs of hardware failure, specifically relating to the NVMe disks in the system that had been disappearing from the OS, based on SMART checks and NVMe status when the drives were both available.
After what appeared to be a random period of time, what is normally a period of minimal I/O activity, a cleanup service was triggered to run a “trim” process on the NVMe disk. This prompted one or both NVMe disks to become fully unavailable until a hard power off and power on. A BIOS update was applied on February 5th in the early morning in UTC / late evening of February 4th in the US. Sadly, this was not the last we had heard from these disks.
Upon further investigation on February 5th, our team had found a related reported issue of disks disappearing due to an error/miscommunication between the Linux kernel and the NVMe disk controller in regards to software-based power state management. The patch has been applied to LAX7 on February 5th, 2021 and has remained online without a hitch.
The specific affected disk models do not appear to be deployed in any new systems from our datacenter and we are preemptively applying this patch to any other systems that contain the same model in our next maintenance window.
We appreciate your patience and understanding in the downtime prompted by this issue early on in this month.
TL;DR? Due to a power state management (aka APST) issue between the Linux kernel and the disks in LAX7, one or both disks would become unavailable at unspecified timeframes. A patch was applied that has resolved the matter permanently.