LAX3 - System Unresponsive
Incident Report for CubedHost, LLC
Postmortem

LAX3 encountered what appeared to be multiple hardware failures, however, that was most certainly not the case. Upon replacement of every single component by means of a new physical server, only differing in IP addresses, with the same symptoms being presented… it was clear that something else was at foot.

Due to a software issue between the Linux OS and the firmware on the NVMe disks, the SSDs had become unusable due to what would normally present as failure to the OS, thus removal of the disks from the RAID array and an attempt to preserve the operations currently ongoing - an expected practice with RAID for redundancy. A small nudge in terms of a kernel flag goes a long way, and while not being a part of our standard, it will be applied to any physical server that operates using these SSDs to ensure service continuity and prevent future incidents similar to this one.

We’re working to resolve the matter that incited the incident, which we are confident has been found and will be resolved in a near-future Prisma update. As for any physical systems with the specific brand of NVMe SSDs, we will be resolving this during our global maintenance this week for software updates. As it has been 48 hours since the last incident, we’ve marked the incident as resolved and believe that it will not recur.

Thank you for your patience in this matter and, as always, please contact us if you require any assistance. We’re always here to help!

Posted Jul 27, 2020 - 20:45 UTC

Resolved
This incident has been resolved.
Posted Jul 27, 2020 - 20:36 UTC
Monitoring
After a BIOS update and some further changes, it seems the system is once again stable. We will continue monitoring, and if proven successful, we will roll out the BIOS update to all nodes with the associated upstream to ensure this issue is not repeated elsewhere.
Posted Jul 23, 2020 - 21:02 UTC
Investigating
We're working to restore services as soon as possible; some upgrades are taking place on the physical system at this time.
Posted Jul 23, 2020 - 20:22 UTC
This incident affected: Prisma - Game Hosting Platform (Los Angeles, CA, USA).