We've been covering a report from search provider Algolia pointing out a potential issue in Samsung SSDs' TRIM implementation. More recently, Samsung itself reported that the bug actually resides in the Linux kernel, and that the company had submitted a patch for the problem.
Now, we have more details of the bug. Samsung has provided us with internal documents detailing the exact cause of the issue, and the subsequent solution. We're geting a bit technical here, so we'll take some liberty to simplify. When Linux's RAID implementation receives a sequence of read or write operations, it creates separate buffers in memory for each of them.
When it comes to TRIM operations, however, a single shared buffer is used. That works in theory, except there's a bug—more specifically, a form of race condition. A sequence of queued TRIM commands in a specific order all need to make use of the shared buffer, but after the first command is queued, subsequent ones may erroneously free the buffer before the previous operation completes. Boom. The wrong sector in the disk gets zeroed out, and chaos ensues.
Samsung developed a fix and reportedly ran Algolia's test scripts for a week without issue. It then submitted a workaround patch to the Linux RAID mailing list on July 19. A healthy discussion ensued until a more permanent solution was tested and agreed upon, which was then commited to the kernel source tree.
In the meantime, users with linear, RAID 0, or RAID 10 configurations using SATA SSDs are advised to disable TRIM altogether until a kernel version is released that includes the patch. RAID1 setups are not affected. The reason why this problem cropped up with Samsung's SSD is due to the precise sequence of events needed to trigger it. Martin Petersen from Oracle notes that the bug is dependant on "timing and a very heavy discard load."