Samsung says data-eating TRIM bug is a Linux kernel problem

Remember that potential TRIM bug in Samsung drives we reported on a few weeks ago? It may not be Samsung's fault, after all. Samsung and developer Algolia have been working together to try and get to the root of things, and after weeks of work, Samsung is blaming the problem on a bug in the Linux kernel. An update to the blog post in which Algolia first broke news of the problem says "Samsung [has reached] a concrete conclusion that the issue is not related to Samsung SSD or Algolia software, but is related to the Linux kernel."

Algolia says Samsung has developed a patch for the Linux kernel that solves the problem. The patch was set to be released to the Linux community on July 18, along with Samsung's official statement on the matter and details of the issue, but as of today (July 21), the patch has apparently not been released. We'll try and get our hands on a copy of the statement to see what's up when it arrives.

Thanks again to our anonymous tipster for the heads-up.

(Updated at 8:20 PM on 7/21/2015 to account for the fact that it is not, in fact, July 17.) 

Comments closed
    • fsckyf
    • 4 years ago

    Not exactly the easiest thing to find… but heres the patch to the issue: [url<]http://www.spinics.net/lists/raid/msg49440.html[/url<] Heres the explanation: "It turns out that there is misunderstanding between raid driver and scsi/ata driver. The raid driver lets split bios share bio vector of source bio. Usually, there is no problem, because the raid layer ensures that the source bio is not freed before the split bios. But, in case of trim, there are some problems. The scsi/ata needs some payloads that include start address and size of device to trim. So, the scsi/ata driver allocates a page and stores that pointer on bio->bi_io_vec->bv_page. (sd_setup_discard_cmnd) Because split bios share the source bio's bi_io_vec, the pointer to the allocated page in scsi/ata driver is overwritten. It leads to memory leakage and data corruption because the overwritten pointer has wrong address and size to trim on device."

      • chuckula
      • 4 years ago

      Interesting. If that’s the actual fix and if that explanation is correct, then it really is a Linux bug, albeit one that won’t trigger most of the time.

      • just brew it!
      • 4 years ago

      Interesting. So if the analysis in that forum thread is accurate, the issue IS in fact a Linux bug, which only manifests when software RAID in JBOD or RAID-10 mode and TRIM are used together. Good catch. If that Seunguk Shin fellow is a Samsung employee, my opinion of their software developers just want back up a notch (though one superstar software engineer can’t completely make up for all the incompetence Samsung has exhibited lately…)

        • fsckyf
        • 4 years ago

        It’s true, but i still don’t understand how/why the Intel drives didn’t have any issues.

          • just brew it!
          • 4 years ago

          Without doing a deep dive into the code in question, I’ll speculate that it may depend on timing (which will vary between drive models), and/or how this bug interacts with other drive features. NCQ may very well be the wildcard here, since it affects the order in which things happen and can cause the drive to execute commands in an order different from how they were issued by the host.

            • fsckyf
            • 4 years ago

            The Samsung drives advertise support for NCQ and NCQ TRIM to the kernel. They have been placed on a blacklist for not correctly handling Queued TRIM recently though.

            [url<]https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/ata/libata-core.c?id=9a9324d3969678d44b330e1230ad2c8ae67acf81[/url<] Some Crucial and Micron drives are also listed on that blacklist. I'm not sure if they would be affected by a similar TRIM bug. Either way, if the set of NCQ commands isn't being executed properly, then maybe that could have somewhat of an effect.

        • dmjifn
        • 4 years ago

        Well, then it looks like there’s just been [url=https://www.linkedin.com/pub/seunguk-shin/20/69a/252<]a little redemption![/url<]

          • just brew it!
          • 4 years ago

          Indeed. I have a great deal of respect for people who can “deep dive” and dig out root causes of bugs triggered by tricky corner cases like that. Samsung as a whole still has yet to re-earn my respect though. That will take more time, more evidence of competence (and a lack of further evidence of incompetence), and a mea culpa regarding the 840 EVO fiasco.

    • crystall
    • 4 years ago

    Since this involves the Linux kernel, if Samsung is wrong on this one brace yourselves for lots of (middle) finger pointing and creative profanity!

      • Kretschmer
      • 4 years ago

      $am$ung

    • UnfriendlyFire
    • 4 years ago

    I’m not sure how much I can trust Samsung.

    We’re talking about a company that released a family of SSD models with inherent, unfixable flaws that cause slowdowns or silent data corruption, and had the balls to disable Windows update on their laptops because they couldn’t be bothered with working with Microsoft on their USB problem.

      • chischis
      • 4 years ago

      The EVOs? Yeah I can attest to this. I have an 840 EVO that was installed in a Dell Studio laptop (yeah old I know but it’s just for clerical stuff) for about a year. Two days ago I was getting IO errors and eventually Windows 7 froze up. Transferred everything to an M500 that just happened to be going spare, restored an older OS image just in case and everything is fine.

      Anecdotal, I know. But I haven’t experienced IO errors or slowdown with any of the Crucial SSDs I have…

      • Welch
      • 4 years ago

      Forgot to mention denies vanilla 840 drives have any issue and release firmware for the 850 EVO that can completely wipe all data on the drive.

      Oh and for their TVs back in about 2009-2012ish had used crappy capacitors and then didn’t honor the warranty on them. I have a 50 incher that I got for free due to that exact issue. Simply capped the board and free perfect 50″ Samsung TV…

        • chischis
        • 4 years ago

        Talking about their TVs: many models have Youtube apps that no longer function because of a simple URL change that Google made. Fixed firmware for many of these Samsung TVs were never issued.

          • UnfriendlyFire
          • 4 years ago

          Did they also sell smart TVs that “accidentally” injected ads into video content being played from DVD or external hard drive?

            • jihadjoe
            • 4 years ago

            Haven’t seen ads actually being injected, but I did hear that data on every file you have and every channel you watch is [url=http://betanews.com/2015/02/19/samsung-lied-its-smart-tv-is-indeed-spying-on-you-and-it-is-doing-nothing-to-stop-that/<]being collected[/url<], quite possibly for some targeted advertising in the future. The same is true of [url=http://www.wcpo.com/money/consumer/dont-waste-your-money/some-samsung-lg-and-vizio-tvs-now-spy-on-you<]LG and Vizio Smart TVs[/url<]. I'm sticking to dumb displays.

      • just brew it!
      • 4 years ago

      I don’t think it was balls… just simple incompetence/stupidity. Not that it makes the situation any better; either way, it implies that a Samsung nameplate should be interpreted as a warning label!

    • HERETIC
    • 4 years ago

    Has Samsung come up with anything except “It doesn’t have a problem”
    for the 840 yet?????????????

      • Chrispy_
      • 4 years ago

      Yep. Steadfast denial makes the problem go away.

      • ClickClick5
      • 4 years ago

      The 850.

      • dmjifn
      • 4 years ago

      My solution has been to sell my wife’s 840! So far I think that the community’s reaction might be a little bit of an over-reaction. And so far I don’t mind keeping my two 840 EVOs. But for a drive with no solution (not even a workaround like the EVO) in a machine I never use – not worth the risk.

        • just brew it!
        • 4 years ago

        If you never use the machine, there’s no risk since you’ll never notice when the drive starts to have problems. 😉

          • dmjifn
          • 4 years ago

          Heh. Well, when my wife’s itunes disappear, I’m very sure I’ll be brought to my attention with much urgency. 🙂

    • geekl33tgamer
    • 4 years ago

    So this doesn’t affect Windows – Like, not even a little bit then?

    • Kougar
    • 4 years ago

    As I said in the forums, we would need to see more than just Samsung and the three most recent Intel Enterprise drives tested before one could say with any certainty either way.

    • NeelyCam
    • 4 years ago

    Should’ve used Windows

    • SuperSpy
    • 4 years ago

    Considering how open Samsung has been about the issue (via Algolia, granted), I’m inclined to believe them.

      • MarkG509
      • 4 years ago

      With a “/sarcasm” tag, I’d have up-voted you instead of down-voting you.

      • Topinio
      • 4 years ago

      Uh, the linked blog update is dated 17th and states that the patch will be released “tomorrow, July 18” which it hasn’t been (at least via normal channels).

      Considering how many times Samsung has screwed up its SSDs, I’m inclined to not believe a word. The only way that would change is an unconditional offer of a full refund for returning the Samsung SSDs and a contribution for the time I spend dealing with the mess.

        • Jeff Kampman
        • 4 years ago

        Thanks for pointing this out, my internal clock is way off today for some reason. We updated the post accordingly.

        • sustainednotburst
        • 4 years ago

        User f s c k y f posted the kernal fix: [url<]http://www.spinics.net/lists/raid/msg49440.html[/url<]

          • notfred
          • 4 years ago

          So a bug in the API usage between the RAID driver and the SATA driver.

          What I don’t understand is why they only saw it on the Samsung SSDs and not on the Intel SSDs. Maybe a subtle timing difference means that the window for corruption on the Samsung is open for longer than on the Intel?

            • BobbinThreadbare
            • 4 years ago

            Those darn race conditions.

            • just brew it!
            • 4 years ago

            Getting “had” by this (or not) may also depend on interactions with other features like NCQ.

            • sustainednotburst
            • 4 years ago

            [url<]http://www.spinics.net/lists/raid/msg49489.html[/url<] From Giontan Dante: "Hi, any idea on why the bug affects/manifests only on specific SATA SSDs?" Martin Peterson's Response "Timing and a very heavy discard load."

    • Peter.Parker
    • 4 years ago

    GUIZE, come on!
    Linux doesn’t have bugs, only undocumented features.

      • Concupiscence
      • 4 years ago

      Ugh. I’m done.

        • BobbinThreadbare
        • 4 years ago

        Adjust that sarcasm of yours.

    • just brew it!
    • 4 years ago

    It’ll be interesting to see what the nature of this bug is. I’m wondering if it is actually an ambiguity in the TRIM/NCQ specification, which Samsung and the Linux kernel guys interpreted differently. If it is really a bug in the Linux kernel, I would expect more brands and models of drives to be affected. OTOH, if it is a Samsung bug, I would expect Windows to be affected.

      • chuckula
      • 4 years ago

      High likelihood that it boils down to a corner case in the spec where Samsung & Linux interpreted a weird situation differently. I’ve had 2 840 Pros for the last two years with trim turned on and nothing has been eaten. That’s probably because I’m not doing their intensive data write & erase patterns that trigger the bug though.

      • Convert
      • 4 years ago

      Exactly what I was thinking, interesting why the Intel drives weren’t impacted by this supposed kernel bug.

      If it only impacts Samsung I’m more inclined to think they are just putting a spin on the story and it’s really their bug.

        • Welch
        • 4 years ago

        Would be one of the worst moves if Samsung was just putting a spin on a story… a very costly spin indeed.

      • willmore
      • 4 years ago

      Could be that Microsoft–with more money and probably early access to prerelease hardware–coded around it. Or Samsung gave them a whitepaper with how they understood these edge cases to behave and Microsoft made sure their implementation took it into account.

        • VincentHanna
        • 4 years ago

        Or, it could just be that Linux and windows structure their TRIM commands differently, or use different drivers, or MSFT has some fail-safe feature to prevent bugs from becoming carnivorous, or any of a hundred other things…

      • Nevermind
      • 4 years ago

      Samsung uses their own proprietary garbage collection in addition to TRIM, right?
      Part of the overprovisioning and whatnot?

      Imagine having two unrelated subroutines controlling garbage collection at the same instant?

        • Waco
        • 4 years ago

        TRIM and garbage collection are related, but they should never conflict. If they do, the drive is doing something wrong.

      • Ninjitsu
      • 4 years ago

      I remember reading that some Crucial/Micron drives were affected too.

        • MarkG509
        • 4 years ago

        Mine haven’t been. Three physical machines, all with Crucial SSDs, zero problems (though the M500’s would freeze once in a while for a noticeable number of microseconds).

        • just brew it!
        • 4 years ago

        Was it ever confirmed that this was the SAME bug, or just a bug with similar symptoms? IIRC the kernel devs were claiming that the Crucial/Micron issue was a bug in Crucial’s firmware, and worked around it by disabling NCQ when they detected one of the affected drives. (I might be mis-remembering this though.)

          • DrDominodog51
          • 4 years ago

          Your memory is correct. I remember reading this and the code that disables NCQ in a mailing list

      • Deanjo
      • 4 years ago

      It wouldn’t be the first time Samsung didn’t adhere to specs and workarounds had to be done in the kernel. It wasn’t all that long ago that their laptops did not follow the uEFI spec and people installing linux wound up bricking them.

      [url<]http://www.h-online.com/open/news/item/Protection-against-Samsung-UEFI-bug-merged-into-Linux-kernel-1795332.html[/url<]

      • adisor19
      • 4 years ago

      I really am having a hard time believing this is a linux bug from what I’ve read. I was under the impression that the Samsung firmware was claiming to support certain commands of the SATA spec when in fact it did not. So when the Linux driver was sending said commands to the SSD firmware, the commands were simply ignored and data was lost/corruption was had.

      Really having a hard time believing Samsung on this one..

      Adi

        • just brew it!
        • 4 years ago

        Well, we should know one way or the other soon enough. If they release the patch, then the kernel developers (and anyone else who is so inclined) will be able to look at it and decide whether it really fixes a bug in the kernel or not. If they don’t release the patch and try to dance around the issue instead (or issue a firmware update for their drives to address it), then that answers the question too.

        • sustainednotburst
        • 4 years ago

        Your talking about the Queued TRIM issue. The issue with Algolia is different and they’ve also stated they disabled Queued TRIM.

    • chuckula
    • 4 years ago

    Nom Nom Nom Nom Nom Nom Nom.

      • Neutronbeam
      • 4 years ago

      Your post highly nominal–chew on that for a while!

        • chuckula
        • 4 years ago

        N is for nominal that good enough for me!

      • odizzido
      • 4 years ago

      [url<]http://cdn.gagbay.com/2013/04/om_nom_nom-277067.jpg[/url<]

        • Ninjitsu
        • 4 years ago

        What…is that? Looks almost like a tiny hippo…but hippo babies should be larger, iirc…

          • chuckula
          • 4 years ago

          Pygmy Hippo
          [url<]https://en.wikipedia.org/wiki/Pygmy_hippopotamus[/url<]

Pin It on Pinterest

Share This