The SSD Endurance Experiment: Testing data retention at 300TB

Solid-state drives are everywhere, and we shouldn’t be surprised. SSDs have long been much faster than mechanical hard drives—and the difference striking enough for even casual users to perceive. The major holdup was pricing, which has become much more reasonable in recent years. Most modern SSDs slip under the arbitrary dollar-per-gigabyte threshold, and many good ones can be had for 70 cents per gig or less.

Higher bit densities are largely responsible for driving down SSD prices. As flash manufacturers transition to finer fabrication techniques, they’re able to cram more gigabytes onto each silicon wafer. This lowers the per-gig cost for SSD makers, but it also has an undesirable side effect. The higher the bit density, the lower the endurance. The very process that’s making SSDs more affordable is also shortening their life spans.

All flash memory is living on borrowed time. Writing data breaks down the physical structure of individual NAND cells until they’re no longer viable and have to be retired. SSDs have overprovisioned “spare area” to stand in for failed flash, but that runs out eventually, and then what? More importantly, how many writes can current drives take before they fail?

Seeking answers to those questions, we started our SSD Endurance Experiment. This long-term test is in the midst of hammering six SSDs with an unrelenting stream of writes. We won’t stop until all the drives are dead, but we’re pausing at regular intervals to monitor health and performance. Our subjects have now reached the 300TB mark, so it’s time for another check-up—and a new wrinkle. We’ve added an unpowered retention test to see if the drives can hold data when left unplugged for a few days.

If you’re unfamiliar with our experiment, I suggest reading our introductory article on the subject. It outlines the specifics of our setup and subjects in far more detail than I’ll indulge here.

The basics are pretty simple. Our subjects include five different models: Corsair’s Neutron GTX 240GB, Intel’s 335 Series 240GB, Kingston’s HyperX 3K 240GB, and Samsung’s 840 Series 250GB and 840 Pro 256GB. Anvil’s Storage Utilities software provides the endurance test, which writes a series of incompressible files to each drive. We’re also testing a second HyperX SSD with the software’s 46% incompressible “applications” setting to gauge the impact of the write compression tech built into SandForce controllers.

With the exception of the Samsung 840 Series, all of the SSDs have MLC flash with two bits per cell. The 840 Series has TLC NAND, which delivers a 50% boost in storage density by packing an extra bit into each cell. The extra bit makes verifying the contents of the cell more difficult, especially as write cycling takes its toll. That’s why TLC flash typically has lower endurance than its MLC counterpart.

We expected the 840 Series to be the first to show weakness, and that’s exactly what happened. After 100TB of writes, we noticed the first evidence of flash failures in the drive’s SMART attributes. The attribute covering reallocated sectors tallies the number of flash blocks have been retired and replaced by reserves in the overprovisioned area. There were only a few reallocated sectors at first, but the number grew dramatically on the way to 200TB, and the pace quickened on the path to 300TB.

At our most recent milestone, the 840 Series reports 833 reallocated sectors. Samsung remains tight-lipped about the size of each sector, but if AnandTech’s 1.5MB estimate is accurate, our drive has used 1.2GB of its spare area to replace retired sectors. The 840 Series still has lots of overprovisioned flash in reserve, and it still offers the same user-accessible capacity as it did fresh out of the box. That said, its flash is clearly degrading at a much higher rate than the MLC NAND in the other SSDs—no surprise there.

Only two other SSDs have registered reallocated sectors thus far. The HyperX drive we’re testing with incompressible data reported four reallocated sectors after 200TB of writes. That number hasn’t changed since. However, the HyperX has been joined by the Intel 335 series, which now has one reallocated sector.

The Corsair Neutron GTX, Samsung 840 Pro, and Kingston HyperX drive with compressible data are the only ones that remain free of bad blocks after 300TB. Of course, the HyperX has written only 215TB to the flash thanks to its compression mojo.

In addition to tracking reallocated sectors, we’re monitoring each drive’s health using the included utility software. Samsung’s SSD Magician app reports that the 840 Series and 840 Pro are both in “good” health despite the former’s high reallocated sector count. Intel’s SSD Toolbox says the 335 Series is in good health, as well. Corsair’s SSD utility doesn’t have a general health indicator, and Kingston’s software doesn’t cooperate with the Intel storage drivers on our test rigs. However, we can get health estimates for all the drives using Hard Disk Sentinel, which makes its own judgments based on SMART data.

  100TB 200TB 300TB
Corsair Neutron GTX 240GB 100% 100% 100%
Intel 335 Series 240GB 88% 73% 58%
Kingston HyperX 3K 240GB 100% 98% 98%
Kingston HyperX 3K 240GB (Comp) 100% 100% 100%
Samsung 840 Pro 256GB 78% 51% 26%
Samsung 840 Series 250GB 66% 19% 1%

Well, that’s not very helpful. HD Sentinel seems to assess health using different SMART attributes for each SSD. The ratings for the 335 Series correspond to the “estimated life remaining” values produced by Intel’s own software. There’s no correlation between the Samsung software and HD Sentinel’s assessment of the 840 Series and 840 Pro, though. It’s unclear why HD Sentinel has such little faith in the 840 Pro, which hasn’t suffered any flash failures. Even the low health ratings for the 840 Series seem a tad pessimistic given the amount of spare area in reserve.

The lack of standardization for wear- and health-related attributes seems to be part of the problem here. Each SSD maker exposes a different mix of variables, making comparisons difficult. We’d like to see SSD vendors agree to offer a common set of attributes covering reallocated sectors, accumulated errors, overall health, and both host and flash writes. Some SSDs don’t even have SMART attributes to track total writes. The Crucial M500 is one example, and we left that drive out of the endurance experiment as a result.

Data retention

The primary goal of this experiment is to see how many writes each SSD can take before it dies. Problems may crop up before the drives stop responding completely, though. We need to know if the SSDs are still viable—not just if they’re still alive.

Anvil’s endurance benchmark has an integrated MD5 test that provides some help on this front. We have it configured to verify the integrity of a 700MB video file pre-loaded on each drive. The file is part of 10GB of static data that sits on the SSDs during the endurance test. Even though that data isn’t disturbed as the endurance test runs, wear-leveling algorithms should move it around in the flash as writes accumulate.

Thus far, the built-in hash check hasn’t reported any errors. As several of our readers have pointed out, though, the integrated test doesn’t tell us whether data is retained accurately when the system is powered off. We actually considered making unpowered retention testing a staple of our regular check-ups. However, that kind of testing involves days of inactive downtime that we’d rather spend writing to the drives.

With our 840 Series sample clearly wilting, we decided it was worth sacrificing some time on an unplugged retention test. Our 700MB movie file is relatively small, so we swapped in a 200GB TrueCrypt file nearly large enough to fill each drive. Then something odd happened. While running an initial MD5 check on the file we copied, the 840 Series produced an unexpectedly incorrect result. We hashed the file again, and the result was still incorrect. This time, the hash test produced an entirely different string. Third time’s the charm? Nope. Strike three, and another different result.

All the other SSDs passed the initial MD5 check, so we started over with the 840 Series. We re-copied our TrueCrypt file, and the results were correct the first, second, and third time we hashed it. So we repeated the process again. Once more, the 840 Series passed three times in a row. We couldn’t reproduce the initial mismatches.

Puzzled, we shut down our test systems and proceeded with the unpowered portion of the retention test. Five days later, we fired them up again and checked the files. All the drives passed, including the 840 Series.

For a moment, I thought I’d imagined those initial errors. But no, I took screenshots. The SMART attributes also provide corroborating evidence. Before the retention test, the 840 Series’ unrecoverable error count was zero. The drive now says it’s suffered 172 unrecoverable errors. Something went seriously wrong, and Samsung’s error correction mechanism was unable to compensate.

Even though our 840 Series drive appears to have rebounded, it suffered a serious failure. In a normal desktop system, unrecoverable errors could result in permanent file corruption and data loss. I certainly wouldn’t trust our test subject with my own data anymore. Since the drive appears to be operating normally again, we’ll keep it in the experiment, albeit with a black mark on its record.

Disappointingly, only the Samsung and Kingston SSDs have SMART attributes that track unrecoverable errors. So far, the 840 Pro and HyperX drives are free of unrecoverable errors. The Corsair Neutron GTX only tallies “soft ECC correction” events, and it doesn’t report any of those. We’re in the dark with the Intel 335 Series, whose SMART attributes are devoid of error-related variables.

Performance

We benchmarked all the SSDs before we began our endurance experiment, and we’ve gathered more performance data at every milestone since. It’s important to note that these tests are far from exhaustive. Our in-depth SSD reviews are a much better resource for comparative performance data. What we’re looking for here is how each SSD’s benchmark scores change as the writes add up.

Apart from a few anomalies tied to the HyperX drives in the 4KB random read test, all the SSDs are maintaining reasonably consistent performance as the endurance experiment progresses. Even the Samsung 840 Series shows no ill effects.

These tests were conducted with the SSDs connected to the same SATA port in the same system. The drives were secure-erased before testing, giving us a nice apples-to-apples comparison. We also have performance data from the endurance test itself. These numbers track the speed of each loop, which writes about 190GB to the drives. The results are somewhat less reliable, because the endurance test is running simultaneously on six drives split between two test machines. The Corsair, Intel, and Samsung SSDs are connected to 6Gbps SATA ports, while the Kingston drives are limited to 3Gbps connectivity. Keeping those caveats in mind, we can still get a sense of how each SSD’s write speed changes over the course of the experiment.

The Samsung 840 Pro’s write speed spiked dramatically in the first run after our 200TB check-up. Since we secure-erase the drives after each threshold, that result isn’t unexpected. Performance typically increases after a secure erase, and some of the other SSDs exhibit similar behavior. The 840 Pro spiked higher than it did previously because it only wrote 145GB during that first run. There’s no indication of why the Anvil test stopped short of the prescribed 190GB, and there were no issues with subsequent runs. The 840 Pro’s SMART attributes don’t report any errors or programming failures, either.

Apart from that outlier, there’s little change from our post-200TB results. All the SSDs are running the endurance test at about the same speed as they were at the last milestone.

Lessons learned so far

The most important thing to take away from our experiment is that modern SSDs can survive an awful lot of writes without issue. We’re up to 300TB, and all the drives remain functional. The MLC-based models are holding up nicely, with only a handful of bad blocks between them. The TLC NAND in the Samsung 840 Series is degrading much faster, which we expected given the flash’s higher bit density. However, the drive still has plenty of overprovisioned spare area in reserve. And, like the other SSDs, the 840 Series has maintained largely consistent performance overall.

Only a couple of the SSDs have published endurance specifications, and we’ve already blown past those figures. The Kingston HyperX 3K is rated for only 192TB of total writes, while the Intel 335 Series is good for 20GB of “typical client” writes for three years, or just 22TB overall. We’ve also far exceeded the volume of writes I’d expect my own SSD to endure over its useful lifetime. The solid-state system drive in my primary desktop has logged a mere 1.3TB of writes since I installed it 18 months ago.

To be fair, our endurance experiment has lower write amplification than typical client workloads. Anvil’s test is comprised almost entirely of sequential writes, while real-world desktop activity involves a lot of random I/O. There isn’t a whole lot of data on the typical write amplification for client workloads, but everything I’ve seen and heard from SSD makers suggests a multiplication factor below 10X. If we take my personal usage patterns as an example and use 10X write amplification as a worst-case scenario, it would take nearly 35 years to write 300TB to the flash.

So, yeah, that’s why we’re not using real-world I/O in our endurance experiment. We wouldn’t be able to get results within a reasonable timeframe.

The data we’ve collected suggests that modern SSDs can easily survive many years of typical desktop use. Even TLC-based offerings should have more than enough endurance to handle what the vast majority of consumers will throw at them. That said, mounting flash failures appear to be responsible for the data integrity errors we encountered on the 840 Series. I would have no qualms about using TLC-based SSDs in my own systems, but I would check the SMART attributes periodically to keep an eye out for reallocated sectors. If those start piling up, it’s a good idea to replace the drive. As we saw with the 840 Series, error correction can’t necessarily keep up as flash failures accelerate.

From the beginning, we knew the 840 Series would be at a disadvantage versus its MLC-based rivals. The results bear that out, and they indicate we probably have a long way to go before the other SSDs start to falter. That’s good news overall, but it means there’s much more writing to do. Stay tuned.

Comments closed
    • ronch
    • 6 years ago

    I haven’t been paying attention to these SSD endurance articles until i got my 840 EVO 3 days ago. Performance is great although I think I’ve been overly cautious with it, afraid not of killing the drive after years of use, but rather, of degrading performance after just months of use.

    So does this mean there’s no need to worry about degraded performance for many, many years? And the cells degrade significantly only when they’re being written to, right? And what about secure erase? With my SSD being the system drive, how could one secure erase it within Windows, which is where one could run the app that does it? Sorry, I’m an SSD noob here.

    • Delphis
    • 6 years ago

    Lots of talk about RAID0 (Striping) which doesn’t interest me. I’d be curious about RAID 6 though and how it would get on with dealing with the double parity writes. With drives as large as the 1TB drives also reviewed, having an SSD backed file-server is becoming a possibility.

    • UnfriendlyFire
    • 6 years ago

    How long would the best 7200 RPM consumer HDD have to run in order to write 300 TB of data?

    (Random writes would probably increase the time significantly)

      • just brew it!
      • 6 years ago

      Streaming sequentially, it’ll take a few weeks. But if the access pattern is random it’ll be years.

    • just brew it!
    • 6 years ago

    Something I’m not entirely clear on regarding the Samsung 840 is whether it immediately reported the errors to the OS as a failed read operation, or if the only indications that something was amiss were the bad checksum and SMART statistics. The former would be acceptable (though unfortunate) behavior; the latter would be an indication of a fatal flaw in the firmware which should disqualify this drive from serious consideration until a firmware fix is issued.

    Silently returning corrupted data is a cardinal sin for a data storage device. I realize that occasional failures are inevitable, but when they happen they need to be unambiguous. Most applications do not explicitly checksum their data files to verify integrity; letting you know when the bits have been mangled is supposed to be the hardware’s job.

      • glugglug
      • 6 years ago

      I would wager that if they checked the event log they would see each failed write accompanied by one or two event log warnings (and probably a notification in the system tray as well), most likely about “Delayed Write Failures”. Which unfortunately means that the data would be handed off to the OS write cache, but not yet have been sent to the drive when control returns to the application, so there is no way of informing the app that the write that didn’t flush yet when “completed” from the app’s perspective had failed.

    • meerkt
    • 6 years ago

    Thanks for adding retention checking to the mix!

    • DarkUltra
    • 6 years ago

    I think we’re missing something here. Shouldn’t the OS and filesystem produce a “file failed a crc check” error message? Shouldn’t the SSD retry reading the bad cells? Or is this only actual on a HDD?

      • Shining Arcanine
      • 6 years ago

      That is a philosophical question. The philosophical answer is yes. The answer in practice is no. In general, filesystems do no checks under the assumption that the hardware is perfect. The only significant exception is ZFS. Also, CRC32 is a very weak checksum.

      As for what the SSD does, it is implementation dependent, but there is no reason that a second attempt at reading should be any different than the first. If it does detect a problem, it will send a read error, which lets the OS/controllerhandle it. What happens here is also implementation dependent, but it is usually wrong.

    • anotherengineer
    • 6 years ago

    Interesting.

    I would like to see what 34nm SLC NAND could do before it dies.

      • jihadjoe
      • 6 years ago

      Gustav’s Coffee Experiment?

    • balanarahul
    • 6 years ago

    According to Endurance Test done by a Korean Website, the Samsung 840 should last around 331 TBW. So, beware, it’s end is extremely near.

    Hardware.info has tested the 250 GB 840 as well. But their drive lasted for much longer before the first Uncorrectable Error occurred.

    [url<]http://us.hardware.info/reviews/4178/10/hardwareinfo-tests-lifespan-of-samsung-ssd-840-250gb-tlc-ssd-updated-with-final-conclusion-final-update-20-6-2013[/url<]

      • Chrispy_
      • 6 years ago

      331TB = 200GB a day for almost five years.
      My workstation 830 is over two years old and I’ve written less than 20TB to it so far.

      The 840 in my HTPC is more typical of a consumer workload and it’s managed 3TB in a year, so it only has 109 years left

      I will make a mental note that it might fail in the early 22nd century, thanks for the heads-up 😉

    • Aliasundercover
    • 6 years ago

    [quote<]We actually considered making unpowered retention testing a staple of our regular check-ups. However, that kind of testing involves days of inactive downtime that we'd rather spend writing to the drives.[/quote<] Do you plan to write them till they melt? It is certainly an interesting result knowing all but the TLC drive remain functional after 300TB even though they are rated for so much less. As useful as this is it doesn't answer the question "can I still trust these drives with my data". Their retention times may be awful. Or perhaps they are still surprisingly good like their ability to absorb writes. I hope you do switch over to testing retention as I think it would be more interesting to know how long they retain data after 300TB than whether they finally crash and burn at 500TB or 900TB. You can always melt them later. 🙂 Thanks for the good work.

      • meerkt
      • 6 years ago

      Yeah, that’d be interesting to know.

      More difficult to test, but even more interesting, would be how the retention time graph looks on a single drive model at different wear levels. Take 10 drives, write to each 35TB more than the previous, let them sit offline, and check hashes weekly or monthly, maybe even per-sector.

      Another interesting thing to check, assuming the SMART write counters show that, is how much automatic refreshing drives will do to static data when there are no writes at all (i.e., “retention refresh”). If they don’t, I wonder if a utility could be made to do a whole-drive block-wise read-and-write-back. The question is how to find out the block size, and how to write a whole block at a time rather than a fragment.

    • Pholostan
    • 6 years ago

    I kinda find it suspicious that several drives report zero or close ot zero reallocated sectors after close to 300TB written. How honest is that really? Are these drives going to continue to say everything is fine right up to the moment they die? Maybe the Samsung drive is more, shall we say, forthcoming with the actual data, not trying to lull us into false safety? These are consumer drives after all, and I think they are just flat out lying to us when they report zero reallocated sectors after 300TB written. I guess we’ll see, interesting times ahead 🙂

      • gamoniac
      • 6 years ago

      I think that would fall under one of the basic assumptions (and trust) necessary to conduct this test, unless the drives just experience sudden death as you said. Even then, it could be due to failure other than NAND wears.

      • sschaem
      • 6 years ago

      [url<]http://www.theregister.co.uk/2012/06/12/nand_dying/[/url<] Quality mlc is rated at 5,000 cycle, but shrank TLC at < 1000 So the result are in line. Again, TR test is not realistic. You will not know how fast your drive will die with this test. What you will get is jus the nand quality...

    • LoneWolf15
    • 6 years ago

    I have followed the testing with a lot of interest. About my only surprise is the lack of a Crucial SSD in here like the M500, or the vaunted (though now aging) M4. I’d consider them one of the high sellers as well, and supposedly, changes to the M500 have been made to increase the longevity further.

    After seeing the 840 results, I did decide on an M500 for my newest build (boot drive for an HP ProLiant Microserver Gen8; the data storage will still be hard drives, in hardware RAID-5). I think the TLC drives are still great choices for a lot of consumer systems, but I’d prefer MLC for this application. Thanks again for the ongoing testing.

      • egon
      • 6 years ago

      Explanation why Crucial is missing from the test is at the end of the first page of the article.

    • Jon
    • 6 years ago

    What tool can I use to determine how much data has been written to my SSD’s?
    I assume this metadata is stored in the firmware and can be queried with the right tool.

      • travbrad
      • 6 years ago

      SSDLife or CrystalDiskInfo will tell you how many GB have been written depending on the drive. It doesn’t really work with a lot of drives though, like my Crucial M4. You can kind of guess from the average block erase count (AD), basically by multiplying the size of your drive by the value in there.

        • Visigoth
        • 6 years ago

        Yeah, with the Crucial drives, some of them don’t expose the amount of total bytes written over the lifetime of the device. This is one of the few weaknesses that they’ll hopefully address with their next SSD’s.

    • Bensam123
    • 6 years ago

    Although it’s a bit of a anomaly and may not be representative of the drives themselves, the problem with the Samsung 840 seems reminiscent of my issues I was having with my Samsung 830s I was using in a raid 0 array. It slowly corrupted itself over time, with reporting relatively little in the way of smart statistics and there was no way to know anything was happening till the BSoDs started.

    My thread about that is still on the forums: [url<]https://techreport.com/forums/viewtopic.php?f=5&t=86939[/url<] If you want to add another reliability test, I'd definitely say using them in a raid array would be a good start. You may get surprising results as something like r0 is pretty easy to corrupt.

      • indeego
      • 6 years ago

      Read Absurdity’s post in that forum post. You need specific controller and drive firmware that is compatible with the drive, controller, and OS to get an accurate setup with RAID. Period.

      Consumer-level RAID doesn’t provide this. Don’t use consumer-level RAID if you truly value your data integrity. How many times have I seen hard-power outages break RAID? a few, really, but enough to make me less trusting of RAID setups on consumer-level drives than without RAID at all. That’s bad news. Maybe it’s gotten better in the ~4 years since I stopped bothering with hardware raid on a built-in controller, though.

      Don’t use RAID0 for anything other than holding data you won’t mind losing that very instant. Scratch disks? YES! temp files/folders? Sure! Live critical data? No. No. No.

        • Bensam123
        • 6 years ago

        I had the proper driver, controller, and OS. Period.

        You know, just saying period after something doesn’t make it correct. Read the thread in question before you start handing out judgement, condescension, and a snide attitude.

        Consumer level raid products work in 99.9% of the scenarios, I happened to find one where it doesn’t work and therefore I’m wrong for finding it? That’s like saying just because you leave the house unlocked it’s your fault for being robbed. It’s a outright silly stance. ‘Raid shouldn’t work cause it’s consumer grade’

        Completely putting this aside, asking TR to look further into this because maybe there is something here is definitely a good place to start. I don’t know why you’d hop on the bandwagon of ‘oh there is no way this could ever happen in a normal scenario’ when Geoff was describing a normal scenario. There may be corruption issues with Samsung drive, it’s definitely something to consider.

          • Diplomacy42
          • 6 years ago

          Actually, the court’s stance for a long time has been that failing to lock your door does in fact make you responsible for being robbed, something they like to call negligence.

            • Bensam123
            • 6 years ago

            Sure, but a verdict isn’t determined entirely by having your door unlocked… Add something like rape and murder to it and it matters less and less.

            • Diplomacy42
            • 6 years ago

            on the other hand, if you are a business owner and your negligence leads to another crime being committed, you can be criminally culpable for your failure to take reasonable precautions, which would be be the other extreme.

            the law doesn’t recognize the concept of being responsible for your own assault, which is why “it matters less and less,” making that argument a red herring.

          • indeego
          • 6 years ago

          Judgment, you mean?

          I never once claimed you were wrong. I was providing an opinion based on my experience. Take it or leave it. You can put lots of things in my mouth, those words aren’t one of them, however.

          [quote<]"That's like saying just because you leave the house unlocked it's your fault for being robbed. It's a outright silly stance."[/quote<] OK. But also don't ask for sympathy if your house does indeed get robbed. Or go on a forum and ask "My house was robbed with three doors unlocked and signs saying "open" when it was fine before, what can I dooooooo? It works for 99.9% of people!" [quote<]'Raid shouldn't work cause it's consumer grade'[/quote<] It doesn't work because RAID is [b<]difficult[/b<] sh_t to get [i<]right[/i<]. You have to match every model of drive, including ones made after your RAID controller, and ensure the data integrity is fine? Are the controllers, mother board makers, or drive makers at fault? Why would a consumer-level drive maker want to support a RAID controller they never tested on? [quote<]"Completely putting this aside, asking TR to look further into this because maybe there is something here is definitely a good place to start. I don't know why you'd hop on the bandwagon of 'oh there is no way this could ever happen in a normal scenario' when Geoff was describing a normal scenario. There may be corruption issues with Samsung drive, it's definitely something to consider."[/quote<] Yeah TR should look into this issue and compare to yours in what possible way? They are a totally different model of drives and your experience isn't comparable, in any sense or shape. They weren't using a RAID0 array for this test. SMART statistics do not work reliably or at all for many controllers in RAID. [url<]http://en.wikipedia.org/wiki/Comparison_of_S.M.A.R.T._tools[/url<] To get the best SMART readings you take a drive out of RAID and read them using the disk manufacturer's own tools, at best. The exception, of course, are drives with specific firmware paired with the controller hosting them. PERC, HP's controllers, etc.

            • Bensam123
            • 6 years ago

            You never claimed I was wrong? That’s what your entire statement reads as.

            Talking to people about getting robbed is not the same as asking for sympathy. Now who is putting words in someone elses mouth? I was discussing my experience and seeing who if anyone had similar things happen to them. But you went and stated what happened didn’t matter and it can’t happen to anyone if they ‘have the proper tools’. No matter how secure your house is, it can still get robbed. It’s a moot point. It happened. Geoff encountered corruption issues, I encountered corruption issues.

            If Samsung SSDs have issues with corruption then they should be discussed in great detail and found out, not just say ‘that’ll never happen cause I say so’. Geoff had it happen as well and he wasn’t using a raid array.

            [quote<]It doesn't work because RAID is difficult sh_t to get right.[/quote<] Tell that to everyone else raid works for. There is no massive disclaimer stating raid doesn't work on consumer chipsets and you should use it at your own risk. Just the same as hard drive manufacturers don't tell you that you can't put them in a raid array unless they say so (consumer grade or otherwise). YOU are the one saying that and you are neither a hard drive manufacturer or a chipset maker. Once again Geoff had issues with corruption. This discussion isn't about the pros and cons of consumer grade raid chipsets, but the possibility that Samsung SSDs are having issues with corruption. You're just trying to pin this on the controller (regardless of Geoff having corruption issues as well), which is not the point of my original statement. Are you going to say Geoff was using his controllers wrong too and didn't have the right drivers? [quote<]Yeah TR should look into this issue and compare to yours in what possible way?[/quote<] You seem to have no comprehension of scientific testing or correlations (there is a bit hint for you) so I don't believe that you'll be able to understand all of correlations if I explain it to you... which is to say there isn't anything to explain, there is a possible correlation here of Samsung SSDs and corruption, which is what needs to be explored. Just because the way I discovered it is different from Geoffs does not in fact mean there can't be a correlation. We aren't discussing SMART characteristics, we're discussing corruption. I happened to discover a correlation between a certain SMART attribute and my own testing. Geoff also did. This isn't the same as saying 'the SMART attribute causes corruption' rather it may be a identifier of when the corruption occurs. I really don't know why you're so adamantly against the possibility of Samsung drives having corruption issues unless there is somehow a bias in here. I don't hate Samsung at all, quite the contrary, but I will encourage TR look at all possible avenues.

          • jihadjoe
          • 6 years ago

          We have QVL lists for memory, which behave in a known way. I would think it prudent to at least wait for a QVL list for SSDs.

          Bear in mind that until recently, RAID controllers expected to be working with hard drives which write in a fairly predictable manner. SSDs are a completely new scenario due to their controllers moving data around on NAND. This sort of behaviour would have been totally unexpected and upredictable to a controller made before SSDs were prevalent, and the SB950 is a pretty old chipset, and the 990FX was released just a few months before your 830 SSDs.

            • just brew it!
            • 6 years ago

            If the drive and controller both adhere to the SATA spec it should work. It may not behave ideally when subjected to all possible corner case (abnormal) conditions, but random data corruption (with no other indication to the user that something is amiss) should not occur unless there’s defective hardware.

            • indeego
            • 6 years ago

            If this were the case there wouldn’t be a need for vendor manufactured firmware updates for controllers, drives, etc in Server-land. The reality is every server/controller/drive manufacturer releases updates firmware on these modern devices, and they have for decades. And that is higher-tier SCSI and SAS land, where data validation is paramount. Look at some of the release notes for controllers that are certified for high-transaction database use or high-bandwidth Fibre Channel controllers. They do have QVL’s and parts are not exactly interchangable while still maintaining support.

            “SAS uses the SCSI command set, which includes a wider range of features like error recovery, reservations and block reclamation.”
            [url<]http://old.steadfast.net/services/hdd.dedicated.hosting.php[/url<] A SATA controller doesn't provide those features, it might not understand the underlying drive, nor can it communicate with the drive to determine this and other faults. Individual SATA drives may handle this differently. Do I think SATA is fine for consumer use? Yes. Because you miss an hour or day of data and you are doing your backups, you live with the pain. Do I think RAID0 or JBOD is fine for corner cases/scratch disks/performance cases in non-critical limited use? Yes. Go ahead and be all geeky and think your few hundred MB/s makes an actual difference in real-world cases. Is consumer RAID at all appropriate for anything mission critical where uptime is paramount and you don't have network redundancy? No. I don't even think anything SATA is appropriate for that. Again, the exception is data centers that have clustering on some level available. They put their money into that clustering, not the data-holding.

            • jihadjoe
            • 6 years ago

            I believe the keyword here is “should”.

            In the real world, sometimes it does, sometimes it doesn’t. Which is why we have QVLs.

            • just brew it!
            • 6 years ago

            Engineers (both design and QA) [i<]should[/i<] do their jobs competently. If they do, then the hardware [i<]will[/i<] work as intended, except in the case of a manufacturing defect. Data storage devices (magnetic, DRAM, flash, it doesn't matter what the underlying tech is) will occasionally drop bits. Which is why we have ECC. 😉

        • WaltC
        • 6 years ago

        You guys are talking only about SSD’s, apparently, but speaking about mechanical drives…

        Single consumer drives vs. single-consumer drives in a RAID 0 array share the identical degree of fault tolerance: 0. If you use a single drive (let’s assume a single partition) and the drive goes down, you loose 100% of your data. If you use two drives in a RAID 0 configuration and a drive (or the same drive) goes down, you lose all your data. Difference: 0. The drives themselves don’t know whether they are being used singly or in a RAID 0 configuration: they work the same either way. RAID 0 drives do not work harder than when used in a single configuration, they work the same. The controller is the difference, of course, and the far different outcome between RAID 0 and the single drive is read/write [i<]performance.[/i<] (I get ~160MB/s *write* peaks between my 2 sata II-drive RAID 0 config and my single WD Blue Sata III drive at home, for instance.) Performance of the array is noticeably superior to a single drive. But that's what RAID 0 is designed for, so it should be. There's a lot of scare mongering about so-called "consumer" RAID 0, for some reason...;) I've been using it regularly for over a decade on maybe a dozen person personal rigs and gosh-knows how many drives--haven't had a RAID failure yet! Indeed, of the very few mechanical hard drive failures I've had *personally* during the last quarter century (three, maybe) every one happened to be used singly at the time and was not in a RAID configuration. Insofar as I've seen, the odds of one of two drives failing in a RAID 0 configuration (or three or more--but I just do dual-drive RAID these days) is identical to the odds of having two single drives and having one of them fail. Putting two drives into a RAID 0 config in no way increases the risk of a drive failure over using two drives singly. The risk is identical. I also used to use nothing except stand-alone more expensive RAID controller cards for PCI, but I've noticed that the quality level of controllers in recent AMD chipsets being already integrated with the PCIe bus is high-enough so that I can get the performance I want and the reliability I require without having to fork over a few hundred $$$ more for the controller. Nice. Long years ago, even, when people were still doing tape backups and that sort of thing, I was doing backups to consumer hard drives...! People were saying "EEEEeeeeeee00000wwww! You poor thing! Your backups won't last! They can't possibly be reliable! If you don't like tape you should at least go to [glacial] CD...!" Heh...not only would my backups consume a fraction of the time their methods required, they would actually last longer by an appreciable margin, too, and be simpler to recall, restore, refresh and transfer! These folks had paid 10x-20x what I paid for backup drives so naturally they wanted think they'd made the wiser choice--I didn't want them to feel bad....;) I only the other day read an interesting article by a decently large "cloud" company which discussed why it had abandoned so-called "server-grade" mechanical hard drives for quality consumer-grade hard drives for its hundreds of racks of drives. So far, the experiment has shown *no difference* in their data center between the far more costly "server drives" and much more reasonably-priced "consumer" drives. After a few years of looking at it, they found the consumer drives are costing them nothing in either performance or reliability, but that they are saving a substantial chunk of money, too. That's what it's all about, right? The article said they are now 100% "consumer" as their data told them spending appreciably more for "server-grade" drives wasn't justifiable. Basically, I just want to say that people's fears of RAID 0 are way overblown. Drive failures can and do happen to all of us--it's just that RAID 0 does nothing to increase any given drive's likelihood of failure.

          • Chrispy_
          • 6 years ago

          single drives and RAID0 do indeed share the same lack of fault tolerance, but four drives in a RAID0 make the volume four times more likely to fail than a single drive.

          Given the typical 3-5% failure rates, that’s suddenly a 12-20% chance of failure.

            • WaltC
            • 6 years ago

            This is a common mistake…;)

            First, your point is really irrelevant, you know–you treat the RAID configuration as though it in and of itself has some material bearing on the longevity of any given drive. I know plenty of people (including moi) with four or more drives in a single box. The drives have no preference for IDE versus ACHI versus RAID operation–it makes not the slightest bit of difference to the drive. If you have four hard drives in a system running separately and you have four running in a RAID array, the odds of *a drive failure* are identical between the boxes! The odds of a RAID drive failure are identical to the odds of single-drive failure. Your computation is based on *the total number of drives* in a system–RAID has nothing do with it. Again, 4 drives in a RAID 0 system and 4 drives in a single-drive system have exactly and precisely the same odds of suffering a drive failure. RAID has nothing to do with it…;)

            Think about that–what you calculate above has nothing to do with RAID, but simply involves the total number of drives in any given system. (I enjoy using the two-drive RAID 0 comparison with the single IDE/ACHI drive to illustrate a point. If you lose one drive in the RAID system, or one drive in the IDE/ACHI system–you lose all of your data, either way.)

            Here’s another way of illustrating the fallacy of the “compound drive failure” theory that predicts that drive failure rates grow proportionally with the number of drives added.

            Think about companies that manufacture millions of hard drives in a single production run. If it was true that a given drive’s failure rate doubles with the addition of each subsequent drive, then how many drives could a manufacturer make before all of his drives would begin to fail and he was unable to make anymore? If, as you say, drive failure doubles with each additional drive, this would be a real number that we could easily calculate. Of course, in real life, we all know that this is not what happens. All of the drives he makes, whether we’re talking about the first drive or the one-millionth drive after that–have the same odds of failing. That is, drive failure rates are based singly and failure rates are the same for each and every *individual* drive–they do not *compound* (or double) with the addition of each drive after the first.

            Here’s another example to help clarify this issue:

            Q: When you add an additional hard drive to you system, bringing the total to two drives from one, has your probability of drive failure doubled for both drives simply because you’ve added another?

            A: Of course not. The probability that *a drive will fail* is the same in your system, whether you have one drive or twelve, one drive or a million. The total number of installed drives has nothing to do with the probability that *one* of them will fail–that probability is always the same. Hence, there is no theoretical limit on the number of identical drives a manufacturer can make. It is entirely possible that a 1-drive box will suffer a drive failure before a drive failure hits a 4-drive box, because the failure rates for individual drives do not accelerate with the addition of more drives in a given system. The failure rates remain the same for *each drive* regardless of how many drives are installed in a system.

            Putting it another way: The longevity of most drives is calculated by an estimate engineers use called MTBFH (mean time between failure hours.) These are estimates made about drives (because of course no one can say with utter precision how long any given drive will continue to operate optimally) that are based on manufacturing processes, materials used, designs, experience with previous drives, etc. No matter how many drives you install in a given system, the MTBFH for each drive will [i<]never go down[/i<]--it remains constant. If you believe that drive failure rates are 400% higher with four drives than they are with one drive, then those MTBFH ratings would dramatically drop with the addition of each drive in a system. But they do not--and again, that is because drive failure rate is figured individually per drive--always individually--how many drives you add to the system being completely irrelevant. The clarification is easy to see once you reason it out--the common wisdom that says otherwise is simply wrong. It reminds me of another common perceptual error--the boy on the bike example. It seems like his pedaling speed would have to be added to the speed of the light beaming from his bike, but such is not the case in reality, is it?...;) Nope, if you have 8 hard drives in your box your drive failure probability hasn't increased 800% over having one drive installed...;) It's exactly the same as it is with one drive installed, and that is because drive failure probabilities (MTBTFH) can *only* be calculated for each drive independently and individually.

            • slowriot
            • 6 years ago

            This post is hilarious, classic WaltC! Make a hilariously wrong mistake and then write at length trying prove your invalid point.

            It’s not a individual drive’s chance to fail that increases as you add more drives, but rather that your chances of experiencing a failure of *any* drive increases. You have more parts that can fail. This is important specifically to RAID 0 because if *any* of the drives fail in your array, you’re hosed.

            • Waco
            • 6 years ago

            This. WaltC – you’re wrong this time. The chances of [i<]array failure[/i<] are multiplied by the number of drives in the array. No, a single drive is not more likely to fail, but there are 4 times as many drives...hence 4 times the risk of losing the entire array.

          • indeego
          • 6 years ago

          The difference is the first letter in the acronym RAID. Redundant. And the number after it: zero. RAID provides performance benefits with no redundancy. It also incurs additional risk in striping. Can your controller handle a stripe write on power loss? How is it flushing cache if the OS, MB, or controller fails?

          You can get *most* of the benefits of RAID *with* redundancy with a hotspare(eh) or RAID 1+0/10(best). Why sacrifice your redundancy for the increased risk?

          If we’re just talking performance, and you have only two disks, you are right, RAID0 wins handily. But without your data, all the performance benefits drop dramatically due to the pain of restoring from backups or using file recovery.

          Cloud companies don’t power off their machines, they run on the same hardware that is tested numerous times, before allowing it to be certified for their environment. They also “network” RAID in different data centers, so really a single failure anywhere is easily tolerated. RAID is meant for single-box systems, for those of us without 10 GBe setups.

          Enthusiasts change MB’s, Operating systems, and drives quite a bit, and mix and match hundreds of combinations.

          Awesome luck/(experience) with RAID, though. Wish I had the same anecdotal experience.

            • Bensam123
            • 6 years ago

            You make it sound like people don’t know what raid 0 is and are on this quest to educate people. It’s entirely possible for drives to fail in more then one way and it doesn’t relate to a raid array. For instance Geoff had issues with corruption and he wasn’t using a raid array.

            Maybe you’re the one with the bad luck and anecdotal experience? Perhaps your raid0 hatred aura is infecting the arrays.

          • Bensam123
          • 6 years ago

          Yes, I’ve used integrated controllers as well for quite a long time (close to a decade as well) and I haven’t encountered anything like this where ‘drives are incompatible’. The original point of the post here was the further discussion of Samsung SSDs corrupting themselves.

          In my specific case I believe corruption occurred much faster for me due to raid arrays (especially 0) depending heavily on data integrity. As soon as enough of it starts going bad it really fucks up the data as the file system and OS has no knowledge what so ever of the raid array, so it simply treats it like normal data. This leads to the data not being corrected by the usual set of tools the OS or file system uses to maintain integrity if something goes wrong.

          This gets worse the more drives that are attached to the array. My guess is also AMD arrays have less of a tolerance for corruption then Intel ones do, which is part of why I experienced it on AMD and not on a Intel chipset.

      • lilbuddhaman
      • 6 years ago

      I read over the thread, sounds like drivers. You haven’t mentioned, specifically, which drivers you are running your setup under, just that you made sure they were all installed correctly. And I minus’d you because you have a bitchy attitude.

    • Forge
    • 6 years ago

    “The solid-state system drive in my primary desktop has logged a mere 1.3TB of writes since I installed it 18 months ago.”

    Wow. I knew you weren’t a heavy user, Scott, but WOW. You put 1.3TB on your desktop SSD in 18 months. I’ve put 10.9TB on mine in about one year. That’s my laptop. My home desktop is even worse.

    Got to leverage that speed more!

      • ikjadoon
      • 6 years ago

      Wow, his number is low. I just checked mine. 1.69TB in 2.5 months, though that included migrating from Windows 7 to Windows 8.1 and reinstalling everything.

        • travbrad
        • 6 years ago

        Yeah that does seem pretty low. Even just with the temporary files Windows writes and pagefile writes I would expect more writes than that, although maybe his pagefile is on a different drive (or disabled entirely).

        I guess even if he wrote 10x that much data to his SSD the flash endurance probably won’t be an issue though.

      • jihadjoe
      • 6 years ago

      4.6TB in a year for me.

      I use it for OS, programs and temporary storage (where all downloads go before being off-loaded to the mechanicals).

    • setbit
    • 6 years ago

    This test is a fantastic idea, but I am also disappointed that the Samsung 840 EVO didn’t make it into the test lineup.

    I bought an EVO based on its price/performance ratio, and it’s great. I’m dying to know, however, whether its longevity will be more like the 840 or the 840 Pro.

    I know, I know, “Probably somewhere in the middle between the two,” but where in the middle? Those two drives will probably be at opposite extremes in terms of durability.

    Question for Geoff:

    Any chance TR will revisit or re-run this experiment with other drives in the future? (And while we’re at it, how come the EVO isn’t included this time around?)

    (Edited to address my question to the article author directly.)

      • jihadjoe
      • 6 years ago

      Almost certainly more like the 840. It does use TLC NAND.

        • setbit
        • 6 years ago

        No, it uses a combination of TLC and [b<]S[/b<]LC. SLC has better performance and longer life than either MLC or TLC: [url<]https://techreport.com/review/25122/samsung-840-evo-solid-state-drive-reviewed[/url<] The EVO will almost certainly last longer than the 840, because the SLC will serve to lower the write multiplication to the second tier TLC. The question is, by how much?

          • jihadjoe
          • 6 years ago

          Yes that’s true, but with TR’s testing method the SLC portion will become full almost immediately, and then the drive starts writing directly to the TLC.

          No doubt the 840 Evo will fare better in a real-world setting, but in this sort of testbench of non-stop 24/7 writes it will very likely perform nearly identical to the 840.

            • setbit
            • 6 years ago

            Good point, but it’s more complicated than that.

            Even when the SLC is “full”, it almost certainly participates in the wear-leveling algorithm along with the bulk TLC banks. Letting the SLC sit unused would be a waste of speed and lifetime, and Samsung didn’t get where it is in the SSD market by leaving spare performance lying around.

            The question is, will that little bit of SLC [i<]meaningfully[/i<] extend the drive's life? Beats the heck outta me, and I used to design flash storage algorithms for a living. (That was some time ago. I'm sure it was crude stuff by today's standards.)

            • jihadjoe
            • 6 years ago

            I’m not sure it does. Guru3D says it doesn’t and acts completely as a FIFO buffer. Sort of makes sense because in normal operation 100% of your writes already go to the SLC. Having it participare in wear-levelling will only put additional stress on what is already the most greatly stressed portion of the drive.

            Unless of course (as TR is also [url=https://techreport.com/review/25122/samsung-840-evo-solid-state-drive-reviewed<]waiting to hear[/url<]), the SLC moves around on the NAND with certain portions of the TLC going into SLC operation at certain times, in which case the wear level of the TLC also affects the SLC Turbowrite cache. Guru3D [url=http://www.guru3d.com/articles_pages/samsung_840_evo_ssd_benchmark_review_test,17.html<]quote[/url<]: [quote<]In the SLC buffer partition. Data is continuously overwritten, there's no wear leveling method applied in that segment of the NAND cells as these caches work as FIFO (First in First out) and as such these cells could die faster, and if your buffer dies ... what then ?[/quote<]

            • balanarahul
            • 6 years ago

            According to Hardware.Info:

            [quote<]The TurboWrite buffer has a permanent spot on the SSD, so specific TLC chips are reserved for and used as SLC. The fact that the buffer is stationary makes you wonder what the impact is on the lifespan. Samsung indicated that TLC memory used as SLC has a lifespan of about 50,000 P/E cycles, a lot more than the 3,000 we achieved in our endurance test of the 840s.[/quote<] 50,000 * 3 GB = 147 TB. So, for the average consumer, the buffer should last for decades. But. How Turbowrite improves endurance is beyond my comprehension.

      • Klyith
      • 6 years ago

      I’d guess they didn’t include an 840 Evo because they had just been released when this test started, and they didn’t want any of the test to be influenced by firmware bugs. All the drives included have been around for at least a year, time to ensure mature firmware.

      If an Evo *was* in this test, it would very likely be performing much like the 840 basic. The large bulk writes the test uses would go straight to TLC for the most part, so the contribution of its SLC would be minimized.

      In a hypothetical real-world scenario, the somewhere in the middle is more likely, since the SLC will “save up” small writes to preserve the TLC. The real question about the Evo is how much endurance its SLC has: being used as a cache it will get a lot of write cycles. But the overall lesson from TR so far is that normal users shouldn’t worry about flash endurance on SSDs.

        • sschaem
        • 6 years ago

        Also, that 300tb would go thought the small SLC buffer…
        300tb / 3gb = 100,000 cycle…

        So the evo will fail much quicker then the 840,
        Or somehow the drive bypass its quick write, or disable it when it been worn off. ?

      • Dissonance
      • 6 years ago

      We selected the drives for this experiment before the EVO was released. And we’d already maxed out the number of internal SATA ports in our twin test systems. The mobos are limited to four internal ports each, and one of those is taken up by the system drive.

      Our current experiment is too far along to add other drives to the mix. However, we’ll consider the EVO–and any other popular drives–for future endurance testing.

        • setbit
        • 6 years ago

        Thanks, Geoff.

        This is one of the most interesting and potentially useful tech product tests in recent memory. You should definitely do more of these as SSD technology evolves.

        It will be interesting to see if manufacturers are forced to implement increasingly elaborate error correction and recovery methods as NAND feature sizes shrink, or if there will be some breakthrough — [url=http://spectrum.ieee.org/semiconductors/memory/flash-memory-survives-100-million-cycles<]heat reconditioning[/url<] or something similar -- that will remove the problem outright.

        • Visigoth
        • 6 years ago

        I would like to add that I would be EXTREMELY interested in a comparison with the M500. I’m sure many will agree with me on this one.

    • indeego
    • 6 years ago

    To put this into perspective, I have several business class systems here at work with SSDs in use for 4+ years. Total host writes for these SSDs are between 1.5-2.5TB. They are rarely powered off, and are probably restarted between 1-4 times a month. That means potentially 60-100 or more years of use at current rates. You could probably tweak them further to put logs and other less important but frequent changing data on another mechanical, if you really cared. If this test is at all applicable to the real world…

    • UnfriendlyFire
    • 6 years ago

    I feel fully comfortable with buying a SSD, assuming the driver/controller doesn’t cause it to have some interesting failures.

    • Hattig
    • 6 years ago

    I don’t think I’d ever get to 100TB of writes to a 250GB drive in its lifetime, never mind 300TB. Hell, 10TB would be pushing it. Maybe I need to do more downloading of everything I can find online… :S

    This is very reassuring.

      • Orwell
      • 6 years ago

      Writing more than 3TiB per year isn’t hard. My poor Samsung 830 128GB had to write 3.5TiB since november last year, mostly due to a lot of code compiling. Next to the compiling, it’s living its life as a pretty normal system drive. Pagefile is disabled though, so 4TiB might’ve been reachable.

      That makes 10TiB in a lifetime quite normal actually. 😉

      The fun part is that the 3.5TiB has only burned up 24 cycles of the average cell which can take at least 3000. If my (grand)children continue the programming heritage and keep using this drive, it’ll lose its data retention capability somewhere in the year of our lord 2139.

        • indeego
        • 6 years ago

        Why would you disable the pagefile?

          • Orwell
          • 6 years ago

          Well, RAM itself is faster than any disk you put the pagefile on to make it act as RAM, and I could really use the extra couple of gigabytes for other purposes. 😉

            • indeego
            • 6 years ago

            I’m well aware of the speed of RAM, and I’m also aware that you are depending on an operating system that is designed around paging intelligently to the RAM [b<]and[/b<] disk. Exhausting your pagefile can and will lead to likely eventual data loss, (or Windows will just recreate it when it needs it.) Do you have actual benchmarks showing a difference, or is this just reading a random tweaking guide and calling it a day?

            • stdRaichu
            • 6 years ago

            “Intelligent” paging should really mean “only swap to disc if you have to” – NT5, unlike NT6 and most UNIXes didn’t handle this very well at all and would frequently write things to swap when they could easily be kept in RAM; running a file copy (which would fill up the RAM cache despite the fact the files weren’t going to be reused) and having XP swap out all your running programs to disc was one of the most obvious examples. It doesn’t make any difference as far as performance goes TBH (since modern OSes try to avoid paging as much as possible), unless you’re always running out of RAM, in which case performance will go out of the window in any scenario.

            Pull up a perfmon and see how much windows is writing to your pagefile some time – even my most anemic windows computer with 4GB of RAM has never used more than about 6MB of data in the page file. I still keep mine set to 512MB since there’s some antiquated apps that check for a certain size of page file, but IME 98% of workloads with NT6 on a reasonably well specced machine will never need to write more than a few MB to the page file; keeping it at 2x, 1.5x or even 1x of RAM in this day and age is almost always incredibly wasteful of disc space – especially if you use an SSD.

            Build standard for our company is disabled page file in dev and test, 1GB page file in UAT and prod – IOPS are bloody expensive and it’s cheaper for us to allocate the right amount of memory for a server workload than it is for us to waste time and IOPS writing that memory out to disc (which of course means less IOPS for everything else on the SAN). We’ve got maybe only fifty or sixty servers where the page file is actually routinely required, and a good 70% of those are applications which say “argh, I’ve got 64GB of RAM and a 20GB commit but only 16GB of page file! I’m going to refuse to run!”.

            The crux of the matter is that disc access speeds haven’t kept up with average memory capacity (except for some SSDs) and it’s perfectly feasible to run everything out of RAM these days (and has been for years). I once worked at a company that still worked by the old “swap must be 2x RAM!” adage and had actually shelled out an utter fortune for 256GB SSDs just to be used as swap… over their entire lifetime 6 month lifetime those SSDs saw less than 500MB of writes.

            Been recommending severely reducing or eliminating the page file (if the application is fine with it) on a professional level for years now. Not seen anything to convince me the page file is worth spending disc space on to convince me otherwise since. Try it on your home machine and let me know how you get on.

            Some further reading if you feel like it (although the latter links to the former):
            [url<]http://blogs.technet.com/b/markrussinovich/archive/2008/11/17/3155406.aspx[/url<] [url<]http://blogs.citrix.com/2011/12/23/the-pagefile-done-right/[/url<]

            • Ryu Connor
            • 6 years ago

            That Russinovich article does not say cut off the pagefile, but instead teaches you the proper way to size it out. He even suggests that if you hit a negative number to still throw some sort of pagefile on the system.

            Windows 8 and 8.1 will dynamically make the pagefile bigger or smaller based on accumulated commit metrics over time as part of system maintenance.

            The pagefile generates very little in the way of writes.

            There’s no good reason to tell the average laymen or enthusiast to muck with pagefile. We don’t know enough about what software they use today or will tomorrow to engage in that discussion. This isn’t the controlled environment of a workplace.

            [quote=”Microsoft”<] In looking at telemetry data from thousands of traces and focusing on pagefile reads and writes, we find that •Pagefile.sys reads outnumber pagefile.sys writes by about 40 to 1, •Pagefile.sys read sizes are typically quite small, with 67% less than or equal to 4 KB, and 88% less than 16 KB. •Pagefile.sys writes are relatively large, with 62% greater than or equal to 128 KB and 45% being exactly 1 MB in size.[/quote<] [url<]http://blogs.msdn.com/b/e7/archive/2009/05/05/support-and-q-a-for-solid-state-drives-and.aspx[/url<]

            • dragosmp
            • 6 years ago

            Even in a non-controlled environement, statistically folks use a certain number of apps that have an estimated memory footprint. If one can estimate with a high degree of confidence the quantity of RAM, then not enabling pagefile can: reduce writes, save space and reduce power consumption.

            • Waco
            • 6 years ago

            I challenge you to find any real documentation on that assertion.

            I used to disable the pagefile in the past – but there’s really no point. It’s about as bad as disabling swap in Linux. Which would you rather happen: random crashes because memory is exhausted or a nearly insignificant slowdown followed by normal operation?

            • ronch
            • 6 years ago

            Instead of disabling my page file, I just had Win7 use my mechanical hard drive instead. I have 8 gigs of RAM and I probably won’t run out of RAM in 99% of cases and I just wanna minimize writes to my SSD.

      • sschaem
      • 6 years ago

      300tb in thiw case is only equivalent of performing 1000 p/e cycle…
      Most people don’t write 240gb! Erase it , and write 240gb again. To give the drive an unrealistic 100% even usage. This never happen.

      We see that TLC crap out at less then 1000 cycle, that’s not much for a real world use case.

    • albundy
    • 6 years ago

    “I’ve seen and heard from SSD makers suggests a multiplication factor below 10X”

    my guess is that R&D of these manufacturers set that exponential decay factor based on theory and not years worth of testing, but i could be wrong. The testing that you guys are doing is far beyond what any other review site will ever do, and it’s a shame really, since generally there’s no spotlight on MTBF on ssd reviews.

      • duke_sandman
      • 6 years ago

      DoublePlusGood for this series of reviews, indeed. Nice work, guys.

      • indeego
      • 6 years ago

      This is true for all product categories. Everything you buy isn’t tested for years, but in simulated environments. They make ovens to replicate high temperatures and humidity environments for pretty much every electronic equipment out there. They don’t actually put a CPU in a desert with those conditions.

      There’s also 100 more ways TR could test this, or another site. How about data sets of various files and compression levels? How about system restarts daily? How about different mother boards or power supplies? How about hard power resets? etc…

    • crystall
    • 6 years ago

    The results of this test are very encouraging. I for once run a desktop workload that happens to generate 10s of gigabytes of writes a day (*) and my Intel 320 has accumulated ~10TB of writes in a year. I bought it mostly because it was one of the most reliable drives on the market at the time but it seems that even less pricey desktop drives seem to be doing just fine.

    The test is not exhaustive as age is also likely to have an effect on data integrity but it’s comforting to know that you can hammer drives with 300TB+ writes with little risk of having them die on you.

    * building Mozilla codebases multiple times a day, a full Firefox OS build for example usually produces ~5GiB of writes

      • Wirko
      • 6 years ago

      Thanks for the real world data. Each build therefore costs you some 3 cents, assuming an endurance of 300TB and amplification factor of 10.

      My X25-M accumulates around 4 GB per day in what is mostly light office use. About half of that is due to hibernation and some is due to (gasp) occasional defrag of most fragmented files. The SSD is rated at 7.5 TB and it’s already past that but I’m confident that it will last forever.

        • crystall
        • 6 years ago

        [quote<]The SSD is rated at 7.5 TB and it's already past that but I'm confident that it will last forever.[/quote<] One of the important things that this article series tells us is that the lifetime of these drives is probably more dependent on the general quality of the electronics rather than on flash endurance. I have the feeling that it's likely that the controller or some other on-board component will fail before the flash runs out of write/erase cycles. That's also barring catastrophic failures of an entire flash chip but some recent SSDs should survive that too.

          • indeego
          • 6 years ago

          Without moving parts, this is far less likely. UPS your workstation (with a quality tested UPS, not Belkin’s crap), watch for shorts. Best you can hope for.

    • Shobai
    • 6 years ago

    “Of course, the HyperX has written only 215GB to the flash thanks to its compression mojo.”

    Is the mojo /that/ good?

      • nanoflower
      • 6 years ago

      It depends upon the sort of data they are writing to the drives. If they were just copying text files over (or writing a fixed pattern) then I could see the data being compressed that well. If they were copying data that was already compressed (movies/RARs/ZIPs) then there would be some remarkable voodoo going on to compress the data that much.

        • peartart
        • 6 years ago

        I’m 95% sure it’s a typo, otherwise the test would be pointless.

          • Stickmansam
          • 6 years ago

          Seems like the typo has been fixed to 215TB

        • dmjifn
        • 6 years ago

        I can’t tell if you know this from your comment but: They have two HyperX drives in the experiment, with one receiving each type of data you just said to test the effects of compressibility. The drive with the low total writes is the one receiving the compressible data.

      • albundy
      • 6 years ago

      Is it that good? It’s the life force. The essence. The right stuff. What the French call a certain… I don’t know what.

      We’re all sensitive people.

    • ShadowTiger
    • 6 years ago

    It is very interesting how robust they are. Do you think vendors will start offering longer warranties?

    • badnews
    • 6 years ago

    The Corsair Neutron GTX / LAMD controller still looks like a pretty sweet drive. I’m surprised no other vendors seemed to go with that chip (excepting Seagate, of course.. maybe I can understand why the chose it now!).

      • vargis14
      • 6 years ago

      I have been wanting a Corsair Neutron GTX drive for a long time now to replace my original series 1 Corsair Force 60gb drive. Besides it being way too small, it is not very fast at all. But 3 years ago it was substantially faster then a HDD and “knock on wood” has worked flawlessly.

    • odizzido
    • 6 years ago

    I wonder if the error you got when hashing the truecrypt file was one of the blocks failing just as you copied it on, and then getting flagged and removed from use for the second try. I don’t suppose you know the number of bad blocks before and after the test?

    That could be another thing to consider….does samsung push each block right to the edge of failure to extend the life of the drive at a small risk to the data on it?

    • Sargent Duck
    • 6 years ago

    Excellent, keep up the good work! Very interesting and although my SSD’s will NEVER get anywhere close to their respected rated runtimes, still super interesting to read.

    Geoff, after destroying these SSD’s, might I suggest you re-awaken “the Beast” for some power supply reviews?

    • hans
    • 6 years ago

    Is the Samsung making the case for ZFS? ZFS may not prevent this type of corruption (but maybe, depending on scrubbing) but it would at least detect it and potentially alert you. I thought there might be a nice opportunity here for pools of cheap TLC drives, except that using reallocated sector count as an indicator of drive life means you’d have to start replacing these at 1/3 the life of MLC. That eats up any price advantage of TLC quickly.

      • ikjadoon
      • 6 years ago

      How..how..long are you using your drives?

      Looking at just the reallocated sector count, they used 1.2GB of the spare area @ 300TB writes. The 250GB EVO has 9.06% as spare area. So that’s 23GB of spare area. You would need to write 5.750 PB (5,750 TB) to exhaust the spare area.

      Using Geoff’s 1.3TB in 1.5 years….you’d need nearly 5 millenia to wipe out this drive in normal usage. Wait, is that right? Someone check my math!

        • Buzzard44
        • 6 years ago

        No, that’s not right, but not because of your math. The sectors are now highly worn and will thus die at a much more frequent rate as they approach the end of their life cycle together. You math would only work if once you hit 1.2GB of reallocated sectors you replaced all the NAND in the drive and repeated each time you reallocated 1.2GB. Still, your point remains valid – these drives are “good” for quite a lot of cycles. I have to put “good” in quotes, because I can’t call data corruption good, even if it’s not reproducible. Data corruption only has to happen once to ruin your day (week? month? year?).

          • ikjadoon
          • 6 years ago

          Ohhhhhhhhhhhhhh, I see. I didn’t know about the accelerating degradation.

          But, yeah, the data corruption is pretty worrying.

        • Diplomacy42
        • 6 years ago

        the failure rate of sectors might not be a strictly exponential curve, but its close enough that I doubt your prediction will hold.

      • nanoflower
      • 6 years ago

      From what I’ve read it seems like ZFS (or something like it) should be used for all storage since bit errors can occur anywhere. Only ZFS seems to do the right thing in bit errors on the drive. So it may not be the best possible solution but it seems to be the best we have now.

    • Ryu Connor
    • 6 years ago

    You can accelerate your retention testing by baking the drive in the oven at 300F(150C).

    [url=https://techreport.com/forums/viewtopic.php?p=1094618#p1094618<]The JPL has used this for testing the reliability of SSDs for retaining data as they age.[/url<] This would also shorten the downtime of your write tests.

      • MadManOriginal
      • 6 years ago

      A very realistic test! heh

        • Ryu Connor
        • 6 years ago

        The high temperature is used to simulate the ageing of the data sitting at room temp over an extended period of time.

        This is why the JPL used this particular test.

          • Wirko
          • 6 years ago

          Yes, retention is reduced at higher temperatures. What we don’t know is if the same holds true for write endurance. It could be that the opposite is true, and we would need something like heat-assisted flash recording then.

      • puppetworx
      • 6 years ago

      Wouldn’t that be cooking the results?

      • willmore
      • 6 years ago

      I don’t know why the thumbs down, this is a very standard way to test things in the industry. How do you think manufacturers spec their brand new parts for 25 years of data retention? Time machines? No, they use ALT (Accelerated Life Testing) to do it. They raise the temp and/or humitdity to increase the rate at which many normal processes operate.

      They have to build a model of the temperature specific ageing of the device–run ALT at different temps and compare error rates. This allows them to say useful things like “25 years at 25C is the same at Y hours at 105C”.

      Since FLASH is typically spec’ed as “After X writes per page, the FLASH will retain data for Y years at 25C”, this kind of testing is the only way they can reliably generate the data needed to say that with any confidence.

      That said, TR would need to know a lot about each and every individual FLASH chip in every drive to know what temp and how long to ALT the drives. Data, I’m pretty sure they don’t have access to.

    • derFunkenstein
    • 6 years ago

    Non-uniform SMART attributes, well, isn’t smart. Why isn’t this stuff standardized?

      • stdRaichu
      • 6 years ago

      [url<]http://xkcd.com/927/[/url<] I don't think there's any over-arching standards body for SMART like there is for many other protocols. Wikipedia says "Although an industry standard exists among most major hard drive manufacturers, there are some remaining issues and much proprietary 'secret knowledge' held by individual manufacturers as to their specific approach". Sadly, it's not SSDs that have a hojillion non-standard SMART attrs (which one might find excusable since they're comparatively new tech that requires different metric from platter-based drives), most hard drives use at least a handful of different SMART counters for similar things. If you spin up smartctl or crystal disc info on a couple of your own drives you'll see how many of the metrics are 'orribly different.

        • Wirko
        • 6 years ago

        The nice thing about standards is that there are so many of them to choose from.

        – Andrew S. Tanenbaum

      • peartart
      • 6 years ago

      Because it is more profitable to split the market than enable measurable competition.

    • Peldor
    • 6 years ago

    Kudos to you for carrying out this test.

Pin It on Pinterest

Share This