Personal computing discussed
Moderators: renee, morphine, Steel
Silent data corruption
The worst type of errors are those that go unnoticed, and are not even detected by the disk firmware or the host operating system. This is known as "silent corruption". A real life study of 1.5 million HDDs in the NetApp database found that on average 1 in 90 SATA drives will have silent corruption which is not caught by hardware RAID verification process; for a RAID-5 system that works out to one undetected error for every 67 TB of data read.[27][28] However, there are many other external error sources other than the disk itself. For instance, the disk cable might be slightly loose, the power supply might be flaky,[29] external vibrations such as a loud sound,[30] the Fibre Channel switch might be faulty,[31] cosmic radiation and many other types of soft errors, etc. In 39,000 storage systems that were analyzed, firmware bugs accounted for 5–10% of storage failures.[32] All in all, the error rates as observed by a CERN study on silent corruption, are far higher than one in every 1016 bits.[33] Webshop Amazon.com confirms these high data corruption rates.[34]
Captain Ned wrote:The problem with this thread is that some read the MTBF as a hard-coded point in time after which the drive will 100% and irretrievably fail. That's not how statistics work, and MBTF is inherently statistical.
SecretSquirrel wrote:Since most people (TR gerbils are not most people) a not really ever going to build more than a four drive array, there is fundamentally no difference in disk overhead between RAID 10 and RAID6.
Ryu Connor wrote:What can be said about the probabilities that exist that hasn't been said already?
What more can the math of odds really tell us? They give a % value that varies depending on a number of factors, but quantify that into something more than just a dice roll. The percentages are fascinating theory, but averages manifest in life in rather unpredictable ways. You could argue that due to the way human perceive life and remember events, that all odds regardless of their actual % distill into a simple dichotomy: did happen, didn't happen.
Silent data corruption
The worst type of errors are those that go unnoticed, and are not even detected by the disk firmware or the host operating system. This is known as "silent corruption". A real life study of 1.5 million HDDs in the NetApp database found that on average 1 in 90 SATA drives will have silent corruption which is not caught by hardware RAID verification process; for a RAID-5 system that works out to one undetected error for every 67 TB of data read.[27][28] However, there are many other external error sources other than the disk itself. For instance, the disk cable might be slightly loose, the power supply might be flaky,[29] external vibrations such as a loud sound,[30] the Fibre Channel switch might be faulty,[31] cosmic radiation and many other types of soft errors, etc. In 39,000 storage systems that were analyzed, firmware bugs accounted for 5–10% of storage failures.[32] All in all, the error rates as observed by a CERN study on silent corruption, are far higher than one in every 1016 bits.[33] Webshop Amazon.com confirms these high data corruption rates.[34]
Scrotos wrote:Ryu
You're assuming 10^16
If you're really concerned about this kind of stuff, I would think you would take a worst-case scenario and plan for that. You seem to be leaning in the non-paranoid everything-is-fine direction about storage arrays. Is your plan to just go to tape whenever an array fails? Can you handle that kind of downtime while a slow restore happens? If you do disk-to-disk backups, how is the other "disk" protected? Do you just run a bunch of redundant arrays in a SAN? If you're not concerned about this, why are you posting in the thread?
I don't know why you locked the thread.
I took a break from it for the last week because I was tired of the pedantic nitpicking about the performance of mirrors.
Scrotos wrote:Ok, I have no idea who uses Areca RAID controllers in business. Maybe for custom build stuff along with the LSI MegaRAID controllers? Most of the big vendors I've dealt with, Dell and HP, have their own solutions, the PERC and the SmartArray lines. That being said, here's what I have found. I will post the results/raw numbers/graphs when I get a chance to send the spreadsheet I've been working on to my home address so I can upload it to my geoblocked-from-work web space so y'all can download it.
morphine wrote:I don't get this. We're still discussing 10^15 over 10^16 instead of best practices?
SecretSquirrel wrote:IIRC, PERC are rebadged controllers from an OEM. Dunno about the SmartArray stuff. Most of it seems to work it's way back to LSI in the end.
Convert wrote:Quite right! LSI pretty much owns the market; Intel, HP, Dell, IBM etc. all seem to tap them for controllers. Here is a list I often refer to when I'm curious what I'm getting in a server: http://forums.servethehome.com/raid-con ... odels.html
Convert wrote:Well, if I may say, both numbers are valid estimates for different types of drives. It doesn't look like Ryu was cherry picking anything on purpose, but if anything, bringing attention to the fact that 10^15 drives also exist is equally important to the discussion. It's another data point for those still interested in it. For those not interested they can ignore it, or not ..
Scrotos wrote:I'm looking to get a storage solution with 12+ bays that I can grow over time. Obviously, if the array will run into problems rebuilding, I'd spec for more bays with smaller drives instead of fewer bays with larger drives. That's the primary reason I started investigating the whole RAID 5/6 dead due to large drive claims.
Waco wrote:It's a 3 TB array?
SecretSquirrel wrote:Scrotos wrote:I'm looking to get a storage solution with 12+ bays that I can grow over time. Obviously, if the array will run into problems rebuilding, I'd spec for more bays with smaller drives instead of fewer bays with larger drives. That's the primary reason I started investigating the whole RAID 5/6 dead due to large drive claims.
My one comment here would be not to "grow" into it. Even with a good controller, expanding an existing RAID group is somewhat iffy. It is by far more dangerous for RAID5/RAID6 than a rebuild as it can potentially move every block on every disk to rebalance the load across all disks -- it depends on how the array calculates and stores the parity information. It is extremely painful and slow. My recommendation would be to fill all 12 slots and let your budget set the size of the disk. Either that or go RAID 10 and grow in four disk increments with each RAID group being independent and a seperate volume/network share. If you are really concerned about having some data on RAID6, you could also set up a 4 drisk RAID 10 group and an eight disk RAID6 group. Put your really important data on the RAID10 group and bulk data on the RAID6.
--SS
Scrotos wrote:
What about expanding a RAID 10? It's supposed to be more "safe" because there are no parity calculations, ya?
Either way it seems like "don't worry about rebuilding 2 TB drives in a RAID 6" turned into "oh geez don't expand a RAID 6 with 2 TB drives, that's askin' fer trouble!" I assume it's more worser if you're at capacity when you grow rather than if you're at 300 GB and then add more drives.
Scrotos wrote:What about expanding a RAID 10? It's supposed to be more "safe" because there are no parity calculations, ya?
Either way it seems like "don't worry about rebuilding 2 TB drives in a RAID 6" turned into "oh geez don't expand a RAID 6 with 2 TB drives, that's askin' fer trouble!" I assume it's more worser if you're at capacity when you grow rather than if you're at 300 GB and then add more drives.