Personal computing discussed
Moderators: renee, morphine, Steel
just brew it! wrote:That's been true for a decade. By my maths and the specs, rebuilding this array has a theoretical 72.28% chance of failure. Assuming there's a hot spare in the chassis.RAID 5 is deprecated for newer high capacity drives, due to the non-trivial risk of a second drive failure during an array rebuild (since the rebuild takes so long and involves reading so much data).
Topinio wrote:just brew it! wrote:That's been true for a decade. By my maths and the specs, rebuilding this array has a theoretical 72.28% chance of failure. Assuming there's a hot spare in the chassis.RAID 5 is deprecated for newer high capacity drives, due to the non-trivial risk of a second drive failure during an array rebuild (since the rebuild takes so long and involves reading so much data).
( 16 TB needs reading, 128 Tb, 1.28E14 b. Probability of not having a bit be unreadable is (1-(1/UBER)) per bit, and as each bit read is an independant event, it's exponential to number of bits. With a consumer UBER of 1E14 and needing to read 1.28E14 bits, (1-(1/1E14))**1.28E14 = 72.17% chance of an uncorrectable read error on that degraded array. With RAID-5, a URE during the rebuild process is catastrophic.
Chance of one of the other drives failing during the rebuild is much smaller but non-negligible. At an assumed 130 MB/s, rebuild time is 34h. A MTBF of 1E6 h is an AFR of 8.7E-1 which translates into an hourly failure risk of a bit over 1 in 1E6, a daily failure risk of 0.0024%. 34h has a failure risk of 0.0034% per drive, which is a failure risk of 0.12% for either of the pair.
72.17+0.12 = 72.28% ; if it takes 5 d to get a spare that's up to 74.50% &c.)
I don't think this is wrong, sure someone will disagree though.
just brew it! wrote:I believe your analysis is more or less correct from a theoretical standpoint.
Where it diverges from reality is that the actual uncorrected error rates we see from real drives tends to be substantially lower than the UBER would imply. So the probability of a rebuild failure is lower in practice, but still non-trivial.
My recommendation in this case would've been to go with a RAID-6 array of smaller devices.
Topinio wrote:just brew it! wrote:I believe your analysis is more or less correct from a theoretical standpoint.
Where it diverges from reality is that the actual uncorrected error rates we see from real drives tends to be substantially lower than the UBER would imply. So the probability of a rebuild failure is lower in practice, but still non-trivial.
My recommendation in this case would've been to go with a RAID-6 array of smaller devices.
Yes on the IRL BER, but if it were 1E-15 the vendors would put that because they do on those drives which can do it.
1E-15 drives in this set-up implies 12% not 72% chance of URE on rebuild, so that's a window of 12-72% ...
On RAID-6, it has the same UBER so it's the same chance of a read-failure. If a drive goes in a 4-drive RAID-6 of these, you still need to read 1.28E14 bits to write the new disk's 6.4E13 bits. Still a 12% to 72% chance you can't do so -- you've just got another disk to try from when the URE happens.
Edit: If we go for a first approximation that the parity means you have double your chances at reading a bit (independance of the parity and data) and work on a basis that there is a RAID controller which will mark the unreadable bit bad and remap unused space ( ) then with any 16 TB usable RAID-6 with disks at 1E-14 I make it only a 53% chance of being able to read the array to rebuild. It's a 6% chance if the UBER is 1E-15.
Waco wrote:Data (or information based on internal vendor datasets) should show better than spec performance when a spec is a limit, but surely if it were massively better across the entire ensemble they'd change the published spec to not be making themselves look worse than they are. Like they do with e.g. some of the drives having a MTBF of 1.2E6, while others have 1E6.That's the theory, but like I said, in practice the numbers for UBER are *far* lower incidence than rated (I wish I could quantify this further, but much of that data is NDA).
Topinio wrote:Market segmentation! Configure the firmware to have a drive retry failed reads twice as much and increase reliability by "10x" and BAM! 30% premiums for the same drives with a different color label. Pulling an Intel is free money.If the former, i.e. if the 1E-14 ones are actually 1E-15 or 1E-16 or whatever, then it seems nonsensical to me: why on earth would every drive manufacturer sell some drives spec'd as 1E-14 and others as 1E-15 ??
Duct Tape Dude wrote:Topinio wrote:If the former, i.e. if the 1E-14 ones are actually 1E-15 or 1E-16 or whatever, then it seems nonsensical to me: why on earth would every drive manufacturer sell some drives spec'd as 1E-14 and others as 1E-15 ??
Market segmentation! Configure the firmware to have a drive retry failed reads twice as much and increase reliability by "10x" and BAM! 30% premiums for the same drives with a different color label. Pulling an Intel is free money.
just brew it! wrote:The negative effects of UBER should be mitigated by RAID systems that do a periodic scrub. Some percentage of sectors which had undetected write failures or which have gone bad after being written get detected and rewritten with good data before a drive failure forces a replacement and rebuild. I'm sure the raw UBER specs don't take this into account.
Krogoth wrote:Modern HDDs are pretty quiet for the most part. They have been that way ever since they moved to fluid bearings. You get better luck with using noise damping kits/mounts if you feel that modern HDDs are too loud for your tastes.
Vhalidictes wrote:As far as reliability goes... nothing is reliable. There, I've said it. Anything over 1TB and you're going to have a horrible failure rate, it's all RNG.
Vhalidictes wrote:
As far as reliability goes... nothing is reliable. There, I've said it. Anything over 1TB and you're going to have a horrible failure rate, it's all RNG.
HERETIC wrote:Vhalidictes wrote:
As far as reliability goes... nothing is reliable. There, I've said it. Anything over 1TB and you're going to have a horrible failure rate, it's all RNG.
Agree that spinning rust can be a cr@pshoot/luck of the draw....................
Don't agree with your 1TB limit.
My most reliable/longest serving are some 2TB-HD204-Samsung drives.
Bauxite wrote:I also think doing weird crap like 24-bay RAID 6/Z2 that performs like a super bad latency single drive is a clear sign of a penny wise and pound foolish person.
blahsaysblah wrote:Recording method is perpendicular btw.
Seagate Barracuda ST4000LM024 full data sheet
edit: forgot reason for post, as mentioned by some, why 2.5" drive for storage is purrrtty good, besides quiet(thought likley most new 3.5" drives arent bad)
Waco wrote:I dunno, these guys found over 200s latencies: http://www.storagereview.com/seagate_ar ... review_8tbModern SMR tech is a lot more tolerable than the 5 TB Archive drive that debuted first. Those things are pigs and only useful if you can *really* tailor your workflow.
Duct Tape Dude wrote:Waco wrote:I dunno, these guys found over 200s latencies: http://www.storagereview.com/seagate_ar ... review_8tbModern SMR tech is a lot more tolerable than the 5 TB Archive drive that debuted first. Those things are pigs and only useful if you can *really* tailor your workflow.
That's 200 seconds, not milliseconds.
I'm sure SMR is fine for light bursty writes and any sort of reads, but as soon as heavy writes hit (ex: RAID rebuild) it stresses the drive a lot to the point of unusable latency, and I'd start worrying about an early death.
Duct Tape Dude wrote:Waco wrote:I dunno, these guys found over 200s latencies: http://www.storagereview.com/seagate_ar ... review_8tbModern SMR tech is a lot more tolerable than the 5 TB Archive drive that debuted first. Those things are pigs and only useful if you can *really* tailor your workflow.
That's 200 seconds, not milliseconds.
I'm sure SMR is fine for light bursty writes and any sort of reads, but as soon as heavy writes hit (ex: RAID rebuild) it stresses the drive a lot to the point of unusable latency, and I'd start worrying about an early death.
morphine wrote:Well, it's not like anyone'd use an SMR drive in a RAID array. Those things are meant for archival purposes, and I dare say, they suit that role just fine.
just brew it! wrote:That review is over 2 years old. Early drive-managed SMR firmware did indeed suck.
morphine wrote:Well, it's not like anyone'd use an SMR drive in a RAID array. Those things are meant for archival purposes, and I dare say, they suit that role just fine.
Waco wrote:just brew it! wrote:That review is over 2 years old. Early drive-managed SMR firmware did indeed suck.
I take some pride in having helped them optimize that firmware, if only from a "dammit make these things not suck or I won't buy them, here's a few simple tricks" perspective.
just brew it! wrote:
(As odd as it may seem, we actually had to do some additional work to make our system capable of economically scaling down to sub-petabyte use cases! )