Personal computing discussed
Moderators: renee, SecretSquirrel, notfred
DragonDaddyBear wrote:That's very interesting. Thanks for sharing. If I understand this, then it sounds like there is no lock put on the data set before it is written to disk, allowing for a race condition. That seems like a bug that could cause some issues. I apologize if I missed it, but what file system are you using?
chuckula wrote:I'd be curious to see if the race condition is in the filesystem itself or at the MD RAID layer. Based on his analysis it sounds like it might be agnostic to the underlying filesystem.
morphine wrote:Niiiice, that's some impressive sleuthing and an easy-to-understand explanation. I second the idea of JBI doing an educational show
Forge wrote:These are why the fs journal exists. In case of a power cut mid-write, the journal can be replayed to get the fs to a sane state. It would also get the md into a sane state as well.
Forge wrote:Use ZFS. It's more better.
just brew it! wrote:Forge wrote:Use ZFS. It's more better.
Yes, ZFS is on my long (and ever growing) "I need to learn about that" list.
just brew it! wrote:By default ext4 only journals meta-data, not file contents. You can configure it to journal everything, but this results in a substantial performance hit.
Forge wrote:This. I've run ZFS on several Solaris fileservers at work since 2006 (on a StorageTek 3510 FC array!) and it's the business. FreeNAS at home, in a ProLiant MicroServerSpare machine of any config, at least 3-4 disks of a size (or real close) and an afternoon. FreeNAS makes a great start, and once you know the terms and concepts, ZOL (ZFS On Linux) will let you apply it anywhere the disks are available. It's very good stuff, I trust a lot of my most important files to it (family photos and such).
Topinio wrote:just brew it! wrote:By default ext4 only journals meta-data, not file contents. You can configure it to journal everything, but this results in a substantial performance hit.
I thought that (by default, data=ordered) this wasn't a problem because the writes are in transactions with the data blocks written to storage first, just before the metadata?
Topinio wrote:Forge wrote:Spare machine of any config, at least 3-4 disks of a size (or real close) and an afternoon. FreeNAS makes a great start, and once you know the terms and concepts, ZOL (ZFS On Linux) will let you apply it anywhere the disks are available. It's very good stuff, I trust a lot of my most important files to it (family photos and such).
This. I've run ZFS on several Solaris fileservers at work since 2006 (on a StorageTek 3510 FC array!) and it's the business. FreeNAS at home, in a ProLiant MicroServer
Waco wrote:Nice sleuthing! It doesn't surprise me that mdraid isn't perfect, it also doesn't surprise me that filesystems designed for speed over integrity might leave the underlying disks in an inconsistent state. If this was any other situation you'd be guessing at which disk is "correct", this is why 3-way mirrors are the only mirrors I tend to run.
DragonDaddyBear wrote:So, I read your write up one more time and have what may be a stupid question. Why doesn't the job just read the file system and scrub those sectors? Would it not be faster and avoid these kinds of harmless alarms?
just brew it! wrote:DragonDaddyBear wrote:So, I read your write up one more time and have what may be a stupid question. Why doesn't the job just read the file system and scrub those sectors? Would it not be faster and avoid these kinds of harmless alarms?
I suppose it could do that (not sure why you think it would be faster though). But in the general case, when there's a discrepancy on a 2-device RAID-1 the RAID subsystem really has no way of determining which block is correct. For this particular case, the RAID subsystem could assume that the block with more data in it is the "good" one; but it doesn't have any way of knowing whether we're in this situation or not, without figuring out whether the block is currently part of a valid file or on the free list.
An interesting alternative approach would be to have an option for ext4 to do a background wipe of free blocks. This would eliminate the RAID-1 mismatch issue, and have the side benefit of enhancing security by making it impossible to recover contents of deleted files using standard forensic techniques.
As I also noted above, this issue probably doesn't exist for situations where TRIM is enabled. As more of the world shifts to solid state storage, this race condition -- which is already more of a curiosity and/or a caveat for paranoid sysadmins than a genuine problem -- will become even less important.
DragonDaddyBear wrote:I don't suppose you feel like writing a cron job for testing that background wipe with something like zerofree on Saturday. I think it would be rather interesting to see the results.
Waco wrote:I also realize due to this that now I get to check mirrors at work on systems I didn't build.
Waco wrote:I appreciate the effort you put into this!
just brew it! wrote:But in the general case, when there's a discrepancy on a 2-device RAID-1 the RAID subsystem really has no way of determining which block is correct.
SuperSpy wrote:just brew it! wrote:But in the general case, when there's a discrepancy on a 2-device RAID-1 the RAID subsystem really has no way of determining which block is correct.
This is precisely why I started putting everything on ZFS (checksum all the things!). If the drive falls off the face of the earth, sure md can fix it. But if the drives can't agree, you/md basically have to flip a coin to decide who to trust.
I need to do a bit of research on Windows software RAID, because I tend to use that on important machines of the Microsoft variety, and it is probably even more susceptible to such disagreements as it seems to do a full resync every time the box boots with an array marked dirty.
just brew it! wrote:Waco wrote:I also realize due to this that now I get to check mirrors at work on systems I didn't build.
I wouldn't stress over it if I was you. As noted above, it seems to be harmless from a data integrity standpoint, since it only affects deleted files. The main effects of it are that a scrub pass may report mismatches, and it complicates the investigation if you're looking into a suspected RAID corruption issue.
SuperSpy wrote:I need to do a bit of research on Windows software RAID, because I tend to use that on important machines of the Microsoft variety, and it is probably even more susceptible to such disagreements as it seems to do a full resync every time the box boots with an array marked dirty.
Waco wrote:just brew it! wrote:I wouldn't stress over it if I was you. As noted above, it seems to be harmless from a data integrity standpoint, since it only affects deleted files. The main effects of it are that a scrub pass may report mismatches, and it complicates the investigation if you're looking into a suspected RAID corruption issue.
That's actually the part I'm worried about - there are some admin teams that will just replace a drive on any sign of a problem, so I'd like to skip the unneeded hardware replacements and subsequent rebuilds.