TR Forums

Waco · Wed Aug 09, 2017 9:52 pm

just brew it! wrote:
Ahh, gotcha. Always gotta take the human element into account.

But in a situation like this, which drive would they replace? Just pick one at random? (I can see it now... "We replaced one of the drives, and the errors went away after the rebuild, so we must have guessed right!" )

In your shoes, I'd probably just let 'em do the drive swaps. As long as the other drive doesn't fail during the rebuild, it should be harmless...

Exactly.

I don't want them to do the drive swaps simply because it's a chance for something that my infrastructure depends on to crash...and then I have to rely on them getting it going again quickly (unlikely).

SuperSpy · Thu Aug 10, 2017 11:04 am

What are the odds that like 24 hours after talking here about the uselessness of Windows soft-RAID, I have a box at work drop a disk that's part of a mirror and have Windows BSOD over a missing boot volume, completely ignoring the mirror?

Don't answer that. :evil:

Forge · Thu Aug 10, 2017 11:56 am

blahsaysblah wrote:
Just FYI, to get FreeNAS 11.0-U2 working under Windows 10 Hyper-V, i had to create Gen 1 host with Guest services off. Remember to offline disks to make available for Hyper-V disk pass through if you want to give ZFS real disks(SMART,...) instead of VHDs.

I'm sorry, I threw up in my mouth a little. Why exactly would we do this? Lab/practice? You can virtualize FreeNAS, but it loses a lot of the advantages. Also keep in mind that FreeNAS machines perform massively better with enough ram, and "enough" is usually more than any desktop or laptop is likely carrying.

Thu Aug 10, 2017 12:23 pm

Forge wrote:
blahsaysblah wrote:
Just FYI, to get FreeNAS 11.0-U2 working under Windows 10 Hyper-V, i had to create Gen 1 host with Guest services off. Remember to offline disks to make available for Hyper-V disk pass through if you want to give ZFS real disks(SMART,...) instead of VHDs.

I'm sorry, I threw up in my mouth a little. Why exactly would we do this? Lab/practice? You can virtualize FreeNAS, but it loses a lot of the advantages. Also keep in mind that FreeNAS machines perform massively better with enough ram, and "enough" is usually more than any desktop or laptop is likely carrying.

While I wasn't nearly as offended as you apparently were, my thought was that it sounded potentially useful as a learning exercise, but not something you'd want to turn loose in a production environment.

Forge · Thu Aug 10, 2017 12:30 pm

Yeah, same idea, more politely stated.

For anyone playing along at home and confused by my disgust, a virtualized FreeNAS can't leverage ECC to keep ram clean, ram is heavily used for caching in FreeNAS, and corrupted cache means you can't trust the disk contents, which invalidates one of the big selling points of ZFS.

For a test or a home lab or just experimentation it's fine, but I wouldn't put anything I cared about onto a FreeNAS instance set up as a virtual guest.

Thu Aug 10, 2017 12:48 pm

Forge wrote:
Yeah, same idea, more politely stated.

For anyone playing along at home and confused by my disgust, a virtualized FreeNAS can't leverage ECC to keep ram clean, ram is heavily used for caching in FreeNAS, and corrupted cache means you can't trust the disk contents, which invalidates one of the big selling points of ZFS.

That's a function of the underlying hardware, not whether or not the system is virtualized. If the host has ECC RAM (and support for it in the memory controller), then the RAM will be ECC protected whether you're running natively or as a guest. (And yes, FWIW my home server and my primary desktop both use ECC RAM.)

Forge wrote:
For a test or a home lab or just experimentation it's fine, but I wouldn't put anything I cared about onto a FreeNAS instance set up as a virtual guest.

It's just more stuff to go wrong. If we were talking about an enterprise-grade virtualization solution like ESXi instead of Windows 10 there'd be more of a case for it.

Waco · Thu Aug 10, 2017 3:32 pm

Forge wrote:
Also keep in mind that FreeNAS machines perform massively better with enough ram, and "enough" is usually more than any desktop or laptop is likely carrying.

I disagree with this, fundamentally. They can perform massively better *if your workload fits into the ARC*. 99% of home users are unlikely to benefit from massive amounts of DRAM just to feed a large ARC. So, sure, you can get great benchmark results if you throw RAM at a ZFS box, but I don't agree that it accurately represents the performance you'll get in the real world for typical NAS workloads (or even atypical ones).

Case in point: I artificially restrict the ARC to 4 GB on many many many installs that run at > 8 GB/s sustained read/write speeds. The ARC isn't magic, it just covers up some classes of I/O problems through great caching.

shodanshok · Thu Aug 10, 2017 4:00 pm

just brew it! wrote:
I learned something fascinating about software RAID-1 on Linux yesterday. This all grew out of some messages I saw logged by the monthly scrub pass that Linux does on all active RAID arrays (it runs a scrub on the first Sunday of every month in the middle of the night).

The mystery began when I noticed that the scrub had logged a dozen or so mirror mismatches. Yet the array status still showed as healthy. My first thought was basically, "WTF? Is my array getting corrupted? Is a drive failing? And why does the scrub show mismatches, but the array still shows as healthy?"

The messages logged by the scrub did not give actual block addresses, so the first task was to figure out if the mismatches were real, and identify the block addresses associated with them. I wrote a short program that calculates MD5 hashes for each 1MB chunk of a raw partition, ran this program against both devices, and diffed the output to identify the 1MB spans where the mismatches were located. Then I ran the same program with a 4KB chunk size over just those 1MB ranges, to get a list of disk block offsets to each discrepancy.

Upon examination of the contents of the suspect blocks, I discovered that the two drives of the mirror always contained similar, but not identical data. In every case I examined, the block from drive A would have some non-zero data, followed by zeros. The corresponding block from drive B would also have some non-zero data followed by zeros, but the point at which the zeros started would be different. The non-zero data always matched, up to the point where the zeros started in the "shorter" block.

I then used debugfs to examine the mounted file system, and discovered that all of those mismatched blocks corresponded to free space in the file system. None of the mismatched blocks contained data belonging to a valid file.

After doing some Googling and reading about Linux's RAID-1 implementation, I believe I've figured out what happened. If you have an application which is appending to a file piecemeal, you can have a race condition where the file system decides to commit a block from the OS's cache to physical media just as the application is about to append additional data. Since the writes to the two drives of the RAID-1 mirror don't occur at exactly the same instant, one drive can get a slightly newer version of that block than the other one. Normally this discrepancy would not persist for long, since the second application write marks the cache block as dirty (again), and this will cause another physical write to get queued up, committing the updated (and consistent) data for that block to both devices in the array.

But what happens if the file gets deleted before this second physical write gets queued? Well, the corresponding blocks in the OS's disk cache get dropped, the second physical write never happens, and the last block of the (now deleted) file is left in an inconsistent state on the underlying RAID media!

Any application that creates temporary files which are then deleted a few seconds later could potentially hit this hole. But since the mismatch only ever happens with data belonging to deleted files, it is "mostly harmless". It may even result in a small performance gain in certain situations, since data belonging to temporary files which are created and quickly deleted never needs to be flushed to physical media.

It certainly has the potential to cause confusion and panic for sysadmins who don't understand that the RAID mismatches are "normal", though. In effect, it results in "false positives" from the scrub pass, since the scrub pass does not know anything about the file system sitting on top of the RAID array.

I also confirmed my theory by writing zeros to all of the free space on the mounted file system. After doing this, all of the RAID mismatches disappeared.

Bottom line: Linux RAID-1 interacts with the file system in non-obvious ways. The upshot of this is that under certain conditions, free space on the file system may have inconsistent data on the underlying RAID devices.

Edit: Corrected a typo and clarified a couple of things.

Yes, this is a know behavior of Linux MD RAID1/10. Basically, you can have mismatch due to:
a) a RAID1/10 swap partition;
b) temporary files which are "halft written" to the two different RAID1/10 legs.

This is ultimately due to the zero-copy behavior of these RAID levels. In short, if a memory page is changed between the first writeout (to the first disk) and the second one (to the second disk), a mismatch occours. If the page with mismatched data is then invalidated (ie: the file is deleted), the mismatch remains lurking on the disk's platters.

From the man page:

However on RAID1 and RAID10 it is possible for software issues to cause a mismatch to be reported. This does not necessarily mean that the data on the array is corrupted. It could simply be that the system does not care what is stored on that part of the array - it is unused space.

The most likely cause for an unexpected mismatch on RAID1 or RAID10 occurs if a swap partition or swap file is stored on the array.

When the swap subsystem wants to write a page of memory out, it flags the page as 'clean' in the memory manager and requests the swap device to write it out. It is quite possible that the memory will be changed while the write-out is happening. In that case the 'clean' flag will be found to be clear when the write completes and so the swap subsystem will simply forget that the swapout had been attempted, and will possibly choose a different page to write out.

If the swap device was on RAID1 (or RAID10), then the data is sent from memory to a device twice (or more depending on the number of devices in the array). Thus it is possible that the memory gets changed between the times it is sent, so different data can be written to the different devices in the array. This will be detected by check as a mismatch. However it does not reflect any corruption as the block where this mismatch occurs is being treated by the swap system as being empty, and the data will never be read from that block.

The bad thing is that current MDRAID code does *not* directly provide the affected block lists, making manual verification a very slow process. Also, it should be noted that the above harmless scenarios do not rule out the possibility of a mismatch due to hardware failures: I had a faulty SATA cables in one two-disks, RAID1 NAS, and zeroing all free spaced did *not* clear all mismatched count. In other words, even RAID1 arrays should be scrubbed regularly.

Please note that parity-based RAID levels (ie: 5 and 6) do *not* suffer from this problem: the memory is never flushed to disks "as is", rather it is copied in a temporary buffer (the stripe cache) and, from that unchanging buffer, it is flushed to the devices. With these raids levels, any mismatch strongly hints to an hardware failure.

Vhalidictes · Thu Aug 10, 2017 5:02 pm

Topinio wrote:
I noped out of Windows' software RAID on first encounter, on discovering the implemented idea of needing a manually-created fault tolerant boot floppy containing the files necessary to boot from the "other" disk, i.e. the mirror is a copy not an equivalent.

Sadly, this first encounter was someone's DC. That I needed to get back up...

Specifically, the boot files aren't duplicated, because Microsoft.

You can usually get around that requirement by simply performing a boot repair option from the OS install CD/USB key (after disconnecting the bad mirror drive).

Thu Aug 10, 2017 5:16 pm

@shodanshok - Yes, I was aware it could be an issue for swap partitions. I was not aware that it potentially affected ext4 partitions until a few days ago.

@Vhalidictes - In my experience, Linux doesn't always set up the mirrored boot partitions correctly either, if you're on a EFI motherboard. In the past I've solved this by dd-ing the EFI partition of the good drive over to the other one after OS installation.

Topinio · Thu Aug 10, 2017 6:00 pm

just brew it! wrote:
@Vhalidictes - In my experience, Linux doesn't always set up the mirrored boot partitions correctly either, if you're on a EFI motherboard. In the past I've solved this by dd-ing the EFI partition of the good drive over to the other one after OS installation.

I've never seen that, but then all my Linux md RAIDs are on older non-UEFI servers.

I guess that makes it a Linux regression (one of many), whereas Windows' brain-dead behaviour is by design; the server I mentioned was 2k3, when I raised it later with Windows sysadmins I was told this was just the way, back from NT 4 at least...

blahsaysblah · Thu Aug 10, 2017 10:20 pm

First off, if you meant to say, running it on same machine, than that doesn't count as a Disaster Recovery level backup. That is true.

However, i think you just have it all wrong. What a computer is. What FreeNAS is. Tools.

Running FreeNAS as just another start at boot App is no different than running any other app(whether is Chrome or an entire VM). It gives me access to a ZFS volume which gives me peace of mind for my internal long term data. By the way, FreeNAS does not need ECC. FreeNAS has nothing to do with ECC. The value of your data is related to ECC. Most everyone does not need ECC for ZFS. (edit: due to many different costs, the value is not there)

I do all my work inside a Windows 10 Pro guest VM, inside of which runs Hyper-V again hosting Docker(be it the standard Moby Linux VM or Windows nano server VM for windows containers). Running apps inside a VM inside another VM inside the host? Oh noes. Nope, very silly to see them as anything other than just complicated apps. That's what all those virtualization opcodes and extensions are for. I also run other Ubuntu server VMs in parallel.

While it was true, initially it was just to learn/test before i put on real box, but after seeing the physical disk pass through works without issue and the speeds are fantastic with minimal overhead...

Another FYI, for a while due to being at foreign location, i reconfigured my Win 10 Pro laptop to route all traffic(this is assuming MS Hyper-V virtual networks didnt have easy exploits) through a pfSense VM that the host and all other VMs routed through to connect to outside network. Sure was nice, considering all the scans and weirdness it logged.

If you idolize stuff, you wont ever do stuff like that. FreeNAS is not a god. Maybe I'm being too harsh in response. AMD has brought us the multi-threaded future we all wanted.... (no im still on Intel, plan was always for gen 2 new AMD)

TR Forums

A Linux RAID-1 mystery (and some answers)

Re: A Linux RAID-1 mystery (and some answers)

Re: A Linux RAID-1 mystery (and some answers)

Re: A Linux RAID-1 mystery (and some answers)

Re: A Linux RAID-1 mystery (and some answers)

Re: A Linux RAID-1 mystery (and some answers)

Re: A Linux RAID-1 mystery (and some answers)

Re: A Linux RAID-1 mystery (and some answers)

Re: A Linux RAID-1 mystery (and some answers)

Re: A Linux RAID-1 mystery (and some answers)

Re: A Linux RAID-1 mystery (and some answers)

Re: A Linux RAID-1 mystery (and some answers)

Re: A Linux RAID-1 mystery (and some answers)

Who is online