just brew it! wrote:I learned something fascinating about software RAID-1 on Linux yesterday. This all grew out of some messages I saw logged by the monthly scrub pass that Linux does on all active RAID arrays (it runs a scrub on the first Sunday of every month in the middle of the night).
The mystery began when I noticed that the scrub had logged a dozen or so mirror mismatches. Yet the array status still showed as healthy. My first thought was basically, "WTF? Is my array getting corrupted? Is a drive failing? And why does the scrub show mismatches, but the array still shows as healthy?"
The messages logged by the scrub did not give actual block addresses, so the first task was to figure out if the mismatches were real, and identify the block addresses associated with them. I wrote a short program that calculates MD5 hashes for each 1MB chunk of a raw partition, ran this program against both devices, and diffed the output to identify the 1MB spans where the mismatches were located. Then I ran the same program with a 4KB chunk size over just those 1MB ranges, to get a list of disk block offsets to each discrepancy.
Upon examination of the contents of the suspect blocks, I discovered that the two drives of the mirror always contained similar, but not identical data. In every case I examined, the block from drive A would have some non-zero data, followed by zeros. The corresponding block from drive B would also have some non-zero data followed by zeros, but the point at which the zeros started would be different. The non-zero data always matched, up to the point where the zeros started in the "shorter" block.
I then used debugfs to examine the mounted file system, and discovered that all of those mismatched blocks corresponded to free space in the file system. None of the mismatched blocks contained data belonging to a valid file.
After doing some Googling and reading about Linux's RAID-1 implementation, I believe I've figured out what happened. If you have an application which is appending to a file piecemeal, you can have a race condition where the file system decides to commit a block from the OS's cache to physical media just as the application is about to append additional data. Since the writes to the two drives of the RAID-1 mirror don't occur at exactly the same instant, one drive can get a slightly newer version of that block than the other one. Normally this discrepancy would not persist for long, since the second application write marks the cache block as dirty (again), and this will cause another physical write to get queued up, committing the updated (and consistent) data for that block to both devices in the array.
But what happens if the file gets deleted before this second physical write gets queued? Well, the corresponding blocks in the OS's disk cache get dropped, the second physical write never happens, and the last block of the (now deleted) file is left in an inconsistent state on the underlying RAID media!
Any application that creates temporary files which are then deleted a few seconds later could potentially hit this hole. But since the mismatch only ever happens with data belonging to deleted files, it is "mostly harmless". It may even result in a small performance gain in certain situations, since data belonging to temporary files which are created and quickly deleted never needs to be flushed to physical media.
It certainly has the potential to cause confusion and panic for sysadmins who don't understand that the RAID mismatches are "normal", though. In effect, it results in "false positives" from the scrub pass, since the scrub pass does not know anything about the file system sitting on top of the RAID array.
I also confirmed my theory by writing zeros to all of the free space on the mounted file system. After doing this, all of the RAID mismatches disappeared.
Bottom line: Linux RAID-1 interacts with the file system in non-obvious ways. The upshot of this is that under certain conditions, free space on the file system may have inconsistent data on the underlying RAID devices.
Edit: Corrected a typo and clarified a couple of things.
Yes, this is a know behavior of Linux MD RAID1/10. Basically, you can have mismatch due to:
a) a RAID1/10 swap partition;
b) temporary files which are "halft written" to the two different RAID1/10 legs.
This is ultimately due to the zero-copy behavior of these RAID levels. In short, if a memory page is changed between the first writeout (to the first disk) and the second one (to the second disk), a mismatch occours. If the page with mismatched data is then invalidated (ie: the file is deleted), the mismatch remains lurking on the disk's platters.
From the man page:
However on RAID1 and RAID10 it is possible for software issues to cause a mismatch to be reported. This does not necessarily mean that the data on the array is corrupted. It could simply be that the system does not care what is stored on that part of the array - it is unused space.
The most likely cause for an unexpected mismatch on RAID1 or RAID10 occurs if a swap partition or swap file is stored on the array.
When the swap subsystem wants to write a page of memory out, it flags the page as 'clean' in the memory manager and requests the swap device to write it out. It is quite possible that the memory will be changed while the write-out is happening. In that case the 'clean' flag will be found to be clear when the write completes and so the swap subsystem will simply forget that the swapout had been attempted, and will possibly choose a different page to write out.
If the swap device was on RAID1 (or RAID10), then the data is sent from memory to a device twice (or more depending on the number of devices in the array). Thus it is possible that the memory gets changed between the times it is sent, so different data can be written to the different devices in the array. This will be detected by check as a mismatch. However it does not reflect any corruption as the block where this mismatch occurs is being treated by the swap system as being empty, and the data will never be read from that block.
The bad thing is that current MDRAID code does *not* directly provide the affected block lists, making manual verification a very slow process. Also, it should be noted that the above harmless scenarios do not rule out the possibility of a mismatch due to hardware failures: I had a faulty SATA cables in one two-disks, RAID1 NAS, and zeroing all free spaced did *not* clear all mismatched count. In other words, even RAID1 arrays should be scrubbed regularly.
Please note that parity-based RAID levels (ie: 5 and 6) do *not* suffer from this problem: the memory is never flushed to disks "as is", rather it is copied in a temporary buffer (the stripe cache) and, from that unchanging buffer, it is flushed to the devices. With these raids levels, any mismatch strongly hints to an hardware failure.