I learned something fascinating about software RAID-1 on Linux yesterday. This all grew out of some messages I saw logged by the monthly scrub pass that Linux does on all active RAID arrays (it runs a scrub on the first Sunday of every month in the middle of the night).
The mystery began when I noticed that the scrub had logged a dozen or so mirror mismatches. Yet the array status still showed as healthy. My first thought was basically, "WTF? Is my array getting corrupted? Is a drive failing? And why does the scrub show mismatches, but the array still shows as healthy?"
The messages logged by the scrub did not give actual block addresses, so the first task was to figure out if the mismatches were real, and identify the block addresses associated with them. I wrote a short program that calculates MD5 hashes for each 1MB chunk of a raw partition, ran this program against both devices, and diffed the output to identify the 1MB spans where the mismatches were located. Then I ran the same program with a 4KB chunk size over just those 1MB ranges, to get a list of disk block offsets to each discrepancy.
Upon examination of the contents of the suspect blocks, I discovered that the two drives of the mirror always contained similar, but not identical data. In every case I examined, the block from drive A would have some non-zero data, followed by zeros. The corresponding block from drive B would also have some non-zero data followed by zeros, but the point at which the zeros started would be different. The non-zero data always matched, up to the point where the zeros started in the "shorter" block.
I then used debugfs to examine the mounted file system, and discovered that all of those mismatched blocks corresponded to free space in the file system. None of the mismatched blocks contained data belonging to a valid file.
After doing some Googling and reading about Linux's RAID-1 implementation, I believe I've figured out what happened. If you have an application which is appending to a file piecemeal, you can have a race condition where the file system decides to commit a block from the OS's cache to physical media just as the application is about to append additional data. Since the writes to the two drives of the RAID-1 mirror don't occur at exactly the same instant, one drive can get a slightly newer version of that block than the other one. Normally this discrepancy would not persist for long, since the second application write marks the cache block as dirty (again), and this will cause another physical write to get queued up, committing the updated (and consistent) data for that block to both devices in the array.
But what happens if the file gets deleted before this second physical write gets queued? Well, the corresponding blocks in the OS's disk cache get dropped, the second physical write never happens, and the last block of the (now deleted) file is left in an inconsistent state on the underlying RAID media!
Any application that creates temporary files which are then deleted a few seconds later could potentially hit this hole. But since the mismatch only ever happens with data belonging to deleted files, it is "mostly harmless". It may even result in a small performance gain in certain situations, since data belonging to temporary files which are created and quickly deleted never needs to be flushed to physical media.
It certainly has the potential to cause confusion and panic for sysadmins who don't understand that the RAID mismatches are "normal", though. In effect, it results in "false positives" from the scrub pass, since the scrub pass does not know anything about the file system sitting on top of the RAID array.
I also confirmed my theory by writing zeros to all of the free space on the mounted file system. After doing this, all of the RAID mismatches disappeared.
Bottom line: Linux RAID-1 interacts with the file system in non-obvious ways. The upshot of this is that under certain conditions, free space on the file system may have inconsistent data on the underlying RAID devices.
Edit: Corrected a typo and clarified a couple of things.
If the world isn't making sense to you, you're either drinking too much or not drinking enough.