In July of 2006, Sun added an interesting new machine to their lineup of Opteron-based x86 servers. It was codenamed 'Thumper' during development and designated the SunFire X4500 when it was launched. It is essentially a 4U rack-mount, dual-AMD Socket 940 server with 48 SATA disks. In my real life as a sysadmin for a large company, I was intrigued by the new direction in storage systems that Sun's experimenting with. As a PC enthusiast, I was impressed by the simplicity and scale of it.
The recipe for the thumper is simple. It's a list of commodity bits we're all familiar with:
- Big, cheap SATA drives
- PCI-X SATA controller chips
- PCI-X to HyperTransport tunnel chips
- AMD Opteron CPUs
Sun has published an architecture whitepaper for the x4500, which includes this block diagram:
Sun, as we can see, has built the Thumper around the copious system bandwidth available on the Opteron platform, an area where AMD is still competitive with the fastest Intel Xeon CPUs. The 48 SATA drives are connected via a passive backplane to power and to the six Marvell SATA controller chips on the mainboard. From there, the design of the system shows an attention to balancing I/O throughput along the entire path back the CPUs. Each drive is connected to a dedicated SATA port, unlike even most high-end storage systems which group multiple drives on a common bus. All six of the eight-port controller chips are connected via a dedicated PCI-X connection, which results in 133MB/s per drive--plenty to keep the drive reading and writing data as fast as it can move bits on and off the platters. Those PCI-X connections are, in turn, bridged onto the 8GB/s CPU hypertransport links by AMD 8132 chips, again leaving enough headroom for all the controller chips to feed data into the CPUs at once. The two PCI-X slots also have dedicated connections and system peripherals, including four Gigabit Ethernet connections, connected via downstream HT links on the tunnel chips.
So what does this add up to? A file server with 48 individual disks and a theoretical 6GB/s of disk bandwidth. Because the disk controllers are simple SATA host adapters with no RAID intelligence, the installed OS is going to see all 48 as individual devices. If you were to install Windows, I suppose you would have drive letters from C: to AX:, but would you really want the poor machine to suffer like that? The solution to this is to use your operating system of choice's software RAID functionality. Software RAID has fallen out of favor these days, in lieu of dedicated hardware to offload that task. This made a lot of sense when 200MHz Pentium Pro processors cost $1000, but most servers these days have plenty of CPU cycles available. Additionally, the RAID controller has become the bottleneck between disk and CPU in many current server configurations.
Another downside of software RAID has always been increased complexity in the OS configuration. Sun has given us another neat piece of technology to assist here: ZFS. ZFS is a new filesystem available in Solaris 10. All of the various layers of storage management have been rolled up into the filesystem with ZFS. Configuring RAID sets, aggregating them into one logical volume and then formatting and mounting it as a filesystem is accomplished a single step with ZFS. There are some examples here, and while those are some of the longest commands you might ever have to type, most of it is taken up listing all the disk device names (nothing like this).
I know this all reads like an advertisement, and maybe I've drunk the purple kool-aid, but it's hard for a server geek not to get excited about this. The combination of the X4500 and ZFS results in a level of performance and capacity that matches some high-end enterprise storage arrays. There are simple benchmarks published that put the real-world read throughput of this configuration over 1GB/s. That's a level of performance that would take an exotic configuration of four 4Gb/s host bus adapters to equal in the fiber channel world, and that's if your array controllers were capable of feeding data at that rate. All this comes at a cost that is very low by enterprise storage standards. The model with 48 1TB drives lists for about $60,000, a delivered cost per gigabyte of about $1.25. This presents new vistas of capability to system engineers, and new challenges as well. We can offer a petabyte of online storage for a little over $1M and only taking up two racks in the computer room. Problems that would have broken the entire IS budget are approachable now, but while we can afford the primary disk, the tape infrastructure to back it all up remains unattainable, not to mention it would take weeks to move 1PB from disk to tapes.
Good problems to have, I guess, at least I don't have to worry about where to save all my linux install .iso's anymore.
I've been putting in some time working with phpBB 3 lately. I expect we'll be taking some downtime in the next month or so to upgrade.