What amazes me about the new Xeons, though, is how much more there is to them than one might have expected. Intel’s architects and designers have crammed formidable new technologies into these chips in order to allow them to scale up to large core counts and multiple sockets. The result may be the most impressive set of CPUs Intel has produced to date, with numbers for core count and throughput that pretty much boggle the mind. Read on to see what makes Haswell-EP different—and better.
A look at a Haswell-EP die.
The Haswell-EP family
The first thing one needs to know about Haswell-EP is that it’s not just a single chip, but a trio of chips. Intel has moved in recent years toward right-sizing its Xeon silicon for different products, and Haswell-EP takes that trend into new territory. Here are the three members of the family.
|Haswell-EP||8||16||20 MB||22 nm||2601||354|
|Haswell-EP||12||24||30 MB||22 nm||3839||484|
|Haswell-EP||18||36||45 MB||22 nm||5569||662|
All three chips are fabbed on Intel’s 22-nm process tech with tri-gate transistors, and they all share the same basic technological DNA. Intel has simply scaled them differently, with quite a bit of separation in terms of size and transistor count between the three options. The biggest of the bunch has a staggering 18 cores, 36 threads, and 45MB of L3 cache. To give you some perspective of this CPU’s size, at 662 mm², it’s substantially larger than even the biggest GPUs in the world. Nvidia’s GK110 is 555 mm², and AMD’s Hawaii GPU is 438 mm².
The prior generation of Xeons, code-named Ivy Bridge-EP, topped out at 12 cores, so Haswell-EP offers a 50% increase on that front. Haswell-EP is a “tock” in Intel’s so-called “tick-tock” development model, which means it brings a new CPU architecture to a familiar chip fabrication process. There’s quite a bit more to this new family than just a revised CPU microarchitecture, though. The entire platform has been reworked, as the diagram below summarizes.
An overview of what’s new in Haswell-EP. Source: Intel.
The changes really do begin with the transition to Haswell-class CPU cores. These are indeed the same basic cores used across Intel’s product portfolio, and by now, their virtues are well known. Through a combination of larger on-chip structures, more execution units, and smarter logic, the Haswell core increases its instruction throughput per clock by about 10% compared to Ivy Bridge before it. That number can go much higher with the use of the new AVX2 instruction set extensions, which have the potential to double vector throughput for both integer and floating-point data types.
For servers in particular, the Haswell core has the potential to boost performance even further via the TSX instruction set extensions, which enable hardware lock elision and restricted transactional memory. The TSX instructions allow the hardware to shoulder much of the burden of making sure concurrent threads don’t cause problems for one another. Unfortunately, Intel discovered an erratum in its TSX implementation just prior to the release of Haswell-EP. As a result, the first systems based on this silicon have shipped with TSX disabled via microcode. Users may have the option to enable TSX in a system’s BIOS for development purposes, but doing so risks system instability. I’d expect Intel to produce a new stepping of Haswell-EP with the TSX erratum corrected, but we don’t yet have a clear timetable for such a move. The firm has hinted that TSX should be production-ready once the larger, multi-socket Haswell-EX parts arrive.
The new generation of Xeons has much to recommend it even without TSX. One of the most notable innovations in Haswell-era chips is the incorporation of voltage regulation circuitry directly onto the CPU die. The integrated VR, which Intel calls FIVR for “fully integrated voltage regulator,” allows for more efficient operation along several lines. Voltage transitions with FIVR can be much quicker than with an external VR, and FIVR has many more supply lines, allowing for fine-grained control of power delivery across the chip. The integrated VRs can also reduce the physical footprint of the CPU and its support circuitry.
The advent of FIVR grants Haswell-EP increased dynamic operating range versus its predecessors. For instance, each individual core on the processor can maintain its own power state, or P-state, with its own clock speed and supply voltage. In Ivy-E and earlier parts, all of the cores share a common frequency and voltage. This per-core P-state feature operates in the margins between idle (power is gated off individually to idle cores) and peak core utilization. Dropping a partially used core to an intermediate P-state via this mechanism can free up some thermal headroom for another, busier core to move to a higher frequency via Turbo—so the payoff ought to be more efficiency and performance.
We’ve seen this sort of independent core clocking run into problems in the past, notably in AMD’s Barcelona-based processors, but Intel’s architects are confident that Haswell-EP’s P-state transitions happen quickly enough and have few enough penalties to make this feature worthwhile. At present, per-core P-states are only being used in server- and workstation-class CPUs, not in client-focused products where immediate responsiveness is a top priority.
FIVR also offers a separate supply rail to the “uncore” complex that handles internal and external communication. As a result, the uncore is now clocked independently of the cores. It can run at higher frequencies when bandwidth is at a premium, even if the CPU cores are lightly utilized, and the situation can be reversed when I/O demands decrease and the CPU cores are fully engaged.
The Turbo Boost algorithm that controls the chip’s clocking behavior has grown a little more sophisticated, as well. One addition is what Intel calls “Energy Efficient Turbo.” The power control routine now monitors the activity of each core for throughput and stalls. If it decides that raising the clock speed of a core wouldn’t be energy efficient—presumably because the core’s present activity is gated by external factors or is somehow inefficient—the Turbo mechanism will choose not to raise the speed.
The final tweak to Haswell-EP’s dynamic operating strategy came as a surprise to me. As you can see illustrated on the right, Haswell-EP processors will operate at lower frequencies when processing AVX instructions. The fundamental reality here is that those 256-bit-wide AVX vector units are big, beefy hardware. They chew up a lot of power, and so they require some concessions. As with regular Turbo operation, the chip will seek as high a clock speed within its defined limits during AVX processing—those limits are just lower. Intel says the CPU will return to its regular, non-AVX operating mode one millisecond after the completion of the last AVX instruction in a stream.
Intel has defined the base and Turbo peak AVX frequencies for each of the new Xeons, and it says it will publish those speeds for all to see. As of now, though, I have yet to see AVX clock speeds listed in any of Intel’s pre-launch press information. I expect we’ll hear more on this front soon.
The move to Haswell cores has also brought with it some benefits for virtualization performance. The amount of time needed to enter and to exit a virtual machine has shrunk, as it has fairly consistently over time with successive CPU generations. The result should be a general increase in VM performance. Haswell-EP also allows the shadowing of VM control structures, which should improve the efficiency of VM management and the like.
Perhaps the niftiest bit of new tech for virtualization can apply to other uses, as well. Haswell-EP has hooks built in for the monitoring of cache allocation by thread. In a VM context, this capability should allow hypervisors to expose information that would let sysadmins identify “noisy neighbor” VMs that thrash the cache and may cause problems for other VMs on the same system. Once identified, these troublesome VMs could be moved or isolated in order to prevent cache contention problems from affecting other virtual machines.
Beyond the core
With chips of this scale, the CPU cores are only a small part of the overall picture. The glue that binds everything together is also incredibly complex—and is crucial for performance to scale up with core count. Have a look at this diagram of the 18-core Haswell-EP part in order to get a sense of things.
Like I said: complex. Intel has used a ring interconnect through multiple generations of Xeons now, but the bigger versions of Haswell-EP actually double the ring count to two fully-buffered rings per chip. Intel’s architects say this arrangement provides substantially more bandwidth, and they expect it to remain useful in the future when core counts rise above the current peak of 18.
The rings operate bidirectionally, and individual transactions always flow in the direction of the shortest path from point A to point B. The two rings are linked via a pair of buffered switches. These switches add a couple of cycles of latency to any transaction that must traverse one of them.
One thing that you’ll notice is that the ring, even in the big chip, is somewhat lopsided. There are eight cores on one ring and ten on the next. Each ring has its own memory controller, but only the left-side ring has access to PCIe connectivity and the QuickPath Interconnect to the other socket.
The 12-core chip seems even weirder, with half of one ring simply clipped off along with the six cores that used to reside there.
Such asymmetry just doesn’t seem natural at first glance. Could it present a problem where one thread executes more quickly than another by virtue of its assigned core’s location?
I think that would matter more if it weren’t for the fact that the chip is operating at billions of cycles per second, and anything happening via one of those off-chip interfaces is likely to be enormously slower. When I raised the issue of asymmetry with Intel’s architects, they pointed out that the latency for software-level thread switching is much, much higher than what happens in hardware. They further noted that Intel has had some degree of asymmetry in its CPUs since the advent of multi-core processors.
Also, notice that each core has 2.5MB of last-level cache associated with it. This cache is distributed across all cores, and its contents are shared, so that any core could potentially access data in any other cache partition. Thus, it’s unlikely that any single core would be the most advantageous one to use by virtue of its location on the die.
For those folks who prefer to have precise control over how threads execute, the Haswell-EP Xeons with more than 10 cores offer a strange and intriguing alternative known as cluster-on-die mode. The idea here is that each ring on the chip operates almost like its own NUMA node, as each CPU socket does in this class of system. Each ring becomes its own affinity domain. The cores on each ring only “see” the last-level cache associated with cores on that ring, and they’ll prefer to write data to memory via the local controller.
This mode will be selectable via system firmware, I believe, and is intended for use with applications that have already been tuned for NUMA operation. Intel says it’s possible to achieve single-digit-percentage performance gains with cluster-on-die mode. I expect the vast majority of folks to ignore this mode and take the “it just works” option instead.
The small die with “only” eight cores has just one ring, with all four memory channels connected to a single home agent. This chip is no doubt the basis for Haswell-E products like the Core i7-5960X.
With this amount of integration, Xeons are increasingly becoming almost entire systems on a chip. Thus, a new generation means little upgrades here and there across that system. Haswell-EP raises the bandwidth on the QPI socket-to-socket interconnect to 9.6GT/s, up from 8GT/s before. The PCIe 3.0 controllers have been enhanced with more buffers and credits, so they can achieve higher effective transfer rates and better tolerate latency.
The biggest change on this front, though, is the move to DDR4 memory. Each Haswell-EP socket has four memory channels, and those channels can talk to DDR4 modules at speeds of up to 2133 MT/s. That’s slightly faster than the 1866 MT/s peak of DDR3 with Ivy Bridge-EP, but the real benefits of DDR4 go beyond that. This memory type operates at lower voltage (1.2V standard), has smaller pages that require less activation power, and employs a collection of other measures to improve power efficiency. The cumulative savings, Intel estimates, are about two watts per DIMM at the wall socket.
DDR4 also operates at higher frequencies with more DIMMs present—up to 1600 MT/s on Haswell-EP with three DIMMs per channel. Going forward, DDR4 should enable even higher transfer rates and bit densities. Memory makers already have 3200 MT/s parts in the works, and Samsung is exploiting DDR4’s native support for die stacking to create high-performance 64GB DIMMs.
Naturally, with the integration of the voltage regulators and the change in memory types, Haswell-EP also brings with it a new socket type. Dubbed Socket R3, this new socket isn’t backward-compatible with prior Xeons at all, although it does have the same dimensions and attach points for coolers.
Accompanying Haswell-EP to market is an updated chipset—really just a single chip—with a richer complement of I/O ports. The chipset’s code name is Wellsburg, but today, it officially gets the more pedestrian name of C612. I suspect it’s the same chip known as the X99 in Haswell-E desktop systems. Wellsburg is much better endowed with high-speed connectivity than its predecessor; it sprouts 10 SATA 6Gbps ports and 14 USB ports, six of them USB 3.0-capable. The chipset’s nine PCIe lanes are still stuck at Gen2 transfer rates, but lane grouping into x2 and x4 configs is now supported.
Intel is spinning the three Haswell-EP chips into a grand total of 29 different Xeon models. The new Xeons will be part of the E5 v3 family, whereas Ivy Bridge-EP chips are labeled E5 v2, and older Sandy Bridge-EP parts lack a trailing version number. There’s a wide array of new products, and here is a confusing—but potentially helpful—slide that Intel is using to map out the lineup.
The Xeon E5 v3 lineup. Source: Intel.
Prices range from $2,702 for the E5-2697 v3 to $213 for the E5-2603 v3. Well, that’s not the entire range. Tellingly, Intel isn’t divulging list prices for the top models, including the 18-core E5-2699 v3. I’m pretty sure that doesn’t mean it’s on discount.
Our attention today is focused primarily on workstation-class Xeons, specifically the 10-core Xeon E5-2687W v3, which we’ve tested against its two direct predecessors based on the Sandy Bridge-EP and Ivy Bridge-EP microarchitectures. Their specs look like so:
|Xeon E5-2687W||8/16||3.1||3.8||20||8.0 GT/s||4||DDR3-1600||150||$1,890|
|Xeon E5-2687W v2||8/16||3.4||4.0||25||8.0 GT/s||4||DDR3-1866||150||$2,112|
|Xeon E5-2687W v3||10/20||2.7/3.1||3.5||25||9.6 GT/s||4||DDR4-2133||160||$2,141|
Note that there are two base frequencies listed for the E5-2687W v3. The base speed is 2.7GHz with AVX workloads and 3.1GHz without. The peak Turbo speed is 3.5GHz for both types of workloads, though.
At any rate, these Xeons are all gut-bustingly formidable processors, and they’re intended to drop into dual-socket systems where the core counts and memory channels will double. That’s a recipe for some almost ridiculously potent end-user systems. In fact, we have an example of just such a box on hand.
A big Boxx o’ badness
Above is the Boxx workstation that Intel supplied to us, wrapped around a pair of Xeon E5-2687W v3 processors, for testing and review. With 20 cores, 40 threads, 50MB of L3 cache, and eight channels of DDR4 memory with a total capacity of 128GB, this puppy is the most potent single-user system ever to find its way into Damage Labs. Regular high-end desktops are just a time slice on this thing.
And, minor miracle, its operation is whisper-quiet, unlike some workstations in this class. How did Boxx manage that feat?
Yep, snaking away from each socket are the hoses for a water cooler. Twin radiators evacuate heat from the Xeons with minimal noise.
Here’s the obligatory screenshot from Windows Task Manager, showing all 40 available cores and indicating 128GB of available RAM. And then there’s this…
Installed in one of the PCI Express slots is an Intel 400GB NVMe SSD, one of the fastest storage devices currently available. If you appreciate fast computers, well, this is among the fastest systems possible with today’s technology.
Here at TR, it’s apparently our mission to educate and—since we serve an audience of geeks—to disappoint. (Read the comments some time if you don’t believe me.) Our daily dose of disappointment comes in the various ways we didn’t test Intel’s new Xeons in the limited time available to us prior to this product launch. We’ve had to confine ourselves to workstation-class processors, although we’ve tested servers quite thoroughly in the past, because we couldn’t carve the time out of our schedule to get the latest SPEC benchmarks up and running on multiple boxes. Even our workstation-class testing is devoid of $40K applications like AutoCAD and the difficult-to-obtain data sets we’d need to test such things properly. Really, it’s a travesty of some sort.
Damage Labs does have a few trick up its sleeves, and one of those is our ability to provide broad comparisons of x86 processors against one another. In a move that will surely risk angering the gods of product segmentation, we have provided, alongside our Xeon numbers, some benchmark results from CPUs stretching down to single-socket offerings that cost less than 80 bucks. The results for the lower-end CPUs are grayed out in the graphs on the following pages, since they’re not the primary focus of our attention. We’ve also included, later in the article, results from much older Xeons and Opterons from years past. All of it is probably a bit much, but perhaps you’ll find it entertaining.
Our testing methods
As usual, we ran each test at least three times and have reported the median result. Our test systems were configured like so:
|Processor||Dual Xeon 2687W||Dual Xeon 2687W v3|
|Dual Xeon 2687W v2|
|Motherboard||Asus Z9PE-D8WS||Supermicro X10DAi|
|Chipset||Intel C602||Intel C610|
|Memory size||128 GB (16 DIMMs)||128 GB (16 DIMMs)|
|Memory type||Micron ECC DDR3 SDRAM||Samsung ECC DDR4 SDRAM|
|Memory speed||1600 MT/s||2133 MT/s|
|Memory timings||11-11-11-28 1T||15-15-15-36 1T|
|Storage||Kingston HyperX SH103S3 240GB SSD||Intel DC S3500 Series 240GB SSD|
|OS||Windows 8.1 Pro||Windows 8.1 Pro|
Thanks to Asus, Boxx, Samsung, Micron, and Kingston for helping to outfit our test rigs with some of the finest hardware available. Thanks to Intel and AMD for providing the processors, as well, of course.
Some further notes on our testing methods:
- The test systems’ Windows desktops were set at 1920×1080 in 32-bit color. Vertical refresh sync (vsync) was disabled in the graphics driver control panel.
- We used a Yokogawa WT210 digital power meter to capture power use over a span of time. The meter reads power use at the wall socket, so it incorporates power use from the entire system—the CPU, motherboard, memory, graphics solution, hard drives, and anything else plugged into the power supply unit. (The monitor was plugged into a separate outlet.) We measured how each of our test systems used power across a set time period, during which time we encoded a video with x264.
- After consulting with our readers, we’ve decided to enable Windows’ “Balanced” power profile for the bulk of our desktop processor tests, which means power-saving features like SpeedStep and Cool’n’Quiet are operating. (In the past, we only enabled these features for power consumption testing.) Our spot checks demonstrated to us that, typically, there’s no performance penalty for enabling these features on today’s CPUs. If there is a real-world penalty to enabling these features, well, we think that’s worthy of inclusion in our measurements, since the vast majority of desktop processors these days will spend their lives with these features enabled.
The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.
Memory subsystem performance
Since we have a new chip architecture and a new memory type on the bench, let’s take a look at some directed memory tests before moving on to real-world applications.
The fancy plot above mainly looks at cache bandwidth. This test is multithreaded, so the numbers you see show the combined bandwidth from all of the L1 and L2 caches on each system. Since our Haswell-EPs have 20 L1 caches of 32KB each, we’re still in the L1 cache at the 512KB block size above. The next few points, up to 4MB, are hitting the L2 caches, and beyond that, up to 16MB, we’re into the L3.
Haswell-EP’s promised doubling of L1 and L2 cache bandwidth per core is on display in the plot above. The E5-2687W v3’s higher core count also plays a part in these results, but however you slice it, this is a massive increase in cache bandwidth.
Now, let’s look at what happens when we get into main memory.
We found that our usual version of the Stream bandwidth test fails to scale to 20 cores and 40 threads properly, so we’ve substituted AIDA’s memory tests, instead. Obviously, they have no such issue. The E5 v3’s higher-speed DDR4 memory clearly outperforms the two prior generations of Xeons with DDR3 memory, with delivered bandwidth of up to 123 GB/s in the memory read test.
SiSoft has a nice write-up of this latency testing tool, for those who are interested. We used the “in-page random” access pattern to reduce the impact of prefetchers on our measurements. This test isn’t multithreaded, so it’s a little easier to track which cache is being measured. If the block size is 32KB, you’re in the L1 cache. If it’s 64KB, you’re into the L2, and so on.
Haswell-EP delivers nearly twice the L1 and L2 cache bandwidth without any increase in access latencies for those caches. There is a slight increase in L3 cache access times, but the Xeon E5 v2 has more LLC cache partitions to access than its eight-core siblings do.
At 2133 MT/s, Haswell-EP’s DDR4 memory doesn’t provide quite as quick a turnaround as DDR3 does. I’d expect that to change as DDR4 operating speeds ramp up. Notice the nice result above for the Haswell-E-based 5960X with DDR4-2800.
Some quick synthetic math tests
The folks at FinalWire have built some interesting micro-benchmarks into their AIDA64 system analysis software. They’ve tweaked several of these tests to make use of new instructions on the latest processors, including Haswell-EP. Of the results shown below, PhotoWorxx uses AVX2 (and falls back to AVX on Ivy Bridge, et al.), CPU Hash uses AVX (and XOP on Bulldozer/Piledriver), and FPU Julia and Mandel use AVX2 with FMA.
Here’s a nice look at the true potential throughput of the Haswell-EP hardware, provided a nicely vectorizable workload and the AVX2 instruction set extensions. Many of the applications we’re testing on the following pages don’t take full advantage of AVX2 yet, but once they do… yikes.
Power consumption and efficiency
The workload for this test is Cinebench, the scene-rendering benchmark whose raw performance results we’ll get into shortly. As you can see below, most of the actual work takes place very quickly, at the beginning of our test period.
Note that we’re testing two similar but not exactly identical workstations here by measuring power draw at the wall socket. (The same system got a CPU upgrade from the E5-2687W to the v2 version of the same during testing.)
Perhaps thanks to the new Xeons’ integrated voltage regulators and the switch to DDR4 memory, our E5 v3 workstation draws quite a bit less power at idle than the other systems. Only 79W of idle power for a system populated with dual processors and 128GB of memory spread across eight DIMMs is mighty frugal. The E5-2687W v3 box doesn’t use much more power at peak than its Ivy Bridge-EP forerunner, either.
One measure of power efficiency is to consider the power used over our entire test period, both while the systems were rendering the scene and after they had finished.
Perhaps our best measure of CPU power efficiency is task energy: the amount of energy used while encoding our video. This measure rewards CPUs for finishing the job sooner, but it doesn’t account for power draw at idle.
Our Haswell-EP workstation requires substantially less energy to render this scene than the other two systems do. That’s worthwhile progress.
Because LuxMark uses OpenCL, we can use it to test both GPU and CPU performance—and even to compare performance across different processor types. OpenCL code is by nature parallelized and relies on a real-time compiler, so it should adapt well to new instructions. For instance, Intel and AMD offer integrated client drivers for OpenCL on x86 processors, and they both support AVX. The AMD APP driver even supports Bulldozer’s and Piledriver’s distinctive instructions, FMA4 and XOP. We’ve used the AMD APP ICD on all of the CPUs, since it’s currently fastest ICD in every case.
I’d hoped one of the OpenCL ICDs would make use of the FMA instruction on Haswell-EP to achieve some really eye-popping speed increases in this test. Unfortunately, that’s not the case for one reason or another. I’ll keep an eye out for OpenCL ICD updates. Perhaps this workload could be further optimized for AVX2 and FMA in time.
The Cinebench benchmark is based on Maxon’s Cinema 4D rendering engine. This test runs with just a single thread and then with as many threads as CPU cores (or threads, in CPUs with multiple hardware threads per core) are available.
Neither Cinebench nor POV-Ray show us the sort of performance gains we’d expect from FMA, either. That said, the E5-2687W v3’s additional cores help it to outperform the older Xeons without any extra help from new instructions.
I’ve included MyriMatch here more as a cautionary statement than anything else. This application-based benchmark began having problems with performance scaling after we moved from Windows 8 to 8.1, likely due to some changes made to the Windows thread scheduler. Those problems manifest themselves at higher core counts and appear to be worst on the dual-socket systems with non-uniform memory access. We’ve reported the best scores for each Xeon system out of three runs, but the completion times for the benchmark varied widely. High-core-count, multi-socket systems like this have tremendous potential, but without careful tuning, even multithreaded applications may not be able to exploit it.
STARS Euler3d computational fluid dynamics
Euler3D tackles the difficult problem of simulating fluid dynamics. Like MyriMatch, it tends to be very memory-bandwidth intensive. You can read more about it right here.
Euler3D’s performance has long been sensitive to memory bandwidth, and the new Xeons have more of that precious commodity on tap. The result is a ~10% increase in throughput over Ivy Bridge-EP.
Compiling code in GCC
Our resident developer, Bruno Ferreira, helped put together this code compiling test. Qtbench tests the time required to compile the QT SDK using the GCC compiler. The number of jobs dispatched by the Qtbench script is configurable, and we set the number of threads to match the hardware thread count for each system.
Yep, compile times are nice and low on Haswell-EP. Developers, it’s time to fill out a requisition form.
x264 HD video encoding
Our x264 test involves one of the latest builds of the encoder with AVX2 and FMA support. To test, we encoded a one-minute, 1080p .m2ts video using the following options:
–profile high –preset medium –crf 18 –video-filter resize:1280,720 –force-cfr
The source video was obtained from a repository of stock videos on this website. We used the Samsung Earth from Above clip.
Handbrake HD video encoding
Our Handbrake test transcodes a two-and-a-half-minute 1080p H.264 source video into a smaller format defined by the program’s “iPhone & iPod Touch” preset.
Neither of our video encoding tests shows any big gains from Haswell-EP. In some cases, the E5-2687W v3’s combination of a higher core count and somewhat lower clock frequencies will limit performance, especially if an application leans heavily on a few main threads. That appears to be what’s happening in our x264 encoding test.
We’ll have to devise some new encoding workloads with high-quality 4K video soon. Perhaps higher-res source material will let us better harness all of the Xeons’ cores and threads.
TrueCrypt disk encryption
TrueCrypt supports acceleration via Intel’s AES-NI instructions, so the encoding of the AES algorithm, in particular, should be very fast on the CPUs that support those instructions. We’ve also included results for another algorithm, Twofish, that isn’t accelerated via dedicated instructions.
7-Zip file compression and decompression
I wouldn’t fight you if you put one of these on my desk, mind you.
Let’s get a bit indulgent and see how today’s fastest workstation processors compare to older x86 CPUs of various classes. We can’t always re-test every CPU from one iteration of our test suite to the next, but there are some commonalities that carry over from generation to generation. We might as well try some inter-generational mash-ups.
Now, these comparisons won’t be as exact and pristine as our other scores. Our new test systems run Windows 8.1 instead of Windows 8 or 7, for instance, and have higher-density RAM and larger SSDs. We’re using some slightly different versions of POV-Ray, too. Still, scores in the benchmarks we selected shouldn’t vary too much based on those factors.
Our mash-up results come from several generations of CPU test suites, dating back to our Xeon X5680 review, our FX-8350 review from the fall of 2012, and our original desktop Haswell review from last year. Our recent desktop CPU reviews have contributed here.
Today’s brand-new Xeons achieve nearly twice the throughput of the Xeon X5680 from just a handful of years ago. Hard to fathom, almost—and my sense is that the power-efficiency gains from then to now are even larger than the performance improvements.
Another remarkable fact is the incredible dynamic range of the x86 processor ecosystem. I really wish we could get some numbers from Intel’s Avoton CPUs to include in here. Hmmmm…
This next set of results includes just one benchmark, but it takes us as far back as the Core 2 Duo and, yes, a chip derived from the Pentium 4: the Pentium Extreme Edition 840. Also present: dual-core versions of low-power CPUs from both Intel and AMD, the Atom D525 and the E-350 APU. We retired this original test suite after the 3960X review in the fall of 2011. We’ve now mashed it up with results from our Xeon X5680 review, our first desktop Haswell review, and from our latest crop of CPU reviews.
Also, ahem, never forget: in April of 2001, the Pentium III 800 rendered this same “chess2” POV-Ray scene in just under 24 minutes.
Intel continues to execute on its development roadmap like, well, like clockwork. I guess the whole tick-tock thing has worked out for them.
Haswell-EP-based Xeons offer measurable performance improvements across a range of workloads compared to the CPUs they succeed—and that’s true even without the broad availability of AVX2-ready applications. Our Xeon E5-2687W v3-based workstation proved to be quite a bit more energy-efficient in our Cinebench rendering test, too. Meanwhile, DDR4 memory looks to be living up to its billing by increasing delivered memory bandwidth and also contributing to our E5 v3 system’s frugal power draw at idle.
Looks like progress on all fronts to me.
As ever, you’ll need to be sure your application can take proper advantage of the power these CPUs have on tap, but if it can, the Xeon E5-2687W v3 will chew through it like nothing else.
As a reviewer, I’m having a hard time finding any flaws here. The one potential chink in these Xeons’ armor, at least for workstation use, may be that they’ve outstripped the demands of quite a few users. I suspect the lower-priced E5-1680 v3, which is a single-socket Xeon, will suffice for a whole lot of folks. That chip’s basic specs and performance will be very similar to the Core i7-5960X results you saw on the preceding pages.
But some of us will soak up as much computing power as Intel or anyone else can provide in a reasonable package. For those folks, the Xeon E5-2687W v3 offers an incredibly compelling solution that’s yet again a solid incremental improvement over last year’s model. I suspect the Haswell-EP Xeons may achieve more dramatic gains over Ivy Bridge-EP in server-class workloads, but that’s a question for another day.
Enjoy our work? Pay what you want to subscribe and support us.