Intel’s Xeon E5-2687W v3 processor reviewed

One of the funny things about Intel’s workstation- and server-class Xeon processors is that we kind of think we know what’s coming before each new generation arrives. For instance, the new generation of chips known as Haswell-EP is making its debut today, yet the Haswell microarchitecture has been shipping in client systems for over a year. The desktop derivative of this very silicon, Haswell-E, was introduced late last month, too.

What amazes me about the new Xeons, though, is how much more there is to them than one might have expected. Intel’s architects and designers have crammed formidable new technologies into these chips in order to allow them to scale up to large core counts and multiple sockets. The result may be the most impressive set of CPUs Intel has produced to date, with numbers for core count and throughput that pretty much boggle the mind. Read on to see what makes Haswell-EP different—and better.

A look at a Haswell-EP die.

The Haswell-EP family

The first thing one needs to know about Haswell-EP is that it’s not just a single chip, but a trio of chips. Intel has moved in recent years toward right-sizing its Xeon silicon for different products, and Haswell-EP takes that trend into new territory. Here are the three members of the family.

Code name Cores/

modules

Threads Last-level

cache size

Process

node

Estimated

transistors

(Millions)

Die

area

(mm²)

Haswell-EP 8 16 20 MB 22 nm 2601 354
Haswell-EP 12 24 30 MB 22 nm 3839 484
Haswell-EP 18 36 45 MB 22 nm 5569 662

All three chips are fabbed on Intel’s 22-nm process tech with tri-gate transistors, and they all share the same basic technological DNA. Intel has simply scaled them differently, with quite a bit of separation in terms of size and transistor count between the three options. The biggest of the bunch has a staggering 18 cores, 36 threads, and 45MB of L3 cache. To give you some perspective of this CPU’s size, at 662 mm², it’s substantially larger than even the biggest GPUs in the world. Nvidia’s GK110 is 555 mm², and AMD’s Hawaii GPU is 438 mm².

The prior generation of Xeons, code-named Ivy Bridge-EP, topped out at 12 cores, so Haswell-EP offers a 50% increase on that front. Haswell-EP is a “tock” in Intel’s so-called “tick-tock” development model, which means it brings a new CPU architecture to a familiar chip fabrication process. There’s quite a bit more to this new family than just a revised CPU microarchitecture, though. The entire platform has been reworked, as the diagram below summarizes.

An overview of what’s new in Haswell-EP. Source: Intel.

The changes really do begin with the transition to Haswell-class CPU cores. These are indeed the same basic cores used across Intel’s product portfolio, and by now, their virtues are well known. Through a combination of larger on-chip structures, more execution units, and smarter logic, the Haswell core increases its instruction throughput per clock by about 10% compared to Ivy Bridge before it. That number can go much higher with the use of the new AVX2 instruction set extensions, which have the potential to double vector throughput for both integer and floating-point data types.

For servers in particular, the Haswell core has the potential to boost performance even further via the TSX instruction set extensions, which enable hardware lock elision and restricted transactional memory. The TSX instructions allow the hardware to shoulder much of the burden of making sure concurrent threads don’t cause problems for one another. Unfortunately, Intel discovered an erratum in its TSX implementation just prior to the release of Haswell-EP. As a result, the first systems based on this silicon have shipped with TSX disabled via microcode. Users may have the option to enable TSX in a system’s BIOS for development purposes, but doing so risks system instability. I’d expect Intel to produce a new stepping of Haswell-EP with the TSX erratum corrected, but we don’t yet have a clear timetable for such a move. The firm has hinted that TSX should be production-ready once the larger, multi-socket Haswell-EX parts arrive.

The new generation of Xeons has much to recommend it even without TSX. One of the most notable innovations in Haswell-era chips is the incorporation of voltage regulation circuitry directly onto the CPU die. The integrated VR, which Intel calls FIVR for “fully integrated voltage regulator,” allows for more efficient operation along several lines. Voltage transitions with FIVR can be much quicker than with an external VR, and FIVR has many more supply lines, allowing for fine-grained control of power delivery across the chip. The integrated VRs can also reduce the physical footprint of the CPU and its support circuitry.

The advent of FIVR grants Haswell-EP increased dynamic operating range versus its predecessors. For instance, each individual core on the processor can maintain its own power state, or P-state, with its own clock speed and supply voltage. In Ivy-E and earlier parts, all of the cores share a common frequency and voltage. This per-core P-state feature operates in the margins between idle (power is gated off individually to idle cores) and peak core utilization. Dropping a partially used core to an intermediate P-state via this mechanism can free up some thermal headroom for another, busier core to move to a higher frequency via Turbo—so the payoff ought to be more efficiency and performance.

We’ve seen this sort of independent core clocking run into problems in the past, notably in AMD’s Barcelona-based processors, but Intel’s architects are confident that Haswell-EP’s P-state transitions happen quickly enough and have few enough penalties to make this feature worthwhile. At present, per-core P-states are only being used in server- and workstation-class CPUs, not in client-focused products where immediate responsiveness is a top priority.

FIVR also offers a separate supply rail to the “uncore” complex that handles internal and external communication. As a result, the uncore is now clocked independently of the cores. It can run at higher frequencies when bandwidth is at a premium, even if the CPU cores are lightly utilized, and the situation can be reversed when I/O demands decrease and the CPU cores are fully engaged.

The Turbo Boost algorithm that controls the chip’s clocking behavior has grown a little more sophisticated, as well. One addition is what Intel calls “Energy Efficient Turbo.” The power control routine now monitors the activity of each core for throughput and stalls. If it decides that raising the clock speed of a core wouldn’t be energy efficient—presumably because the core’s present activity is gated by external factors or is somehow inefficient—the Turbo mechanism will choose not to raise the speed.

The final tweak to Haswell-EP’s dynamic operating strategy came as a surprise to me. As you can see illustrated on the right, Haswell-EP processors will operate at lower frequencies when processing AVX instructions. The fundamental reality here is that those 256-bit-wide AVX vector units are big, beefy hardware. They chew up a lot of power, and so they require some concessions. As with regular Turbo operation, the chip will seek as high a clock speed within its defined limits during AVX processing—those limits are just lower. Intel says the CPU will return to its regular, non-AVX operating mode one millisecond after the completion of the last AVX instruction in a stream.

Intel has defined the base and Turbo peak AVX frequencies for each of the new Xeons, and it says it will publish those speeds for all to see. As of now, though, I have yet to see AVX clock speeds listed in any of Intel’s pre-launch press information. I expect we’ll hear more on this front soon.

The move to Haswell cores has also brought with it some benefits for virtualization performance. The amount of time needed to enter and to exit a virtual machine has shrunk, as it has fairly consistently over time with successive CPU generations. The result should be a general increase in VM performance. Haswell-EP also allows the shadowing of VM control structures, which should improve the efficiency of VM management and the like.

Perhaps the niftiest bit of new tech for virtualization can apply to other uses, as well. Haswell-EP has hooks built in for the monitoring of cache allocation by thread. In a VM context, this capability should allow hypervisors to expose information that would let sysadmins identify “noisy neighbor” VMs that thrash the cache and may cause problems for other VMs on the same system. Once identified, these troublesome VMs could be moved or isolated in order to prevent cache contention problems from affecting other virtual machines.

 

Beyond the core

With chips of this scale, the CPU cores are only a small part of the overall picture. The glue that binds everything together is also incredibly complex—and is crucial for performance to scale up with core count. Have a look at this diagram of the 18-core Haswell-EP part in order to get a sense of things.

Source: Intel.

Like I said: complex. Intel has used a ring interconnect through multiple generations of Xeons now, but the bigger versions of Haswell-EP actually double the ring count to two fully-buffered rings per chip. Intel’s architects say this arrangement provides substantially more bandwidth, and they expect it to remain useful in the future when core counts rise above the current peak of 18.

The rings operate bidirectionally, and individual transactions always flow in the direction of the shortest path from point A to point B. The two rings are linked via a pair of buffered switches. These switches add a couple of cycles of latency to any transaction that must traverse one of them.

One thing that you’ll notice is that the ring, even in the big chip, is somewhat lopsided. There are eight cores on one ring and ten on the next. Each ring has its own memory controller, but only the left-side ring has access to PCIe connectivity and the QuickPath Interconnect to the other socket.

Source: Intel.

The 12-core chip seems even weirder, with half of one ring simply clipped off along with the six cores that used to reside there.

Such asymmetry just doesn’t seem natural at first glance. Could it present a problem where one thread executes more quickly than another by virtue of its assigned core’s location?

I think that would matter more if it weren’t for the fact that the chip is operating at billions of cycles per second, and anything happening via one of those off-chip interfaces is likely to be enormously slower. When I raised the issue of asymmetry with Intel’s architects, they pointed out that the latency for software-level thread switching is much, much higher than what happens in hardware. They further noted that Intel has had some degree of asymmetry in its CPUs since the advent of multi-core processors.

Also, notice that each core has 2.5MB of last-level cache associated with it. This cache is distributed across all cores, and its contents are shared, so that any core could potentially access data in any other cache partition. Thus, it’s unlikely that any single core would be the most advantageous one to use by virtue of its location on the die.

For those folks who prefer to have precise control over how threads execute, the Haswell-EP Xeons with more than 10 cores offer a strange and intriguing alternative known as cluster-on-die mode. The idea here is that each ring on the chip operates almost like its own NUMA node, as each CPU socket does in this class of system. Each ring becomes its own affinity domain. The cores on each ring only “see” the last-level cache associated with cores on that ring, and they’ll prefer to write data to memory via the local controller.

This mode will be selectable via system firmware, I believe, and is intended for use with applications that have already been tuned for NUMA operation. Intel says it’s possible to achieve single-digit-percentage performance gains with cluster-on-die mode. I expect the vast majority of folks to ignore this mode and take the “it just works” option instead.

Source: Intel.

The small die with “only” eight cores has just one ring, with all four memory channels connected to a single home agent. This chip is no doubt the basis for Haswell-E products like the Core i7-5960X.

With this amount of integration, Xeons are increasingly becoming almost entire systems on a chip. Thus, a new generation means little upgrades here and there across that system. Haswell-EP raises the bandwidth on the QPI socket-to-socket interconnect to 9.6GT/s, up from 8GT/s before. The PCIe 3.0 controllers have been enhanced with more buffers and credits, so they can achieve higher effective transfer rates and better tolerate latency.

The biggest change on this front, though, is the move to DDR4 memory. Each Haswell-EP socket has four memory channels, and those channels can talk to DDR4 modules at speeds of up to 2133 MT/s. That’s slightly faster than the 1866 MT/s peak of DDR3 with Ivy Bridge-EP, but the real benefits of DDR4 go beyond that. This memory type operates at lower voltage (1.2V standard), has smaller pages that require less activation power, and employs a collection of other measures to improve power efficiency. The cumulative savings, Intel estimates, are about two watts per DIMM at the wall socket.

DDR4 also operates at higher frequencies with more DIMMs present—up to 1600 MT/s on Haswell-EP with three DIMMs per channel. Going forward, DDR4 should enable even higher transfer rates and bit densities. Memory makers already have 3200 MT/s parts in the works, and Samsung is exploiting DDR4’s native support for die stacking to create high-performance 64GB DIMMs.

Naturally, with the integration of the voltage regulators and the change in memory types, Haswell-EP also brings with it a new socket type. Dubbed Socket R3, this new socket isn’t backward-compatible with prior Xeons at all, although it does have the same dimensions and attach points for coolers.

Accompanying Haswell-EP to market is an updated chipset—really just a single chip—with a richer complement of I/O ports. The chipset’s code name is Wellsburg, but today, it officially gets the more pedestrian name of C612. I suspect it’s the same chip known as the X99 in Haswell-E desktop systems. Wellsburg is much better endowed with high-speed connectivity than its predecessor; it sprouts 10 SATA 6Gbps ports and 14 USB ports, six of them USB 3.0-capable. The chipset’s nine PCIe lanes are still stuck at Gen2 transfer rates, but lane grouping into x2 and x4 configs is now supported.

The models

Intel is spinning the three Haswell-EP chips into a grand total of 29 different Xeon models. The new Xeons will be part of the E5 v3 family, whereas Ivy Bridge-EP chips are labeled E5 v2, and older Sandy Bridge-EP parts lack a trailing version number. There’s a wide array of new products, and here is a confusing—but potentially helpful—slide that Intel is using to map out the lineup.

The Xeon E5 v3 lineup. Source: Intel.

Prices range from $2,702 for the E5-2697 v3 to $213 for the E5-2603 v3. Well, that’s not the entire range. Tellingly, Intel isn’t divulging list prices for the top models, including the 18-core E5-2699 v3. I’m pretty sure that doesn’t mean it’s on discount.

Our attention today is focused primarily on workstation-class Xeons, specifically the 10-core Xeon E5-2687W v3, which we’ve tested against its two direct predecessors based on the Sandy Bridge-EP and Ivy Bridge-EP microarchitectures. Their specs look like so:

Model Cores/

threads

Base

clock

(GHz)

Max

Turbo

clock

(GHz)

L3

cache

(MB)

QPI

speed

Memory

channels

Memory

type

& max

speed

TDP

(W)

Price
Xeon E5-2687W 8/16 3.1 3.8 20 8.0 GT/s 4 DDR3-1600 150 $1,890
Xeon E5-2687W v2 8/16 3.4 4.0 25 8.0 GT/s 4 DDR3-1866 150 $2,112
Xeon E5-2687W v3 10/20 2.7/3.1 3.5 25 9.6 GT/s 4 DDR4-2133 160 $2,141

Note that there are two base frequencies listed for the E5-2687W v3. The base speed is 2.7GHz with AVX workloads and 3.1GHz without. The peak Turbo speed is 3.5GHz for both types of workloads, though.

At any rate, these Xeons are all gut-bustingly formidable processors, and they’re intended to drop into dual-socket systems where the core counts and memory channels will double. That’s a recipe for some almost ridiculously potent end-user systems. In fact, we have an example of just such a box on hand.

 

A big Boxx o’ badness

Above is the Boxx workstation that Intel supplied to us, wrapped around a pair of Xeon E5-2687W v3 processors, for testing and review. With 20 cores, 40 threads, 50MB of L3 cache, and eight channels of DDR4 memory with a total capacity of 128GB, this puppy is the most potent single-user system ever to find its way into Damage Labs. Regular high-end desktops are just a time slice on this thing.

And, minor miracle, its operation is whisper-quiet, unlike some workstations in this class. How did Boxx manage that feat?

Yep, snaking away from each socket are the hoses for a water cooler. Twin radiators evacuate heat from the Xeons with minimal noise.

Here’s the obligatory screenshot from Windows Task Manager, showing all 40 available cores and indicating 128GB of available RAM. And then there’s this…

Installed in one of the PCI Express slots is an Intel 400GB NVMe SSD, one of the fastest storage devices currently available. If you appreciate fast computers, well, this is among the fastest systems possible with today’s technology.

Here at TR, it’s apparently our mission to educate and—since we serve an audience of geeks—to disappoint. (Read the comments some time if you don’t believe me.) Our daily dose of disappointment comes in the various ways we didn’t test Intel’s new Xeons in the limited time available to us prior to this product launch. We’ve had to confine ourselves to workstation-class processors, although we’ve tested servers quite thoroughly in the past, because we couldn’t carve the time out of our schedule to get the latest SPEC benchmarks up and running on multiple boxes. Even our workstation-class testing is devoid of $40K applications like AutoCAD and the difficult-to-obtain data sets we’d need to test such things properly. Really, it’s a travesty of some sort.

Damage Labs does have a few trick up its sleeves, and one of those is our ability to provide broad comparisons of x86 processors against one another. In a move that will surely risk angering the gods of product segmentation, we have provided, alongside our Xeon numbers, some benchmark results from CPUs stretching down to single-socket offerings that cost less than 80 bucks. The results for the lower-end CPUs are grayed out in the graphs on the following pages, since they’re not the primary focus of our attention. We’ve also included, later in the article, results from much older Xeons and Opterons from years past. All of it is probably a bit much, but perhaps you’ll find it entertaining.

Our testing methods

As usual, we ran each test at least three times and have reported the median result. Our test systems were configured like so:

Processor Dual Xeon 2687W Dual Xeon 2687W v3
Dual Xeon 2687W v2
Motherboard Asus Z9PE-D8WS Supermicro X10DAi
Chipset Intel C602 Intel C610
Memory size 128 GB (16 DIMMs) 128 GB (16 DIMMs)
Memory type Micron ECC DDR3 SDRAM Samsung ECC DDR4 SDRAM
Memory speed 1600 MT/s 2133 MT/s
1866 MT/s
Memory timings 11-11-11-28 1T 15-15-15-36 1T
13-13-13-32 1T
Storage Kingston HyperX SH103S3 240GB SSD Intel  DC S3500 Series 240GB SSD
OS Windows 8.1 Pro Windows 8.1 Pro

Thanks to Asus, Boxx, Samsung, Micron, and Kingston for helping to outfit our test rigs with some of the finest hardware available. Thanks to Intel and AMD for providing the processors, as well, of course.

Some further notes on our testing methods:

  • The test systems’ Windows desktops were set at 1920×1080 in 32-bit color. Vertical refresh sync (vsync) was disabled in the graphics driver control panel.
  • We used a Yokogawa WT210 digital power meter to capture power use over a span of time. The meter reads power use at the wall socket, so it incorporates power use from the entire system—the CPU, motherboard, memory, graphics solution, hard drives, and anything else plugged into the power supply unit. (The monitor was plugged into a separate outlet.) We measured how each of our test systems used power across a set time period, during which time we encoded a video with x264.
  • After consulting with our readers, we’ve decided to enable Windows’ “Balanced” power profile for the bulk of our desktop processor tests, which means power-saving features like SpeedStep and Cool’n’Quiet are operating. (In the past, we only enabled these features for power consumption testing.) Our spot checks demonstrated to us that, typically, there’s no performance penalty for enabling these features on today’s CPUs. If there is a real-world penalty to enabling these features, well, we think that’s worthy of inclusion in our measurements, since the vast majority of desktop processors these days will spend their lives with these features enabled.

The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

 

Memory subsystem performance

Since we have a new chip architecture and a new memory type on the bench, let’s take a look at some directed memory tests before moving on to real-world applications.

The fancy plot above mainly looks at cache bandwidth. This test is multithreaded, so the numbers you see show the combined bandwidth from all of the L1 and L2 caches on each system. Since our Haswell-EPs have 20 L1 caches of 32KB each, we’re still in the L1 cache at the 512KB block size above. The next few points, up to 4MB, are hitting the L2 caches, and beyond that, up to 16MB, we’re into the L3.

Haswell-EP’s promised doubling of L1 and L2 cache bandwidth per core is on display in the plot above. The E5-2687W v3’s higher core count also plays a part in these results, but however you slice it, this is a massive increase in cache bandwidth.

Now, let’s look at what happens when we get into main memory.

We found that our usual version of the Stream bandwidth test fails to scale to 20 cores and 40 threads properly, so we’ve substituted AIDA’s memory tests, instead. Obviously, they have no such issue. The E5 v3’s higher-speed DDR4 memory clearly outperforms the two prior generations of Xeons with DDR3 memory, with delivered bandwidth of up to 123 GB/s in the memory read test.

SiSoft has a nice write-up of this latency testing tool, for those who are interested. We used the “in-page random” access pattern to reduce the impact of prefetchers on our measurements. This test isn’t multithreaded, so it’s a little easier to track which cache is being measured. If the block size is 32KB, you’re in the L1 cache. If it’s 64KB, you’re into the L2, and so on.

Haswell-EP delivers nearly twice the L1 and L2 cache bandwidth without any increase in access latencies for those caches. There is a slight increase in L3 cache access times, but the Xeon E5 v2 has more LLC cache partitions to access than its eight-core siblings do.

At 2133 MT/s, Haswell-EP’s DDR4 memory doesn’t provide quite as quick a turnaround as DDR3 does. I’d expect that to change as DDR4 operating speeds ramp up. Notice the nice result above for the Haswell-E-based 5960X with DDR4-2800.

Some quick synthetic math tests

The folks at FinalWire have built some interesting micro-benchmarks into their AIDA64 system analysis software. They’ve tweaked several of these tests to make use of new instructions on the latest processors, including Haswell-EP. Of the results shown below, PhotoWorxx uses AVX2 (and falls back to AVX on Ivy Bridge, et al.), CPU Hash uses AVX (and XOP on Bulldozer/Piledriver), and FPU Julia and Mandel use AVX2 with FMA.

Here’s a nice look at the true potential throughput of the Haswell-EP hardware, provided a nicely vectorizable workload and the AVX2 instruction set extensions. Many of the applications we’re testing on the following pages don’t take full advantage of AVX2 yet, but once they do… yikes.

 

Power consumption and efficiency

The workload for this test is Cinebench, the scene-rendering benchmark whose raw performance results we’ll get into shortly. As you can see below, most of the actual work takes place very quickly, at the beginning of our test period.

Note that we’re testing two similar but not exactly identical workstations here by measuring power draw at the wall socket. (The same system got a CPU upgrade from the E5-2687W to the v2 version of the same during testing.)

Perhaps thanks to the new Xeons’ integrated voltage regulators and the switch to DDR4 memory, our E5 v3 workstation draws quite a bit less power at idle than the other systems. Only 79W of idle power for a system populated with dual processors and 128GB of memory spread across eight DIMMs is mighty frugal. The E5-2687W v3 box doesn’t use much more power at peak than its Ivy Bridge-EP forerunner, either.

One measure of power efficiency is to consider the power used over our entire test period, both while the systems were rendering the scene and after they had finished.

Perhaps our best measure of CPU power efficiency is task energy: the amount of energy used while encoding our video. This measure rewards CPUs for finishing the job sooner, but it doesn’t account for power draw at idle.

Our Haswell-EP workstation requires substantially less energy to render this scene than the other two systems do. That’s worthwhile progress.

 

3D rendering

LuxMark

Because LuxMark uses OpenCL, we can use it to test both GPU and CPU performance—and even to compare performance across different processor types. OpenCL code is by nature parallelized and relies on a real-time compiler, so it should adapt well to new instructions. For instance, Intel and AMD offer integrated client drivers for OpenCL on x86 processors, and they both support AVX. The AMD APP driver even supports Bulldozer’s and Piledriver’s distinctive instructions, FMA4 and XOP. We’ve used the AMD APP ICD on all of the CPUs, since it’s currently fastest ICD in every case.

I’d hoped one of the OpenCL ICDs would make use of the FMA instruction on Haswell-EP to achieve some really eye-popping speed increases in this test. Unfortunately, that’s not the case for one reason or another. I’ll keep an eye out for OpenCL ICD updates. Perhaps this workload could be further optimized for AVX2 and FMA in time.

Cinebench rendering

The Cinebench benchmark is based on Maxon’s Cinema 4D rendering engine. This test runs with just a single thread and then with as many threads as CPU cores (or threads, in CPUs with multiple hardware threads per core) are available.

POV-Ray rendering

Neither Cinebench nor POV-Ray show us the sort of performance gains we’d expect from FMA, either. That said, the E5-2687W v3’s additional cores help it to outperform the older Xeons without any extra help from new instructions.

Scientific computing

MyriMatch proteomics

MyriMatch is intended for use in proteomics, or the large-scale study of protein. You can read more about it here.

I’ve included MyriMatch here more as a cautionary statement than anything else. This application-based benchmark began having problems with performance scaling after we moved from Windows 8 to 8.1, likely due to some changes made to the Windows thread scheduler. Those problems manifest themselves at higher core counts and appear to be worst on the dual-socket systems with non-uniform memory access. We’ve reported the best scores for each Xeon system out of three runs, but the completion times for the benchmark varied widely. High-core-count, multi-socket systems like this have tremendous potential, but without careful tuning, even multithreaded applications may not be able to exploit it.

STARS Euler3d computational fluid dynamics

Euler3D tackles the difficult problem of simulating fluid dynamics. Like MyriMatch, it tends to be very memory-bandwidth intensive. You can read more about it right here.

Euler3D’s performance has long been sensitive to memory bandwidth, and the new Xeons have more of that precious commodity on tap. The result is a ~10% increase in throughput over Ivy Bridge-EP.

 

Productivity

Compiling code in GCC

Our resident developer, Bruno Ferreira, helped put together this code compiling test. Qtbench tests the time required to compile the QT SDK using the GCC compiler. The number of jobs dispatched by the Qtbench script is configurable, and we set the number of threads to match the hardware thread count for each system.

Yep, compile times are nice and low on Haswell-EP. Developers, it’s time to fill out a requisition form.

x264 HD video encoding

Our x264 test involves one of the latest builds of the encoder with AVX2 and FMA support. To test, we encoded a one-minute, 1080p .m2ts video using the following options:

–profile high –preset medium –crf 18 –video-filter resize:1280,720 –force-cfr

The source video was obtained from a repository of stock videos on this website. We used the Samsung Earth from Above clip.

Handbrake HD video encoding

Our Handbrake test transcodes a two-and-a-half-minute 1080p H.264 source video into a smaller format defined by the program’s “iPhone & iPod Touch” preset.

Neither of our video encoding tests shows any big gains from Haswell-EP. In some cases, the E5-2687W v3’s combination of a higher core count and somewhat lower clock frequencies will limit performance, especially if an application leans heavily on a few main threads. That appears to be what’s happening in our x264 encoding test.

We’ll have to devise some new encoding workloads with high-quality 4K video soon. Perhaps higher-res source material will let us better harness all of the Xeons’ cores and threads.

TrueCrypt disk encryption

TrueCrypt supports acceleration via Intel’s AES-NI instructions, so the encoding of the AES algorithm, in particular, should be very fast on the CPUs that support those instructions. We’ve also included results for another algorithm, Twofish, that isn’t accelerated via dedicated instructions.

7-Zip file compression and decompression

JavaScript performance

I’ve included these two client-class JavaScript tests as a reminder that not every workload will run best on many-core Xeon systems. Quite a few of our everyday interactions with computers rely on the performance of one or several key threads. In those cases, the higher-frequency Haswell quad cores Intel markets toward desktop systems can actually outperform these beefy Xeons—though, you know, not by much. Just realize that plopping a Xeon workstation onto the average guy’s desk won’t always improve his user experience.

I wouldn’t fight you if you put one of these on my desk, mind you.

 

Legacy comparisons

Let’s get a bit indulgent and see how today’s fastest workstation processors compare to older x86 CPUs of various classes. We can’t always re-test every CPU from one iteration of our test suite to the next, but there are some commonalities that carry over from generation to generation. We might as well try some inter-generational mash-ups.

Now, these comparisons won’t be as exact and pristine as our other scores. Our new test systems run Windows 8.1 instead of Windows 8 or 7, for instance, and have higher-density RAM and larger SSDs. We’re using some slightly different versions of POV-Ray, too. Still, scores in the benchmarks we selected shouldn’t vary too much based on those factors.

Our mash-up results come from several generations of CPU test suites, dating back to our Xeon X5680 review, our FX-8350 review from the fall of 2012, and our original desktop Haswell review from last year. Our recent desktop CPU reviews have contributed here.

3D rendering

Scientific computing

Productivity

Today’s brand-new Xeons achieve nearly twice the throughput of the Xeon X5680 from just a handful of years ago. Hard to fathom, almost—and my sense is that the power-efficiency gains from then to now are even larger than the performance improvements.

Another remarkable fact is the incredible dynamic range of the x86 processor ecosystem. I really wish we could get some numbers from Intel’s Avoton CPUs to include in here. Hmmmm…

This next set of results includes just one benchmark, but it takes us as far back as the Core 2 Duo and, yes, a chip derived from the Pentium 4: the Pentium Extreme Edition 840. Also present: dual-core versions of low-power CPUs from both Intel and AMD, the Atom D525 and the E-350 APU. We retired this original test suite after the 3960X review in the fall of 2011. We’ve now mashed it up with results from our Xeon X5680 review, our first desktop Haswell review, and from our latest crop of CPU reviews.

Also, ahem, never forget: in April of 2001, the Pentium III 800 rendered this same “chess2” POV-Ray scene in just under 24 minutes.

 

Conclusions

Intel continues to execute on its development roadmap like, well, like clockwork. I guess the whole tick-tock thing has worked out for them.

Haswell-EP-based Xeons offer measurable performance improvements across a range of workloads compared to the CPUs they succeed—and that’s true even without the broad availability of AVX2-ready applications. Our Xeon E5-2687W v3-based workstation proved to be quite a bit more energy-efficient in our Cinebench rendering test, too. Meanwhile, DDR4 memory looks to be living up to its billing by increasing delivered memory bandwidth and also contributing to our E5 v3 system’s frugal power draw at idle.

Looks like progress on all fronts to me.

As ever, you’ll need to be sure your application can take proper advantage of the power these CPUs have on tap, but if it can, the Xeon E5-2687W v3 will chew through it like nothing else.

As a reviewer, I’m having a hard time finding any flaws here. The one potential chink in these Xeons’ armor, at least for workstation use, may be that they’ve outstripped the demands of quite a few users. I suspect the lower-priced E5-1680 v3, which is a single-socket Xeon, will suffice for a whole lot of folks. That chip’s basic specs and performance will be very similar to the Core i7-5960X results you saw on the preceding pages.

But some of us will soak up as much computing power as Intel or anyone else can provide in a reasonable package. For those folks, the Xeon E5-2687W v3 offers an incredibly compelling solution that’s yet again a solid incremental improvement over last year’s model. I suspect the Haswell-EP Xeons may achieve more dramatic gains over Ivy Bridge-EP in server-class workloads, but that’s a question for another day.

Enjoy our work? Pay what you want to subscribe and support us.

Comments closed
    • kamikaziechameleon
    • 5 years ago

    WHAT, NO GAMING BENCHMARKS???

      • Krogoth
      • 5 years ago

      Because these chips will yield the same performance a their desktop counterparts because games don’t take advantage of the extra threads. 😉

      • ronch
      • 5 years ago

      SSK, is that you?

    • ClickClick5
    • 5 years ago

    So that PIII is lookin’ sexy. 24 min vs 11 seconds? I can’t throw a meal into the microwave, go to the bathroom, pick up the mail and respond to a few emails in 11 seconds. That is just awful.

      • willmore
      • 5 years ago

      [url<]https://xkcd.com/303/[/url<]

    • sschaem
    • 5 years ago

    Very nice review, specially appreciated the V2 vs V3 benchmarks.

    This all look very good for the skylake-E release 🙂

    • ronch
    • 5 years ago

    It looks to me like using a ring bus, even a very fast one, may not be the most elegant solution as we increase core count. Would a crossbar (i.e. star topology) be more efficient, given the same clocks and proper implementation?

      • intanjir
      • 5 years ago

      It depends on how you define efficient.

      In terms of wires required, a crossbar connecting N cores is extremely inefficient, as it requires roughly N-squared (n*(n-1)/2) connections, while a ring only requires N connections. For N=18, that’s 153 versus 18. Of course, the benefit of the crossbar is getting from any core to another only requires one hop, as opposed to the ring requiring something like N/4 hops on average, N/2 worst-case.

      But the ring is routable in silicon when you’re talking 10+ cores, and the crossbar is not, so it’s a bit of a theoretical discussion. 😉

    • slaimus
    • 5 years ago

    Unless you need the extra memory slots, wouldn’t a single 18 core chip fare better than 2 10 core chips, since you would not have to deal with the NUMA latency?

      • Milo Burke
      • 5 years ago

      But the 10-core chips will have higher clock frequencies, 3.1 GHz vs 2.3 GHz.

    • Klimax
    • 5 years ago

    For full 18 core massacre, Anandtech got it partially covered. (Not fully tested yet)
    [url<]http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-[/url<] AVX can take up to 450W in energy... And top SKU is according to them 4115USD...

      • Krogoth
      • 5 years ago

      Wow, 18-core Haswell-EP is a monster and it is very efficient when fully loaded. Remember a single chip or two yield the same level of performance as several server boxes that use previous generation of 4c/6c Xeon chips.

    • Flatland_Spider
    • 5 years ago

    I would ask for hi-res shots of the hardware inside the Boxx, but some people might be offended by that sort of thing. 😉

    Anyway… Any chance some Xeon E3s could find their way to Damage Labs? It would be interesting to compare them with the E5s.

    • anotherengineer
    • 5 years ago

    “With 20 cores, 40 threads,”
    and below
    “Here’s the obligatory screenshot from Windows Task Manager, showing all 40 available cores ”

    Shouldn’t that be 40 threads?

    Pretty crazy workstation, I’m surprised you didn’t keep it and skip town 😉

    • Jigar
    • 5 years ago

    After watching this big boys run – i7 5960X just peed in its pants …

    • blastdoor
    • 5 years ago

    Very impressive.

    It’s a darn shame AMD can’t compete with these guys so that we could get these things at a lower price. Even if AMD would just put 6 billion 28nm transistors worth of steamroller cores in an Opteron, it would be something.

    Such irony… if bulldozer makes any sense at all, it makes sense for servers. Yet AMD lets Opteron languish on an out of date implementation of the bulldozer concept, while pushing the latest and greatest into client products where this concept has no chance of competing against Core, no matter how well its implemented.

      • ronch
      • 5 years ago

      I’m sure AMD would looove to compete with Intel as fiercely as they once did and rake in the money, but we have to keep in mind that AMD doesn’t have nearly as much resources as Intel has, and the theoretical 6-billion transistor Opteron you’re suggesting would be humongous. Even using Piledriver cores built on 28nm (which is all they have access to at this point)…

      6 / 1.2 = 5 //Vishera has 1.2B transistors, so at 6B transistors, this chip will have 40 ‘cores’ or 20 modules and 80MB cache to feed them all.

      315 * 5 = 1,575mm^2

      (28 x 28) / (32 x 32) = 0.76. //assuming the 28nm node scales linearly compared with GF 32nm.

      0.76 x 1,575 = 1,197mm^2

      Even if we take out some cache and some bits and pieces we’d still have one gigantic piece of silicon sucking 1.6 gigawatts. And of course there’s the cost of building it…

      There were rumors that a [url=http://www.kitguru.net/components/cpu/anton-shilov/amd-readies-native-16-core-chips-based-on-steamroller/<]single-die 16-core Opteron[/url<] is in the works though.

        • Klimax
        • 5 years ago

        From what I have read, that chip wouldn’t be manufacturable at all. (Limit of optics IIRC)

    • USAFTW
    • 5 years ago

    Now, this is what I would call shock and awe.

      • Ninjitsu
      • 5 years ago

      Now, this is what I would call hard-core.

        • USAFTW
        • 5 years ago

        The 20 core/40 thread part is impressive to say the least. More impressive, getting that big a die to yield, at all. But we’re talking Intel.
        I mean, how the hell to they get that type of a design to yield?

          • chuckula
          • 5 years ago

          The 600+mm size is possible for two major reasons:
          1. This part is being made on an extremely mature 22nm process that also happens to be Intel’s best yielding process in the entire history of the company (one reason that 14nm looks troubled in comparison).

          2. Intel can charge a metric crapton for each fully operational chip, and they can cut down many of the partially-operational chips, so they don’t need incredibly high yields to make money.

          • the
          • 5 years ago

          This is actually Intel’s second largest chip they’ve manufactured. The largest being the Tukwila Itanium at 700 mm^2. That was a 65 nm chip which shipped after Intel started releasing 32 nm parts (needless to say, this Itanium design had some delays). Optical limits for manufacturing aren’t much larger than this.

          Similarly, the 18 core Haswell chip is being manufactured on a mature 22 nm process. It’ll be a 12 to 18 month wait before we see designs of similar magnitude reach the market on Intel’s 14 nm process.

            • Ninjitsu
            • 5 years ago

            Opti[i<]cal[/i<] or opti[i<]mal[/i<]?

            • the
            • 5 years ago

            Optical due to the mask size and the lenses used in the lithography process. My recollection is that limit is ~750 mm^2 back in the 65 nm days. It has likely changed due to finer wavelengths of light being used in this process and the necessary optics to work at those wavelengths.

            • Ninjitsu
            • 5 years ago

            Ah, thanks!

    • ronch
    • 5 years ago

    Just a thought about the ring bus. It’s been around since 2010 with Intel’s stuff but I admit I haven’t given it much thought given how fast Intel’s stuff is compared to AMD’s. Now that I think about it, wouldn’t cores located further from things like the QPI or PCIe or DRAM controllers be at a disadvantage when accessing said resources? I mean, yeah, sure, some cores will be better off with PCIe access but worse off with memory, balancing things out between the cores, but still, that would mean each core’s performance will vary depending on which kind of workload it’s working on, and how such workload stresses the abovementioned resources. The ring bus may be frickin’ fast but in the world of microelectronics and microprocessors such latencies still merit attention, don’t they?

      • Ninjitsu
      • 5 years ago

      It’s possible the uncore (and possibly the OS kernel too) is aware of this, and schedules accordingly.

        • ronch
        • 5 years ago

        But how would the OS know to which core to issue the thread best?

          • Ninjitsu
          • 5 years ago

          I don’t actually know, man. Just speculating!

          The OS may be aware of which Core ID maps to which structure, etc. This info could be passed to the kernel via UEFI/BIOS (using CPUID like things).

          EDIT: My main suggestion was the uncore, though.

      • Maff
      • 5 years ago

      This is mostly answered on the 2nd page of this article afaik:”
      The 12-core chip seems even weirder, with half of one ring simply clipped off along with the six cores that used to reside there.

      Such asymmetry just doesn’t seem natural at first glance. Could it present a problem where one thread executes more quickly than another by virtue of its assigned core’s location?

      I think that would matter more if it weren’t for the fact that the chip is operating at billions of cycles per second, and anything happening via one of those off-chip interfaces is likely to be enormously slower. When I raised the issue of asymmetry with Intel’s architects, they pointed out that the latency for software-level thread switching is much, much higher than what happens in hardware. They further noted that Intel has had some degree of asymmetry in its CPUs since the advent of multi-core processors.

      So, while you are correct, apparently in the real world it doesn’t really matter as the latency differences are on a whole other scale compared to the scale the software is usually operating at latency wise.

        • the
        • 5 years ago

        This is actually a bit of a non-answer by Intel’s engineers. Case in point is if a running thread needs access to the memory from both on-die controllers. There is no optimal solution here as some of the memory traffic will have to come across the bridging switch at a distinctly higher latencies. The difference in latencies will be relatively small since the external DRAM access will take up most of the absolute time but it is there and should be measurable.

        • ronch
        • 5 years ago

        I considered the explanation on Page 2 but if you read my post I was more curious about what goes on at the level of small transistors and nanoseconds. We wouldn’t notice it, but their simulations would show some performance differences between the cores. And yes, there’s been assymetry with Intel’s designs since multi-core came out. Core 2 had some parts that looked assymetric, but here we see it more than ever.

    • Deanjo
    • 5 years ago

    [quote<]TrueCrypt disk encryption[/quote<] ... Why bench a product that even the developers tell you to avoid and move away from? At the very least, a note should accompany the benchmark noting that the developers do not recommend using TrueCrypt anymore.

      • UberGerbil
      • 5 years ago

      The test has been in TR’s suite for a long time now, and provides another basis for comparison. The actual applicability of this particular product doesn’t really matter. How many people are buying high-dollar workstations specifically to compute Julia or Mandelbrot sets? Yet the benches based on those calculations are interesting, both because of the long tail of past CPUs with benchmark results in those tests, and because those results are suggestive of performance in similar kinds of code. Truecrypt results are suggestive of results from code that uses AES-NI specifically, nothing more. It’s hardly an endorsement of the product; and in any case, who really would be served by the warning you’re demanding? People who read reviews of server/workstation components, specify those components in high-dollar builds, but are completely oblivious to the widely-publicized problems with TrueCrypt?

        • Deanjo
        • 5 years ago

        There are several other options to benchmark aes-ni. Again at the very least, a warning should accompany the results. I don’t think you would like to see TR start using compression/decompression benchmarks of your favourite archive program if on decompression the archive was found to be corrupt.

          • w76
          • 5 years ago

          7zip has, in fact, fixed bugs that can lead to corrupt files since the version that most review sites use (9.20, the last stable release from 2010), plus at least a couple edge cases where various bits of metadata or NTFS data would get stripped. Thanks for proving the point that the point of it all is widely comparable results AND a tradeoff given the fact that these reviewers are humans with limited time on their hands.

          The average user would probably be wise to the alpha; fewer bugs at this point, more features, slightly enhanced performance. But that’s not the point for benchmark comparisons.

            • Deanjo
            • 5 years ago

            So because many sites use shoddy benchmarking practices TR should follow suite?

            I also suspect that there are also issues with the way they utilize GCC to benchmark the QT compile but without seeing what they are using for a command line it is hard to verify but this alone should send alarm bells ringing who have benchmarked GCC extensively over the years.

            [quote<] The number of jobs dispatched by the Qtbench script is configurable, and we set the number of threads to match the hardware thread count for each system.[/quote<] Typically using -j n+1 will yield more consistent results and keep the cpu running at its fullest capability. Matching job count to thread count doesn't always keep the cpu at 100% utilization. Of course there can also be I/O differences between the systems as well that can skew results. I only hope that their compile script is set to build for a specified architecture as well. Otherwise gcc will utilize runtime architecture and feature detection leading to very different code compilation paths, again rendering comparison of values useless.

    • ptsant
    • 5 years ago

    Nice review. Progress in all fronts. Not spectacular, but progress nonetheless.

    The E5-1680v3 is the product I need for my bioinformatics research and I suspect it’s the best buy for most users. In fact I think it’s probably a more intelligent buy than the i7-5XXX if you don’t care to overclock. I wish the price was a little lower.

      • culotso
      • 5 years ago

      I want an E5-1640v3 — between the 1630v3 and the 1650v3 there seems a nice spot for a 5820k-based chip to sit. Maybe at a ~<$400 price range. ECC and 16gb dimms. I crave!

    • ronch
    • 5 years ago

    1. I think Zambezi and Vishera also support variable clock speeds between the modules. I’m not sure anybody has bothered to look into whether or not AMD has fixed the issues they encountered with Barcelona. One thing I find interesting though, is that I know AMD disabled this feature with Deneb but my Phenom II X3’s box says each core can dynamically and independently clock depending on its workload.

    2. Given how AMD doesn’t have TSX in their own products and how their proposed Advanced Synchronization Facility extensions hasn’t found its way to finished products, I see no reason why Intel would rush to fix TSX, or be concerned that people would change their minds about buying their chips because of it. Take it or leave it, folks.

    3. Clocking lower when running AVX? This didn’t happen with prior AVX-equipped chips. So Intel had to do this to meet the TDP rating, perhaps? Why don’t they just admit that they need to raise TDPs? Eyebrow-raising, if you ask me.

    4. Two rings joined by buffer switches. Just like the subway. I like it.

    5. Why not 9 cores per ring? That would solve the assymetry.

    6. Oh I get it. The ring with fewer cores contains the PCIe and QPI.

    7. But then, the 12-core … sure, no problem with it, they say, but it sure ain’t elegant..

    8. No prices for the fully-enabled 18-core models. Perhaps there are still some yield problems with such huge chips despite Intel’s proven 22nm node?

    9. Good luck, AMD.

      • UberGerbil
      • 5 years ago

      2. TSX has to be considered purely an experimental or “preview” feature at this point, even before the errata showed up. Intel didn’t even make a point of turning it on in every model, which is what you’d do if you were trying to build an installed base for the future. But despite that, the Intel hype machine couldn’t stop itself from beating the drum for it just a bit, and that makes it a black eye for them when it turns out to be so broken that they have to default to turning it off. They need to fix it quickly for PR reasons alone; the fact that almost nobody was going to use it for production code is beside the point. And from a long-term perspective, they do need to get it a working version in the hands of the people who were going to play with it, because that’s the only way they’re going to be able to get the feedback Intel needs to develop it further. (I also have a theory that Intel itself plans to use it as the basis for doing some limited speculative multithreading in some future design, in which case they’d obviously need to have it working reliably)

      3. I suspect this is a temporary oddity that will vanish in later designs. An alternative way to look at it is that the clock and power ramping in Sandy and Ivy was so crude that they were unable specify different frequencies based on the instruction mix. Perhaps if they’d had the level of control they have now we would’ve seen this done there too. But AVX2 — specifically high-throughput FMA on 256bit operands — is also significantly more demanding of CPU resources, so maybe not.

      5./6. Exactly. There really isn’t any assymetery from the point of view of the rings. Each has 13 stations: the ring with 8 CPU stations has the other two consumed by the QPI and PCI connections.

        • ronch
        • 5 years ago

        2. Well, if TSX was experimental Intel should have been crystal clear about it when Haswell came out. As for having TSX disabled in certain SKUs, it doesn’t mean a thing with regard to TSX as a working/experimental feature or not, it simply echoes Intel’s fondness for product segmentation.

        5./6. Still, even with am equal number of stations, I would hazard a guess that the ring containing the PCIe and QPI nodes must be busier, given how these two nodes service all the cores on the die.

          • UberGerbil
          • 5 years ago

          Everything involving transactional memory is experimental: that’s the state of the art at the moment. Intel didn’t have to be crystal clear about it because everybody that cares about it already knows that (except maybe Intel’s marketing people, and the folks who listen to them). Virtually all the work done to this point is in academia, and mostly on simulated hardware; most of the commercial hardware implementations have either been hidden (Azul), failures (Rock), theoretical (AMD) or not widely available (Blue Gene/Q). Intel’s implementation is the first to actually ship in an off-the-shelf processor that somebody without deep pockets might actually be able to buy. But that means that everybody is starting from near zero and learning as they go — and that includes Intel. Which is why this initial TSX implementation is so limited: rather than making significant architectural decisions without understanding the long-term consequences, and building an elaborate design with hidden limitations they might regret later (a problem that Intel is probably more familiar with than anyone, thanks to the x86 ISA) they decided to do a very basic skeleton and then see what that turns up.

      • dragosmp
      • 5 years ago

      1. Yes – it works on Phenom II, but not as with the v3 Xeon. On the Phenom it’s only clock scaling, the voltage for all cores corresponds to the voltage of the core at the maximum P-state. Independent clock scaling without independent voltage scaling brings less benefits than going from clock scaling-only to per core DVFS.

        • ronch
        • 5 years ago

        Er, I’m talking about independent core clocks in Phenom II, not power gating or separate power planes for each core (although IIRC the Phenom II has separate power domains for the uncore, as Intel calls it).

        According to [url=http://www.anandtech.com/show/2702/6<]AnandTech's Phenom II X4 920 and 940 review[/url<]: [quote<]Phenom II fixes this by not allowing individual cores to run at clock speeds independently of one another[/quote<]

      • the
      • 5 years ago

      1) My Opteron 6376 box scales dynamically per core. The real power saving feature is dynamic voltage scaling which I haven’t looked into. At the very least, the each die in the package can run at a different voltages, just not sure about modules on the same package.

      2) TSX is important for scalability at the high end. Consumer Haswell was supposed to be for development, Haswell-EP for staging/testing of the new code and Haswell-EX to be ready when everything has been validated and put into production. It is a big bigger deal than you’re giving it credit for in the development circles. It is only in this circle though that will actively delay Haswell-EP purchasing until a stepping with fixed errata reaches the market. Everyone else will still buy these chips because the rest of the design functions (and that’s what all current code can use anyway). The TSX bus is likely why we’re getting the 18 core chip in the EP socket. These are supposed to be for Haswell-EX sockets but with the bug Intel isn’t going to ship them in that market segment. So this is likely loss recovery to a degree.

      3) Not only did until have to lower AVX clock speeds, but they also raised the TDP compared to Ivy Bridge-EP. Turbo speed are really telling as every core is rated for that max speed, just that there isn’t enough power to have them all running at ~3.6 Ghz. I genuinely wonder how much power the 18 core chip would consume if the base clock was raised to 3.6 Ghz without turbo and running an AVX heavy workload.

      4) 5) & 6) The buffer switches actually hinder performance. Intel’s ring topology isn’t scaling that well. Too many hops and latency over the entire rings becomes an issue. These buffer switches are going to have to move a lot of traffic, especially since the QPI controller is only on one side. Having two memory controllers, one on each ring, means that loads across to the different cluster on the same die will have a distinctly higher latency. Ultimately they need to take a lesson from the GPU designers and use a crossbar. ATI quickly learned that a ring bus has its downsides (see Radeon HD 2000 & 3000 series).

      7) The 12 core chips is genuinely the oddity. It would have made sense to move the PCIe controllers to the ring with the fewer core in an attempt to equalize the number of hops on the rings. There is a case for keeping the QPI controller on the ring with more cores as that is where most of the cache is for coherency purposes. Like the 18 core die, it’d have been better to include a QPI controller on each side.

      8) Prices are available else where. $4115 for the E5-2699v3 and $3226 for the E5-2698v3. And there are yield problems at this level. It isn’t because there are problems at 22 nm, rather there are just problem manufacturing a 662 mm^2 die regardless of what process is used. That thing is huge. Only the quad core Itanium 9100 series is larger at 699 mm^2.

      9) Luck will have nothing to do with anything here. AMD has effectively given up in this segment without a Streamroller revision. There doensn’t look to be a new platform next year based around Excavator either. It’ll be 2016 before we see a return of AMD into this part of the server market. In the mean time they’ll be around in the low end segment flogging ARM based Opterons for dense and/or ultra low power designs. HSA enabled Opterons maybe interesting to the HPC crowd but that’s rather niche in overall volume.

        • ronch
        • 5 years ago

        1. Erm, I was talking strictly about individual per-core clock scaling. Phenom II didn’t have voltage scaling, IIRC, and it wasn’t until Bulldozer that AMD was able to implement power gating and dynamic voltage scaling, AFAIK.

        2. I’m not discrediting TSX. Yes, it’s important for future scalability as more and more cores are added. I was just pointing out there’s less pressure for Intel to fix it given how AMD isn’t being very competitive. Then again, Xeon’s real competitors these days are from the likes of IBM and SPARC. AMD’s fallen behind quite a bit.

        5./6. Yes, I’d imagine those buffer switch taking their toll on inter-CPU core latencies, and while that ring bus looked cool when SB came out, it does have its drawbacks. I suppose that’s why AMD has stuck with crossbars. I hope they will evaluate their options well in their next-gen cores.

        9. Of course luck doesn’t have anything to do with AMD being able to pull off a strong x86 core by 2016 short of Intel engineers leaving the company all at once for some reason. It takes Herculean effort to pull off these things.

      • Maff
      • 5 years ago

      Regarding point 3:
      When one looks at it from a “how much sillicon is powered” point of view, the lower AVX2 clocks make just as much sense as a lower clock with more cores active. If you have to switch on vast amounts of sillicon per core in order to run those AVX2 instructions, the logical result is more power being used. To negate that fact one needs to lower the clocks a bit. This still results in a net gain of processing power however, as in the best case there is twice the amount of work being done.

      I think it actually makes more sense when one looks at it the other way around. Previously, the clocks and turboclocks were tuned in order to keep the poweruse in check in the worst case scenario, thereby limiting the clocks in all other scenarios, since while they didn’t use as much power, the clocks still needed to be limited in case an instruction came along that caused the cpu to use more power. This time however, they took out their most powerhungry instruction and gave it its own powerprofile so the rest of the instructions could be executed with higher clocks. This once again gives you the best of both worlds(in the ideal case), and from a certain standpoint isn’t much different from lowering your clocks with multicore loads vs higher clocks in single core loads.

        • ronch
        • 5 years ago

        I know what you mean, but it’s the first time a CPU had to scale back the clocks because of some instructions. That’s just weird. Next thing you know, CPUs would be scaling back in different degrees for every x86 instruction out there, or boosting for simpler instructions.

        On the other hand, we can look at this as Intel labeling these chips with base clock specs that are [u<]actually or essentially[/u<] 'turbo' clocks, and when things get too hot the chips will need to drop out of the turbo zone to some lower clocks.

          • Ninjitsu
          • 5 years ago

          [quote<] On the other hand, we can look at this as Intel labeling these chips with base clock specs that are actually or essentially 'turbo' clocks, and when things get too hot the chips will need to drop out of the turbo zone to some lower clocks. [/quote<] I don't think we can look at it that way. It's simply that particular very wide instructions have a different power profile to keep total power consumption in check. Just like the IGP has a different power profile.

            • ronch
            • 5 years ago

            Nope. Not excused. What about GPUs then? Remember when people bashed the R9 290X because it couldn’t maintain its advertised base clocks? I would imagine GPUs churning more activity and crunching far more digits than AVX will ever do. And even then it was just because of inadequate cooling used for the reference card. In the case of these Xeons, its not even due to poor cooling or poor power delivery, just an excuse for being unable to maintain advertised base clock speeds.

            Sorry, gentlemen, I don’t think that’s acceptable. Or perhaps you’re OK with dialing clocks down as well when the scheduler or caches or decoders get to 100% activity? Those are big parts of the chips too, aren’t they?

            • Maff
            • 5 years ago

            Well, I’m for as much transparancy as possible, which in general means you want the most consistent clocks possible. So I agree with you on that.

            On the other hand, I also want my stuff to run as fast as physically possible, which apparently on this scale means going even more granular clock and powerwise. I guess its inevitable this thing gets more common. Doesn’t get any less confusing for outsiders however.

            • Klimax
            • 5 years ago

            GPUs are quite different case. (Different set of trade-offs for different targets)

            And AVX even under new regime can consume about 150W.

            Price of massive general CPU.

      • Wirko
      • 5 years ago

      4. thru 7.: But it’s funny, Intel’s approach looks a lot like “please break off the chip along one of the two perforated lines”.

      8. Ark has all the prices, and the 18-core model is as expensive as you’d expect, $229 per core.

      3. I don’t quite understand this AVX magic either. If a core is running at its max turbo speed and it encounters an AVX instruction, will it stop, lower the frequency and continue? It takes some time for the frequency to change and stabilize, no matter if it’s upwards or downwards.

        • Klimax
        • 5 years ago

        Re 3. Looks so. Considering AVX can consume up to 350-450W… (Anandtech test, they didn’t know consumption of cooling)

          • chuckula
          • 5 years ago

          That’s 350 – 450W on a dual-socket system BTW, so divide accordingly for an individual chip.

            • Klimax
            • 5 years ago

            Still lot.

            • Wirko
            • 5 years ago

            10-12W per core, fully loaded – is this a lot?

            • Klimax
            • 5 years ago

            In comparison to rest (non-AVX) of workloads.

        • Ninjitsu
        • 5 years ago

        [quote<] It takes some time for the frequency to change and stabilize, no matter if it's upwards or downwards. [/quote<] I read 1ms was the time required, if I can find that source, I'll update this.

          • jihadjoe
          • 5 years ago

          My guess is the core will stay at or near max turbo if it’s just one or a couple of AVX instructions, then slowly clocks down to the lower speed given a steady stream of AVX.

          Considering power management is controlled by a completely separate 486-class CPU in there, I take it the design is somewhat reactive and dynamically adjusts clocks while monitoring power and temp changes in the main cores.

            • Ninjitsu
            • 5 years ago

            [quote<] Considering power management is controlled by a completely separate 486-class CPU in there [/quote<] That boggles my mind, for some reason. CPU-ception!

            • jihadjoe
            • 5 years ago

            It actually started with Nehalem, and is called the PCU (Power Control Unit).

            [url<]http://www.anandtech.com/show/2594/12[/url<] [quote<]Nehalem’s architects spent over 1 million transistors on including a microcontroller on-die called the Power Control Unit (PCU). That’s around the transistor budget of Intel’s 486 microprocessor, just spent on managing power. The PCU has its own embedded firmware and takes inputs on temperature, current, power and OS requests. [/quote<] IIRC there was at one point a video that was sort of flying through an Intel chip architecture (can't remember if this was for Nehalem or Sandy), but it shows the ring interconnect, and clearly referred to the PCU as an embedded 486 class processor.

    • Buzzard44
    • 5 years ago

    Yawn. Wake me at $20/core.

      • ronch
      • 5 years ago

      If that’s the case we may never wake you up.

      • ptsant
      • 5 years ago

      Well, you can buy a quad core for $80, but it won’t be a Xeon. More like an Athlon X4 (Kaveri) or an AM1 processor ($60 or lower!).

        • Ninjitsu
        • 5 years ago

        Yeah, Silvermont Atoms will be cheaper than that, too. 😀

    • Chrispy_
    • 5 years ago

    Can it run Crysis?

      • mnecaise
      • 5 years ago

      I think the question is becoming, “How many simultaneous instances of Crysis can it run?”

      • MadManOriginal
      • 5 years ago

      I tried looking to see what graphics adaptor that Supermicro borad has. I couldn’t find it on the spec page, and it doesn’t seem to have a local video output port, so I think the answer is ‘not at all without a graphics card.’

        • moose17145
        • 5 years ago

        So…. we are still waiting on a game that can play Crysis then?

      • ronch
      • 5 years ago

      Sure, but you’d have no money left to pay for a copy of Crysis after you buy one of these.

      • Freon
      • 5 years ago

      It can probably run Crysis via Minecraft logic.

    • Krogoth
    • 5 years ago

    Pretty exciting stuff for the enterprise world.

    I more curious to see how 18-core behemoth scales.

      • LoneWolf15
      • 5 years ago

      B/c someone had to say it – Krogoth was impressed.

        • NeelyCam
        • 5 years ago

        [quote<]"I more curious to see..."[/quote<] This makes me feel like he wasn't [i<]that[/i<] impressed...

          • BIF
          • 5 years ago

          Oh sure he was.

    • guardianl
    • 5 years ago

    Wonderful to see the long term comparison again.

    The Pentium EE 840 came out just about 10 years ago at a price of $999 ($1,218.70 inflation adjusted). Conveniently the i7 5960X costs $999. POV ray performance increased 15.6x !

    Unfortunately if we look back four years to the Sandy bridge 3960X for $999 ($1,058.12 inflation adjusted) performance gained is just 1.4x.

    That’s not even single thread performance limited, but rendering, which is embarrassingly parallel. Xeon’s are doing a little better, but the prices are all over the map, so it’s harder to make a direct comparison.

    • the
    • 5 years ago

    Possible error on page 3. The table of system configuration lists a “Intel DC S3500 Series 240GB SSD” as the system storage but picture of the SSD in the system clearly reads DC P3700 which is a different beast.

      • Damage
      • 5 years ago

      I didn’t use the P3700 for testing. Maybe later. 🙂

    • MadManOriginal
    • 5 years ago

    I was really looking forward to a value scatter plot, but I clicked to the last page and found no such thing awaiting me. 🙁

      • Milo Burke
      • 5 years ago

      No scatter plot? Who are you, Damage, and what have you done with Scott Wasson?

    • esc_in_ks
    • 5 years ago

    We’ve been playing with pre-release E5 v3 chips for the past month or so (2667v3, 2687Wv3, and 2697v3). Now that the embargo has lifted, I have to say that 56 threads on the 2697v3 x 2 really tears through the compiles.

    One thing we noticed was that the 2687Wv3 was actually slightly slower on our workload than the 2687Wv2 because we can’t use the extra 2 cores per socket. The “W”orkstation part has been our go-to part previously, but not this time around, we’ll be on the 2667v3.

    As Ferris would say, “If you have the means, I highly recommend picking one up.”

    [ Great to see Tech Report talking about workstation/server hardware. (You do have readers that make purchasing decisions for this kind of stuff!) ]

      • Wirko
      • 5 years ago

      [quote<]Great to see Tech Report talking about workstation/server hardware.[/quote<] Yeah, I sometimes feel like a lone non-gamer here at TR, and I'm glad to see stuff reviewed here that's mostly useless to gamers.

        • willmore
        • 5 years ago

        You’re not alone.

          • Klimax
          • 5 years ago

          Definitely. (Video and C++, but also games :D)

        • Ninjitsu
        • 5 years ago

        Wish they’d also test prosumer software, though.

        (I mean, compare the performance of video editing software, for example).

      • BIF
      • 5 years ago

      me too!

    • geekl33tgamer
    • 5 years ago

    Just had a nerdgasm over that Task Manager screenshot.

      • Ryu Connor
      • 5 years ago

      [url=http://blogs.msdn.com/b/b8/archive/2011/10/27/using-task-manager-with-64-logical-processors.aspx<]Wait till they get a load of me.[/url<]

    • ColeLT1
    • 5 years ago

    I’ll be building 3x whitebox dual proc servers with these sometime this month. Shooting for 1-2tb SSD storage, 64gb ram, 2×8 core, will post pics in the forums.

    • divide_by_zero
    • 5 years ago

    I sure wish I had any sort of workload that could justify running on this beast.

    Even my 55xx Xeons still hold up fine with a few VMs, and quicksync on my i5 is probably a better option for video transcoding than these Xeons would be.

    Still, awesome tech, and I’m happy to see TR reviewing more workstation/server hardware!

      • Kreshna Aryaguna Nurzaman
      • 5 years ago

      Virtual machines?

      • stdRaichu
      • 5 years ago

      [quote<]quicksync on my i5 is probably a better option for video transcoding than these Xeons would be[/quote<] Well, depends on what sort of transcoding you're doing I suppose - I'm still running a 2600K which I use for my x264 encodes of DVDs and BDs but I've tried IQS via my laptop and, whilst IQS is still hands-down the best hardware encoder out there, x264 is still better for quality IME - especially if you want to target lower bitrates and are prepared to do multi-pass encodes (although the quality difference for those is marginal) - as such I've been keen on getting an affordable 6 or 8 core. Case in point for me was the BD release of The French Connection in its fantastically grainy and almost monochrome palette; IQS just seems to throw most of the grain away leaving a homogenised blurry mess whilst x264 manages to preserve it extremely well even at 4000kb/s. On the other hand, if you're doing stuff like twitch streaming or just V(ideo)oiP then IQS is ideal as you don't necessarily need the extra/archival quality that software encoders are capable of providing. As an aside as someone who watches Twitch but doesn't actually upload anything themselves - I note that OBS has support for both x264 and IQS encoding, is there any apparent difference in quality between the two encoders there?

        • divide_by_zero
        • 5 years ago

        My transcoding is generally done with MCE Buddy, which utilizes handbrake to scan recorded TV shows, rip out the commercials (with okay-ish accuracy), and then re-encodes the massive MPEG2 file to a format that takes less obscene amounts of storage space. For this type of usage, the results with IQS fit into the “Good Enough” category.

        If/when I ever start ripping my physical media collection, I’ll probably use the similar methods to what you mention – x264 with multi-pass. That project would probably be a good justification for getting a more burly proc.

        I should do some hunting and see if there’s any TR threads on the forums regarding encoders, preferred settings, etc – so many endless configuration options.

          • stdRaichu
          • 5 years ago

          Been doing rips of my DVDs since 2002 (good ol’ XviD on my P3) and x264 since about 2007 I think, and the quality and speed has only come on in leaps and bounds since then. I think it got 30-40% faster at encoding merely over the lifetime of my 2600K.

          For my DVD/BD rips I use MeGUI which is designed with this sort of work in mind; after years of fiddling with them myself, I wouldn’t worry too much at all about x264 settings as the defaults are reasonably good and you can get good mileage simply by tweaking the –tune parameter which’ll do most of the fiddly bits for you (but TBH you should be fine with just using presets –tune animation and –tune grain depending on the source). One non-standard thing that I do however is restrict it to use only 2 threads (or 4 if I’m in a hurry) since I’m almost always doing at least five encodes at once; x264 handles multiple thread by splitting up the picture which can lead to some marginal quality loss when then parts are re-assembled.

          Of course, x265 is the new hotness but my 2600K will only encode that at single-digit fps at the moment for 480p content.

            • divide_by_zero
            • 5 years ago

            Well MeGUI sounds like *exactly* the type of program I’d like to take for a spin. Thanks for the tip!

    • chuckula
    • 5 years ago

    If there was a cheesy 90’s show called 18 Wheels of Justice… what TV themes can we think of for 18 cores?

      • bthylafh
      • 5 years ago

      A show like that would have to be a really bad early ’80s cartoon.

        • Pez
        • 5 years ago

        M.A.S.K?

          • ronch
          • 5 years ago

          M.A.S.K. toys were pretty cool. The cartoon series, I’d rather watch Care Bears.

      • BillyBuerger
      • 5 years ago

      Not a TV show but this just reminded me of 18 wheels on a big rig song.

    • chuckula
    • 5 years ago

    Meh. My Kaveri has a better GPU.

      • UnfriendlyFire
      • 5 years ago

      And when HSA does become common (which is a big IF), Kaveri’s HSA features would’ve been obsolete.

      • ronch
      • 5 years ago

      Sarcasm can be a wonderful thing.

    • UberGerbil
    • 5 years ago

    POV-Ray: 24 minutes to 11 seconds in 13 years. That really is amazing progress.

      • SnowboardingTobi
      • 5 years ago

      I remember it taking hours on my 386DX/33 to render a 640×480 scene… only to find out I messed up in my scene definition. :p

      • ptsant
      • 5 years ago

      Progress was much more amazing in the past. Going from a 386DX40 (AMD!) to a Pentium (ie 2 generations ahead) completely blew me away! Linux kernel compiled at 4h on the 386, took minutes on the Pentium.

      I could argue the difference between the E5 v1 and E5 v3 is not that spectacular. Then again, the TDPs have reached the ceiling, so any performance at the [b<]high[/b<] end must come from an improvement in efficiency. The 386 did not even need a heatsink but the Pentium had a small fan. Improving performance 5x is much easier when you can also increase TDP 5-10x.

        • chuckula
        • 5 years ago

        [quote<]Going from a 386DX40 (AMD!) to a Pentium (ie 2 generations ahead) completely blew me away! [/quote<] Be careful with your nostalgia. You are forgetting that the time gap between the 386 (launched in 1985-1986) to that Pentium (launched in 1993) is about 7 years. Now what was Intel putting out in the server world 7 years ago in 2007? It would be first-generation Conroe quad-core (with two dual-core dies) Xeons. Go ahead and compare one of those to an 18-core Haswell EP and tell me that Progress isn't still amazing.

      • Milo Burke
      • 5 years ago

      It is amazing, but is that also comparing a very expensive part to a very cheap part?

      • ronch
      • 5 years ago

      So from 1,440 seconds to 11 seconds in 13 years. An improvement of 131x. And we thought Moore’s law is dead. 😀

      • BIF
      • 5 years ago

      Yes it is amazing progress. This allows either faster work or better quality render effects. Or both.

      • sschaem
      • 5 years ago

      Comes down to a ~45% speedup a year.

      Impressive for recent progress. Seem that most of that relative speedup happen way back when.

      But things might come in growth spurt… Skylake might equalized it all.

Pin It on Pinterest

Share This