Intel’s Xeon W5580 processors

Ever since the introduction of the first Opteron, Intel has faced a formidable foe in the x86 server and workstation markets. AMD’s decision to integrate a memory controller into its processors and use a narrow, high-speed interconnect between CPUs and I/O chips has made it a perennial contender in this space. Even recently, while Intel’s potent Core microarchitecture has given it a lead in the majority of performance tests, Xeons have been somewhat hamstrung on two fronts: on the power-efficiency front by their prevailing use FB-DIMM type memory, and on the scalability front by the use of a front-side bus and a centralized memory controller.

Those barriers for the Xeon are about to be swept away by today’s introduction of new processors based on the chip code-named Nehalem, a new CPU design that brings with it a revised system architecture that will look very familiar to folks who know the Opteron. Try this on for size: a single-chip quad-core processor with a relatively small L2 cache dedicated to each core, backed up by larger L3 cache shared by all cores. Add in an integrated memory controller and a high-speed, low-latency socket interconnect. Sounds positively.. Opteronian, to coin a word, but that’s also an apt description of Nehalem.

Of course, none of this is news. Intel has been very forthcoming about its plans for Nehalem for some time now, and the high-end, single-socket desktop part based on this same silicon has been selling for months as the Core i7. Just as with the Opteron, though, Nehalem’s true mission and raison d’etre is multi-socket systems, where its architectural advantages can really shine. Those advantages look to be formidable because, to be fair, the Nehalem team set out to do quite a bit more than merely copy the Opteron’s basic formula. They attempted to create a solution that’s newer, better, and faster in most every way, melding the new system architecture with Intel’s best technologies, including a heavily tweaked version of the familiar Core microarchitecture.

Since this is Intel, that effort has benefited from world-class semiconductor fabrication capabilities in the form of Intel’s 45nm high-k/metal gate technology, the same process used to produce “Harpertown” Xeons. At roughly 751 million transistors and a die area of 263 mm², though, the Nehalem EP is a much larger chip. (Harpertown is comprised of a pair of dual-core chips, each of which has 410 million transistors in an area 107 mm².) The similarity with AMD’s “Shanghai” Opteron core is, again, striking in this department: Shanghai is estimated at 758 million transistors and measures 258 mm².

The Xeon W5580

We have already covered Nehalem at some length, since it’s already out in the market in single-socket form. Let me direct you to my review of the Core i7 if you’d like more detail about the microarchitecture. If you want even more depth, I suggest reading David Kanter’s Nehalem write-up, as well. Rather than cover all of the same ground again here, I’ll try to offer an overview of the changes to Nehalem most relevant to the server and workstation markets.

A brief tour of Nehalem

As we’ve noted, Nehalem’s quad execution cores are based on the four-issue-wide Core microarchitecture, but they have been modified rather extensively to improve performance per clock and to take better advantage of the new system architecture. One of the most prominent additions is the return of simultaneous multithreading (SMT), known in Intel parlance as Hyper-Threading. Each Nehalem core can track and execute two hardware threads, to keep its execution units more fully occupied. This capability has dubious value on the desktop in the Core i7, but it makes perfect sense for Xeon-based servers, where most workloads are widely multithreaded. With 16 hardware threads in a dual-socket config, the new Xeons take threading in this class of system to a new level.

Additionally, the memory subsystem, including the cache hierarchy, has been broadly overhauled. Each core now has 32K L1 instruction and data caches, along with a dedicated 256K L2 cache. A new L3 cache is 8MB in size and serves all four cores; it’s part of what Intel calls the “uncore” and is clocked independently, typically at a lower speed than the cores.

The chip’s integrated memory controller, also an “uncore” component, interfaces with three 64-bit channels of DDR3 memory, with support for both registered and unbuffered DIMM types, along with ECC. Intel has decided to jettison FB-DIMMs for dual-socket systems, with their added power draw and access latencies. The use of DDR3, which offers higher operating frequencies and lower voltage requirements than DDR2, should contribute to markedly lower platform power consumption. The bandwidth is considerable, as well: a dual-socket system with six channels of DDR3-1333 memory has theoretical peak throughput of 64 GB/s.

That’s a little more than one should typically expect, though, because memory frequencies are limited by the number of DIMMs per channel. A Nehalem-based Xeon can host only one DIMM per channel at 1333MHz, two per channel at 1066MHz, and three per channel at 800MHz. The selection of available memory speeds is also limited by the Xeon model involved. Intel expects 1066MHz memory, which allows for 12-DIMM configurations, to be the most commonly used option. The highest capacity possible at present, with all channels populated, is 144GB.

Nehalem’s revised memory hierarchy also supports an important new feature: Extended Page Tables, which is again like a familiar Opteron capability, Nested Page Tables. Like NPT, EPT accelerates virtualization by relieving the hypervisor of the burden of software-based page table emulation. NPT and EPT have the potential to reduce the overhead of virtualization substantially.

The third and final major uncore element in Nehalem is the QuickPath Interconnect, or QPI. Much like HyperTransport, QPI is a narrow, high-speed, low-latency, point-to-point interconnect used in both socket-to-socket connections and links to I/O chips. QPI operates at up to 6.4 GT/s in the fastest Xeons, where it yields a peak two-way aggregate transfer rate of 25.6 GB/s—again, a tremendous amount of bandwidth. The CPUs coordinate cache coherency over the QPI link by means of a MESIF protocol, which extends the traditional Xeon MESI protocol with the addition of a new Forwarding state that should reduce traffic in certain cases. (For more on the MESIF protocol, see here.)

One of the implications of the move to QPI and an integrated memory controller is that the new Xeons’ memory subsystems are non-uniform. That is, getting to local memory will be notably quicker than retrieving data owned by another processor. Non-uniform memory architectures (NUMA) have some tricky performance ramifications, not all of which have been sufficiently addressed by modern OS schedulers, even now. The Opteron has occasionally run into problems on this front, and now Xeons will, too. One can hope that Intel’s move to a NUMA design will prompt broader and deeper OS- and application-level awareness of memory locality issues.

Power efficiency has become a key consideration in server CPUs, and the new Xeons include a range of provisions intended to address this issue. In fact, the chip employs a dedicated microcontroller to manage power and thermals. Nehalem EP includes more power states (15) than Harpertown (4) and makes faster transitions between them, with a typical switch time of under two microseconds, compared to four microseconds for Harpertown. Nehalem’s lowest power states make use of a power gate associated with each execution core; this gate can cut voltage to to an idle core entirely, eliminating even leakage power and taking its power consumption to nearly zero.

The power management microcontroller also enables an intriguing new feature, the so-called “Turbo mode.” This feature takes advantage of the additional power and thermal headroom available when the CPU is at partial utilization, say with a single- or dual-threaded application, by dynamically raising the clock speed of the busy cores beyond their rated frequency. The clock speed changes involved are relatively conservative: one full increment of the CPU multiplier results in an increase of 133MHz, and most of the new Xeons can only go two “ticks” beyond their usual multiplier ceilings. Still, the highest end W- and X- series Xeons can reach up to three ticks, or 400MHz, beyond their normal limits. Unlike the generally advertised clock frequency of the CPU, this additional Turbo mode headroom is not guaranteed and may vary from chip to chip, depending upon its voltage needs and resulting thermal profile. What headroom is available brings a “free,” if modest, performance boost to lightly threaded applications.

A new platform, too

Of course, this sweeping set of changes brings with it a host of platform-level alterations, not least of which is the modification of the role and naming of what has been traditionally called the north bridge chip, or the memory controller hub (MCH) in Intel’s world. Say hello, instead, to the I/O Hub, or IOH.

A block digram of the Tylerburg chipset. Source: Intel.

The new Xeons’ first IOH has been known by its code name, Tylersburg-36D, and will now be officially called the Intel 5520 chipset. True to its name, this IOH is focused almost entirely on PCI Express connectivity, with one QPI link to each of the two processors and a total of 42 PCIe lanes onboard—36 of them PCIe Gen2 and six Gen1. Those lanes can be apportioned in groups of various sizes for specific needs. Tylersburg also has an ESI port for connecting with an Intel south bridge chip, one of the members of the ICH9/10/R family; these chips provide SATA and USB ports, along with various forms of legacy connectivity.

Tylerburg’s dual QPI links open up the possibility of dual IOH chips, which Intel has decided to enable for certain configurations. In this scenario, each Tylersburg chip is linked via QPI to a different CPU, and the two IOH chips are linked via QPI, as well. The primary IOH chip handles various system management and legacy I/O duties, while the secondary one simply provides 36 additional lanes of PCIe Gen 2 connectivity, for a total of 72 lanes in the system (plus six Gen1 lanes). That’s a tremendous amount of connectivity, but it’s in keeping with the platform’s high-bandwidth theme.

Two large coolers and one DDR3 DIMM per channel in our test rig

The new Xeons’ LGA1366-style socket

Nehalem-based Xeons come in a much larger package (left) than the prior Xeon generation (right)

The new Xeons drop into a new, LGA1366-style socket that looks, unsurprisingly, just like the Core i7’s. The CPU itself is housed in a larger package, as well, that dwarfs the Harpertown Xeons and their predecessors.

Pricing and availability

Here’s a quick overview of the new dual-socket Xeons models, along with key features and pricing.

Model Clock

speed

Cores L3 cache QPI

link

speed

Max

DDR3

speed

TDP Turbo? Hyper-

Threading?

Price
Xeon W5580 3.2GHz 4 8MB 6.4 GT/s 1333MHz 130 W Y Y $1600
Xeon X5570 2.93GHz 4 8MB 6.4 GT/s 1333MHz 95 W Y Y $1386
Xeon X5560 2.8GHz 4 8MB 6.4 GT/s 1333MHz 95 W Y Y $1172
Xeon X5550 2.66GHz 4 8MB 6.4 GT/s 1333MHz 95 W Y Y $958
Xeon E5540 2.53GHz 4 8MB 5.86 GT/s 1066MHz 80 W Y Y $744
Xeon E5530 2.4GHz 4 8MB 5.86 GT/s 1066MHz 80 W Y Y $530
Xeon E5520 2.26GHz 4 8MB 5.86 GT/s 1066MHz 80 W Y Y $373
Xeon L5520 2.26GHz 4 8MB 5.86 GT/s 1066MHz 60 W Y Y $530
Xeon E5506 2.13GHz 4 4MB 4.8 GT/s 800MHz 80 W N N $266
Xeon L5506 2.13GHz 4 4MB 4.8 GT/s 800MHz 60 W N N $422
Xeon E5504 2.00GHz 4 4MB 4.8 GT/s 800MHz 80 W N N $224
Xeon E5502 1.86GHz 2 4MB 4.8 GT/s 800MHz 80 W N N $188

Nehalem has a plethora of knobs and dials available for product differentiation, and Intel has apparently decided to twiddle with them all. Each of them impacts performance in its own way, so choosing the right processor for your needs may prove to be something less than straightforward.

On top of all of the possibilities you see in the table above, there’s the issue of L3 cache speed, a notable attribute that impacts performance, but one Intel hasn’t opted to document too clearly (as we learned with the Core i7.) As I understand it, the uncore elements in Nehalem chips can be clocked independently of one another, so the speed of the memory controller or the QPI link doesn’t necessarily correspond to the frequency of the L3 cache. The pair of processors we have for this first review, of the decidedly ultra-high-end, workstation-oriented Xeon W5580 variety, have a 2.66GHz L3 cache. So does the Xeon X5570, the top server model.

The Opteron also edges forward

Unfortunately, we don’t have a direct Opteron competitor to test against the Xeon W5580, primarily because AMD doesn’t make a dual-socket CPU that expensive. We do, however, have a pair of new “Shanghai” Opterons, model 2389, with a 2.9GHz core clock frequency (and a 2.2GHz L3/north bridge clock.) These are not “SE” parts, so they offer higher performance within the same power/thermal envelope as most mainstream Opterons, with a 75W ACP rating.

The bigger news here may be the addition, at last, of HyperTransport 3.0 support to these Opterons. HT3 essentially doubles the bandwidth of a HyperTransport link, and at 2.2GHz, the link between our Opteron 2389s should operate at 4.4 GT/s and provide a total of 17.6 GB/s of bandwidth—quite close to the 19.2 GB/s supplied by the 4.8 GT/s QPI link on mainstream Nehalem variants. Upgrading to HT3 was as simple as dropping these new Opterons into our existing test system. This system’s Nvidia core logic chipset doesn’t support HT3, but the socket-to-socket interconnect automatically came up with HT3 enabled.

The Opteron 2389 currently lists for $989, which makes it a direct competitor for the Xeon X5550. I’d certainly like to show you a performance comparison between these two chips, but unfortunately, time constraints and a minor flood in my office have prevented me from pursuing the matter. Perhaps soon.

Our testing methods

As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.

Our test systems were configured like so:

Processors
Dual Xeon E5450 3.0GHz

Dual Xeon X5492 3.4GHz
Dual
Xeon L5430 2.66GHz
Dual
Xeon W5580 3.2GHz
Dual
Opteron 2347 HE 1.9GHz

Dual
Opteron 2356 2.3GHz


Dual Opteron 2384 2.7GHz

Dual Opteron 2389 2.9GHz
System
bus
1333
MT/s

(333MHz)

1600
MT/s

(400MHz)

1333
MT/s

(333MHz)

QPI
6.4 GT/s

(3.2GHz)

HT
2.0 GT/s

(1.0GHz)

HT
2.0 GT/s

(1.0GHz)

HT
4.4 GT/s

(2.2GHz)

Motherboard SuperMicro
X7DB8+
SuperMicro
X7DWA
Asus
RS160-E5
SuperMicro
X8DA3
SuperMicro
H8DMU+
SuperMicro
H8DMU+
BIOS
revision
6/23/2008 8/04/2008 8/08/2008 2/20/2009 3/25/08 10/15/08
North
bridge
Intel
5000P MCH
Intel
5400 MCH
Intel
5100 MCH
Intel
5520 MCH
Nvidia
nForce Pro 3600
Nvidia
nForce Pro 3600
South
bridge
Intel
6321 ESB ICH
Intel
6321 ESB ICH
Intel
ICH9R
Intel
ICH10R
Nvidia
nForce Pro 3600
Nvidia
nForce Pro 3600
Chipset
drivers
INF
Update 9.0.0.1008
INF
Update 9.0.0.1008
INF
Update 9.0.0.1008
INF
Update 8.9.0.1006
Memory
size
16GB
(8 DIMMs)
16GB
(8 DIMMs)
6GB (6 DIMMs) 24GB (6 DIMMs) 16GB
(8 DIMMs)
16GB
(8 DIMMs)
Memory
type
2048MB
DDR2-800 FB-DIMMs
2048MB
DDR2-800 FB-DIMMs
1024MB
registered ECC

DDR2-667 DIMMs

4096MB
registered ECC

DDR3-1333 DIMMs

2048MB
registered ECC

DDR2-800 DIMMs

2048MB
registered ECC

DDR2-800 DIMMs

Memory
speed (Effective)

667MHz
800MHz
667MHz
1333MHz
667MHz
800MHz
CAS
latency (CL)
5 5 5 10 5 6
RAS
to CAS delay (tRCD)
5 5 5 9 5 5
RAS
precharge (tRP)
5 5 5 9 5 5
Storage
controller
Intel
6321 ESB ICH
with

Matrix Storage Manager 8.6

Intel
6321 ESB ICH
with

Matrix Storage Manager 8.6

Intel ICH9R with

Matrix Storage Manager 8.6

Intel ICH10R with

Matrix Storage Manager 8.6

Nvidia
nForce Pro 3600
LSI
Logic Embedded MegaRAID

with 8.9.518.2007 drivers

Power
supply
Ablecom
PWS-702A-1R
700W
Ablecom
PWS-702A-1R
700W
FSP
Group FSP460-701UG 460W
Ablecom
PWS-702A-1R
700W
Ablecom
PWS-702A-1R
700W
Ablecom
PWS-702A-1R
700W
Graphics Integrated
ATI ES1000 with 8.240.50.3000 drivers
Integrated
ATI ES1000 with 8.240.50.3000 drivers
Integrated
XGI Volari Z9s with 1.09.10_ASUS drivers
Nvidia
GeForce 8400 GS with ForceWare 182.08 drivers
Integrated
ATI ES1000 with 8.240.50.3000 drivers
Integrated
ATI ES1000 with 8.240.50.3000 drivers
Hard
drive
WD
Caviar WD1600YD 160GB
OS Windows
Server 2008 Enterprise x64 Edition with Service Pack 1

We used the following versions of our test applications:

The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

Memory subsystem performance

This test gives us a visual look at the throughput of the different levels of the memory hierarchy. Generally speaking, Intel’s caches seem to achieve higher bandwidth than AMD’s. Looking at the block sizes between 512KB and 16MB shows us that the Xeon W5580’s L2 caches appear to be quite a bit faster than the older Harpertown Xeons’, but the W5580’s throughput drops at 4MB, where its smaller L2 caches run out of space. The most striking result may be the new Xeons’ throughput once we spill into main memory. Let’s take a closer look at that data point.

The W5580’s main memory throughput nearly doubles that of the fastest Opterons and is just short of four times that of the fastest Harpertown Xeon, the X5492. That’s a staggering increase in measured bandwidth

Memory access latencies are down dramatically, as well—the W5580’s round trip to main memory takes nearly half the time that the Xeon E5450’s does, and the W5580’s even quicker than the quickest Opteron. Incidentally, other than the fact that the 2389 has HT3 enabled, I’m unsure why the 2389’s memory performance isn’t quite as good as the 2384. They were tested in the same server and otherwise configured identically.

To rather gratuitously drive the point home, we can take a more complete look at memory access latencies in the charts below. Note that I’ve color-coded the block sizes that roughly correspond to the different caches on each of the processors. L1 data cache is yellow, L2 is light orange, L3’s darker orange, and main memory is brown.

Each stage of the new Xeon’s cache and memory hierarchy delivers shorter access times than the corresponding stage in the Shanghai Opteron’s, although the two certainly look similar, don’t they?

Bottom line: the Nehalem Xeon’s re-architected memory subsystem delivers the goods as advertised, with higher throughput and quicker access times than anything else we’ve tested.

SPECjbb 2005

SPECjbb 2005 simulates the role a server would play executing the “business logic” in the middle of a three-tier system with clients at the front-end and a database server at the back-end. The logic executed by the test is written in Java and runs in a JVM. This benchmark tests scaling with one to many threads, although its main score is largely a measure of peak throughput.

As you may know, system vendors spend tremendous effort attempting to achieve peak scores in benchmarks like this one, which they then publish via SPEC. We did not intend to challenge the best published scores with our results, but we did hope to achieve reasonably optimal tuning for our test systems. To that end, we used a fast JVM—the 64-bit version of Oracle’s JRockIt JRE R27.6—and picked up some tweaks for tuning from recently published results. We used two JVM instances with the following command line options:

start /AFFINITY [0F, F0] java -Xms3700m -Xmx3700m -XXaggressive -XXlazyunlocking -Xlargepages:exitOnFailure=true -Xgc:genpar -XXgcthreads=4 -Xns3200m -XXcallprofiling -XXtlasize:min=4k,preferred=512k -XXthroughputcompaction

Notice that we used the Windows “start” command to affinitize our threads on a per-socket basis. We also tried affinitizing on a per-chip basis for the Harpertown Xeon systems, but didn’t see any performance benefit from doing so. One exception to the command line options above was our Xeon L5430/San Clemente system. Since it had only 6GB of memory, we had to back the heap size down to 2200MB for it.

Also, in order to affinitize for the 16 hardware threads of the Xeon W55800 system, we used masks of FF00 and 00FF. Although our Xeon W5580 system has more memory than the rest of the systems—practically unavoidable in an optimal configuration because of its six DIMM channels—we did not raise the heap size to take advantage of the additional space. (Although we did experiment with doing so and found it not to bring a substantial advantage.) In order to follow the rules of SPECjbb to the letter, we tested the Xeon W5580 with one to 16 warehouses with two JVMs—topping out at twice the number of concurrent warehouses at which we expected performance to peak, thanks to Nehalem’s two hardware threads per core. (We also experimented with running four JVMs on this system, but as with the older Xeons, doing so didn’t improve throughput significantly.)

The new Xeons’ prowess here is absolutely staggering. We’ve rarely, if ever, seen this sort of performance increase from one CPU generation to the next. One can’t help but think how ominous this looks for AMD upon seeing these results.

I should note that you may see published scores even higher than these. We’re testing with an older version of the JRockIt JVM that’s not as well optimized for Nehalem—or Shanghai—as a newer version might be. Unfortunately, we haven’t yet been able to get our hands on a newer revision of this JVM, though I believe our present comparison should put the newer CPUs on relatively equal footing.

Before we move on, let’s take a quick look at power consumption during this test. SPECjbb 2005 is the basis for SPEC’s own power benchmark, which we had initially hoped to use in this review, but time constraints made that impractical. Nevertheless, we did capture power consumption for each system during a test run using our Extech 380803 power meter. All of the systems used the same model of Ablecom 700W power supply unit, with the exception of the Xeon L5430 server, which used an FPS Group 460W unit. Power management features (such as SpeedStep and Cool’n’Quiet) were enabled via Windows Server’s “Balanced” power policy.

Although it delivers much higher throughput, the Xeon W5580’s system’s peak power draw isn’t appreciably higher than that if its direct predecessor, the Xeon X5492. This is a top-end workstation Xeon model with a generous 130W TDP; I’d expect more mainstream Nehalem Xeons to draw quite a bit less power within their 80 and 95W TDPs. Hopefully we can test one soon.

Still, have a look at what happens when we consider performance per watt.

On the strength of its amazing throughput, even this 130W version of the new Xeon edges out the most effective Opteron in terms of power-efficient performance. Only a low-voltage version of the Harpertown Xeon, on the low-power San Clemente platform, proves more efficient.

Cinebench rendering

We can take a closer look at power consumption and energy-efficient performance by using a test whose time to completion varies with performance. In this case, we’re using Cinebench, a 3D rendering benchmark based on Maxon’s Cinema 4D rendering engine.

The performance leap with the new Xeons isn’t quite as stunning here as it is in SPECjbb, but it’s formidable nonetheless. The W5580 is nearly 50% faster than the Opteron 2389.

Once again, we measured power draw at the wall socket for each of our test systems across a set time period, during which we ran Cinebench’s multithreaded rendering test.

A quick look at the data tells us much of what we need to know, Still, we can quantify these things with more precision. We’ll start with a look at idle power, taken from the trailing edge of our test period, after all CPUs have completed the render.

Despite all of the bandwidth and its tremendous performance—and despite the 130W peak TDP of our Xeon W5580 processors—our Nehalem test system draws even less power at idle than our Opteron 2389-based system. Finally free from the wattage penalty of FB-DIMMs, Intel is again competitive on the platform power front.

Next, we can look at peak power draw by taking an average from the ten-second span from 15 to 25 seconds into our test period, during which the processors were rendering.

The combination of a 130W TDP and smart power management gives the Xeon W5580 more dynamic range in terms of power draw than any other solution tested. As with SPECjbb, the Xeon W5580 system’s peak pull is slightly higher than the Xeon X5492 system’s.

Another way to gauge power efficiency is to look at total energy use over our time span. This method takes into account power use both during the render and during the idle time. We can express the result in terms of watt-seconds, also known as joules.

We can quantify efficiency even better by considering specifically the amount of energy used to render the scene. Since the different systems completed the render at different speeds, we’ve isolated the render period for each system. We’ve then computed the amount of energy used by each system to render the scene. This method should account for both power use and, to some degree, performance, because shorter render times may lead to less energy consumption.

Here, too, the new Xeon just outdoes the Opteron 2384 in our perhaps best measure of power-efficient performance by using less energy to complete the task at hand.

MyriMatch proteomics

Our benchmarks sometimes come from unexpected places, and such is the case with this one. David Tabb is a friend of mine from high school and a long-time TR reader. He has provided us with an intriguing benchmark based on an application he’s developed for use in his research work. The application is called MyriMatch, and it’s intended for use in proteomics, or the large-scale study of proteins. I’ll stop right here and let him explain what MyriMatch does:

In shotgun proteomics, researchers digest complex mixtures of proteins into peptides, separate them by liquid chromatography, and analyze them by tandem mass spectrometers. This creates data sets containing tens of thousands of spectra that can be identified to peptide sequences drawn from the known genomes for most lab organisms. The first software for this purpose was Sequest, created by John Yates and Jimmy Eng at the University of Washington. Recently, David Tabb and Matthew Chambers at Vanderbilt University developed MyriMatch, an algorithm that can exploit multiple cores and multiple computers for this matching. Source code and binaries of MyriMatch are publicly available.
In this test, 5555 tandem mass spectra from a Thermo LTQ mass spectrometer are identified to peptides generated from the 6714 proteins of S. cerevisiae (baker’s yeast). The data set was provided by Andy Link at Vanderbilt University. The FASTA protein sequence database was provided by the Saccharomyces Genome Database.

MyriMatch uses threading to accelerate the handling of protein sequences. The database (read into memory) is separated into a number of jobs, typically the number of threads multiplied by 10. If four threads are used in the above database, for example, each job consists of 168 protein sequences (1/40th of the database). When a thread finishes handling all proteins in the current job, it accepts another job from the queue. This technique is intended to minimize synchronization overhead between threads and minimize CPU idle time.

The most important news for us is that MyriMatch is a widely multithreaded real-world application that we can use with a relevant data set. MyriMatch also offers control over the number of threads used, so we’ve tested with one to eight threads.

I should mention that performance scaling in MyriMatch tends to be limited by several factors, including memory bandwidth, as David explains:

Inefficiencies in scaling occur from a variety of sources. First, each thread is comparing to a common collection of tandem mass spectra in memory. Although most peptides will be compared to different spectra within the collection, sometimes multiple threads attempt to compare to the same spectra simultaneously, necessitating a mutex mechanism for each spectrum. Second, the number of spectra in memory far exceeds the capacity of processor caches, and so the memory controller gets a fair workout during execution.

Here’s how the processors performed.

The Xeon W5580 completes this task in almost precisely half the time it takes the Xeon X5492—another eye-popping performance.

STARS Euler3d computational fluid dynamics

Charles O’Neill works in the Computational Aeroservoelasticity Laboratory at Oklahoma State University, and he contacted us to suggest we try the computational fluid dynamics (CFD) benchmark based on the STARS Euler3D structural analysis routines developed at CASELab. This benchmark has been available to the public for some time in single-threaded form, but Charles was kind enough to put together a multithreaded version of the benchmark for us with a larger data set. He has also put a web page online with a downloadable version of the multithreaded benchmark, a description, and some results here.

In this test, the application is basically doing analysis of airflow over an aircraft wing. I will step out of the way and let Charles explain the rest:

The benchmark testcase is the AGARD 445.6 aeroelastic test wing. The wing uses a NACA 65A004 airfoil section and has a panel aspect ratio of 1.65, taper ratio of 0.66, and a quarter-chord sweep angle of 45º. This AGARD wing was tested at the NASA Langley Research Center in the 16-foot Transonic Dynamics Tunnel and is a standard aeroelastic test case used for validation of unsteady, compressible CFD codes.
The CFD grid contains 1.23 million tetrahedral elements and 223 thousand nodes . . . . The benchmark executable advances the Mach 0.50 AGARD flow solution. A benchmark score is reported as a CFD cycle frequency in Hertz.

So the higher the score, the faster the computer. Charles tells me these CFD solvers are very floating-point intensive, but oftentimes limited primarily by memory bandwidth. He has modified the benchmark for us in order to enable control over the number of threads used. Here’s how our contenders handled the test with different thread counts.

Here we are again with another stunning leap in performance. The Xeon W5580 performs this simulation at over the twice the rate of the fastest Opteron. Nehalem seems to be especially well suited for bandwidth-intensive scientific computing and HPC applications like the two on this page.

For what it’s worth, I should note that at lower thread counts, we saw a striking amount of variability from run to run with the Xeon X5580 system. At two threads, for instance, the scores came in at 1.5Hz, 2.4Hz, and 1.8Hz. So I wouldn’t put too much stock into those non-peak results. Things seemed to even out once we got to higher thread counts. This variance at low thread counts could be the result of one or several facets of the Nehalem architecture, including Turbo mode, NUMA, and SMT, all of which can contribute some performance variability, especially when interacting with a non-NUMA/SMT-aware application and perhaps a less-than-optimal thread scheduler. For what it’s worth, we didn’t see any such variability on our Opteron test systems.

Folding@Home

Next, we have a slick little Folding@Home benchmark CD created by notfred, one of the members of Team TR, our excellent Folding team. For the unfamiliar, Folding@Home is a distributed computing project created by folks at Stanford University that investigates how proteins work in the human body, in an attempt to better understand diseases like Parkinson’s, Alzheimer’s, and cystic fibrosis. It’s a great way to use your PC’s spare CPU cycles to help advance medical research. I’d encourage you to visit our distributed computing forum and consider joining our team if you haven’t already joined one.

The Folding@Home project uses a number of highly optimized routines to process different types of work units from Stanford’s research projects. The Gromacs core, for instance, uses SSE on Intel processors, 3DNow! on AMD processors, and Altivec on PowerPCs. Overall, Folding@Home should be a great example of real-world scientific computing.

notfred’s Folding Benchmark CD tests the most common work unit types and estimates performance in terms of the points per day that a CPU could earn for a Folding team member. The CD itself is a bootable ISO. The CD boots into Linux, detects the system’s processors and Ethernet adapters, picks up an IP address, and downloads the latest versions of the Folding execution cores from Stanford. It then processes a sample work unit of each type.

On a system with two CPU cores, for instance, the CD spins off a Tinker WU on core 1 and an Amber WU on core 2. When either of those WUs are finished, the benchmark moves on to additional WU types, always keeping both cores occupied with some sort of calculation. Should the benchmark run out of new WUs to test, it simply processes another WU in order to prevent any of the cores from going idle as the others finish. Once all four of the WU types have been tested, the benchmark averages the points per day among them. That points-per-day average is then multiplied by the total number of cores (or threads, in the case of SMT) in the system in order to estimate the total number of points per day that CPU might achieve.

This may be a somewhat quirky method of estimating overall performance, but my sense is that it generally ought to work. We’ve discussed some potential reservations about how it works here, for those who are interested. I have included results for each of the individual WU types below, so you can see how the different CPUs perform on each.

Because each of its cores is executing two threads at once, the Xeon W5580’s performance in the individual work unit tests is relatively lackluster. Once we reach the bottom line and look at total projected points per day, though, it achieves nearly a 50% gain over the Xeon X5492—just another day at the office for Nehalem, I suppose.

3D modeling and rendering

POV-Ray rendering

We’re using the latest beta version of POV-Ray 3.7 that includes native multithreading and 64-bit support. Some of the beta 64-bit executables have been quite a bit slower than the 3.6 release, but this should give us a decent look at comparative performance, regardless.

To put this performance in our chess2 scene into perspective, the Xeon W5580 box finishes in 30 seconds what used to take over 10 minutes to complete.

Valve VRAD map compilation

This next test processes a map from Half-Life 2 using Valve’s VRAD lighting tool. Valve uses VRAD to pre-compute lighting that goes into its games.

The new Xeon’s dominance continues here.

x264 HD video encoding

This benchmark tests performance with one of the most popular H.264 video encoders, the open-source x264. The results come in two parts, for the two passes the encoder makes through the video file. I’ve chosen to report them separately, since that’s typically how the results are reported in the public database of results for this benchmark. These scores come from the newer, faster version 0.59.819 of the x264 executable.

I’m at a bit of a loss to express the reality of what we’re seeing. Across a broad mix of applications, the Xeon W5580 is—by far—the fastest processor we’ve ever tested. Yes, this is a very high end part, but Intel’s new architecture is unquestionably effective.

Sandra Mandelbrot

We’ve included this final test largely just to satisfy our own curiosity about how the different CPU architectures handle from SSE extensions and the like. SiSoft Sandra’s “multimedia” benchmark is intended to show off the benefits of “multimedia” extensions like MMX, SSE, and SSE2. According to SiSoft’s FAQ, the benchmark actually does a fractal computation:

This benchmark generates a picture (640×480) of the well-known Mandelbrot fractal, using 255 iterations for each data pixel, in 32 colours. It is a real-life benchmark rather than a synthetic benchmark, designed to show the improvements MMX/Enhanced, 3DNow!/Enhanced, SSE(2) bring to such an algorithm.

The benchmark is multi-threaded for up to 64 CPUs maximum on SMP systems. This works by interlacing, i.e. each thread computes the next column not being worked on by other threads. Sandra creates as many threads as there are CPUs in the system and assignes [sic] each thread to a different CPU.

The benchmark contains many versions (ALU, MMX, (Wireless) MMX, SSE, SSE2, SSSE3) that use integers to simulate floating point numbers, as well as many versions that use floating point numbers (FPU, SSE, SSE2, SSSE3). This illustrates the difference between ALU and FPU power.

The SIMD versions compute 2/4/8 Mandelbrot point iterations at once – rather than one at a time – thus taking advantage of the SIMD instructions. Even so, 2/4/8x improvement cannot be expected (due to other overheads), generally a 2.5-3x improvement has been achieved. The ALU & FPU of 6/7 generation of processors are very advanced (e.g. 2+ execution units) thus bridging the gap as well.

We’re using the 64-bit version of the Sandra executable, as well.

Well, OK, then.

Conclusions

The Nehalem Xeons’ truly astounding leap in performance over prior generations, in a range of applications, speaks for itself. The largest gains came in our scientific computing/HPC tests, where the Xeon W5580 proved to be between 50% and 100% faster than the Harpertown Xeon X5492. We saw a massive performance increase in SPECjbb 2005, as well, along with more modest but still substantial improvements everywhere else.

This performance revolution comes alongside a vast reduction in platform power consumption, especially at idle. Our modestly appointed Xeon X5580 test system drew only 154W while idling, fully 79W less than our comparably equipped Xeon E5450 box. Naturally, the Xeon W5580’s 130W TDP makes for fairly considerable platform power draw under load—just shy of 400W in our testing—but the CPUs made up for it with outstanding performance. In our two best measures of power-efficient performance, the Xeon W5580 finished just ahead of the Opteron 2384, AMD’s strongest contender in this department.

Although we’ve seen Nehalem on the desktop, it’s even more impressive in its dual-socket server/workstation form. That’s true for several reasons, including the fact that this architecture was obviously designed with the server market in mind. This system layout translates particularly well into multi-socket systems, where its scalability is quite evident. Another reason Nehalem looks so impressive here is the simple reality that the past few generations of Xeons were handcuffed by FB-DIMMs, not only in due to added power consumption, but also in terms of memory latencies and, as a result, overall performance. Seeing that limitation go away punctuates Nehalem’s other virtues.

Interestingly enough, the Xeon W5580 restores a certain continuity between high-end desktops and workstations that hasn’t been present in recent years. Not only is it an excellent workstation processor, but it has the makings of the most desirable personal computer we have ever seen, should you be so bold as to hog one for just yourself. Of course, the W5580 is almost outrageously expensive, but with this sort of performance on tap, it could easily pay for itself in time saved when put to the right application.

With luck, we’ll have an opportunity to test a more mainstream variant of the new Xeons before long. Like I said, a more direct price comparison against the Opteron 2389 would be nice to do. I had also hoped to try out some additional benchmarks this time around, but simply ran out of time. Perhaps we can add some of the more promising candidates next time.

For now, this quick first look at the new Xeons has left us gasping for breath and wondering what, exactly, AMD can do to counter. The six-core Istanbul looks like a start, but it will have to be exceptional in order to close the performance gap the Nehalem Xeons have opened up.

Comments closed
    • Sunburn74
    • 10 years ago

    Can we get some dual core results?

    • dpaus
    • 10 years ago

    /[

      • Convert
      • 10 years ago

      They aren’t bad for retail units, HP servers are only a couple grand more for the new xeons, which isn’t bad considering the improvement they bring.

    • Fighterpilot
    • 10 years ago

    Be good to hear Jack and Ubergerbil on the TR podcast…how about it JD?

    • DrDillyBar
    • 10 years ago

    Yay, the workstation has class again.

    • Rza79
    • 10 years ago

    You mention HT3 for the Opteron 2389 in your spec sheet but the Supermicro H8DMU+ doesn’t support it. So it’s actually running in HT1 mode.

    • mattthemuppet
    • 10 years ago

    how can you have a “minor” flood – a flood’s a flood!

    now why am I suddenly reading that as floode (as in dude)?

    • octop
    • 10 years ago

    In my view, despite Xeon 5580 able to easily outperform Opteron, it’s still not a fair comparison. The fact that you categorize processor by market price is not actually a technical comparison. Price is base on demand & supply. Being the same price range, Opteron is lower in raw frequency due to manufacturing capability. And I bet after Xeon EP release, the Opteron’s price is going to chg again. So I think Xeon with its scalable QPI & HyperThreading, is able to exceed Opteron in same raw freq range by 20-35%.

    • UberGerbil
    • 10 years ago

    How did we end up with a lolcatz picture illustrating the article? Oh, wait, I forgot what day it was…. (It’s actually not that day yet in the PDT, which further confused me).

      • NIKOLAS
      • 10 years ago

      Never complain about LOLCats. NEVER!!!!!!

        • nerdrage
        • 10 years ago

        I CAN HAS XEON!!!

    • Krogoth
    • 10 years ago

    Intel has finally retaken the workstation market. Intel did it the same way as what AMD did with its the original Opetrons which at the time leapfrogged the Netburst-based Xeons.

    Intel’s two biggest problems with adoption are obviously the uncertain economic conditions and pointy-haired bosses. 😉

    • bdwilcox
    • 10 years ago

    So Intel has created a fast Lamborghini…what’s new? With the market as it is, I feel this chip will have a hard time finding a home. Sure, those that must have it will have it. But most sales, for the foreseeable future, will go to servers that are “fast enough” without the price premium. If AMD can focus on a much better price/performance ratio and deliver less expensive chips that the market deems “fast enough”, all of Intel’s expensive, raw horsepower may very well loiter on the shelf.

      • MadManOriginal
      • 10 years ago

      Actually these are pretty compelling in TCO based upon performance/watt for virtualized servers that are heavily loaded most of the time. Anandtech’s article has some power numbers.

      • Freon
      • 10 years ago

      There is ALWAYS a business somewhere that needs faster servers. You can’t beowulf cluster and load balance across servers every time.

      There is no such thing as a CPU that is too powerful and too expensive to have a market.

        • bdwilcox
        • 10 years ago

        That’s why I said, “Sure, those that must have it will have it.” But what percentage of companies NEED a large number of servers like these? I have a feeling SDMs will be looking pretty closely at price to benefit ratios for a while and most will choose the cheaper, “good enough” solution in order to hit their mark. The days of “buy the best you can, as many as you can, then find a role for ’em” is over.

          • Freon
          • 10 years ago

          Well enough companies need them that Intel is marketing the chip. I’m not the least bit surprised. I don’t quite understand where you are coming from I guess. It seems blatantly obvious to me, but I’m sure we have different experiences. I’ve worked for a lot of database-driven-software companies, and especially now we can always use more horsepower for certain tasks, like our primary database servers even if disk is a bigger issue.

          Server based processing is back in style. The workstation is almost becoming moot again. Especially with web based apps these days. If you aren’t running your servers out, your business probably isn’t doing that great. While I don’t expect Sun’s original vision of the dummy terminal, the pendulum is definitely swinging back towards the server.

      • kuraegomon
      • 10 years ago

      Umm, no. If your revenue depends on your CPU performance, and performance-per-watt is even in the same ballpark, you’re going to go with the faster part every single time.

      In this case, the most expensive, power-hungry EP part is already very competitive in efficiency. Note especially the task energy graph for Cinebench on page 7. That’s a good approximation to the real-world TCO for this system – and there will be many stepping improvements even before we get to 32 nm.

      The product my team develops is CPU and memory-bandwidth intensive, and we already know we’ll be moving our hardware platfrom from Opteron to Nehalem in the mid-term. Be assured that _many_ other companies will be doing the same, this year. As long as Intel can make enough of these, AMD will take a tremendous beating in the 2P/4P space. No one in the enterprise is just going to walk away from 50 to 80 % clock-for-clock performance. No one.

      Now if your particular application has a much smaller (or no) performance delta when run on Nehalem, then I can see AMD having an excellent chance at retaining your business. Unfortunately, no one’s seeing many of those kind of applications right now …

        • bdwilcox
        • 10 years ago

        “If your revenue depends on your CPU performance”
        -That’s a bit myopic. How many companies’ revenue depends on their CPU performance? Small niches, that’s who. And they will opt for Intel’s offering. Everyone else will look at the cheaper server and say “it will do for now”. Let me reiterate for the third time, now: “Sure, those that must have it will have it.”

        “Now if your particular application has a much smaller (or no) performance delta when run on Nehalem, then I can see AMD having an excellent chance at retaining your business. Unfortunately, no one’s seeing many of those kind of applications right now …”
        -Those applications are niches compared to the preponderance of servers and server roles in corporations around the world. Does anyone really say they need the fastest CPUs, and the highest prices, on Earth for their file/print/AD servers?

        Krogoth is right, “Intel’s two biggest problems with adoption are obviously the uncertain economic conditions and pointy-haired bosses. ;)”

          • crazybus
          • 10 years ago

          A server with Nehalem’s capability allows many server roles to be consolidated onto one machine via virtualization. Given the performance/watt, performance/$ and the fact that we’re seeing 4S performance in a 2S footprint, I’m thinking these new Xeons won’t have trouble finding buyers.

            • Krogoth
            • 10 years ago

            Xeon-based Nehelam are indeed impressive and worth the cost if time is money.

            The problem with adoption is purely based on other factors like the current uncertain times and IT guys who have to convince their PHB to justify the expense of an upgrade. 😉

            • MadManOriginal
            • 10 years ago

            ^This. Sure, companies who are between upgrades won’t be looking right away but they’ll certainly do cost analysis for the future. Of those looking to replace their current servers even companies who just want ‘enough’ will consider these for their TCO with virtualization. Aside from a company that only runs a few CPUs worth of servers these have potential to be great everywhere.

          • Freon
          • 10 years ago

          “How many companies’ revenue depends on their CPU performance? Small niches, that’s who. ”
          I think you’re missing a whole world of data center and enterprise level operations. While the customer number may not be as high as all the medium size companies running three servers total for the DS, email, and file and print sharing they are still important customers. Customers with money, because they’re using a better business model, and can stay funded. It’s not going away, and there is always thirst for more power. It’s not going away.

      • swaaye
      • 10 years ago

      By what measure have we reached the “diminishing gains in usefulness” point? There are absolutely viable uses for this processor in many industries.

    • no51
    • 10 years ago

    I’m curious, that SuperMicro X8DA3 has 2 x16 slots. Does it support Crossfire and maybe SLI?

    • AMDisDEC
    • 10 years ago

    Intel has out Hypertransported the Hypertransport. Amazing!

    Not only that, but the triple channel memory controller is also a level above alternatives.
    The design decision to support both non-buffered and registered ECC DRAM is a huge bonus. Let the end user make the choice and trade-offs. This is a very intelligent design.

      • Krogoth
      • 10 years ago

      It is premature to say that QPI is outright superior to HTP.

      The benches mostly reflect the differences in CPU architectures not the the interconnects. i7-core has a similar performance delta compared to its desktop competition.

      We will see more of a difference in the interconnects if there is any with the upcoming Nehalem-EXs versus current 8xxxx series (a.k.a platforms with 4 sockets or more).

    • UberGerbil
    • 10 years ago

    Also, that comparison of POV-Ray scores to those of 2001 is beautiful. We may have long passed the amount of CPU horsepower ordinary users need, but the astonishing progress continues.

      • echo_seven
      • 10 years ago

      It’s hilarious to look at some of the comments in the 2001 review:

      /[

        • Forge
        • 10 years ago

        I see what you link thar.

        FWIW, those last three comments were made from a few months to nearly a year after the bulk of the comments. There’s always been some slow fokes who feel that an eon’s delay is the soul of wit or something.

          • UberGerbil
          • 10 years ago

          Yes, but that just means those comments were made in the “dawn of the Opteron era” in 2003-2004. I recall at that time arguing with people who claimed that Intel was doomed, had no chance of catching up to AMD’s insurmountable lead, and would be bankrupt in a couple of years. Seriously.

      • Krogoth
      • 10 years ago

      Newer chips are roughly an order of a magnitude faster. 😉

    • UberGerbil
    • 10 years ago

    That SPECjbb result is just ridiculous — best absolute performance by a mile, and best perf/watt? That makes for a pretty easy sell when looking to upgrade your server room (assuming you have the budget to do any upgrades, of course).

    The really interesting competitive comparison for Gainestown may not be Opteron at all, but Itanium. Clearly Intel didn’t make any effort to reduce x86’s expansion into Itanium’s HPC turf. The much-delayed Tukwila and Poulson now have an even bigger hurdle to overcome in terms of absolute performance, and even if QPI shaves a little off their total platform cost the price/performance numbers are unlikely to look favorable either (especially in these budget-conscious days). Once the Beckton is out, Itanium really is left only with the very large 16+S installations, and there just aren’t many of those (especially when you’re looking at non-clusters).

    Intel wins either way, of course, and most of the Itanium system vendors like HP and IBM have x86 offerings as well, but I wonder if Intel will quietly wind down the line after Tukwila/Poulson stagger out the door into an evaporating niche. Then again, people have been predicting Itanium’s demise since before it was born.

      • blastdoor
      • 10 years ago

      Good points…

      I recall reading long ago that Itanium would have to stick around for quite a while because of contractual obligations with very important customers (like DOD perhaps?)

      If that’s true, then Itanium could stick around for quite a while in a niche, but could be for all practical purposes dead.

      Perhaps someone who (unlike me) isn’t talking out of their a$$ could confirm or deny my baseless assertion…

        • jabro
        • 10 years ago

        I guess it depends on what your definition of “dead” is. No doubt, Itanium is confined to a low volume, high margin niche, and any far fetched fantasies about selling Itanium into the low-end server and and high-end workstation market have gone up in smoke long ago.

        But, Intel and it’s partners (notably HP) stand to earn billions of dollars from Itanium processor and systems sales over the next decade. HP’s enterprise server customers who require high-end HP-UX, VMS, and NonStop systems (as well as Linux & Windows) will continue to spend big $$$ on Itanium systems, and Intel will be happy to cater to them as long as it can charge ~$2000 per Itanium CPU.

        Of course, I have no idea how long this will go on for, but enterprise customers customers may never stop wanting high end servers with cutting edge RAS features and scalability (heck, 15 years ago who would have guessed that the IBM mainframe would still be going strong today?). As long as Intel continues to make enough money from its high volume x86 business to pay for the fabrication R&D and foundry costs, it can afford to cater to high end customers with special requirements and deep pockets.

          • tfp
          • 10 years ago

          I’m not sure if I agree with the Itanium is dead argument if that is the case so is IBM’s POWER chips. I don’t see either of them leaving any time soon.

            • UberGerbil
            • 10 years ago

            Yeah, I was just pointing out that it’s getting harder and harder for them as x86 climbs up-market and Itanium gets restricted into ever-more esoteric niches. And I’m looking forward to comparisons of Beckton vs Tukwila in fp-heavy tasks, because they’re going to be using the same interconnects.

            I’m usually the one pointing out Itanium is /[http://www.esj.com/articles/2009/03/31/Big-Iron-Bucks-Trend.aspx<]§

            • bdwilcox
            • 10 years ago

            Itanium is most certainly NOT dead! It is very alive and very kicking. It may not be the comprehensive solution Intel wanted, but it fulfills its role very well and the people who need it tend to be very happy with it. There’s an entire building across the street from me humming with Itaniums and no one seems to be complaining. :o)

    • UberGerbil
    • 10 years ago

    So, just out of interest, have you tried swaping1366 chips? Does a Xeon run in an X58 board? Does a Bloomfield (especially your engineering samples) boot in a Tylersburg chipset?

      • Damage
      • 10 years ago

      No, I had to disassemble all of my CPU test rigs for the water/mold cleanup. Still not back together. Maybe later this week.

    • AMDisDEC
    • 10 years ago

    Intel rises from the ashes and raises the bar for high performance computing. As expected, this CPU is a massive leap forward.
    Expect Cray, IBM, and HP to soon design and release some super HPC systems around this monster.
    Meanwhile, AMD is in massive layoff mode and losing tons of serious talent weekly. It is highly unlikely they will ever successfully challenge Intel again.
    I think AMD under Sanders has done a tremendous service to consumers by consolidating DEC and API tech to develop a damn good CPU contender which forced Intel to refresh itself and innovate. The consumers are the beneficiaries.
    Unfortunately AMD’s new management has destroyed themselves in the effort.
    What we need now is a new rising technology star to replace AMD to push Intel to even higher heights.

      • blastdoor
      • 10 years ago

      I suspect that the pressure on Intel will come from outside of the x86 world. In the server/workstation space the only pressure will be the POWER of IBM. In the mobile space, it will be ARM. But in the desktop/laptop PC space, I’m afraid that competition will be dead for quite some time, perhaps until the desktop as we have known it just no longer exists — when a cell phone or a SOC embedded inside a monitor is powerful enough for 90% of what 90% of people do, and the only people who have a big box on/under their desk are those simulating climates or decoding genomes or producing HD feature length 3D animated films.

      It occurs to me that as I write this, I’m essentially refuting the argument (which I myself have advanced on many occasions) that Apple should sell an xMac.

    • UberGerbil
    • 10 years ago

    g[

      • Damage
      • 10 years ago

      Shanghaian? Shanghawaiian? 🙂

        • UberGerbil
        • 10 years ago

        Shanghainey?

      • SecretMaster
      • 10 years ago

      What a Turkish Delight.

      No wait… that is in the future.

    • blastdoor
    • 10 years ago

    These things do rock. I got a new Mac Pro as soon as they were announced. It is extremely satisfying to watch all of those cores come to life to attack a problem. And I was really impressed by the gains from hyperthreading — much more than what I had realized it would be (but maybe that’s because my expectations were influenced by the P4).

    • TaBoVilla
    • 10 years ago

    Nice reference there to the old P4/Athlon scores in the POV benchmark. wow, that one used to take between 10 and 20 minutes to complete and now 30 seconds.

    performance wise, it almost follows moore’s law 100%.

    • Usacomp2k3
    • 10 years ago

    I wonder how much only having 8MB of cache hurts things. I’d be curious if a bump to 12 or 16MB would improve performance at all, especially with Hyper-Threading.

      • UberGerbil
      • 10 years ago

      It’ll depend on the task, of course. A lot of those impressive HPC scores come on tests that are working on shared data sets that likely fit into the cache so pollution from SMT isn’t an issue. The specjbb and some webserving tests might show gains from more L3. Intel certainly will know one way or the other (particularly when it comes to the tradeoff with higher costs of larger dies) and if there are gains to be had we’ll see variants down the road with more (especially in the Westmere era).

    • glacius555
    • 10 years ago

    So, what kind of minor flood was it?;)

      • UberGerbil
      • 10 years ago

      Caused by Intel’s MiB, no doubt.

      Or entirely imaginary, according to the “TR is in Intel’s pocket” bunch.

    • thermistor
    • 10 years ago

    #5…Really? I heard up and down about Intel having a ‘fake’ dual/quad core because they were multi-chip modules and not monolithic. About how bad the Netburst arch was (and it really was) for multi-core computing, etc., etc. Apparently you didn’t get the AMD press releases.

    Let’s not shed any tears for AMD; they’re still a huge multi-billion dollar company – and at most price points for consumers and businesses they’ve got good offerings.

    AMD ain’t going anywhere, even with second best parts.

      • indeego
      • 10 years ago

      They are just barely multibillion nowg{<.<}g

    • Usacomp2k3
    • 10 years ago

    Will this come in a 4S variety? That could be scary indeed.

      • mczak
      • 10 years ago

      Yes it will – Nehalem-EX. Will have twice the cores, 3 times the L3 cache. But current rumours put this at Q1/10. Not good for AMD – by that time AMD should have the 6-core Istanbul out, though I fail to see how this could be more than a 4-core Nehalem competitor…

        • UberGerbil
        • 10 years ago

        Yeah, Beckton looks like a monster chip, but it will take some time to work out all the wrinkles so 1Q10 is probably best case — though they’re right on time with Gainestown, and that says a lot, adding more QPI links and getting the protocols glitch-free for four (and more) chips amps the complexity quite a bit.

        But 2S servers are a huge chunk of the market, so Intel has to be pretty happy with where things sit ATM.

    • Fighterpilot
    • 10 years ago

    That is a seriously badass CPU……uber expensive I’ll bet but it’d be nice to own one.
    Lets hope the dual cores coming out later this year have some of the same magic dust.

    • Crayon Shin Chan
    • 10 years ago

    Wait, so you’re looking for a Nehalem Xeon? Or are you looking for a downside?

      • moshpit
      • 10 years ago

      Looking for the downside. They already found the Nehalem Xeon.

    • lycium
    • 10 years ago

    > /[

    • ssidbroadcast
    • 10 years ago

    Wow, 11:52pm. Cutting it close! Burnin’ the midnight oil on that one!

Pin It on Pinterest

Share This