I received two related emails while I was out recently for a couple of days of fishing and hiking. I thought they’d make for an interesting blog entry. The first email read:
…our tests show very little performance improvement on nehalem cpus compared to older Xeon…
And, the other email was the polar opposite:
…in most of our tests the Xeon 5500 was over 2 times as fast as the harpertown Xeon…
And the email continued:
…so we think you should stop saying that Xeon 5500 is double the perf of older xeon
Well, I can’t make everyone happy. I tend to say that Intel Xeon 5500 (Nehalem) processors are twice as fast as Harpertown Xeon (5400) as a conservative, well-rounded way to set expectations.
Introducing Fat and Skinny
OK, bear with me now, this is a wee tongue-in-cheek. The reader who emailed me with the report of near parity between Nehalem and Xeon is not lying, he’s just skinny. And the reader who admonished me for my usual low-ball citation of 2x performance vis a vis Nehalem versus Harpertown? No, he’s not lying either…he’s fat. Allow me to explain.
It’s really quite simple. If you run code that spends a significant portion of processor cycles operating on memory lines in the processor cache, you are operating code that has a very low CPI (cycles per instruction) cost. In my terminology such code is “skinny.” On the other hand code that jumps around in memory causing processor stalls for memory loads has a high CPI and is, in my terminology, fat.
Skinny code more or less relegates the comparison between Harpertown and Nehalem to one of clock frequency whereas fat code is really where the rubber hits the road. The more load and store hungry (fat) the code is the more the Nehalem pay-off will be.
Let’s take a look at two different, simple programs to help make the point. Using fat.c and skinny.c I’ll take timings on a Harpertown and Nehalem based boxes. As you can see, skinny.c simply hammers away on the same variable and does not leave L2 cache. On the other hand, fat.c treats its memory allocation as an array of 8-byte longs and skips to every 8th one in a loop in order to force memory loads since the cache line size on this box is 64 bytes. NOTE: do not compile these with -O (or change the longs in the array to volatile long). A simple gcc without args will suffice.
So, skinny.c has a very low CPI and fat.c has a very high CPI.
In the following examples, the model name field from cpuid output tells us what each system is. The E5430 is Harpertown Xeon and the 5570 is of course Nehalem. In terms of clock frequency, the Nehalem processors are 10% faster than the Harpertown Xeons.
In the following box you’ll see screen-scrapes I took from two different systems, one based on Nehalem and the other Harpertown. Notice how skinny only improves by 17% with the same executable on Nehalem compared to Harpertown.
# cat /proc/cpuinfo | grep 'model name' model name : Intel(R) Xeon(R) CPU E5430 @ 2.66GHz model name : Intel(R) Xeon(R) CPU E5430 @ 2.66GHz model name : Intel(R) Xeon(R) CPU E5430 @ 2.66GHz model name : Intel(R) Xeon(R) CPU E5430 @ 2.66GHz model name : Intel(R) Xeon(R) CPU E5430 @ 2.66GHz model name : Intel(R) Xeon(R) CPU E5430 @ 2.66GHz model name : Intel(R) Xeon(R) CPU E5430 @ 2.66GHz model name : Intel(R) Xeon(R) CPU E5430 @ 2.66GHz # md5sum skinny df86d9a278ea33b7da853d7a17afdd46 skinny # time ./skinny real 6m3.658s user 6m3.567s sys 0m0.001s # # cat /proc/cpuinfo | grep 'model name' model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz # md5sum skinny df86d9a278ea33b7da853d7a17afdd46 skinny # time ./skinny real 5m1.941s user 5m2.043s sys 0m0.001s
In the next box you’ll see screen-scrapes from the same two systems where I ran the “fat” executable. Notice how the Harpertown Xeon took 2.75x longer to process the fat.
# cat /proc/cpuinfo | grep 'model name' | head -1 model name : Intel(R) Xeon(R) CPU E5430 @ 2.66GHz # md5sum fat b717640846839413c87aedd708e8ac0d fat # time ./fat real 1m57.731s user 1m57.659s sys 0m0.045s # cat /proc/cpuinfo | grep 'model name' | head -1 model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz # md5sum fat b717640846839413c87aedd708e8ac0d fat # time ./fat real 0m42.834s user 0m42.803s sys 0m0.023s
So, as it turns out, we can believe both of the folks that sent me email on the matter.
Looks like Nehalem-EX will also make a big splash: http://news.cnet.com/8301-13512_3-10321740-23.html
Hello Kevin,
Is there a way in the BIOS to turn off two of the four cores (Xeon 5500)? Oracle does licensing by cores, so that is something we are considering.
I am having difficulty finding this information.
Thanks!
I don’t get involved in licensing topics as doing so is a great way to find one’s self floating face down in a murky swamp somewhere.
Kevin – thanks for the blog.
Do you know of any TPC-H benchamarks available on these processors. I’m getting some new servers to put together a 5 node 11g (hopefully 11gR2) RAC environment and was looking to compare a config with the Nehalem processors on RHEL5 vs. a Sun 5240 Ultra Sparc config running Solaris.
My main concern is w.r.t to parallel processing within our warehouse. We’re starting to hit some CPU bottlencks in our existing environment (single server Sun 890 8 CPU config) even though we do a pretty good job controlling DOP through Resource management. Any opinions on which processors would handle parallel processing better?
Well now, it would be really odd for me to take a position against SPARC at this juncture.
It’s not about the processors anyway…it’s about the bandwidth between memory and the processors… really fasy CPUs on a junk bus/interconnect stall a lot..they remain “busy” but not effectively so. Until the latest Harpertown Xeon-based systems, I’d have to say that Intel routinely mated CPUs to an under-performing bus. Those were the days when Opteron with HT ruled in the commodity space. Well, actually, the Woodcrest 5100 family and their chipset brought Intel and AMD closer together. All that aside, we are talking QuickPath and Nehalem these days… and the whole thing is moot, I don’t know enough about SPARC-based systems to say one way or the other and it would be utterly technocratically/politically incorrect for me to say anything about it now anyway given the Oracle/Sun merger.