Oracle Database 11g with Intel Xeon 5570 TPC-C Result: Beware of the Absurdly Difficult NUMA Software Configuration Requirements!

Published March 30, 2009 oracle 16 Comments

According to this Business Wire article, the Intel Xeon 5500 (a.k.a., Nehalem) is making a huge splash with an Oracle Database 11g TPC-C result of 631,766 TpmC. At 78,970 TpmC/core, that is an outrageous result! I remember when it was difficult to push a 64 CPU system to the level these CPUs get with only one processor core! I had to quickly scurry over to the TPC website to dig in to the disclosures, but, as of 8:12 PM GMT it had not been posted yet:

TPC posting

Jumping the Gun
Ah, but then by the time I got around again to check it had indeed been posted. The first thing I did was check the full disclosure report to see what sort of Oracle NUMA-specific tweaking was done in the init.ora. None. That, is very good news to me. The last thing I want to see is a bunch of confusing NUMA-specific tuning. Allow me to quote myself with a saying I’ve been rattling off for years:

The best NUMA system is the best SMP system.

By that I mean it shouldn’t take application software tuning to get your money’s worth out of the platform. Sure, we had to do it back in the mid to late 1990’s with the pioneer NUMA systems, but that was largely due to the incredible ratio between local memory latency and highly-contended remote memory (and due to the concept of remote I/O which does not apply here). Of course the operating system has to be NUMA aware. Period.

Speeds
I know what the ratios are on Xeon 5500 series but I can’t recall whether or not the specific number I have in mind is one I obtained under non-disclosure so I’m not going to go blurting it out. However, it turns out that as long as memory is fairly placed (e.g., not a Cyclops ) and the ratio is comfortably below 2:1 (R:L) you’re going to get a real SMP “feel” from the box. Of course, the closer the ratio leans towards 1:1 the better.

Summary
NUMA is a hardware architecture that breaks bottlenecks. It shouldn’t have to break SMP programming principles in the process. The Intel Xeon 5570, it turns out, is the sort of NUMA system you should all be clamoring for. What kind of NUMA system is that? The answer is a NUMA system that is indistinguishable from a flat-memory SMP.

Very cool!

PS. I actually already knew what level of NUMA tuning was used in this TPC-C testing. I just couldn’t blog about it. I also know the precise R:L memory latency ratio for the box. The way I look at it though is since this modern NUMA system gets 78,970 TpmC/core, the R:L ratio is unnecessary minutiae-as is thoughts of NUMA software tuning. I never imagined NUMA would come far enough for me to write that.

16 Responses to “Oracle Database 11g with Intel Xeon 5570 TPC-C Result: Beware of the Absurdly Difficult NUMA Software Configuration Requirements!”

Feed for this Entry Trackback Address

1 chris_c March 31, 2009 at 2:32 pm

Is it just me or is that a lot of storage? :), nearly 1200 disks for data and indexes I’m not entirely sure I’d want to attach 20 or so TB of disk storage to a standard edition one database, I’d really love to know how much of that space actually got used on the round spinny things.
Still an impressive result from 8 cores, and no less realistic than any of the other TPC-C results and just a usefull.

Reply
- 2 kevinclosson March 31, 2009 at 3:33 pm
  
  TPC-C is an IOPS benchmark. It is whacking a very small slice of each of those drives. I tend to ignore the physical storage aspect of C and focus instead on what the processors are able to do. It is a very contentious workload…difficult to scale. The physical storage aspect of the test is an arms-race. If you can find someone with enough floorspace, power and cooling they have overcome the largest TPC-C hurdle.
  
  Reply
3 Matt April 1, 2009 at 7:35 pm

Interesting, nehalem does work faster.
However it appears that there is no replacement for displacement.
HPs DL580G5 benchmark eeks out of few more transactions than the newer kit with a few extra cores and a scoach more memory.
All at the same Oracle License cost.

http://www.tpc.org/tpcc/results/tpcc_result_detail.asp?id=109012001

It will be interesting to see how the new DL580G6 bench marks.

Matt
– I could not find a comparable DL360G5 benchmark to compare to the newer DL370

Reply
- 4 kevinclosson April 1, 2009 at 9:04 pm
  
  Matt,
  
  That DL580G5 is a “Dunnington” 6-core result so there are 24 cores compared to the 8 Nehalem cores in the DL370-G6 TPC-C. Please reiterate how these results are comparable?
  
  Reply
5 David Sansom April 1, 2009 at 10:48 pm

Kevin,

How long till Nehalem processors become available in Oracle Database Machines? Do you plan to change the CPU to Disk ratio in Exadata or the Database tier?

David

Reply
- 6 kevinclosson April 1, 2009 at 10:51 pm
  
  David,
  
  You routinely ask great questions on my blog. Thanks. Unfortunately I cannot answer a question about “futures.” I do like to see readers thinking about ratios and tiers with regard to Exadata architecture. It tells me people are becoming aware of the technology at a quick pace.
  
  Reply
7 Matt April 2, 2009 at 1:43 am

Correct, I was not explicitly comparing the two architectures. Just noting that “Dunnington” server for the same Oracle standard edition license (4 Licenses) – eeks out more TPC-C transaction.

Reply
8 George May 10, 2009 at 7:07 pm

hmm, am i missign somethign here, 4 Licenses, stadard edition well that implies 4 sockets being filled. How can 4 Dunnington CPU’s give the same performance as 4 Nehalem.

hmm the Dunnington is a 6 core, so thats 24 cores as Kevin stated, ok so the Nehalem is 4 cores so thats 16 cores, Ok license wise the older Dunnington could maybe out perform the newer Nehalem system it being 16 cores. For now…

But surely it is unfair to compair 16 cores to 24 core based system. of course it will be very fair when a 6 core Nehalem is released.

G

Reply
9 Andrew Gregovic May 22, 2009 at 1:07 am

Matt

When comparing the G5 24-core vs. the G6 8-core configurations for TPC-C, it’s worthwhile to dig in a bit deeper: compare the response times and you’ll see that the G5 ones are generally a lot higher than the G6!

Why is that the case? It’s hard to say without knowing the CPU and IO utilization… It’s a real shame that TPC don’t publish even most basic figures for them.

I suspect (without any real proof though) that the very similar throughput but different response timings between the G5 and the G6 are due to

1) the G5 being IO-bound (perhaps the extra 20% of disk spindles make quite some difference or perhaps something wasn’t configured quite right). Hence the larger response times.
2) the G6 being CPU-bound, and by chance the CPU throughput matching the IO-throughput of the G5 configuration

We can only hope that Kevin can ring up some of his friends in high places and get the missing CPU/IO charts…

Cheers

Andrew

Reply
- 10 kevinclosson May 23, 2009 at 2:47 pm
  
  First rule in running TPC-C: Don’t stop until the processors are totally saturated. Once there, nothing else really matters. Given the scale rules you basically can’t get processor saturation without adequate I/O. It really is the simplest workload on the planet. Now, these comments are about executing for audit. There is an entire world of benchmark engineering that goes into ensuring those saturated processors are getting as much work done per cycle as possible and that was where I spent a lot of time in the 90s. Oracle engineering for processor efficiency (code ordering for locality, memory optimizations, code path optimization, etc, etc) is quite enjoyable work.
  
  Reply
  - 11 Andrew Gregovic May 28, 2009 at 2:30 am
    
    Kevin
    
    Are you saying that the G6 Nehalem core is 3 times more powerful than the old G5 core? I very much suspect that this is not the case, most benchmarks indicate 20-40% improvements, not 200%!
    
    I have no doubt that at least one of the above configurations, G5 or G6, is not very well CPU-IO balanced. If you think that in both tests (G5 and G6) the processors were stretched out to the max, what could be the reason that the TPM was the same but the response times were notably different?
    
    Regards
    
    Andrew
    
    Reply
    - 12 kevinclosson May 28, 2009 at 5:10 am
      
      Hi Andrew,
      
      I’m not saying exactly how much “faster” the G5 is than the G6, but be aware that “fast” in this context is an interesting term. There is more to “fast” than can be solely attributed to the processor. The fastest processor in the world gets little done when accessing slow memory for instance. But first, I need to point out that it is only in the comment thread that this comparison between the “G5” and the “G6” surfaced. I should further point out that the processor in that G5 Proliant is merely a “cousin” to the processor in the “G6.” One is a Dunnington and the other is Nehalem. There is a reason that the processor nomenclature of Dunnington is 7XXX and Clovertown, Harpertown, Nehalem is 5XXX. But none of that matters as much as the concept of processor efficiency. Not processor utilization.
      
      I do not question your observation that the G5 result had higher response times. I noticed that too in my reading of the FDR. I do assert, however, that it isn’t a processor utilization issue. It is, instead, a processor efficiency issue. These benchmarks are not run with idle processor cycles. The processors are totally nailed to the wall. Whether the processor gets much work done per cycle is a reflection of processor efficiency. If the processor is 100% utilized (busy) but stalling on loading heavily contended memory from a slow FSB, that is still 100% processor utilization…just not very efficient. Just consider how much off-die traffic there is in that G5 result to interface with the memory controller! Nehalem is a world apart from that in both memory controller locale and memory speed/bandwidth terms. Totally different ball parks.
      
      So, no, you really can’t presume that there must have been some idle cycles or I/O bottlenecks in that G5 to G6 comparison. You can be quite certain, however, that processor efficiency was pathetic in the G5 case compared to the Nehalem case. And, folks, Dunnington wasn’t exactly a slouch…neither was it’s older brother Tulsa for that matter. They are just not on par with Nehalem. Nowhere near!
      
      I should just come out and say it, Nehalem is a processor for “grown-ups”. It is a processor for huge, balanced, systems. It really is a thing of wonder and beauty. I jest not!
      
      Reply
13 Andrew Gregovic May 29, 2009 at 12:30 pm

Kevin

I guess you’re right after all, here’s another benchmark that pretty much confirms your assertion:

http://it.anandtech.com/IT/showdoc.aspx?i=3536&p=7

Thanks for the info. Can’t wait to get one of those cheapo Dell servers, do a benchmark of our application and publish double-ish performance figures. An easy way of pleasing ignorant management and customers 😉

Cheers

Andrew

Reply
14 Steve Shaw May 31, 2009 at 2:56 pm

The difference between the two platforms and results is greatly determined by the memory architecture. A dunnington CPU on the caneland platform and clarksboro chipset has a FSB of 1066MHz with dedicated FSB connections to each of the four processors which gives it a 8.5GB/s dedicated link per six-core processor for a total of 34GB/s bandwidth for all four processors combined. The fastest memory for this platform is DDR2-667 and it has 4 channels so a total memory bandwidth of up to 32GB/s with up to 256GB in 32 FB-DIMMs. (Note that although DDR2-667 would do 5.34GB/s the AMB on FB-DIMM increases this up to 8GB/s). You can see in the Dunnington spec that 256 GB 32x8GB (HP 16GB Reg PC2-5300 2x8GB Kit) was used.

On Nehalem with the Tylersburg chipset on the other hand it is a NUMA configuration (and you are in a good place to learn about that) with three memory channels per processor and populated with DDR3-1333 memory each processor can support a maximum memory bandwidth of 32GB/s equivalent to the 10.6 GB/s supported by each channel and therefore 64GB/s in total for the two processor configuration. The max in this config is 48GB and reading the spec it used 144GB (HP 8GB 2Rx4 PC3-8500R-7 Kit) where the DIMMs could support 8.5GB/s but will operate at 6.4GB/s per channel (800MT/s when three DIMMs are populated per channel), this means the bandwith for this benchmark is 19.2 GB/s per processor.

Therein lies the main difference, (There are other significant differences in CPU architecture as well) On the Nehalem system the actual test system with 144GB delivers a spec of a maxium memory bandwidth of 19.2 GB/s per processor compared to the 8.5GB/s on Dunnington. The combined total bandwith across both systems is roughly equivalent but the combination of the memory architecture and DDR3 memory on the Nehalem is the major factor that gives it the opportunity for higher per processor performance on this type of workload.

Reply
- 15 kevinclosson May 31, 2009 at 5:34 pm
  
  Thanks for stopping by and commenting, Steve.
  
  Reply

1 Kevin Closson’s Silly Little Benchmark Is Silly Fast On Nehalem | Structured Data Trackback on April 10, 2009 at 10:51 pm

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage