I haven’t made a blog entry in about two weeks but that is not due to of lack of topics. I get a constant flow of email from readers with requested topics that they’d like to see covered in this blog.
I’m speaking at Hotsos Symposium 2010 next month and my presentation consists of a deep dive into a lot of the sorts of topics readers ask me to blog about. I’ve posted the abstract below. The abstract is also posted at the Hotsos Symposium 2010 speaker page.
According to the speaker schedule I’m presenting in a time slot adjacent to Tom Kyte’s session about PL/SQL. So, if you are one of the 3 or so people who, for one bizarre reason or another, decide not to see Tom, perhaps you can attend my session. We’ll all be able to stretch out as there will be plenty of room 🙂
Here’s the abstract:
Ten Years After Y2K And We Still “Party Like It’s 1999”
Whether you call it “the two-thousands”, the “Ohs”, “The Naughties” or “The Aughts”, the first decade of this millennium is over and it ushered in a significant amount of new technology related to Oracle. Don’t be alarmed, this is not one of those worthless technical chronology presentations. After all, is there anyone who isn’t aware that the decade started with the introduction of Real Application Clusters—and thus the demise of the large, central server—and finished with Oracle acquiring Sun Microsystems? This presentation has nothing to do with any of that! In spite of how much technology has changed, we really do still seem to be stuck in the 1990s. The following is sample of some of the topics I’ll be diving into—deeply.
- We still think a CPU is a CPU.
- We still think memory is memory.
- We still think bulk data loading is a disk-intensive operation.
- We still think direct-attached storage is for “small” systems.
- We still think database==structured and file system==unstructured.
- We still think NUMA is niche technology.
- We still think NFS is a file serving protocol.
And, of course, I’ll be talking a bit about Oracle Exadata Storage Server and the Oracle Database Machine.
Hi Kevin
I’ll be one of the 3 people…
CU
Chris
Thanks, Christian… I’ll be in good company that’s for sure!
Chris, That means we just need to find a friend to bring along.
Kevin, I was reading this abstract on my phone in an airport and it made me chuckle. Looking forward to it.
One of the email questions I get the most lately is what processor threading (specifically Intel Nehalem SMT) means to Oracle Database throughput and scalability. I have held back blogging the topic but will be going into that topic deeply during the presentation. Shucks, lots of folks don’t even understand processor threading at all (e.g., when switches occur, etc). If there are any such folks at Hotsos, who can’t bear standing room only for Tom Kyte’s adjacent session, I hope they read my abstract and choose the spacious lounge-like environment of a nearly empty room. Who knows, there might even be some learnin’ going on as well 🙂
With me the 3 person quota is filled.
Damn! Does it mean I’m on the waiting list???
@Alex
Bring beer. We’ll think about it.
Alex,
I guess I will be with you on the wait list! Yes, Kevin, we will be seeing you then!
I didn’t even know you were going to be at Hotsos, Carol! Cool!
>> We still think NUMA is niche technology.
and
>> lots of folks don’t even understand processor threading
>> at all (e.g., when switches occur, etc).
Our old Sun v1280 vs new(er) Sun T2000 helps demonstrate my positive opinions about NUMA – v1280 is wayyy faster.
I have a good reason for missing your speech – new baby.
Can you give me some references to material that would cover those same topics?
thanks in advance
-paul
Hi Paul,
And I know from history at your site that you folks have a looong lineage of NUMA. I’m not a T2000 or 1280 export by any means, but it seems there must be at least some things that run faster on the T2000? No?
Kevin,
Our T2000 is a first generation Niagara: Single CPU, 6 cores. Our v1280 is 2 quads, total of 8 SPARC III. All Oracle rdbms activity runs faster on the v1280, regardless of parallelism techniques and/or number of processes/threads/LWP’s. One-for-one, 6-for-6 or 8-for-8 or 10-for-10, if it’s Oracle, the v1280 is faster. I’m sure that the clock speed and internal mgmt of LWP’s, of the t2, could be harnessed for something (related to an Oracle instance) but I have yet to see it. I’ve done my best to verify and ensure identical configurations – generally speaking, I’m really “anal” about making sure Test==Production. That goes all the way down to isolating EMC interface cards and mapping specific hypers/metavolumes.
—–
I’m curious about how to rationalize this…
Paul,
You answered the question yourself. You have 6 cores @ 1.0GHz running on the 1st gen T1 processor with 3MB cache shared across all cores. Each of these T1 cores were basically cloned off the SPARC II processor… and you are comparing these to 8 SPARC III cores @ 1.2GHz with 8MB cache per core. I would expect the V1280 to run a bit faster.
A modern CMT server like the “T5220” with 8 cores has about 2-3x greater throughput than the V1280. I would still expect a single thread of execution to be better on the V1280. So, you want to be measuring what 64 or 128 thread performance.
Recently, I did a test of a two-socket CMT box the T5240. This was combined with some testing Kevin was doing on a two-socket Nehalem-EP. With this workload, the throughput was just 15% shy of Nehalem-EP.
Now that performance is better understood, look at savings in the space/power/cooling. The V1280 is 12 RU server that uses much more resource than a T5220 that takes up only 2 RU.
The most recent TPC-C result by Oracle/Sun shows 12xT5440 combined with flash and disk storage. This result used only 9 racks of gear… while the IBM result of similar magnitude used around 60 racks of gear.
Glenn, thank you for the information.
Our v1280 has 900mHz processors, not 1.2gHz. For our Oracle workloads (my only experience), v1280 is 40-50% faster than t2000 – I still think I’ve been really careful to vary only one thing: server hardware.
Kevin wrote, in September 2009:
——-
If you run code that spends a significant portion of processor cycles operating on memory lines in the processor cache, you are operating code that has a very low CPI (cycles per instruction) cost. In my terminology such code is “skinny.” On the other hand code that jumps around in memory causing processor stalls for memory loads has a high CPI and is, in my terminology, fat.
——–
Is that a plausible explanation? Understanding this stuff will help me make a good choice for the inevitable task of replacing the v1280’s. I still wish I could attend the hotsos talk and I’d still LOVE to get any references to more information about the topics of the talk.
Hi Paul,
My Hotsos session covers topics I will blog about. I’ve actually postponed series installments so I can present the material at Hotsos first. I don’t know anything about memory latencies on that older CMT stuff (T1), but I’d never bet on 6cores sharing 3MB cache versus 8 cores each with 3MB cache…especially for Oracle workloads. Your quotation of my explanation of CPI for “fat” versus “Skinny” code does relate. However, the code is very similar in both cases so the difference in CPI comes from cache misses (more on that T1) and further exacerbated by the memory latencies. Glenn might be able to mention a comparison of T1 memory latencies compared to the modern T5XXX platform. Glenn is right, we are both quite surprised at how well the T5240 holds up to a 2-socket Nehalem EP running the same Oracle workload in both cases. Be aware, however, that a single invocation of this workload Glenn and I are referring to generates 5-fold more throughput on Nehalem EP than T5240. As the workload scales up they demonstrate remarkably similar throughput.
If by faster you mean a single-threaded process runs faster on the V1280 then yes, I would expect this to be the case.
The single-threaded differences come from the fact that the four threads share one core… and one thread can’t get all cycles on the core even if the other threads are idle. So, you have to load it up. It is all about throughput.
Oracle is “fat” as you say so memory latency does come into play. I am not sure of the latency differences between the two boxes off hand.
take care,
Glenn
Kevin, thanks. I did have the idea of cache misses and memory latency. Don’t know how to prove (or disprove) it, though. That’s ok. I like the extra detail on Sun’s improvements to CMT. Chip designers must be really smart people. Thanks again for taking the time to comment.
-paul