AMD Quad-core | Kevin Closson's Blog: Platforms, Databases and Storage

Archive for the 'AMD Quad-core' Category

Oracle11g Automatic Memory Management – Part III. A NUMA Issue.

Now I’m glad I did that series about Oracle on Linux, The NUMA Angle. In my post about the the difference between NUMA and SUMA and “Cyclops”, I shared a lot of information about the dynamics of Oracle running with all the SGA allocated from one memory bank on a NUMA system. Déjà vu.

Well, we’re at it again. As I point out in Part I and Part II of this series, Oracle implements Automatic Memory Management in Oracle Database 11g with memory mapped files in /dev/shm. That got me curious.

Since I exclusively install my Oracle bits on NFS mounts, I thought I’d sling my 11g ORACLE_HOME over to a DL385 I have available in my lab setup. Oh boy am I going to miss that lab when I take on my new job September 4^th. Sob, sob. See, when you install Oracle on NFS mounts, the installation is portable. I install 32bit Linux ports via 32bit server into an NFS mount and I can take it anywhere. In fact, since the database is on an NFS mount (HP EFS Clustered Gateway NAS) I can take ORACLE_HOME and the database mounts to any system with a RHEL4 OS running-and that includes RHEL4 x86_64 servers even though the ORACLE_HOME is 32bit. That works fine, except 32bit Oracle cannot use libaio on 64bit RHEL4 (unless you invokde everything under the linux32 command environment that is). I don’t care about that since I use either Oracle Disk Manager or, better yet, Oracle11g Direct NFS. Note, running 32bit Oracle on a 64bit Linux OS is not supported for production, but for my case it helps me check certain things out. That brings us back to /dev/shm on AMD Opteron (NUMA) systems. It turns out the only Opteron system I could test 11g AMM on happens to have x86_64 RHEL4 installed-but, again, no matter.

Quick Test

[root@tmr6s5 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3585 MB
node 1 size: 4095 MB
node 1 free: 3955 MB
[root@tmr6s5 ~]# dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024
1024+0 records in
1024+0 records out
[root@tmr6s5 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3585 MB
node 1 size: 4095 MB
node 1 free: 2927 MB

Uh, that’s not good. I dumped some zeros into a file on /dev/shm and all the memory was allocated from socket 1. Lest anyone forget from my NUMA series (you did read that didn’t you?), writing memory not connected to your processor is, uh, slower:

[root@tmr6s5 ~]# taskset -pc 0-1 $$
pid 9453's current affinity list: 0,1
pid 9453's new affinity list: 0,1
[root@tmr6s5 ~]# time dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024 conv=notrunc
1024+0 records in
1024+0 records out

real    0m1.116s
user    0m0.005s
sys     0m1.111s
[root@tmr6s5 ~]# taskset -pc 1-2 $$
pid 9453's current affinity list: 0,1
pid 9453's new affinity list: 1
[root@tmr6s5 ~]# time dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024 conv=notrunc
1024+0 records in
1024+0 records out

real    0m0.931s
user    0m0.006s
sys     0m0.923s

Yes, 20% slower.

What About Oracle?
So, like I said, I mounted that ORACLE_HOME on this Opteron server. What does an AMM instance look like? Here goes:

SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3587 MB
node 1 size: 4095 MB
node 1 free: 3956 MB

SQL> startup pfile=./amm.ora
ORACLE instance started.

Total System Global Area 2276634624 bytes
Fixed Size                  1300068 bytes
Variable Size             570427804 bytes
Database Buffers         1694498816 bytes
Redo Buffers               10407936 bytes
Database mounted.
Database opened.
SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 1331 MB
node 1 size: 4095 MB
node 1 free: 3951 MB

Ick. This means that Oracle11g AMM on Opteron servers is a Cyclops. Odd how this allocation came from memory attached to socket 0 when the file creation with dd(1) landed in socket 1’s memory. Hmm…

What to do? SUMA? Well, it seems as though I should be able to interleave tmpfs memory and use that for /dev/shm-at least according to the tmpfs documentation. And should is the operative word. I have been tweaking for a half hour to get the mpol=interleave mount option (with and without the -o remount technique) to no avail. Bummer!

Impact
If AMD can’t get the Barcelona and/or Budapest Quad-core off the ground (and into high-quality servers from HP/IBM/DELL/Verari), none of this will matter. Actually, come to think of it, unless Barcelona is really, really fast, you won’t be sticking it into your existing Socket F motherboards because that doubles your Oracle license fee (unless you are on standard edition which is priced on socket count). That leaves AMD Quad-core adopters waiting for HyperTransport 3.0 as a remedy. I blogged all this AMD Barcelona stuff already.

Given the NUMA characteristics of /dev/shm, I think I’ll test AMM versus MMM on NUMA, and them test again on SUMA-if I can find the time.

If anyone can get /dev/shm mounted with the mpol option, please let me know because, at times, I can be quite a dolt and I’d love this to be one of them.

Oracle on Opteron with Linux-The NUMA Angle (Part VII).

Published June 5, 2007 AMD Memory Latency , AMD Memory Throughput , AMD Quad-core , NUMA Oracle , Opteron NUMA , Opteron Oracle , Oracle Barcelona 5 Comments

This installment in my series about Oracle on Linux with NUMA hardware is very, very late. I started this series at the end of last year and it just kept getting put off—mostly because the hardware I needed to use was being used for other projects (my own projects). This is the seventh in the series and it’s time to show some Oracle numbers. Previously, I laid groundwork about such topics as SUMA/NUMA, NUMA API and so forth. To make those points I relied on microbenchmarks such as the Silly Little Benchmark. The previous installments can be found here.

To bring home the point that Oracle should be run on AMD boxes in NUMA mode (as opposed to SUMA), I decided to pick an Oracle workload that is very easy to understand as well as processor intensive. After all, the difference between SUMA and NUMA is higher memory latency so testing at any level below processor saturation actually provides the same throughput-albeit the SUMA result would come at a higher processor cost. To that end, measuring SUMA and NUMA at processor saturation is the best way to see the difference.

The workload I’ll use for this testing is what my friend Anjo Kolk refers to as the Jonathan Lewis Oracle Computing Index workload. The workload comes in script form and is very straightforward. The important thing about the workload is that it hammers memory which, of course, is the best way to see the NUMA effect. Jonathan Lewis needs no introduction of course.

The test was set up to execute 4, 8 16 and 32 concurrent invocations of the JL Comp script. The only difference in the test setup was that in one case I booted the server in SUMA mode and in another I booted in NUMA mode and allocated hugepages. As I point out in this post about SUMA, hugepages are allocated in a NUMA fashion and booting an SGA into this memory offers at least crude fairness placement of the SGA pages—certainly much better than a Cyclops. In short, what is being tested here one case where memory is allocated at boot time in a completely round-robin fashion versus the SGA being quasi-round robin yet page tables, kernel-side process-related structures and heap are all NUMA-optimized. Remember, this is no more difficult than a system boot option. Let’s get to the numbers.

I have also rolled up all the statspack reports into a word document (as required by WordPress). The document is numa-statspack.doc and it consist of 8 statspacks each prefaced by the name of what the specific test was. If you pattern search for REPORT NAME you will see each entry. Since this is a simple memory latency improvement, you might not be surprised how uninteresting the stats are-except of course the vast improvement in the number of logical reads per second the NUMA tests were able to push through the system.

SUMA or NUMA
A picture speaks a thousand words. This simple test combined with this simple graph covers it all pretty well. The job complete time ranged from about 12 to 15 percent better with NUMA at each of the concurrent session counts. While 12 to 15% isn’t astounding, remember this workload is completely processor bound. How do you usually recuperate 12-15% from a totally processor-bound workload without changing even a single line of code? Besides, this is only one workload and the fact remains that the more your particular workload does outside the SGA (e.g., sorting, etc) the more likely you are to see improvement. But by all means, do not run Oracle with Cyclops memory.

The Moral of the Story
Processors are going to get more cores and slower clock rates and memory topologies will look a lot more NUMA than SUMA as time progresses. I think it is important to understand NUMA.

What is Oracle Doing About It?
Well, I’ve blogged about the fact that the Linux ports of 10g do not integrate with libnuma. That means it is not NUMA-aware. What I’ve tried to show in this series is that the world of NUMA is not binary. There is more to it than SUMA or NUMA-aware. In the middle is booting the server and database in a fashion that at least allows benefit from the OS-side NUMA-awareness. The next step is Oracle NUMA-awareness.

Just recently I was sitting in a developer’s office in bldg 400 of Oracle HQ talking about NUMA. It was a good conversation. He stated that Oracle actually has NUMA awareness in it and I said, “I know.” I don’t think Sequent was on his mind and I can’t blame him—that was a long time ago. The vestiges of NUMA awareness in Oracle 10g trace back to the high-end proprietary NUMA implementations of the 1990s. So if “it’s in there” what’s missing? We both said vgetcpu() at the same time. You see, you can’t have Oracle making runtime decisions about local versus remote memory if a process doesn’t know what CPU it is currently executing on (detection with less than a handful of instructions). Things like vgetcpu() seem to be coming along. That means once these APIs are fully baked, I think we’ll see Oracle resurrect intrinsic NUMA awareness in the Linux port of Oracle Database akin to those wildcat ports of the late 90s…and that should be a good thing.

AMD Quad-Core “Barcelona” Processor For Oracle. How Badly Do You Need Enterprise Edition Oracle?

Published March 2, 2007 AMD Barcelona , AMD K8L , AMD Quad-core , AMD Quad-Core Performance , Oracle licensing 9 Comments

This blog entry is 6th in a series about Oracle on AMD’s upcoming quad-core processor code named “Barcelona.” The following is a link to the other installments on this thread:

Oracle on Opteron, K8L, NUMA, etc

Got Quad-Core? Need Enterprise Edition Oracle?
There is quite a buzz today about Oracle’s changes to software licensing for the database products. According to this ZDNet article, the changes are specific to the Standard Edition family of database products. The article refers to Oracle’s multi-core pricing guide which was updated on February 16, 2007. Get out your slide rule and gulp a heaping helping of patience.

Quad-Core x86_64
The ZDNet Article states:

Servers with four quad-core chips are relatively rare right now, but Intel and AMD plan to release processors for that segment later this year.

Um, the Xeon “Cloverdale” processors are quad-core and shipping already. AMD “Barcelona” is coming out this year. So what does this change really mean? If you use one of the Standard Edition products, you are longer limited based on cores, but sockets instead.

Misinformation—Lot’s of It
It’s Christmas for the bean counters. According to this News.com article, you can just simply switch out Enterprise Edition with Standard Edition:

Customers no longer must buy licenses for each of the 16 cores to run the top-end Enterprise Edition, but instead may buy licenses for the four sockets and run Standard Edition. That cuts list licensing prices from between $320,000 and $480,000–depending on Oracle adjustments that factor in multi-core processor performance–to $60,000.

I am still scratching my head about that one. Customers don’t swap out EE for SE at the drop of a hat—or do you? Since the choice would have never been there before to run SE on that many cores, could it be that SE will start to be the preferred multi-core edition? Can you live without the differences between EE and SE?

Barcelona
Folks that if have EE on a 4-Socket F (2200/8200) Opteron system today might be wise to think very hard about whether they can drop to SE because if they plug in Barcelona processors (they are socket-compatible), EE is going to be very, very expensive. That is, if you stay with EE and plug in Barcelona processors you will double your license cost.

I find this to be a very interesting policy change.

AMD Quad-Core “Barcelona” Processor For Oracle (Part V). 40% Expected Over Clovertown.

Published January 26, 2007 AMD Barcelona , AMD K8L , AMD Quad-core , AMD Quad-Core Performance , Oracle Barcelona , Oracle licensing 4 Comments

A reader posted an interesting comment on the latest installment on my thread about Oracle licensing on the upcoming AMD Barcelona processor. The comment as posted on my blog article entitled AMD Quad-Core “Barcelona” Processor For Oracle (Part IV) and the Web 2.0 Trolls states:

The problem with your numbers is that they are based on old AMD marketing materials. AMD has had a chance to run their engineering samples at their second stepping (they are now gearing up full production for late Q2 delivery – 12 weeks from wafer starts) and they are currently claiming a 40% advantage on Clovertown versus the 70% over the Opteron 2200 from their pre-A0 stepping marketing material.

The AMD claim was covered in this ZDNet article which quotes AMD Vice President Randy Allen as follows:

We expect across a wide variety of workloads for Barcelona to outperform Clovertown by 40 percent,” Allen said. The quad-core chip also will outperform AMD’s current dual-core Opterons on “floating point” mathematical calculations by a factor of 3.6 at the same clock rate, he said.

That is a significantly different set of projections than I covered in my article entitled AMD Quad-core “Barcelona” Processor For Oracle (Part II). That article covers AMD’s initial OLTP projections of 70% OLTP improvement on a per-processor (socket) over Opteron 2200. These new projections are astounding, and I would love to see it be the case for the sake of competition. Let’s take a closer look.

Hypertransport Bandwidth
I’m glad AMD has set expectations by stating the 40% uplift over Clovertown would be realized for “a wide variety of workloads.” However, since this is an Oracle blog I would much have preferred to see OLTP mentioned specifically. The numbers are hard to imagine, and it is all about feeding the processor, not the processor itself. The Barcelona processor is socket-compatible with Socket F. Any improvement of Opteron 2200/8200 would require existing headroom on the Hypertransport for workloads like OLTP. A lot of headroom—let’s look at the numbers.

The Socket F baseline that the original AMD projections were based on was 139,693 TpmC. If OLTP is included in the “wide variety of workloads”, then the projected OLTP throughput would be Clovertown 222,117 TpmC x 1.4, or 310.963 TpmC—all things being equal. This represents 2.2 times the throughput from the same Socket F/Hypertransport setup. Time for a show of hands, how many folks out there think that the Opteron 2200 OLTP result of 139,693 TpmC was achieved with more then 50% headroom to spare on the Hyptertransports? I would love to see Barcelona come in with this sort of OLTP throughput, but folks, systems are not made with more than 200% bus bandwidth than the processors need. I’m not very hopeful.

Bear in mind that today’s Tulsa processor as packaged in the IBM System x 3950 is capable of 331,087 TpmC with 8 cores. So, let’s factor our Oracle licensing in and see what the numbers look like if AMD’s projections apply to OLTP:

Opteron 2200 4 core: 139,693 TpmC, 2 licenses = 69,846 per license

Clovertown 8 core: 222,117 TpmC, 4 licenses = 55,529 per license

AMD Old Projection 8 core: 237,478 TpmC, 4 licenses = 59,369 per license

AMD New Projection 8 core: 310,963 TpmC, 4 licenses = 77,740 per license

Tulsa 8 core: 331,087 TpmC, 4 licenses = 82,771 per license

Barcelona Floating Point
FPU performance doesn’t matter to Oracle as I point out in this blog entry.

Clock Speed
The news about the expected 40% jump over Clovertown was accompanied by the news that Barcelona will clock in at a lower speed than Opteron 2200/8200 processors. I haven’t mentioned that aspect—because with Oracle it really doesn’t matter much. The amount of work Oracle gets done in cache is essentially nill. I’ll blog about clock speed with Opterons very soon.

AMD Quad-Core “Barcelona” Processor For Oracle (Part IV) and the Web 2.0 Trolls.

Published January 25, 2007 AMD Barcelona , AMD K8L , AMD Quad-core , AMD Quad-Core Performance , Opteron Oracle , Oracle Barcelona , Oracle performance 14 Comments

This blog entry is the fourth in a series:

Oracle on Opteron with Linux–The NUMA Angle (Part I)

Oracle on Opteron with Linux-The NUMA Angle (Part II)

Oracle on Opteron with Linux-The NUMA Angle (Part III)

It Really is All About The Core, Not the Processor (Socket)
In my post entitled AMD Quad-core “Barcelona” Processor For Oracle (Part III). NUMA Too!, I had to set a reader straight over his lack of understanding where the terms processor, core and socket are concerned. He followed up with:

kevin – you are correct. your math is fine. though, i may still disagree about core being a better term than “physical processor”, but that is neither here, nor there.

He continues:

my gut told me based upon working with servers and knowing both architectures your calculations were incorrect, instead i errored in my math as you pointed out. *but*, i did uncover an error in your logic that makes your case worthless.

So, I am replying here and now. His gut may just be telling him that he ate something bad, or it could be his conscience getting to him for mouthing off over at the investor village AMD board where he called me a moron. His self-proclaimed server expertise is not relevent here, nor is it likely the level he insinuates.

This is a blog about Oracle; I wish he’d get that through his head. Oracle licenses their flagship software (Real Application Clusters) at a list price of USD $60,000 per CPU. As I’ve pointed out, x86 cores are factored at .5 so a quad-core Barcelona will be 2 licenses—or $120,000 per socket. Today’s Tulsa processor licenses at $60,000 per socket and outperforms AMD’s projected Barcelona performance. AMD’s own promotional material suggests it will achieve a 70% OLTP (TPC-C) gain over today’s Opteron 2200. Sadly that is just not good enough where Oracle is concerned. I am a huge AMD fan, so this causes me grief.

Also, since he is such a server expert, he must certainly be aware that plugging a Barcelona processor into a Socket F board will need 70% headroom on the Hypertransport in order to attain that projected 70% OLTP increase. We aren’t talking about some CPU-only workload here, we are talking OLTP—as was AMD in that promotional video. OLTP hammers Hypertransport with tons of I/O, tons of contentious shared memory protected with spinlocks (a MESI snooping nightmare) and very large program text. I have seen no data anywhere suggesting this Socket F (Opteron 2200) TPC-C result of 139,693 TpmC was somehow achieved with 70% headroom to spare on the Hypertransport.

Specialized Hardware
Regarding the comparisons being made between the projected Barcelona numbers and today’s Xeon Tulsa, he states:

you are comparing a commodity chip with a specialized chip. those xeon processors in the ibm TPC have 16MB of L3 cache and cost about 6k a piece. amd most likely gave us the performance increase of the commodity version of barcelona, not a specialized version of barcelona. they specifically used it as a comparison, or upgrade of current socket TDP (65W,89W) parts.

What can I say about that? Specialized version of Barcelona? I’ve seen no indication of huge stepping plans, but that doesn’t matter. People run Oracle on specialized hardware. Period. If AMD had a “specialized” Barcelona in the plans, they wouldn’t have predicted a 70% increase over Opteron 2200—particularly not in a slide about OLTP using published TPC-C numbers from Opteron 2200 as the baseline. By the way, the only thing 16MB cache helps with in an Oracle workload is Oracle’s code footprint. Everything else is load/store operations and cache invalidations. The AMD caches are generally too small for that footprint, but the fact that the on-die memory controller is coupled with awesome memory latencies (due to Hypertransport), small cache size hasn’t mattered that much with Opteron 800 and Socket F—but only in comparison to older Xeon offerings. This whole blog thread has been about today’s Xeons and future Barcelona though.

Large L2/L3 Cache Systems with OLTP

Regarding Tulsa Xeon processors used in the IBM System x TPC-C result of 331,087 TpmC, he writes:

the benchmark likely runs in cache on the special case hardware.

Cache-bound TPC-C? Yes, now I am convinced that his gut wasn’t telling him anything useful. I’ve been talking about TPC-C. He, being a server expert, must surely know that TPC-C cannot execute in cache. That Tulsa Xeon number at 331,087 TpmC was attached to 1,008 36.4GB hard drives in a TotalStorage SAN. Does that sound like cache to anyone?

Tomorrow’s Technology Compared to Today’s Technology
He did call for a new comparison that is worth consideration:

we all know the p4 architecture is on the way out and intel has even put an end of line date on the architecture. compare the barcelon to woodcrest

So I’ll reciprocate, gladly. Today’s Clovertown ( 2 Woodcrest processors essentially glued together) has a TPC-C performance of 222,117 TpmC as seen in this audited Woodcrest TPC-C result. Being a quad-core processor, the Oracle licensing is 2 licenses per socket. That means today’s Woodcrest performance is 55,529 TpmC per Oracle license compared to the projected Barcelona performance of 59,369 TpmC per Oracle license. That means if you wait for Barcelona you could get 7% more bang for your Oracle buck than you can with today’s shipping Xeon quad-core technology. And, like I said, since Barcelona is going to get plugged into a Socket F board, I’m not very hopeful that the processor will get the required complement of bandwidth to achieve that projected 70% increase over Opteron 2200.

Now, isn’t this blogging stuff just a blast? And yes, unless AMD over-achieves on their current marketing projections for Barcelona performance, I’m going to be really bummed out.

Gettimeofday() and Oracle on AMD Processors

Published January 24, 2007 AMD Quad-core , AMD Quad-Core Performance , Oracle performance 5 Comments

It is pretty well known that the Oracle database relies quite heavily on gettimeofday(2) for timing everything from I/O calls to latch sleeps. The wait interface is coated with gettimeofday() calls. I’ve blogged about Oracle’s heavy reliance upon the gettimeofday(2) such as in this entry about DBWR efficiency. In fact, gettimeofday() usage is so high by Oracle that the boutique platforms of yesteryear even went so far as to work out a mapping of the system clock into user space so that a simple CPP macro could be used to get the data—eliminating the function overhead and kernel dive associated with the library routine. Well, it looks like there is relief on the horizon for folks running Linux on AMD. According to this AMD webpage about RDTSCP, there is about a 30% reduction in processor overhead for every call when using a gettimeofday() implementation based upon the new RDTSCP instruction in AMDs Socket-F compatable processors. The webpage states:

Testing shows that on RDTSCP capable CPUs, vast improvements in the time it takes to make gettimeofday() (GTOD)

calls. It takes 324 cycles per call to complete 1 million GTOD calls without RDTSCP and 221 cycles per call with the capability.

Of course that would be a kernel-mode reduction in CPU consumption which is even better for an Oracle database system.

I need to get my hands on a Socket F system to see whether the kernel support in RHEL4 U4 and the glibc side of things are set to use this RDTSCP-enabled gettimeofday() right out of the box. If not it might require the vgettimeofday() routine that is under development. If the latter is true it will require Oracle to release a patch to make the correct call—but only on AMD. Hmm, porting trickery. Either way, an optimized gettimeofday() can be a nice little boost. I’ll be sure to blog on that when I get the information. In the meantime, it is nice to see folks like AMD are trying to address these pain points.

Since Oracle calls gettimeofday() so frequently, and they are so very serious about Linux, I wonder why you are reading this here first?

Oracle on Opteron with Linux-The NUMA Angle (Part II)

Published January 19, 2007 AMD Barcelona , AMD K8L , AMD Quad-core , AMD Quad-Core Performance , NUMA Oracle , Opteron Oracle , Oracle Barcelona 1 Comment

A little more groundwork. Trust me, the Linux NUMA API discussion that is about to begin and the microbenchmark and Oracle benchmark tests will make a lot more sense with all this old boring stuff behind you.

Another Terminology Reminder
When discussing NUMA, the term node is not the same as in clusters. Remember that all the memory from all the nodes (or Quads, QBBs, RADs, etc) appear to all the processors as cache-coherent main memory.

More About NUMA Aware Software
As I mentioned in Oracle on Opteron with Linux–The NUMA Angle (Part I), NUMA awareness is a software term that refers to kernel and user mode software that makes intelligent decisions about how to best utilize resources in a NUMA system. I use the generic term resources because as I’ve pointed out, there is more to NUMA than just the non-uniform memory aspect. Yes, the acronym is Non Uniform Memory Access, but the architecture actually supports the notion of having building blocks with only processors and cache, only memory, or only I/O adaptors. It may sound really weird, but it is conceivable that a very specialized storage subsystem could be built and incorporated into a NUMA system by presenting itself as memory. Or, on the other hand, one could envision a very specialized memory component—no processors, just memory—that could be built into a NUMA system. For instance, think of a really large NVRAM device that presents itself as main memory in a NUMA system. That’s much different than an NVRAM card stuffed into something like a PCI bus and accessed with a device driver. Wouldn’t that be a great place to put an in-memory database for instance? Even a system crash would leave the contents in memory. Dealing with such topology requires the kernel to be aware of the differing memory topology that lies beneath it, and a robust user mode API so applications can allocate memory properly (you can’t just blindly malloc(3) yourself into that sort of thing). But alas, I digress since there is no such system commercially available. My intent was merely to expound on the architecture a bit in order to make the discussion of NUMA awareness more interesting.

In retrospect, these advanced NUMA topics are the reason I think Digital’s moniker for the building blocks used in the AlphaServer GS product line was the most appropriate. They used the acronym RAD (Resource Affinity Domain) which opens up the possible list of ingredients greatly. An API call would return RAD characteristics such as how many processors, how much memory (if any) and so on a RAD consisted of. Great stuff. I wonder how that compares to the Linux NUMA API? Hmm, I guess I better get to blogging…

When it comes to the current state of “commodity NUMA” (e.g., Opteron and Itanium) there are no such exotic concepts. Basically, these systems have processors and memory “nodes” with varying latency due to locality—but I/O is equally costly for all processors. I’ll speak mostly of Opteron NUMA with Linux since that is what I deal with the most and that is where I have Oracle running.

For the really bored, here is a link to a AlphaServer GS320 diagram.

The following is a diagram of the Sequent NUMA-Q components that interfaced with the SHV Xeon chipset to make systems with up to 64 processors:

OK, I promise, the next NUMA blog entry will get into the Linux NUMA API and what it means to Oracle.

AMD Quad-Core “Barcelona” Processor For Oracle (Part III). NUMA Too!

Published December 28, 2006 AMD Barcelona , AMD K8L , AMD Quad-core , AMD Quad-Core Performance , oracle , Oracle Barcelona 21 Comments

To continue my thread about AMD’s future Quad-core processors code named “Barcelona” (a.k.a. K8L), I need to elaborate a bit on my last installment on this thread where I pointed out that AMDs marketing material suggests we should expect 70% better OLTP performance from Barcelona than Socket F (Opteron 2220). To be precise, the marketing materials are predicting a 70% increase on a per-processor basis. That is a huge factor that I need to blog, so here it is.

“Friendemies”
While doing the technical review for the Julian Dyke/Steve Shaw RAC on Linux Book I got to know Steve Shaw a bit. Since then we have become more familiar with each other especially after manning the HP booth in the exhibitor hall at UKOUG 2006. Here is a photo of Steve in front of the HP Enterprise File Services Clustered Gateway demo. The EFS is an OEMed version of the PolyServe scalable file serving utility (scalable clustered storage that works).

People who know me know I’m a huge AMD fan, but they also know I am not a techno-religious zealot. I pick the best, but there is no room for loyalty in high technology (well, on second thought, I was loyal to Sequent to the bitter end…oh well). So over the last couple of years, Steve and I have occasionally agreed to disagree about the state of affairs between Intel and AMD processor fitness for Oracle. Steve and I are starting to see eye to eye a lot more these days because I’m starting to smell the coffee as they say.

It’s All About The Core
When it comes to Oracle performance on industry standard servers, the only thing I can say is, “It’s the core, stupid”—in that familiar Clintonian style of course. Oracle licenses the database at the rate of .5 per core, rounded up. So a quad-core processor is licensed as 2 CPUs. Let’s look at some numbers.

Since AMD’s Quad-core promo video is based on TPC results, I think it is fair to go with them. TPC-C is not representative of what real applications do to a processor, but the workload does one thing really well—it exploits latency issues. For OLTP, memory latency is the most important performance characteristic. Since AMD’s material sets our expectations for some 70% improvement in OLTP over the Opteron 2200, we’ll look at TPC-C.

This published TPC-C result shows that the Opteron 2200 can perform 69,846 TpmC per processor. If the AMD quad-core promotional video proves right, the Barcelona processor will come it at approximately 118,739 TpmC per processor (a 70% improvement).

TpmC/Oracle-license
Since a quad-core AMD is licensed by Oracle as 2 CPUs, it looks like Barcelona will be capable of 59,370 TpmC per Oracle license. Therein lies the rub, as they say. There are a couple of audited TPC-C results with the Intel “Tulsa” processor (a.k.a. Xeon 7140, 7150), such as this IBM System x result, that show this current high-end Xeon processor is capable of some 82,771 TpmC per processor. Since the Xeon 71[45]0 is a dual-core processor, the Oracle-license price factor is 82,771 TpmC per Oracle license. If these numbers hold any water, some 9 months from now when Barcelona ships, we’ll see a processor that is 28% less price-performant from a strict Oracle licensing standpoint. My fear is that it will be worse than that because Barcelona is socket-compatible with Socket F systems—such as the Opteron 2200. I’ve been at this stuff for a while and I cannot imagine the same chipset having enough headroom to feed a processor capable of 70% more throughput. Also, Intel will not stand still. I am comparing current Xeon to future Barcelona.

A Word About TPC-C Analysis
I admit it! I routinely compare TPC-C results on the same processor using results achieved by different databases. For instance, in this post, I use a DB2/SLES on IBM System x to make a point about the Xeon 7150 (“Tulsa”) processor. E-gad, how can I do that with a clear conscience? Well, think about it this way. If DB2 on IBM System x running SuSE can achieve 82,771 TpmC per Xeon 7150 and this HP result shows us that SQL Server 2005 on Proliant ML570G4 (Xeon 7140) can do 79,601 TpmC per CPU, you have to at least believe Oracle would do as well. There are no numbers anywhere that suggest Oracle is head and shoulders above either of these two software configurations on identical hardware. We can only guess because Oracle seems to be doing TPC-C with Itanium exclusively these days. I think that is a bummer, but Steve Shaw likes it (he works for Intel)!

What Does NUMA Have To Do With It?
Uh, Opteron/HyperTransport systems are NUMA systems. I haven’t blogged much about that yet, but I will. I know a bit about Oracle on NUMA—a huge bit.

I hope you’ll stay tuned because we’ll be looking at real numbers.

AMD Quad-core “Barcelona” Processor For Oracle (Part II)

Published December 18, 2006 AMD Barcelona , AMD K8L , AMD Quad-core , AMD Quad-Core Performance , NUMA Oracle , oracle 1 Comment

I am a huge AMD fan, but I am now giving up my hopes of finding any substantial information that could be used to predict what Oracle performance might be like on next year’s Barcelona (a.k.a. K8L) quad-core processor. I did, however, find another ” interesting blog” while trolling for information on this topic. Note, the quotes! Folks, NOTE THE QUOTES!!! I’m insinuating something there…

Lowered Expectations?
Anyway, what I am finding is that by AMD’s own predictions, we should expect Barcelona to outperform Intel’s Clovertown (Xeon 5355) processor by about 15% or so. The problem is that there really are no real numbers. You can view this AMD video about Barcelona. In it you’ll find a slide that shows their estimated 70% OLTP improvement over the Opteron 2200 SE product. The 2200 is a Socket F processor and luckily for us there is an audited TPC-C result of 34,923 TpmC/core. Note, I’m boiling down TPC results by core to make some sense of this. The Barcelona processor is 100% compatible with the Socket F family. I find it hard to imagine that Barcelona will be able to squeeze out a 70% performance increase from the same chipset. Oh well. But if it did, that would be a TPC-C result of 59,369 per core. So why then is that AMD video so focused on leap-frogging the Xeon 5355 which “only” gets 30,092 TpmC/core? And why the fixation on the Xeon 5355 when the Xeon 7140 “Tulsa” achieves 39,800 TpmC/core? It was nice and convenient to be able to compare the 2200SE, 5355 and 7140 with TPC results based on the same database—SQL Server.

I also see no evidence of IBM, HP or Dell planning to base a server on Barcelona. That’s scary. I’m expecting some quasi-inside information from Sun. Let’s see if that will help any of this make sense.

The following is shot of the AMD slide predicting 70% performance over the Xeon 5160 and Opteron 2200SE (which as I point out is a bit moot). You may have to right-click and view to zoom in on it:

OLTP is Old News
Finally, I’m discovering that you don’t get much information about processors when searching for that old, boring OLTP stuff. If I search for “megatasking +AMD” on the other hand—now that produces a richness of information! I’ve also learned that “enthusiast” is a buzzword AMD and Intel are both beating on heavily. I was completely unaware that there is actually what is known as an “enthusiast market”. It seems customers in this particular market buy processors that also wind up in servers for OLTP. I just hope the processors they are making for “enthusiasts” are also reasonably fit for Oracle databases. I’m afraid we aren’t going to know until we find out.

In the meantime, I think I’ll push some megatasking tests through my cluster of DL585s.

AMD Quad-Core “Barcelona” Processor For Oracle (Part I)

Published December 8, 2006 AMD Barcelona , AMD K8L , AMD Quad-core , AMD Quad-Core Performance , oracle 6 Comments

I haven’t seen much in the Oracle blogosphere on this topic. Let me see if I can get it going…

AMD’s move into quad-core processors has me thinking. First, I like how this arstechnica.com article about AMD’s quad-core “Barcelona” processor is a ”true” quad-core as opposed to the Xeon 5300 family which is actually 2 dual core processors mated in a multi-chip module (MCM). The article reads:

AMD touts Barcelona as a “true” quad-core processor, because it features a highly integrated design with all four cores on a single die with some shared parts. This is in contrast to Intel’s “quad-core” Kentsfield parts, which use package-level integration to get two separate dual-core dies in the same socket. For my part, I’m inclined to agree with AMD that Barcelona is real quad-core and Kentsfield isn’t, but I gave up fighting that semantic fight a long time ago. Nowadays, if it has four cores in a single package, I (grudgingly) call it “quad-core.”

I agree with the author on that point.

Just recently I worked the HP demo booth at UKOUG with Steve Shaw of Intel. I actually found myself playing a little po-tay-toe/po-tah-toe regarding the nature of just how true each of these quad-core packages were. Honestly, I think I held that stance for just a moment, because the point is moot. Let me explain. It is all about Oracle licensing.

Oracle licenses Intel cores at .5 of a CPU, rounded up to the next whole number. So a single socket, quad-core system is .5 x 4 or 2 full CPU licenses. On the other hand, single socket/dual-core is .5 x 2 or 1 CPU license. The power of these processors is no longer a challenge of how much you can get as much as it is how little you can get. If the workload can be satisfied with a single socket/dual-core, the price savings in Oracle licensing alone might motivate folks to buy such a system. Oracle is the most expensive thing you buy after all. What systems are there that offer significant performance in a single socket/dual-core? Itanium. It seems you can order the HP Integrity rx3600 with a single scoket. There I said it. Now I need to go kneel on peach pits or something to make me feel properly chastised.

There is more to it than hardware. Oracle ports have always lagged for Itanium Linux. In fact, Oracle10g was released on PowerPC Linux before Itanium.

I just think Intel missed the boat in the late 1990s on getting Merced to market in a package worth having. And who really needed another instruction set? Now I digress.

	Optimize replication… on Introducing SLOB – The S…
	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage

Archive for the 'AMD Quad-core' Category

Oracle11g Automatic Memory Management – Part III. A NUMA Issue.

Oracle on Opteron with Linux-The NUMA Angle (Part VII).

AMD Quad-Core “Barcelona” Processor For Oracle. How Badly Do You Need Enterprise Edition Oracle?

AMD Quad-Core “Barcelona” Processor For Oracle (Part V). 40% Expected Over Clovertown.

AMD Quad-Core “Barcelona” Processor For Oracle (Part IV) and the Web 2.0 Trolls.

Gettimeofday() and Oracle on AMD Processors

Oracle on Opteron with Linux-The NUMA Angle (Part II)

AMD Quad-Core “Barcelona” Processor For Oracle (Part III). NUMA Too!

AMD Quad-core “Barcelona” Processor For Oracle (Part II)

AMD Quad-Core “Barcelona” Processor For Oracle (Part I)

DISCLAIMER

Pages

Blogroll

Follow Blog via Email

Recent Posts

Recent Comments

Fond Memories

Copyright

Archive for the 'AMD Quad-core' Category

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

DISCLAIMER

Pages

Blogroll

Follow Blog via Email

Recent Posts

Recent Comments

Fond Memories

Copyright