Archive for the 'Oracle Barcelona' Category

Oracle11g Automatic Memory Management – Part III. A NUMA Issue.

Now I’m glad I did that series about Oracle on Linux, The NUMA Angle. In my post about the the difference between NUMA and SUMA and “Cyclops”, I shared a lot of information about the dynamics of Oracle running with all the SGA allocated from one memory bank on a NUMA system. Déjà vu.

Well, we’re at it again. As I point out in Part I and Part II of this series, Oracle implements Automatic Memory Management in Oracle Database 11g with memory mapped files in /dev/shm. That got me curious.

Since I exclusively install my Oracle bits on NFS mounts, I thought I’d sling my 11g ORACLE_HOME over to a DL385 I have available in my lab setup. Oh boy am I going to miss that lab when I take on my new job September 4th. Sob, sob. See, when you install Oracle on NFS mounts, the installation is portable. I install 32bit Linux ports via 32bit server into an NFS mount and I can take it anywhere. In fact, since the database is on an NFS mount (HP EFS Clustered Gateway NAS) I can take ORACLE_HOME and the database mounts to any system with a RHEL4 OS running-and that includes RHEL4 x86_64 servers even though the ORACLE_HOME is 32bit. That works fine, except 32bit Oracle cannot use libaio on 64bit RHEL4 (unless you invokde everything under the linux32 command environment that is). I don’t care about that since I use either Oracle Disk Manager or, better yet, Oracle11g Direct NFS. Note, running 32bit Oracle on a 64bit Linux OS is not supported for production, but for my case it helps me check certain things out. That brings us back to /dev/shm on AMD Opteron (NUMA) systems. It turns out the only Opteron system I could test 11g AMM on happens to have x86_64 RHEL4 installed-but, again, no matter.

Quick Test

[root@tmr6s5 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3585 MB
node 1 size: 4095 MB
node 1 free: 3955 MB
[root@tmr6s5 ~]# dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024
1024+0 records in
1024+0 records out
[root@tmr6s5 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3585 MB
node 1 size: 4095 MB
node 1 free: 2927 MB

Uh, that’s not good. I dumped some zeros into a file on /dev/shm and all the memory was allocated from socket 1. Lest anyone forget from my NUMA series (you did read that didn’t you?), writing memory not connected to your processor is, uh, slower:

[root@tmr6s5 ~]# taskset -pc 0-1 $$
pid 9453's current affinity list: 0,1
pid 9453's new affinity list: 0,1
[root@tmr6s5 ~]# time dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024 conv=notrunc
1024+0 records in
1024+0 records out

real    0m1.116s
user    0m0.005s
sys     0m1.111s
[root@tmr6s5 ~]# taskset -pc 1-2 $$
pid 9453's current affinity list: 0,1
pid 9453's new affinity list: 1
[root@tmr6s5 ~]# time dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024 conv=notrunc
1024+0 records in
1024+0 records out

real    0m0.931s
user    0m0.006s
sys     0m0.923s

Yes, 20% slower.

What About Oracle?
So, like I said, I mounted that ORACLE_HOME on this Opteron server. What does an AMM instance look like? Here goes:

SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3587 MB
node 1 size: 4095 MB
node 1 free: 3956 MB
SQL> startup pfile=./amm.ora
ORACLE instance started.

Total System Global Area 2276634624 bytes
Fixed Size                  1300068 bytes
Variable Size             570427804 bytes
Database Buffers         1694498816 bytes
Redo Buffers               10407936 bytes
Database mounted.
Database opened.
SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 1331 MB
node 1 size: 4095 MB
node 1 free: 3951 MB

Ick. This means that Oracle11g AMM on Opteron servers is a Cyclops. Odd how this allocation came from memory attached to socket 0 when the file creation with dd(1) landed in socket 1’s memory. Hmm…

What to do? SUMA? Well, it seems as though I should be able to interleave tmpfs memory and use that for /dev/shm-at least according to the tmpfs documentation. And should is the operative word. I have been tweaking for a half hour to get the mpol=interleave mount option (with and without the -o remount technique) to no avail. Bummer!

If AMD can’t get the Barcelona and/or Budapest Quad-core off the ground (and into high-quality servers from HP/IBM/DELL/Verari), none of this will matter. Actually, come to think of it, unless Barcelona is really, really fast, you won’t be sticking it into your existing Socket F motherboards because that doubles your Oracle license fee (unless you are on standard edition which is priced on socket count). That leaves AMD Quad-core adopters waiting for HyperTransport 3.0 as a remedy. I blogged all this AMD Barcelona stuff already.

Given the NUMA characteristics of /dev/shm, I think I’ll test AMM versus MMM on NUMA, and them test again on SUMA-if I can find the time.

If anyone can get /dev/shm mounted with the mpol option, please let me know because, at times, I can be quite a dolt and I’d love this to be one of them.

Oracle on Opteron with Linux-The NUMA Angle (Part VII).

This installment in my series about Oracle on Linux with NUMA hardware is very, very late. I started this series at the end of last year and it just kept getting put off—mostly because the hardware I needed to use was being used for other projects (my own projects). This is the seventh in the series and it’s time to show some Oracle numbers. Previously, I laid groundwork about such topics as SUMA/NUMA, NUMA API and so forth. To make those points I relied on microbenchmarks such as the Silly Little Benchmark. The previous installments can be found here.

To bring home the point that Oracle should be run on AMD boxes in NUMA mode (as opposed to SUMA), I decided to pick an Oracle workload that is very easy to understand as well as processor intensive. After all, the difference between SUMA and NUMA is higher memory latency so testing at any level below processor saturation actually provides the same throughput-albeit the SUMA result would come at a higher processor cost. To that end, measuring SUMA and NUMA at processor saturation is the best way to see the difference.

The workload I’ll use for this testing is what my friend Anjo Kolk refers to as the Jonathan Lewis Oracle Computing Index workload. The workload comes in script form and is very straightforward. The important thing about the workload is that it hammers memory which, of course, is the best way to see the NUMA effect. Jonathan Lewis needs no introduction of course.

The test was set up to execute 4, 8 16 and 32 concurrent invocations of the JL Comp script. The only difference in the test setup was that in one case I booted the server in SUMA mode and in another I booted in NUMA mode and allocated hugepages. As I point out in this post about SUMA, hugepages are allocated in a NUMA fashion and booting an SGA into this memory offers at least crude fairness placement of the SGA pages—certainly much better than a Cyclops. In short, what is being tested here one case where memory is allocated at boot time in a completely round-robin fashion versus the SGA being quasi-round robin yet page tables, kernel-side process-related structures and heap are all NUMA-optimized. Remember, this is no more difficult than a system boot option. Let’s get to the numbers.


I have also rolled up all the statspack reports into a word document (as required by WordPress). The document is numa-statspack.doc and it consist of 8 statspacks each prefaced by the name of what the specific test was. If you pattern search for REPORT NAME you will see each entry. Since this is a simple memory latency improvement, you might not be surprised how uninteresting the stats are-except of course the vast improvement in the number of logical reads per second the NUMA tests were able to push through the system.

A picture speaks a thousand words. This simple test combined with this simple graph covers it all pretty well. The job complete time ranged from about 12 to 15 percent better with NUMA at each of the concurrent session counts. While 12 to 15% isn’t astounding, remember this workload is completely processor bound. How do you usually recuperate 12-15% from a totally processor-bound workload without changing even a single line of code? Besides, this is only one workload and the fact remains that the more your particular workload does outside the SGA (e.g., sorting, etc) the more likely you are to see improvement. But by all means, do not run Oracle with Cyclops memory.

The Moral of the Story

Processors are going to get more cores and slower clock rates and memory topologies will look a lot more NUMA than SUMA as time progresses. I think it is important to understand NUMA.

What is Oracle Doing About It?
Well, I’ve blogged about the fact that the Linux ports of 10g do not integrate with libnuma. That means it is not NUMA-aware. What I’ve tried to show in this series is that the world of NUMA is not binary. There is more to it than SUMA or NUMA-aware. In the middle is booting the server and database in a fashion that at least allows benefit from the OS-side NUMA-awareness. The next step is Oracle NUMA-awareness.

Just recently I was sitting in a developer’s office in bldg 400 of Oracle HQ talking about NUMA. It was a good conversation. He stated that Oracle actually has NUMA awareness in it and I said, “I know.” I don’t think Sequent was on his mind and I can’t blame him—that was a long time ago. The vestiges of NUMA awareness in Oracle 10g trace back to the high-end proprietary NUMA implementations of the 1990s.  So if “it’s in there” what’s missing? We both said vgetcpu() at the same time. You see, you can’t have Oracle making runtime decisions about local versus remote memory if a process doesn’t know what CPU it is currently executing on (detection with less than a handful of instructions).  Things like vgetcpu() seem to be coming along. That means once these APIs are fully baked, I think we’ll see Oracle resurrect intrinsic NUMA awareness in the Linux port of Oracle Database akin to those wildcat ports of the late 90s…and that should be a good thing.

Oracle on Opteron with Linux-The NUMA Angle (Part VI). Introducing Cyclops.

This is part 6 in a series about Oracle on Opteron-based NUMA servers running Linux. The list of prior installments can be found through my index of NUMA-related posts.

In part 5 of the series I discussed using Opteron-based servers with NUMA features disabled in the BIOS. Running an Opteron server (e.g., HP Proliant DL585) in this fashion is sometimes called SUMA (Sufficiently Uniform Memory Access) or SUMO (Sufficiently Uniform Memory Organization). At the risk of being controversial, I pointed out that in the Oracle Validated Configuration listing for Proliant, the recommendation is given to configure Opteron-based servers as SUMO/SUMA. In my experience, most folks do not change the BIOS and are therefore running a NUMA system since that is the default. However, if steps are taken to disable NUMA on an Opteron system, there are subtleties that warrant deeper understanding. How subtle are the subtleties? That question is the main theme of this blog series.

Memory Latencies with SUMA/SUMO vs NUMA
In part 5 of the series, I used the SLB memory latency workload to show how memory writes differ in NUMA versus SUMA/SUMO. I wrote:

Writing memory on the SUMA configuration in the 8 concurrent memhammer case demonstrated latencies on order of 156ns but dropped 38% to 97ns by switching to NUMA and using the Linux 2.6 NUMA API.

But What About Oracle?
What is the cost of running Oracle on SUMA? The simple answer is, it depends. More architectural background is needed before I go into that.

OK, so SUMA is what you get when you tweak a Proliant Opteron-based server so that memory is interleaved at the low level. Accompanying this with the setting of numa=off in the grub.conf file gets you a completely non-NUMA setup.

NUMA enabled in the BIOS, however, is the default. If the Oracle ports to Linux were NUMA-aware, that would be just fine. However, if the server isn’t configured as a SUMA and you boot Oracle without any consideration for the fact that you are on a NUMA system, you get what I call Cyclops. Let’s take a look at what I mean.

In the following screen shot I have booted an Oracle10g SGA of 7584MB on my Proliant DL585. The system is configured with 32GB physical memory which is, of course, 4 banks of 8GB each attached to one of the 4 dual-core Opterons (nodes). Before booting this SGA, I had between roughly 7.6GB and 7.7GB free memory on each of the memory banks. In the following figure it’s clear that after booting this 7584MB SGA I am left with all but 116MB of memory consumed from node 0 (socket 0)—Cyclops!

NOTE: You may need to right click->view the image


Right, so really bad things can happen if processes that are memory-resident on node 0 try to allocate more memory. In the 2.4 Kernel timeframe Red Hat points out such ill affect as OOM process termination in this web page. I haven’t spent much time researching how 2.6 responds to it because the point of this blog entry to not get into such a situation.

Let’s consider what things are like on a Cyclops even if there are no process or memory allocation failures. Let’s say, for instance, there is a listener with soft node affinity to node 2. All the sessions it forks off will have node affinity to node 2 where they will be granted pages for their kernel structures, page tables, stack, heap and so on. However, the entire SGA is remote memory since as you can see all the memory for the SGA was allocated from node 0. That is, um, not good.

Hugepages Are More Attractive Than Cyclops
Cyclops pops up its ugly single-eyed head only when you are running NUMA (not SUMA/SOMA) and fail to allocate/use hugepages. Whether you allocate hugepages off the grub boot line or out of sysctl.conf, memory for hugepages is allocated in a distributed fashion from the varying memory banks. Did I say round-robin? No. Because I don’t yet know whether it is round-robin or segmented. I have to leave something to blog about in the future.

The following is a screen shot of a session where I allocated 3800 2MB hugepages after the system was booted by echoing that value into /proc/sys/vm/nr_hugepages. Notice that unlike Cyclops, the pages are allocated for Oracle’s future use in a more distributed fashion from the various memory banks. I then booted Oracle. No Cyclops here.


Interleaving NUMA Memory Allocation
The numactl(8) command supports the notion of pushing memory allocation preferences down to its children. Until such time as the Linux port of Oracle is NUMA-aware internally—as was done in the Sequent DYNIX/ptx, SGI, DG, and to a lesser degree the Solaris Oracle10g port with MPO—the best hopes for efficient memory usage on a commodity NUMA system is to interleave the placement of shared memory via numactl(8). With the SGA allocated in this fashion on a 4-socket NUMA system, Oracle’s memory accesses for the variable and buffer pool components will have locality of up to 25%–generally speaking. Yes, I’m sure some session could go crazy with logical reads of 2 buffers 20,000 times per second or some pathological situation, but I am trying to cover the topic in more general terms. You might wonder how this differs from SUMA/SOMA though.

With SUMA, all memory is interleaved. That means even the NUMA-aware Linux 2.6 kernel cannot exploit the hardware architecture by allocating structures with respect to the memory hierarchies. That is a pure waste. Moreover, with SUMA, 100% of your Oracle memory accesses will hit interleaved memory. That includes PGA. In contrast, properly allocated NUMA-interleaved hugepages results in fairness in the SGA placement, but allocation in the PGA (heap) and stack for the sessions are 100% local memory! That is a good thing. In the following screen shot I coupled numactl(8) memory interleaving with hugepages.


Validated Oracle Configuration
As I pointed out, this Oracle Validated Configuration listing for Proliant recommends turning off NUMA. Now that I’m an HP employee, I’ll have to pursue that a bit because I don’t agree with it at all. You’ll see why when I post my performance measurements contrasting NUMA (with interleave hugepages) to SUMA/SOMA. Look at that Validated Configuration web page closely and you’ll see a recommendation to allow Oracle to use hugepages by tuning /etc/security/limits.conf, but neither allocation of hugepages from the grub boot line nor via the sysctl.conf file!

Could it be that the recommendations in this Validated Configuration were a knee-jerk reaction to Cyclops? I’m not much of a betting man, but I’d wager $5.00 that was the case. Like I said, I’m in HP now…I’ll have to see what all that was about.

Up Next
In my next installment, I will provide Oracle measurements contrasting SUMA and NUMA. I know I’ve said this would be the installment with Oracle performance numbers, but I had to lay too much ground work in this post. The mind can only absorb what the seat can endure.

Patent Infringement
For all you folks that hate the concept of software patents, here’s a good one. When my Sequent colleagues and I were working out the OS-requirements to support our NUMA-optimizations of the Oracle 8 port to Sequent’s NUMA-Q system, we knew early on we’d need a very rich set of enhancements to shmget() for memory region placement. So we specified the requirements to our OS developers. Lo and behold U.S. Patent 6,505,286 plopped out. So, for extra credit, can someone explain to me how the Linux 2.6 libnuma call numa_alloc_onnode() (described here) is not in complete violation of that patent? Hmmm…

Now for a real taste of NUMA-Oracle history, read the following: Sequent_NUMA_Oracle8i

AMD Quad-Core “Barcelona” Processor For Oracle (Part V). 40% Expected Over Clovertown.

A reader posted an interesting comment on the latest installment on my thread about Oracle licensing on the upcoming AMD Barcelona processor. The comment as posted on my blog article entitled AMD Quad-Core “Barcelona” Processor For Oracle (Part IV) and the Web 2.0 Trolls states:

The problem with your numbers is that they are based on old AMD marketing materials. AMD has had a chance to run their engineering samples at their second stepping (they are now gearing up full production for late Q2 delivery – 12 weeks from wafer starts) and they are currently claiming a 40% advantage on Clovertown versus the 70% over the Opteron 2200 from their pre-A0 stepping marketing material.

The AMD claim was covered in this ZDNet article which quotes AMD Vice President Randy Allen as follows:

We expect across a wide variety of workloads for Barcelona to outperform Clovertown by 40 percent,” Allen said. The quad-core chip also will outperform AMD’s current dual-core Opterons on “floating point” mathematical calculations by a factor of 3.6 at the same clock rate, he said.

That is a significantly different set of projections than I covered in my article entitled AMD Quad-core “Barcelona” Processor For Oracle (Part II). That article covers AMD’s initial OLTP projections of 70% OLTP improvement on a per-processor (socket) over Opteron 2200. These new projections are astounding, and I would love to see it be the case for the sake of competition. Let’s take a closer look.

Hypertransport Bandwidth
I’m glad AMD has set expectations by stating the 40% uplift over Clovertown would be realized for “a wide variety of workloads.” However, since this is an Oracle blog I would much have preferred to see OLTP mentioned specifically. The numbers are hard to imagine, and it is all about feeding the processor, not the processor itself. The Barcelona processor is socket-compatible with Socket F. Any improvement of Opteron 2200/8200 would require existing headroom on the Hypertransport for workloads like OLTP. A lot of headroom—let’s look at the numbers.

The Socket F baseline that the original AMD projections were based on was 139,693 TpmC. If OLTP is included in the “wide variety of workloads”, then the projected OLTP throughput would be Clovertown 222,117 TpmC x 1.4, or 310.963 TpmC—all things being equal. This represents 2.2 times the throughput from the same Socket F/Hypertransport setup. Time for a show of hands, how many folks out there think that the Opteron 2200 OLTP result of 139,693 TpmC was achieved with more then 50% headroom to spare on the Hyptertransports? I would love to see Barcelona come in with this sort of OLTP throughput, but folks, systems are not made with more than 200% bus bandwidth than the processors need. I’m not very hopeful.


Bear in mind that today’s Tulsa processor as packaged in the IBM System x 3950 is capable of 331,087 TpmC with 8 cores. So, let’s factor our Oracle licensing in and see what the numbers look like if AMD’s projections apply to OLTP:

Opteron 2200 4 core: 139,693 TpmC, 2 licenses = 69,846 per license

Clovertown 8 core: 222,117 TpmC, 4 licenses = 55,529 per license

AMD Old Projection 8 core: 237,478 TpmC, 4 licenses = 59,369 per license

AMD New Projection 8 core: 310,963 TpmC, 4 licenses = 77,740 per license

Tulsa 8 core: 331,087 TpmC, 4 licenses = 82,771 per license

Barcelona Floating Point
FPU performance doesn’t matter to Oracle as I point out in this blog entry.

Clock Speed
The news about the expected 40% jump over Clovertown was accompanied by the news that Barcelona will clock in at a lower speed than Opteron 2200/8200 processors. I haven’t mentioned that aspect—because with Oracle it really doesn’t matter much. The amount of work Oracle gets done in cache is essentially nill. I’ll blog about clock speed with Opterons very soon.

AMD Quad-Core “Barcelona” Processor For Oracle (Part IV) and the Web 2.0 Trolls.

This blog entry is the fourth in a series:

Oracle on Opteron with Linux–The NUMA Angle (Part I)

Oracle on Opteron with Linux-The NUMA Angle (Part II)

Oracle on Opteron with Linux-The NUMA Angle (Part III)

It Really is All About The Core, Not the Processor (Socket)
In my post entitled AMD Quad-core “Barcelona” Processor For Oracle (Part III). NUMA Too!, I had to set a reader straight over his lack of understanding where the terms processor, core and socket are concerned. He followed up with:

kevin – you are correct. your math is fine. though, i may still disagree about core being a better term than “physical processor”, but that is neither here, nor there.

He continues:

my gut told me based upon working with servers and knowing both architectures your calculations were incorrect, instead i errored in my math as you pointed out. *but*, i did uncover an error in your logic that makes your case worthless.

So, I am replying here and now. His gut may just be telling him that he ate something bad, or it could be his conscience getting to him for mouthing off over at the investor village AMD board where he called me a moron. His self-proclaimed server expertise is not relevent here, nor is it likely the level he insinuates.

This is a blog about Oracle; I wish he’d get that through his head. Oracle licenses their flagship software (Real Application Clusters) at a list price of USD $60,000 per CPU. As I’ve pointed out, x86 cores are factored at .5 so a quad-core Barcelona will be 2 licenses—or $120,000 per socket. Today’s Tulsa processor licenses at $60,000 per socket and outperforms AMD’s projected Barcelona performance. AMD’s own promotional material suggests it will achieve a 70% OLTP (TPC-C) gain over today’s Opteron 2200. Sadly that is just not good enough where Oracle is concerned. I am a huge AMD fan, so this causes me grief.

Also, since he is such a server expert, he must certainly be aware that plugging a Barcelona processor into a Socket F board will need 70% headroom on the Hypertransport in order to attain that projected 70% OLTP increase. We aren’t talking about some CPU-only workload here, we are talking OLTP—as was AMD in that promotional video. OLTP hammers Hypertransport with tons of I/O, tons of contentious shared memory protected with spinlocks (a MESI snooping nightmare) and very large program text. I have seen no data anywhere suggesting this Socket F (Opteron 2200) TPC-C result of 139,693 TpmC was somehow achieved with 70% headroom to spare on the Hypertransport.

Specialized Hardware
Regarding the comparisons being made between the projected Barcelona numbers and today’s Xeon Tulsa, he states:

you are comparing a commodity chip with a specialized chip. those xeon processors in the ibm TPC have 16MB of L3 cache and cost about 6k a piece. amd most likely gave us the performance increase of the commodity version of barcelona, not a specialized version of barcelona. they specifically used it as a comparison, or upgrade of current socket TDP (65W,89W) parts.

What can I say about that? Specialized version of Barcelona? I’ve seen no indication of huge stepping plans, but that doesn’t matter. People run Oracle on specialized hardware. Period. If AMD had a “specialized” Barcelona in the plans, they wouldn’t have predicted a 70% increase over Opteron 2200—particularly not in a slide about OLTP using published TPC-C numbers from Opteron 2200 as the baseline. By the way, the only thing 16MB cache helps with in an Oracle workload is Oracle’s code footprint. Everything else is load/store operations and cache invalidations. The AMD caches are generally too small for that footprint, but the fact that the on-die memory controller is coupled with awesome memory latencies (due to Hypertransport), small cache size hasn’t mattered that much with Opteron 800 and Socket F—but only in comparison to older Xeon offerings. This whole blog thread has been about today’s Xeons and future Barcelona though.

Large L2/L3 Cache Systems with OLTP

Regarding Tulsa Xeon processors used in the IBM System x TPC-C result of 331,087 TpmC, he writes:

the benchmark likely runs in cache on the special case hardware.

Cache-bound TPC-C? Yes, now I am convinced that his gut wasn’t telling him anything useful. I’ve been talking about TPC-C. He, being a server expert, must surely know that TPC-C cannot execute in cache. That Tulsa Xeon number at 331,087 TpmC was attached to 1,008 36.4GB hard drives in a TotalStorage SAN. Does that sound like cache to anyone?

Tomorrow’s Technology Compared to Today’s Technology
He did call for a new comparison that is worth consideration:

we all know the p4 architecture is on the way out and intel has even put an end of line date on the architecture. compare the barcelon to woodcrest

So I’ll reciprocate, gladly. Today’s Clovertown ( 2 Woodcrest processors essentially glued together) has a TPC-C performance of 222,117 TpmC as seen in this audited Woodcrest TPC-C result. Being a quad-core processor, the Oracle licensing is 2 licenses per socket. That means today’s Woodcrest performance is 55,529 TpmC per Oracle license compared to the projected Barcelona performance of 59,369 TpmC per Oracle license. That means if you wait for Barcelona you could get 7% more bang for your Oracle buck than you can with today’s shipping Xeon quad-core technology. And, like I said, since Barcelona is going to get plugged into a Socket F board, I’m not very hopeful that the processor will get the required complement of bandwidth to achieve that projected 70% increase over Opteron 2200.

Now, isn’t this blogging stuff just a blast? And yes, unless AMD over-achieves on their current marketing projections for Barcelona performance, I’m going to be really bummed out.

Oracle on Opteron with Linux-The NUMA Angle (Part II)

A little more groundwork. Trust me, the Linux NUMA API discussion that is about to begin and the microbenchmark and Oracle benchmark tests will make a lot more sense with all this old boring stuff behind you.

Another Terminology Reminder
When discussing NUMA, the term node is not the same as in clusters. Remember that all the memory from all the nodes (or Quads, QBBs, RADs, etc) appear to all the processors as cache-coherent main memory.

More About NUMA Aware Software
As I mentioned in Oracle on Opteron with Linux–The NUMA Angle (Part I), NUMA awareness is a software term that refers to kernel and user mode software that makes intelligent decisions about how to best utilize resources in a NUMA system. I use the generic term resources because as I’ve pointed out, there is more to NUMA than just the non-uniform memory aspect. Yes, the acronym is Non Uniform Memory Access, but the architecture actually supports the notion of having building blocks with only processors and cache, only memory, or only I/O adaptors. It may sound really weird, but it is conceivable that a very specialized storage subsystem could be built and incorporated into a NUMA system by presenting itself as memory. Or, on the other hand, one could envision a very specialized memory component—no processors, just memory—that could be built into a NUMA system. For instance, think of a really large NVRAM device that presents itself as main memory in a NUMA system. That’s much different than an NVRAM card stuffed into something like a PCI bus and accessed with a device driver. Wouldn’t that be a great place to put an in-memory database for instance? Even a system crash would leave the contents in memory. Dealing with such topology requires the kernel to be aware of the differing memory topology that lies beneath it, and a robust user mode API so applications can allocate memory properly (you can’t just blindly malloc(3) yourself into that sort of thing). But alas, I digress since there is no such system commercially available. My intent was merely to expound on the architecture a bit in order to make the discussion of NUMA awareness more interesting.

In retrospect, these advanced NUMA topics are the reason I think Digital’s moniker for the building blocks used in the AlphaServer GS product line was the most appropriate. They used the acronym RAD (Resource Affinity Domain) which opens up the possible list of ingredients greatly. An API call would return RAD characteristics such as how many processors, how much memory (if any) and so on a RAD consisted of. Great stuff. I wonder how that compares to the Linux NUMA API? Hmm, I guess I better get to blogging…

When it comes to the current state of “commodity NUMA” (e.g., Opteron and Itanium) there are no such exotic concepts. Basically, these systems have processors and memory “nodes” with varying latency due to locality—but I/O is equally costly for all processors. I’ll speak mostly of Opteron NUMA with Linux since that is what I deal with the most and that is where I have Oracle running.

For the really bored, here is a link to a AlphaServer GS320 diagram.

The following is a diagram of the Sequent NUMA-Q components that interfaced with the SHV Xeon chipset to make systems with up to 64 processors:


OK, I promise, the next NUMA blog entry will get into the Linux NUMA API and what it means to Oracle.

AMD Quad-Core “Barcelona” Processor For Oracle (Part III). NUMA Too!

To continue my thread about AMD’s future Quad-core processors code named “Barcelona” (a.k.a. K8L), I need to elaborate a bit on my last installment on this thread where I pointed out that AMDs marketing material suggests we should expect 70% better OLTP performance from Barcelona than Socket F (Opteron 2220). To be precise, the marketing materials are predicting a 70% increase on a per-processor basis. That is a huge factor that I need to blog, so here it is.

While doing the technical review for the Julian Dyke/Steve Shaw RAC on Linux Book I got to know Steve Shaw a bit. Since then we have become more familiar with each other especially after manning the HP booth in the exhibitor hall at UKOUG 2006. Here is a photo of Steve in front of the HP Enterprise File Services Clustered Gateway demo. The EFS is an OEMed version of the PolyServe scalable file serving utility (scalable clustered storage that works).


People who know me know I’m a huge AMD fan, but they also know I am not a techno-religious zealot. I pick the best, but there is no room for loyalty in high technology (well, on second thought, I was loyal to Sequent to the bitter end…oh well). So over the last couple of years, Steve and I have occasionally agreed to disagree about the state of affairs between Intel and AMD processor fitness for Oracle. Steve and I are starting to see eye to eye a lot more these days because I’m starting to smell the coffee as they say.

It’s All About The Core
When it comes to Oracle performance on industry standard servers, the only thing I can say is, “It’s the core, stupid”—in that familiar Clintonian style of course. Oracle licenses the database at the rate of .5 per core, rounded up. So a quad-core processor is licensed as 2 CPUs. Let’s look at some numbers.

Since AMD’s Quad-core promo video is based on TPC results, I think it is fair to go with them. TPC-C is not representative of what real applications do to a processor, but the workload does one thing really well—it exploits latency issues. For OLTP, memory latency is the most important performance characteristic. Since AMD’s material sets our expectations for some 70% improvement in OLTP over the Opteron 2200, we’ll look at TPC-C.

This published TPC-C result shows that the Opteron 2200 can perform 69,846 TpmC per processor. If the AMD quad-core promotional video proves right, the Barcelona processor will come it at approximately 118,739 TpmC per processor (a 70% improvement).

Since a quad-core AMD is licensed by Oracle as 2 CPUs, it looks like Barcelona will be capable of 59,370 TpmC per Oracle license. Therein lies the rub, as they say. There are a couple of audited TPC-C results with the Intel “Tulsa” processor (a.k.a. Xeon 7140, 7150), such as this IBM System x result, that show this current high-end Xeon processor is capable of some 82,771 TpmC per processor. Since the Xeon 71[45]0 is a dual-core processor, the Oracle-license price factor is 82,771 TpmC per Oracle license. If these numbers hold any water, some 9 months from now when Barcelona ships, we’ll see a processor that is 28% less price-performant from a strict Oracle licensing standpoint. My fear is that it will be worse than that because Barcelona is socket-compatible with Socket F systems—such as the Opteron 2200. I’ve been at this stuff for a while and I cannot imagine the same chipset having enough headroom to feed a processor capable of 70% more throughput. Also, Intel will not stand still. I am comparing current Xeon to future Barcelona.

A Word About TPC-C Analysis
I admit it! I routinely compare TPC-C results on the same processor using results achieved by different databases. For instance, in this post, I use a DB2/SLES on IBM System x to make a point about the Xeon 7150 (“Tulsa”) processor. E-gad, how can I do that with a clear conscience? Well, think about it this way. If DB2 on IBM System x running SuSE can achieve 82,771 TpmC per Xeon 7150 and this HP result shows us that SQL Server 2005 on Proliant ML570G4 (Xeon 7140) can do 79,601 TpmC per CPU, you have to at least believe Oracle would do as well. There are no numbers anywhere that suggest Oracle is head and shoulders above either of these two software configurations on identical hardware. We can only guess because Oracle seems to be doing TPC-C with Itanium exclusively these days. I think that is a bummer, but Steve Shaw likes it (he works for Intel)!

What Does NUMA Have To Do With It?
Uh, Opteron/HyperTransport systems are NUMA systems. I haven’t blogged much about that yet, but I will. I know a bit about Oracle on NUMA—a huge bit.

I hope you’ll stay tuned because we’ll be looking at real numbers.


I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 746 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories


All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: