Oracle on Opteron with Linux–The NUMA Angle (Part I)

There are Horrible Definitions of NUMA Out There on the Web
I want to start blogging about NUMA with regard to Oracle because NUMA has reached the commodity hardware scene with Opteron and Hypertransport technology Yes, I know Opteron has been available for a long time, but it wasn’t until the Linux 2.6 Kernel that there were legitimate claims of the OS being NUMA-aware. Before I can start blogging about NUMA/Oracle on Opteron related topics, I need to lay down some groundwork.

First, I’ll just come out and say it, I know NUMA—really, really well. I spent the latter half of the 1990’s inside the Sequent Port of Oracle working out NUMA-optimizations to exploit Sequent NUMA-Q 2000—the first commercially available NUMA system. Yes, Data General, SGI and Digital were soon to follow with AViiON, Origin 2000 and the AlphaServer GS320 respectively. The first port of Oracle to have code within the kernel specifically exploiting NUMA architecture was the Sequent port of Oracle8i.

Glossary
I’d like to offer a couple of quick definitions. The only NUMA that matters where Oracle is concerned is Cache Coherent NUMA (a.k.a CC-NUMA):

NUMA – A microprocessor-based computer system architecture comprised of compute nodes that possess processors and memory and usually disk/network I/O cards. A CC-NUMA system has specialized hardware that presents all the varying memory components as a single memory image to the processors. This has historically been accomplished with crossbar, switch or SCI ring technologies. In the case of Opteron, NUMA is built into the processor since each processor has an on-die memory controller. Understanding how a memory reference is satisfied in a NUMA system is the most important aspect of understanding NUMA. Each memory address referenced by the processors in a NUMA system is essentially “snooped” by the “NUMA memory controller” which in turn determines if the memory is local to the processor or remote. If remote, the NUMA “engine” must perform a fetch of the memory and install it into the requesting processor cache (which cache depends on the implementation although most have historically implemented an L3 cache for this remote-memory “staging”). The NUMA “engine” has to be keenly tuned to the processor’s capabilities since all memory related operations have to be supported including cache line invalidations and so forth. Implementations have varied wildly since the early 1990s. There have been NUMA systems that were comprised of complete systems linked by a NUMA engine. One such example was the Sequent NUMA-Q 2000 which was built on commodity Intel-based Pentium systems “chained” together by a very specialized piece of hardware that attached directly to each system bus. That specialized hardware was the called the Lynx Card which had an OBIC (Orion Bus Interface Controller) and a SCLIC (SCI Line Interface Controller) as well as 128MB L3 remote cache. On the Lynx card was a 510-pin GaAs ASIC that served as the “data pump” of the NUMA “engine”. These commodity NUMA “building blocks” were called “Quads” because they had 4 processors, local memory, local network and disk I/O adaptors—a lot of them. Digital referred to their physical building blocks as QBB (Quad Building Blocks) and logically (in their API for instance) as“RAD”s for Resource Affinity Domains. In the case of Opteron, each processor is considered a “node” with only CPU and memory locality. With Opteron, network and disk I/O are uniform.

NUMA Aware – This term applies to software. NUMA-aware software is optimized for NUMA such that the topology is understood and runtime decisions can be made such as what segment of memory to allocate from or what adaptor to perform I/O through. The latter, of course, not applying to Opteron. NUMA awareness starts in the kernel and with a NUMA API, applications too can be made NUMA aware. The Linux 2.6 Kernel had NUMA awareness built into the kernel—to a certain extent and there has been a NUMA API available for just as long. Is the Kernel fully NUMA-optimized? Not by any stretch of the imagination. Is the API complete? No. Does that mean the Linux NUMA-related technology is worthless? That is what I intend to blog about.

Some of the good engineers that build NUMA-awareness into the Sequent NUMA-Q operating system—DYNIX/ptx—have contributed NUMA awareness to Linux through their work in the IBM Linux Technology Center. That is a good thing.

This thread on Opteron and Linux NUMA is going to be very Oracle-centric and will come out as a series of installments. But first, a trip down memory lane.

The NUMA Stink
In the year 2000, Sun was finishing a very anti-NUMA campaign. I remember vividly the job interview I had with Sun’s Performance, Availability and Architecture Engineering (PAE) Group lead by Ganesh Ramamurthy. Those were really good guys, I enjoyed the interview and I think I even regretted turning down their offer so I could instead work in the Veritas Database Editions Group on the Oracle Disk Manager Library. One of the prevailing themes during that interview was how hush, hush, wink, wink they were about using the term NUMA to describe forthcoming systems such as StarCat. That attitude even showed in the following Business Review Online article where the VP Enterprise Systems at Sun in that time frame stated:

“We don’t think of the StarCat as a NUMA or COMA server,” he said. “This server has SMP latencies, and it is just a bigger, badder Starfire.”

No, it most certainly isn’t a COMA (although it did implement a few of the aspects of COMA) and it most certainly has always been a NUMA. Oops, I forgot to define COMA…next entry…and, oh, Opteron has made saying NUMA cool again!

6 Responses to “Oracle on Opteron with Linux–The NUMA Angle (Part I)”

Feed for this Entry Trackback Address

1 Alex Gorbachev January 18, 2007 at 2:53 am

Nice start Kevin. Thanks. Keep it going!

2 Mario January 18, 2007 at 3:07 am

Hi Kevin,

I can hardly wait to read your insights on Oracle NUMA..
I guess you have read Oracle’s ‘insights’ on NUMA too:
Metalink Doc ID 169539.1
good luck!

3 kevinclosson January 18, 2007 at 4:02 am

Mario,

I’ve never seen MetaLink 169539.1. What a steaming pile of dumg that things is! Good Lord. What happened to peer-review? It is incorrect and unreadable. I wish I could cut and paste it here to show how pathetic is is…but I can’t .

Alex, thanks.

4 Mark J. Bobak January 18, 2007 at 2:23 pm

Wow! That is one of the worst written (perhaps THE worst written) Oracle documents I’ve ever seen….

Makes me wonder if the author was drunk at the time…;-)

-Mark

5 kevinclosson January 18, 2007 at 3:07 pm

Well, the big problem with MetaLink 169539.1 is it is completely wrong. It rambles on about shared memory as if that is the only thing that needs NUMA awareness. Process page tables, Text, Data, Stack also impact performance (especially page tables and text) if they are remote form the executing process. Oh well, enough pounding on yet another bad NUMA-related article…

1 NUMA, es gran desconocido Trackback on May 20, 2008 at 11:06 pm

	David Zheng on Announcing pgio (The SLOB Meth…
	Oracle redo log perf… on File Systems For A Database? C…
	Oracle redo log perf… on Yes, File Systems Still Need T…
	kevinclosson on Announcing SLOB 2.5.4
	pgio nutzen? - I/O W… on So pgio Does Not Accurately Re…

Kevin Closson's Blog: Platforms, Databases and Storage