Archive Page 16

Recorded Webcast Available: Exadata Storage Server Technical Deep Dive – Part IV.

This is just a quick blog entry to point out that I updated my “Papers, etc” section with a link to the recorded Exadata Storage Server Technical Deep Dive – Part IV webcast.

Oracle-Enhancing Solaris Features. Memory Lane.

Reinventing Inventions. Deja Vu.

My old friend Glenn Fawcett sent me a link to a list of historical key technological Solaris platform enhancements for Oracle. After thanking him for that I (no surprise) felt compelled to point out which items on that list had been implemented in Sequent DYNIX/ptx on average 3 years prior to being implemented in Solaris. 🙂  Although neither of us spoke the words, but were likely thinking nonetheless, it’s interesting how many of the items on the list emerged in Linux on average several years after the Solaris rendition hit the streets.

Glenn successfully completed his ex-Sequent 12-step program many years ago. I was not as successful it seems.

Staging Data For ETL/ELT? Flat Files Appear Magically! No, Load Time Starts With Transfer Time.

In my recent post entitled Something to Ponder? What Sort of Powerful Offering Could a Filesystem in Userspace Be?, I threw out what may have seemed to be a totally hypothetical Teaser Post™. However, as our HP Oracle Exadata Storage Server and HP Oracle Database Machine customers know based on their latest software upgrade, a FUSE-based Oracle-backed file system is a reality. It is called Oracle Database File System (DBFS). DBFS is one of the corner stones of data loading infrastructure in the HP Oracle Database Machine environment. For the time being it is a staging area for flat files to be accessed as external tables. The “back-end”, as it were, is not totally new. See, DBFS is built upon Oracle SecureFiles. FUSE is the presentation layer that makes for mountable file systems. Mixing FUSE with DBFS results in a distributed, coherent file system scalable due to Real Application Clusters. This is a file system that is completely managed by Database Administrators.

So, I’m sure some folks’ eyes are rolling back in their head wondering why we need YAFS (Yet Another File System). Well, as time progresses I think Oracle enthusiasts will come to see just how feature rich something like DBFS really is.

If it performs, is feature rich and incremental to existing technology, it sounds awfully good to me!

I’ll be discussing DBFS in Oracle Exadata Technical Deep Dive – Part IV session tomorrow.

Here is a quick snippet of DBFS in action. In the following box you’ll see a DBFS mount of type FUSE on /data and the listing of a file called all_card_trans.ul


$ mount | grep fuse
dbfs on /data type fuse (rw,nosuid,nodev,max_read=1048576,default_permissions,allow_other,user=oracle)
$ pwd
/data/FS1/stage1
$ ls -l all_card_trans.ul
-rw-r--r-- 1 oracle dba 34034910300 Jun 15 15:30 all_card_trans.ul

In the next box you’ll see ssh jumping to 4 of the servers in an HP Oracle Database Machine to list the contents of the DBFS file system and md5sum output to validate that it is the same file.


$ for n in r1 r2 r3 r4
> do
> ssh $n md5sum `pwd`/all_card_trans.ul &
> done
[5] 3943
[6] 3945
[7] 3946
[8] 3947
$ 1adbff1a36a42253c453c22dd031b48b  /data/FS1/stage1/all_card_trans.ul
1adbff1a36a42253c453c22dd031b48b  /data/FS1/stage1/all_card_trans.ul
1adbff1a36a42253c453c22dd031b48b  /data/FS1/stage1/all_card_trans.ul
1adbff1a36a42253c453c22dd031b48b  /data/FS1/stage1/all_card_trans.ul
[5]   Done                    ssh $n md5sum `pwd`/all_card_trans.ul
[6]   Done                    ssh $n md5sum `pwd`/all_card_trans.ul
[7]   Done                    ssh $n md5sum `pwd`/all_card_trans.ul
[8]   Done                    ssh $n md5sum `pwd`/all_card_trans.ul

In the next box you’ll see concurrent multi-node throughput. I’ll use one dd process on each of 4 servers in the HP Oracle Database Machine each sequentially reading the contents of the same DBFS-based file and achieving 876 MB/s aggregate throughput. And, no, there is no cache involved.


$ for n in r1 r2 r3 r4; do ssh $n time dd if=`pwd`/all_card_trans.ul of=/dev/null bs=1M &; done
[5] 13325
[6] 13326
[7] 13327
[8] 13328
$ 32458+1 records in
32458+1 records
34034910300 bytes (34 GB) copied, 154.117 seconds, 221 MB/s

real    2m34.127s
user    0m0.014s
sys     0m3.073s
32458+1 records in
32458+1 records out
34034910300 bytes (34 GB) copied, 155.113 seconds, 219 MB/s

real    2m35.123s
user    0m0.020s
sys     0m3.127s
32458+1 records in
32458+1 records out
34034910300 bytes (34 GB) copied, 155.813 seconds, 218 MB/s

real    2m35.821s
user    0m0.026s
sys     0m3.210s
32458+1 records in
32458+1 records out
34034910300 bytes (34 GB) copied, 155.89 seconds, 218 MB/s

real    2m35.901s
user    0m0.017s
sys     0m3.039s

With Exadata in mind, the idea is to offer a comprehensive solution for data warehousing. All too often I see data loading claims that start with the flat files sort of magically appearing ready to be loaded. Oh no, we don’t think that way. The data is outside on a provider system somewhere and has to be staged in advance of ETL/ELT. Since DBFS exploits the insane bandwidth of Exadata, it is an extremely good data staging solution. The data has to be rapidly ingested into the staging area and then rapidly loaded. A bottleneck on either part of that equation will be your weakest link.

Just think, no external systems required for data staging. No additional storage connectivity, administration, tuning, etc.

And, yes, it can do more than a single dd process on each node! Much more.

Exciting stuff.

Webcast Announcement Clarification. Exadata Technical Deep Dive Part IV.

I just noticed that my announcement for the up-coming Part IV in my Exadata Technical Deep Dive series was missing the date and time. So, here it is:

Thursday, June 18, 2009 12:00 PM – 1:00 PM CDT

Oracle Data Warehouse Performance Issues? Solve It The Old-Fashioned Way With A Third-Party Accelerator!

I read Curt Monash’s report on the current state of affairs at Dataupia and it got me thinking. I agree with Curt on his position toward add-on or external accelerator-type technology. See, one of Dataupia’s value propositions was to accelerate I/O for Oracle Database External Tables. To the best of my knowledge it basically offered high bandwidth, cached flat file access.

About this time last year I produced a bit of a toungue-in-cheek post about Dataupia. A blog reader posted the following comment on that thread:

… This product is very, very real.

It works as an external table in Oracle so it’s transparent to all your BI tools. They do a lot of work with SQL that Oracle passes to make it usable.

You have to re-point your ETL loads at Dataupia directly but they should run with very little alteration.

Speed is 10x Oracle at these volumes (2Tb+).

As most folks know I was deep into HP Oracle Exadata Storage server performance work at that time and couldn’t really go toe-to-toe with any of the DW/BI appliance or accelerator folks. Oracle had not yet released Exadata. The idea of accelerating Oracle ten-fold is certainly no longer all that avant-garde given the proven acceleration Oracle Exadata Storage Server provides.

What I wanted to point out at the time is that accelerating the loading of an Oracle Data Warehouse is indeed important, but surely not sufficiently critical to warrant bringing in another vendor and working out all the plumbing. I had a suspicion then that the blog reader who posted that comment was not fully aware that the value proposition supposedly went beyond accelerating ETL to offering run-time access to the flat files they housed in their Satori server. Yes, running queries against External Tables just because they offer a lot of cache and a lot of I/O bandwidth. At least that is what I got from reading their datasheet.

Erroneously Accelerating Accelerates What?
The problem with that story is that query throughput from External Tables is very seldom an I/O issue. See, scanning External Tables requires conversion from ASCII flat-file text to Oracle data types on the fly. To that end, scanning External Tables is a CPU-intensive task. For instance, if you load data from an External Table into a data warehouse (internal, true) table and then compare scan throughput of both you’ll see that processor saturation will impede the External Table scan. Same-query comparisons commonly show 80% lower throughput accessing External Tables compared to internal tables and I’m not talking about an I/O-hobbled External Table comparison. That, of course, depends on the processor bandwidth available to such a test because the less processor bandwidth available, the more significant the skew towards the internal table.  What I’m trying to say is that if you accelerate External Table I/O, say, 10x, you need as much as 10x more processor bandwidth to handle it. So, sure, if you take a totally I/O bound query and do something like this External Table acceleration technique, you will see significant performance increase. On the contrary, a host processor-bound situation will not benefit from this sort of accelerator. Architecture…it’s important.

Tacking on accelerators is just not a reasonable approach. I recall a lot of hoopla back in about 2005 or so about another one of these sorts of external accelerator offerings—Xprime. I don’t hear much about them any more, other than perhaps bits and pieces about intellectual property infringement claims against DATAllegro (Microsoft).

I’m no stranger to the external acceleration game, but I have generally steered clear of such approaches. I have always leaned toward a more native approach. Offer a better platform, not an external platform. About the same time Xprime was garnering quite a bit of interest, we at my former company, PolyServe, had been putting the final touches on product infrastructure that offered scale-out reporting using clustering technology. Of course the story had all the common tag words such as transparent, seamless, scalable, etc. Unlike usual, however, the claims were true. But, no matter. Nobody cared. I sure thought people would have clamored for up to 16-fold throughput increase for processor-intensive reporting jobs. Oh well…memories. It was an interesting project to work on though as this old paper I wrote suggests.

So What Does This Have To Do With Exadata?
It’s probably about high time people stop getting venture capital to “solve” a “problem” that Oracle Database supposedly has with data warehouse workloads.

Oracle Exadata Storage Server Technical Deep Dive – Part IV

BLOG UPDATE (18-JUN-2009): Links to the recorded webcasts can be found in my Papers, etc section. The original blog post follows:

We’ve set the date for Part IV. As an aside, I’m sorry to report that the recorded webcast of Part III is still not available from IOUG.

I recommend that folks view Part I as a minimum prerequisite for Part IV. I won’t spend any time in review. You can access Part I and Part II here.

Announcing Part IV:

Kevin Closson will continue his “Technical Deep Dive” series in Part IV by covering:

– Loading the Data Warehouse in an Exadata Environment
* The Data Staging Model
* A Data Loading Performance Study

Space is limited.
Reserve your Webinar seat now at:

https://www1.gotomeeting.com/register/473487440

World-Record TPC-H Results Require World-Record Floor Space?

This may be one of the quickest follow-ups to one of my own posts. I just saw an IBM blogger’s tongue-in-cheek post about the World-Record Oracle Database 11g TPC-H result. The post reads:

Yesterday, for the very first time, I went to Costco.

Now for those of you who live on Mars, Costco is one of those big warehouse type member only stores. You can get almost anything there but can never be sure exactly what will be there. I ended up with two of the biggest cans of tuna I have ever seen, a jar of Kalamata olives as big as a fishbowl, and enough toilet paper for a year.

Which reminded me of yesterday’s new TPC-H BI result from HP. HP now leads in the 1000GB space here – by using 64 servers with 512 cores. And 6 humongously specialized storage devices.(1)

It’s fun to think about the floor space, energy, and resources to manage that infrastructure. At least the toilet paper can go in the basement.

I got a chuckle out of that post and would have just commented on the blog, but there was some sort of login credential required to comment.

So What is My Comment?
Well, according to the blog header, the blogger who posted this humorous bit is Chief Technical Strategist, Performance Marketing for the IBM Systems and Technology Group. I think since a professional holding such a position as this seems to have missed a couple of critical points, I thought I’d point out a couple of things.

The blogger referred to the 6 HP Oracle Exadata Storage Server cells as “humongously specialized storage devices.” Yes, Exadata is humongously, enormously, gigantically, immensely, vastly, colossally specialized. However, the blogger moved on to insinuate there would be floor space issues with such a beast.

In case anyone else missed the point, this was 4 10U HP BladeSystem C7000 enclosures. That’s 40U. The humongously specialized, but minimally sized, storage devices were 6 2U HP Oracle Exadata Storage Servers. Sure, there were a couple of switches and some other such supporting gear, but, honestly, is it that “fun to think about” the floor space required for 52U worth of kit?  🙂

Fun with Intel Xeon 5500 Nehalem and Linux cpuspeed(8). Part III.

I recently received email from a reader who wondered why Part I and II of my series on Intel 5500 “Nehalem” cpuspeed(8) was based on NUMA-disabled mode (SUMA/SUMO system) testing. The series the reader referred to can be found at the following links:

Fun With Intel Xeon 5500 Nehalem and Linux cpuspeed(8) Part I.

Fun With Intel Xeon 5500 Nehalem and Linux cpuspeed(8). Part II.

The reader is correct. Thus far in the series I’ve been sharing some findings (trivia?) from a test system with NUMA disabled at the BIOS level. For reference, you can see more about the concept of disabling NUMA with commodity NUMA systems in this post. As an aside, running a Commodity NUMA Implementation (CNI) system (e.g., Xeon 5500 Nehalem) with NUMA disabled in the BIOS is also refered to as a SUMA or SUMO configuration.

A Look at cpuspeed(8) and NUMA
In this blog entry I’ll show some findings based on the busy.sh script (to stress varying processor threads) and analysis of how cpuspeed(8) reacts using the howfast.sh script. But first, recall from Part II of this series where I said:

Hammering all the primary threads heats up only OS cpus 0,2,4,6,8,10,12 and 14 but hammering on the all the secondary threads causes all processor threads to clock up.

That was an indeed an odd thing to observe and I have not yet started to investigate why it is that way since I’m still in somewhat of a discovery phase. Let’s see how the processors respond under the same conditions with NUMA enabled in the BIOS. But first, I’ll do a quick check to make sure it is a NUMA system, not a SUMA/SUMO system system. I’ll use numactl(8) to make sure I have two NUMA nodes in this HP Proliant server with Intel Xeon 5500 “Nehalem” processors:


# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 8052 MB
node 0 free: 3683 MB
node 1 size: 8080 MB
node 1 free: 3664 MB
node distances:

node   0   1
0:  10  20
1:  20  10

Good, it is a NUMA system. In the following box I’ll show how the processors respond to two different experiments. Before I show any test results, though, I need to point out that I’ve changed the howfast.sh script so that it it takes an argument and compares the current processor speeds against the value supplied in the argument. If no argument is provided the script just lists a single line of output with all the processors’ current clock rates. This change was necessary to avoid having to peruse the output of the script to validate the speeds prior to an experiment.

The following box shows the new script behavior. I first use the script with an argument of 1600 and so long as all the cpus are currently clocked at 1600 MHz, the script returns success and the shell moves on to execute busy.sh. As expected, after busy.sh executed, howfast.sh stumbles on a cpu that is not clocked at 1600 and fails.


# ./howfast.sh 1600 && ./busy.sh 1;./howfast.sh 1600
Check Failed: CPU 0 is 2934.00

NUMA Experiments
First, I’ll stress the primary thread of core 0. Next, I’ll stress the primary thread of core 1. Both cores are in socket 0:

# ./howfast.sh 1600 && ./busy.sh 0; ./howfast.sh
0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000
#
# ./howfast.sh 1600 && ./busy.sh 1; ./howfast.sh
0 1600.000 1 2934.000 2 1600.000 3 2934.000 4 1600.000 5 2934.000 6 1600.000 7 2934.000 8 1600.000 9 2934.000 10 1600.000 11 2934.000 12 1600.000 13 2934.000 14 1600.000 15 2934.000

That output should look familiar to the six or so folks following this series because it is exaclty how the processors behave when the system is booted as a SUMA/SUMO system. In Part II of this series I made the following observation:

Running dumb.c on core 0 speeds up OS CPU 0 and every even-numbered processor thread in the box. Conversely, stressing core 1 causes the clock rate on all odd-numbered processor threads to increase.

Let’s see what happens when I hammer multiple processor threads as I did in Part II.


# ./howfast.sh 1600 && ./busy.sh '0 1 2 3 4 5 6 7';./howfast.sh

0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000

# ./howfast.sh 1600 && ./busy.sh '8 9 10 11 12 13 14 15';./howfast.sh
0 2934.000 1 2934.000 2 2934.000 3 2934.000 4 2934.000 5 2934.000 6 2934.000 7 2934.000 8 2934.000 9 2934.000 10 2934.000 11 2934.000 12 2934.000 13 2934.000 14 2934.000 15 2934.000

Déjà Vu

Here, as in the SUMA case, stressing the primary procesor threads in both sockets causes only certain processor threads to clock up. On the other hand, as was also the case with SUMA, stressing the secondary processor threads of both sockets speeds up all processor threads. So, at least this much is consistent between the NUMA and SUMA tests. But what about a series of these tests with a cool down period in the loop?

In the following box I’ll show the effect of looping the busy.sh script in the same fashion as I did in Part II (SUMA). In each iteration, I’ll stress the secondary processor threads of both sockets. As you’ll see, the results are similar to the SUMA behavior except for the frequency of tests that resulted in all processors speeding up. In the SUMA case it was 50% but in the NUMA case it is only 40%:

#
# for t in 1 2 3 4 5 6 7 8 9 10; do ./howfast.sh 1600 && ./busy.sh '8 9 10 11 12 13 14 15' ;./howfast.sh;sleep 30; done
0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000
0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000
0 2934.000 1 2934.000 2 2934.000 3 2934.000 4 2934.000 5 2934.000 6 2934.000 7 2934.000 8 2934.000 9 2934.000 10 2934.000 11 2934.000 12 2934.000 13 2934.000 14 2934.000 15 2934.000
0 2934.000 1 2934.000 2 2934.000 3 2934.000 4 2934.000 5 2934.000 6 2934.000 7 2934.000 8 2934.000 9 2934.000 10 2934.000 11 2934.000 12 2934.000 13 2934.000 14 2934.000 15 2934.000
0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000
0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000
0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000
0 2934.000 1 2934.000 2 2934.000 3 2934.000 4 2934.000 5 2934.000 6 2934.000 7 2934.000 8 2934.000 9 2934.000 10 2934.000 11 2934.000 12 2934.000 13 2934.000 14 2934.000 15 2934.000
0 2934.000 1 2934.000 2 2934.000 3 2934.000 4 2934.000 5 2934.000 6 2934.000 7 2934.000 8 2934.000 9 2934.000 10 2934.000 11 2934.000 12 2934.000 13 2934.000 14 2934.000 15 2934.000
0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000

So here we are at Part III and thus far the sum value of all this information is:

  • cpuspeed(8) acts unpredictably on Xeon 5500 “Nehalem” processors
  • cpuspeed(8) acts differently on Xeon 5500 “Nehalem” processors in NUMA mode compared to SUMA mode.
  • processors cool down quickly after being clocked up

Someone, someday, will likely be scratching their head and googling to see if anyone else is seeing odd processor frequency issues with the Xeon 5500 “Nehalem processors. If nothing else, this series of blog posts will at least let said googler know that they are not alone in what they are seeing.

Xeon 5500 “Nehalem” Most Certainly Is Not Risky!

BLOG UPDATE: I just found out that the link to the Computerworld article was broken. I fixed it.

I’ve learned quite a bit today about risk mitigation and processor architectures from this Computerworld article about the next turn of Xeon 5500 “Nehalem” processors—the EX. The EX is an 8 core processor with 16 threads. That’s all true, but that isn’t what I’m blogging about.

Risky RISC
The article quotes Intel’s Boyd Davis as saying:

[…] with the launch of Nehalem-EX, the intent is to move away from costly, risk-based proprietary RISC (reduced instruction set computing)-processor based systems

And then again:

“This is going after a market that was limited to being served by risk architecture,” Davis said. “We think Nehalem-EX will represent a pretty significant opportunity on the overall server and hardware market.”

That’s funny. But then the article also points out that 16/6 == 2.7 … that’s funny too. Or is that Nehalem_EX / Dunnington == 2.7 ?  🙂

All that aside, Nehalem is an outrageous processor!

Now that was a cheezy blog entry.

Little Things Doth Crabby Make – Part VII. NUMA Terminology, or Bust! Sad Manpages Too.

A recent email thread between The Oaktable Network members put me into the mood for another post in the Little Things Doth Crabby Make series. The topic at hand was this post on the MySQL Performance Blog about timing queries. In short, the post was about timing MySQL queries using clock_gettime(). In that post, the blogger showed the following pseudo code to help describe the problem he was seeing:


start_time = clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tp);
...
query_execution
...
end_time = clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tp);

total_time = end_time - start_time;

According to the post, the time rendered was 18446744073709550.000 which, of course, on a 2.6 GHz processor would be 82 days. What the blogger likely didn’t know is that when called with this argument the clock_gettime() routine uses the CPU time stamp counter (rdtc). As soon as I saw 18.4 quadrillion (or should I say billiard) I knew this was a clock wrap issue. But, to be honest, I had to look at the manpage to see what CLOCK_THREAD_CPUTIME_ID actually does. It turns out for threaded (pthread) programs this call will use the processor time stamp counter. The idea of wrapping rdtsc in a function call seems bizarre to me but to each their own.

Comparing an x86 processor time stamp counter on one CPU against another CPU will result in bizarre arithmetic results. Well, of course, since the time stamp counters are local to the CPU (not synchronized across CPUs). I know a bit about this topic since I started using rdtsc() to time tight code back in the Pentium Pro days (circa 1996). And, yes, you have to lock down (hard processor affinity) the process using rdtsc() to one CPU. But that isn’t all. Actually, the most accurate high-resolution timing goes more like this:

  1. Hard Affinity me to CPU N
  2. Disable process preemption (only good operating systems support this)
  3. Serialize CPU  with CPUID
  4. rdtsc()
  5. do whatever it is you are timing (better not be any blocking code or device drivers involved)
  6. rdtsc()
  7. Re-Enable process preemption
  8. Release from Hard Affinity (if desired)

But all that is just trivial pursuit because I don’t think anyone should time a MySQL query (or any SQL query for that matter) with nanosecond resolution anyway. And, after all, that is not what I’m blogging about. This is supposed to be Little Things Doth Crabby Make post. So what am I blogging about?

Some Linux Manpages Make Me Crabby
The latest Linux manpage to make me crabby is indeed the manpage for clock_gettime(2). I like how it  insinuates a requirement for hard processor affinity, but take a look at the following paragraph from the manpage

The CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID clocks are realized on many platforms using timers from the CPUs (TSC on i386, AR.ITC on Itanium).  These registers may differ between CPUs and as a consequence these clocks may return bogus results if a process is migrated to another CPU.

Words Matter
Using the term migrated in this context it is totally wrong, especially for NUMA-minded people. And, if you can’t tell by my blogging of late, I assert that we are all, or should be, but definitely will be,  NUMA-minded folks now that Intel has entered the Commodity NUMA Implementation market with the insanely cool Xeon 5500 “Nehalem” processor and QPI.

The only time the term migrate can be used in the context of process scheduling is when a NUMA system is involved. The clock_gettime(2) manpage was merely referring to a process being scheduled on different CPUs during its life. Generically speaking, there is no migration involved with that. It is a simple context switch. Come to think of it, context switch is a term that routinely falls prey to this sort of misuse. Too often I see the term context switch used to refer to a process entering the kernel. That is not a context switch. A context switch is a scheduling term specifically meaning the stopping of a process, saving of its state and switching to the next selected runable process. Now, having said that, the next time a stopped process (either voluntarily blocked or time-sliced off) is scheduled it could very well be on a different CPU. But that is not a migration.

Enter NUMA
A process migration is a NUMA-specific term related to the “re-homing” of a process’ memory from one NUMA “node” to another. Consider a process that is exec()ed on “node 0” of a NUMA system. The pages of its text, stack, heap, page tables and all other associated kernel structures will reside in node 0 memory. What if a system imbalance occurs such that the CPUs of node 1 are generally idle whereas CPUs of node 0 are generally saturated? Well, the scheduler can simply run the processes homed on node 0 on node 1 processors. That is called remote execution and one very important side effect of remote execution is that any memory resources required while doing so will have to be yanked from the remote memory and installed in the local cache. Historical NUMA systems (e.g. Pioneer, Proprietary NUMA Implementations) had specialized NUMA caches on each node to house memory lines being used during remote execution. The Sequent NUMAQ-2000, for instance, offered a 512 MB “remote cache.” In aggregate, that was 8 GB of remote cache on a system that supported a maximum of 64 GB RAM! CNI systems do not have specialized NUMA caches but instead a simple L3 cache that is generally quite small (e.g., 8 MB). I admit I have not done many tests to analyze remote execution versus migration on Xeon 5500 based systems. In general (as I point out in this post) extreme low latency and huge interconnect bandwidth (ala Xeon 5500) can mitigate a potentially undersized cache for remote lines, but the proof is only in the pudding of actual measurements. More on that soon I hope.

What Was It That Made Him Crabby?
The use of the NUMA-sanctified term migrate in the clock_gettime(2) manpage! Seems too picky, doesn’t it? OK, since I’m discussing NUMA and trying to justify an installment in the Little Things Doth Crabby Make series, how about this from the numactl(8) manpage:


EXAMPLES
 numactl --interleave=all bigdatabase arguments Run big database with its memory interleaved on  all
 CPUs.

 numactl  --cpubind=0--membind=0,1 process Run process on node 0 with memory allocated on node 0 and
 1.

 numactl --preferred=1 numactl --show Set preferred node 1 and show the resulting state.

 numactl --interleave=all --shmkeyfile /tmp/shmkey Interleave all of the sysv shared memory  regiion
 specified by /tmp/shmkey over all nodes.

 numactl  --offset=1G  --length=1G --membind=1 --file /dev/shm/A --touch Bind the second gigabyte in
 the tmpfs file /dev/shm/A to node 1.

 numactl --localalloc /dev/shm/file Reset the policy for the shared memory file file to the  default
 localalloc policy.

Do you think “Run big database with its memory interleaved on all CPUs” or “Run big database with its memory interleaved on all CPUs” are arguments to the numactl command? No? Me neither. Sure looks like it in the examples section though. Not very tidy.

It doesn’t really make me crabby…this is just blogging.

Webcast Reminder. Oracle Exadata Storage Server Technical Deep Dive – Part III

Just a last-minute reminder:

Webcast Announcement: Oracle Exadata Storage Server Technical Deep Dive – Part III.

Blogroll Update: Jason Arneil’s Blog

I’ve been reading Jason Arneil’s blog for some time now and just took a moment to add it to my blogroll. I recommend you pay Jason’s blog a visit!

Whoa, I Know an Ace Director! Congrats to Tanel

This is just a quick blog entry to extend a well-deserved congratulations to Tanel Poder who has freshly donned his Ace Director vest!

You Buy a NUMA System, Oracle Says Disable NUMA! What Gives? Part II.

BLOG UPDATE ( 19-JUN-2009): I need to point out that ML 759565.1 has been significantly revised. The message regarding test before enabling NUMA persists. Not that it matters much,  I concur with that advice. The original post follows:

Cart Before Horse?
Yes, in this mini-series of posts Part II will precede Part I. I’ll explain…eventually.

In the comment thread of  my recent blog post entitled Oracle Database 11g Automatic Memory Management – Part IV. Don’t Use PRE_PAGE_SGA, OK?, a reader asked if I’d comment on Metalink note 759565.1 (759565.1). The reader’s specific interest was in the late breaking stance Oracle Support has taken regarding software NUMA optimizations. Just what is that support position? Run with all NUMA software optimizations disabled. Full stop.

I have been in contact with the Support Engineer that “owns” ML 759565.1 and have given some advice on how to change that note. I’ve been informed that a re-write is underway and that I will be on the review for that revision. That’s good, but I still think this topic is worthy of your time (feeling especially presumptuous today I guess).

The current rendition of the support note reads:

Oracle Support recommends turning off NUMA at the OS level or at the database level. If NUMA is disabled at the OS level then it will also be disabled at Oracle level. It does not matter how NUMA is disabled as long as it is disabled either at the OS layer or the Database layer.

What’s that? Run a NUMA system with all software optimizations disabled? Gasp! But wait, given how much I’ve rambled on about NUMA and NUMA software optimization principles over the years you will surely be flabbergasted to find out that, in principle, I agree with this stance.

What Is It Exactly That I Agree With?
Herein lies the rub. I have to offer a lengthy diatribe about NUMA in order to explain what I agree with. Not all NUMA systems are created equal. NUMA system fall into one three camps:

  • Pioneer, Proprietary NUMA Implementations (PPNI).
  • Modern, Proprietary NUMA Implementations (MPNI).
  • Commodity NUMA Implementations (CNI).

 

Different NUMA Implementations == Differences!
Let’s see if I can make some sense of these differences. And, trust me, this does relate to ML 759565.1.

1)      Pioneer, Proprietary NUMA Implementations (PPNI). The first commercial cache-coherent NUMA system was the Sequent NUMAQ-2000. Within a couple of years of that hardware release there were several other pioneer implementations brought to market by DG, DEC, SGI, Unisys and others. The implementation details of these pioneer NUMA systems varied hugely (e.g., interconnect technology, levels of OS NUMA awareness, etc). One thing these pioneer implementations all shared in common was the fact that they suffered huge ratios between local and remote memory latency. When I say huge, I’m talking as much as 50 to 1 for highly contended multiple-hop remote memory. The only reason these pioneer systems were brought to market was because they offered tremendous advancements in system bandwidth. The cost, however,  was lumpy memory and thus software NUMA-awareness was of utmost importance.  I would consider systems like the Sun E25K to be “second-generation” pioneer systems. Sure, the E25K suffered memory locality costs, but not as badly as the true pioneer systems. Few would argue that even the “second-generation” pioneer systems relied heavily on software NUMA-awareness.

2)      Modern, Proprietary NUMA Implementations (MPNI). I’m not going to cite many systems here as cases in point. I don’t aim to wound the tender sensibilities of any hardware supplier. I can define what I mean by MPNI by simply stating that MPNI systems differ from PPNI in terms of remote to local memory latency ratios. In short, MPNI systems have very favorable L:R latency ratios. By very favorable, I mean significantly less than 2 to 1. An example of an MPNI system would be the Sun SPARC Enterprise M9000 Server which, according to my good friend Glenn Fawcett, sports an approximate local to remote latency ratio of 1.3:1. In my opinion, it is not worth the complexities necessary to do proper, effective software  NUMA awareness when there is only 30% disparity between the local and remote memory (at least not Oracle NUMA-awareness). Now, having said that, I know the M9000 supports scaling to multiple cabinets. I don’t know enough about the crossbar (Juniper Interconnect) to say whether it requires any “hop” overhead in a multiple-cabinet configuration. Sun literature states point-to-point without caveats so the L:R ratio might remain constant as one adds cabinets. Nonetheless, the point being made here is that there exist today modern, proprietary NUMA implementations and concerns over Oracle NUMA awareness should be weighed according to MPNI capabilities—not arcane, PPNI capabilities.

3)      Commodity NUMA Implementations (CNI). I don’t feel compelled to hide my exuberance for modern NUMA implementations such as Intel QuickPath Interconnect and the HyperTransport (HT) used by AMD. The points I want to make about CNI are as follows:

  1. Memory Latency Ratios. While I’ve not stayed as up to speed on local-remote ratios with HT 3.0, I know that the Intel QPI-based systems offer very pleasant L:R ratios (e.g., 1.4:1 or better). More importantly, I should point out that even remote memory references in Nehalem-based Servers (Xeon 5500) are faster than all memory references in the previous generation Xeon-based systems (e.g., “Harpertown” Xeon 5400)!
  2. BIOS-Enabled NUMA. Commodity NUMA systems support the concept of boot-time NUMA enablement. When booted with NUMA disabled at the BIOS, the resultant memory architecture is commonly referred to as Sufficiently Uniform Memory Access (SUMA) or Sufficiently Uniform Memory Organization (SUMO).

Does All This Really Relate to ML 759565.1?
Yes. While I haven’t seen the re-write of that note, I’ll say what I think it needs to say:

  1. Regarding Commodity NUMA Implementations.
    1. If you are running Oracle on a CNI system you should test before you even bother enabling NUMA in the BIOS—not the other way around. If you deploy a CNI system, such as a two-socket Intel Xeon 5500 (Nehalem) server to run Oracle, I assert that you would have to do significant testing to find much of a performance difference between disabling NUMA in the BIOS and a fully-NUMA-aware configuration (i.e., NUMA BIOS=on + OS NUMA + on Oracle enable_NUMA_optimization=TRUE). That is, you will likely have to go to significant effort to find a performance delta in the range of anything greater than about 10%. It will be extremely workload dependent (in reality I’m aware of test results that show 10% improvement with NUMA disabled in the BIOS). Having said that, I’m not sure a 10% improvement would be worth the Linux-specific issues associated with setting enable_NUMA_optimization to TRUE. I’ll gladly take my medicine from anyone that can show me more than, say, 10% with all the bells and whistles (and associated thorns and barbs) of a fully NUMA-aware Oracle Database 11g deployment. Remember, I’m speaking specifically about CNI systems and let’s keep track of the publish date of this blog entry because the number of sockets can bear significance on this topic as I’ll point out later in this series. When I talk about a SUMA/SUMO approach to Oracle on CNI, I mean single-hop memory, which should be the case up to at least 4 sockets, but I don’t know since QPI systems are limited to two sockets today.
      1. What are the “thorns and barbs” I allude to? Well, it’s all about OS NUMA sophistication with specific regard to when, or if, to remotely execute a process, when an imbalance warrants a process migration and how often such things should occur. The best NUMA-aware Operating System ever developed (DYNIX/ptx) had these sorts of issues iron out flat. We (Sequent) had these sorts of issues directly in our fully focused rifle scopes and we weren’t messing around. Who is scrutinizing such issues in today’s world with CNI systems? After all, the “thorns and barbs” I allude to are dealt with in the Operating System. Do you know many Linux guys that are scrutinizing low-level systems performance issues specific to Oracle on CNI with bus traces  and all? That’s not exactly a saturated research field 🙂 There is a reason the term SUMA/SUMO exists!
  1. Regarding Modern Proprietary NUMA Implementations.
    1. If you are running Oracle on a MPNI system, take heed to Oracle Support Services’ recommendation to run with Oracle NUMA optimizations disabled (enable_NUMA_optimization = FALSE) as is the default with patch 8199533 applied. Based on testing and analysis, it may make sense to enable NUMA awareness in Oracle. That will depend on the workload and the degree of non-uniformity of your MPNI system. Talk of disabling NUMA in the OS will likely not relate to your platform. More on this latter in this post. Talk of disabling NUMA at the hardware level (akin to the BIOS CNI approach) is most likely irrelevant.
  2. Regarding Pioneer Proprietary NUMA Implementation.
    1. If you are running Oracle on a PPNI system, I’d like to buy you a beer! That would be a pretty old system with very old software. But, vintage notwithstanding, it would likely be a pretty stable system.

That Middle Ground: MPNI with enable_NUMA_optimization = FALSE
It’s true. Running Oracle Database with its NUMA-awareness disabled on a MPNI system is a sort of middle ground. It so happens that this particular middle ground makes a lot of sense—and you likely have no choice either way. After all, I know of no MPNI system that allows you to run in a hardware-SUMA mode. It would be a bit of a chore to interleave memory across motherboards.

The Operating Systems that support MPNI systems generally interleave the memory allocations that back IPC shared memory and mmap()s. That is a good thing since the result is a fairness of access to the shared resource. More importantly, however, on really large NUMA systems (as most MPNI are) is the fact that private memory allocations (e.g., stack and heap) are allocated from the memory local to the NUMA node the process is executing on. Likewise, kernel structures affiliated with the process are allocated from local memory. So, the blend is a good thing on these types of systems. Having said that, however, I cannot vouch for what happens when there is CPU saturation on a NUMA node and the process scheduler is faced with a decision to remotely execute a process or just let idle cycles blow by. Moreover, I cannot vouch for these Operating Systems’ intelligence regarding when to just migrate a process over to a less-saturated NUMA node re-homing all the physical memory backing the process. Getting that stuff right is pretty hard work. I’ve got the T-Shirt to prove it.

I’m going to call this Part II and follow up with Part I soon. When you read Part II that move will make sense.

Oracle Ace, Ace Director, And… Uh, What Was That Again?

Hmmmm… according to wiki.oracle.com Oracle Related Blog list I’m an “Oracle Employee Ace.”

Can someone shoot me a quick comment and tell me what that is please? I have not heard about this.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.