Archive for the 'oracle' Category



Hard Drives Are Arcane Technology. So Why Can’t I Realize Their Full Bandwidth Potential?

We are all at different skill and technical sophistication levels so this post will look like fluff to some readers. This is just a quick post to show a clear depiction of a SAN configuration that wouldn’t do anyone any favors in an Oracle environment.

I showed some snapshots of the lab gear at my disposal in this blog entry. There are a lot of SANs in the lab here, but I can’t say which brand of SAN array I’m blogging about today, and honestly, this problem is not necessarily the architecture of the particular SAN array controller in question. Yes, this is a low-end array, but the concepts I’m blogging about are relevant through the high-end. The following vmstat(1) output shows one of the problems I’m blogging about:

# vmstat 10
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 3  3      0 31577944  81948 653352    0    0    96   597   13    34  0  7 89  4
 5  4      0 31550968  82204 653096    0    0 73732 89913 2919 57992  0 10 60 30
 0  4      0 31576760  82332 652968    0    0 74555 98056 2942 55904  0 10 60 30
 1  6      0 31553824  82460 652840    0    0 80289 85870 2831 61435  0 10 63 28
 0  4      0 31576968  82460 652324    0    0 79466 85397 2860 60537  0 10 62 28
 8  0      0 31573368  82588 654260    0    0 41279 46733 1917 36035  0 32 53 15
 8  0      0 31572896  82588 653744    0    0 11613 13344 1239 13387  0 49 46  5
 5  0      0 31572528  82588 654260    0    0  8875  9039 1193 10727  0 47 49  4
 7  0      0 31572280  82588 653744    0    0  6559  7057 1146  9823  0 47 49  4
 6  0      0 31572176  82588 653744    0    0  6556  7048 1156  9121  0 50 47  3
 4  0      0 31572072  82588 654260    0    0  5736  5956 1136  7955  0 47 50  3
 6  0      0 31571576  82588 654260    0    0  4205  5529 1126  7851  0 47 51  2
 8  0      0 31571584  82588 654260    0    0  5960  5264 1132  7982  0 48 49  3
 7  0      0 31570896  82652 654712    0    0  4239  5167 1119  8086  0 47 51  2
11  0      0 31569968  82652 654712    0    0  5116  4995 1124  7437  0 55 43  2
 7  0      0 31570184  82684 653648    0    0  4782  5082 1125  7880  0 54 44  2

This vmstat(2) output shows that Oracle was getting upwards of 165MB/s (via a single 2Gb FCP path) of combined read/write throughput until the array cache reached its saturation point. At that point, the throughput of the array was relegated to the bandwidth of the spindles being accessed.

Its All About Balance
I have a pretty simple view of storage and it is worth quoting myself:

Physical I/O is a necessary evil. Bottlenecking downwind of the SAN array cache is silly. Bottlenecking above the disks is foolishness.

Bottlenecking The Array (Silliness)
Configuring LUNs with insufficient spindle count to handle the cache miss (read-through) and write-back overhead is what I call bottlenecking the SAN array. This sort of bottleneck is strictly a configuration issue and therefore falls into the silliness category. The vmstat(1) output above is an example of bottlenecking the array. That is, the array could certainly deliver more throughput, but the spindle count was holding it back. Again, I’m not going to talk about what vendor’s SAN array controller it was because as it turns out any of them will act this way under these circumstances. I had a LUN of some 500GB that consisted of very few spindles so it was no surprise to me that the throughput was lousy. This LUN had to be configured this way due to other constraints on the array’s usage (read Kevin had to share some hardware). I got to live the pain I blogged about recently in this post about capacity versus spindle count. I could configure a more reasonable number of drives and get around this performance problem, but only up to the point where the tables turn and the array starts to bottleneck the disks.

Bottlenecking The Disks (Foolishness)
What do I mean by this? Well, most modern SAN arrays will bottleneck long before your application realizes the full bandwidth potential of the all the drives the array can support. I call this bottlenecking the disks and it falls into the foolishness category. This is basic foolishness on the SAN array vendor’s part. Why build storage arrays with horrible imbalance between capacity and performance?

Let me go over a typical case of a SAN array that bottlenecks the disks. Here is a link to some technical detail of a high end array. This particular array supports roughly 140TB of capacity when configured with the maximum spindle count (1024) of 146GB drives. On page 3 of the document, the vendor states the first figure related to bandwidth by citing the internal bandwidth of 15GB/s. What does that mean to the servers connected to this array? Reading further (pg. 15) the document states that there is some 10.6GB/s of aggregate data bandwidth and 5.3GB/s bandwidth for control. This means that if you get all the king’s horses and all the king’s men to work it out, you could feed 10.6GB/s of data to your servers. Indeed, 10.6GB/s is a great deal of bandwidth. So what am I talking about? Taking another look at page 15 we see the vendor’s claims that a single 146GB drive can deliver 99.9MB/s ( max theoretical). If you wanted to drive all these spindles at full throttle, they would theoretically deliver 99.9GB/s (1024 disks X 99.9MB/s) which is much less than the maximum theoretical data bandwidth of the array. In fact, if you drove “only” about 108 of these spindles at full throttle you’d saturate the array. I quoted the word only because 108 is a lot of disks, but that is only 10% of what the array supports.

If Oracle needs to do a physical I/O, let’s not bottleneck somewhere in the pipe! Think about it this way, in the technology stack I’m discussing (Oracle, high-end SAN array, etc), hard drive technology represents the simplest and least advanced component. That is, while hard drives are faster than they were 10 years ago, they have not fundamentally changed. They are still round and brown and they spin. Wouldn’t you think homely old disk drives would be the bane of performance? They aren’t. If we could drive all our disks at their maximum throughput, we’d be in a much better place performance-wise.

Summary
Hard drives are miserably low-tech necessary evils. Will we ever get a storage architecture where the more sophisticated components don’t make matters worse? Yes, we will. I’ll tell you all more as time passes. In the meantime, I bet dollars to donuts that the paltry 128 drives used in the TPC-H benchmark I blogged about were being driven at or near full bandwidth. That, my dear readers, is cool.

It’s Not a Record, But It Is The Most Interesting TPC-H Result! No Manly Man SAN Here.

Not a Record, But It Is In My Book
Not a throughput record that is. This new result at 39,614 QphH comes within 2% of Oracle’s previous top ranking in the 300GB scale. Why would Oracle publish a result that doesn’t even beat their own previous result? Price performance is the answer. I also find the deployment architecture to be very interesting.

Oracle still hold the top ranking at 300GB with their December 2006 result of 40,411 QphH at a price/performance of $18.67 $$/QphH. This new result, however, comes in at only $12.57 QphH-33% cost savings for essentially the same performance! What did Oracle do to reduce the cost? Did they use fewer RAC nodes to reduce their own licensing cost? Did they use a better, cheaper SAN? Did they use a cheaper, faster RAC interconnect? No, none of that. In fact:

  • Both results used an 8 node cluster of blade servers
  • Both results used Infiniband for the RAC cluster interconnect
  • Both results used “networked storage”

So what gives?

No SAN.
The December 2006 result of 40,411 QphH used a $393,765 configuration consisting of some 448 disks in a SAN with 32 MSA 1000 controllers (not including spares). This new result, on the other hand, did not use a SAN at all. Instead, there were 8 blade servers (the RAC cluster) attached to 16 storage blades. The storage blades were running RHEL5 and presenting block storage via SRP over Infiniband to the 8 node RAC cluster. What’s that? Yes, Infiniband-a single, unified connectivity model for both RAC interconnect and storage. With this type of storage, Oracle was able to drive the same performance (yeah, I know, within 2%) with only 128 hard drives. All told, the storage configuration was priced at $141,158-a healthy 64% reduction in storage cost!

This is a very, very good day-but not for Manly Man.

I couldn’t be happier!

Nearly Free or Not, GridSQL for EnterpriseDB is Simply Better Than Real Application Clusters. It is Shared-Nothing Architecture After All!

According to this businesswire.com piece, EnterpriseDB made quite the splash at LinuxWorld San Francisco 2007 by introducing a clustered version of EnterpriseDB called GridSQL for EnterpriseDB. It’s no surprise that a product based on open source would get honorary mention at LinuxWorld, but I don’t care about that.

What I’m blogging about is the fact that I’m already seeing blogs and press pieces that clump GridSQL in with Oracle Real Application Clusters (RAC). What’s the problem? There is a cluster involved and both products use a cluster so they must be birds of a feather.

Po-tay-toe / Po-tah-toe
Oracle Real Application Clusters is a shared-disk architecture. This GridSQL product is yet another shared-nothing implementation as is IBM’s DB2 UDB EEE (cousin of SP2). Mainframe DB2 is shared-everything of course. Also in the shared-nothing camp are: Teradata, Informix XPS, Microsoft SQL Server with Distributed Partitioned Views (Egad!), Sybase Navigation Server (there’s a blast from the past), Greenplum and the hardware niche guys like DATAllegro and Netezza. Ingres r3 takes a shared everything approach like Oracle RAC and mainframe DB2, but I’m not sure what those folks are up to these days.

Everyone Else Is Doing It
Simply put, the fastest way to get a clustered database out the door is to implement shared-nothing. Take your regular database engine, throw some replication and data shipping in there and poof-you have a shared nothing database. I think everyone got the memo back in 1986 when Michael Stonebraker put his foot down and said that shared nothing was the way to go-in spite of the fact that shared nothing architectures include the necessary evil of data shipping. That Stonebraker paper was written a long time ago and most of the data points in it are OLTP-minded. For instance, I’ll quote Stonebraker:

Consider the number of messages which an SN [shared nothing] system must incur in a typical high transaction processing environment. The example consists of a data base with N objects subject to a load consistin entirely of transactions containing exactly k commands, each affecting only one record. (For TP1 the value of k is 4). For any partitioning of the data base, these k commands remain single-site commands. Suppose that there exists a partitioning of the data base into non-overlapping collections of objects such that all transactions are locally sufficient [WONG83]. Such a data base problem will be termed delightful. Most data base applications are nearly delightful. For example, the TP1 in [ANON84] has 85% delightful transactions.

Using TP1 (a batch mode ATM transaction workload) as a case in point for the claim that most applications will get along nicely (thus the term delightful) in a shared-nothing database! Wow, that was then and this is now. Folks, today’s applications are built on large numbers of tables and complex joins. The reason shared-nothing is nothing like RAC is because instead of only shipping functions (or tasks) and lock messages to the clustered nodes, as is the case with RAC, shared-nothing requires the shipping of data. And as soon as you have a problem with too much data shipping you are required to reload and repartition your database to mitigate the problem. Try getting your partitioning right with an application that has 1,000 tables and 1,300 indexes. Ever chew on crushed glass?

OK, let me try it this way. Let’s say shared-nothing is in fact the best approach for data warehousing. Greenplum puts it this way :

Most of today’s general-purpose relational database management systems are designed for Online Transaction Processing (OLTP) applications. By default, business intelligence applications have inherited this less than optimal architecture. The reality is that BI workloads are fundamentally different from OLTP transaction workloads and therefore require a profoundly different architecture.

So in this view it is one or the other. DSS or OLTP, you choose.

So humor me for a moment. Let’s say for instance that a shared-nothing architecture product on the same hardware outperforms RAC by a flat 50% across all measurements. Nifty! Now let me ask, where is your OLTP? If some shared-nothing product does in fact out perform RAC for data warehousing, that benefit gets canceled out by the fact that the same shared-nothing architecture cannot do OLTP at all. At least not ERP. I won’t even touch the fact that most ERP apps are a bit of an amalgam of both OLTP and DSS-style accesses.

If anyone was going to have phenomenal success with shared nothing it would be IBM with DB2 UDB EEE given its long heritage (SP2) of shared nothing. What do the top 10 audited clustered TPC-H results tell us?

  1. 100GB Scale. Oracle chooses not to put much effort in this scale of the benchmark. Neither does IBM really since their last entry there was a 2003 result.
  2. 300GB Scale. Of the top 10, Oracle holds spots 1-6. In fact, Oracle’s RAC result in the 6th position has stood since September 2005 so IBM’s shared-nothing product has had ample time to trump at least that result. Instead, the most recent DB2 result at this scale was in 2004 which oddly didn’t even beat their own prior 2003 result in position 7 of the top ten! How bizarre-especially for a fundamentally better architecture for DSS-style workloads! Oracle has held the top spot in this scale since December 2006 so there is clearly no leap-frogging going on. The 300GB scale is a clear lock-down on the part of RAC.
  3. 1000GB Scale. Here again RAC has held the top position for many months-10 months to be exact. That is plenty of time for any product of superior architecture to advance in the list. Oracle holds the number 3 slot at this scale as well.
  4. 3000GB Scale. Oracle RAC took the top slot at this scale just 2 months ago with a result that more than doubles the IBM DB2 UDB number at only 15% higher cost!
  5. 10000GB Scale. Here IBM hold the top spot-out of only two entries. Two years elapsed from the time Oracle entered this scale of the benchmark and IBM took the lead.

Based upon points 2-4 in the list above it seems shared-nothing is not somehow inherently superior to shared-disk architecture even for the supposed preferred workload which is warehousing!

 

Using Linux Processor Affinity To Beat The Oracle Licensing Police.

In my recent blog entry about LGWR processing I set up a proof case that included establishing hard processor affinity for LGWR and other processors in order to prove a point. I’ve received a few emails about what tool I used to affinity running processes to specific CPUs. The answer is not what you might presume. I did not use the taskset(1) command that ships with 2.6 kernel distributions because I’ve seen some irregularities with that tool in the past. I prefer to use a bit of my own code that uses sched_setaffinity(2). I’d provide that code as a reference but it is a mitt of a mess and does a lot of other stuff that doesn’t pertain to affinity.

The recent Oracle single socket Xeon “Cloverdale” TPC-C used hard affinity for various purposes. The FDR pointed to by that URL has a hard affinity program (affinity.c) listing that is very succinct. If you want to give hard affinity a play I recommend you cut and paste from that FDR, compile and give it a go. For instance, presume you have compiled affinity.c and wish to round up all the processes for a specific $ORACLE_SID and bind them to CPU 0 you could do something like the following. Note, unless the oracle user has CAP_SYS_NICE capability, you’ll have to make the affinity program setuid root in order to play with sched_setaffinity(2). Such as:

$ cc -o aff affinity.c
$ sudo chmod 6755 aff
$ TARGET_SID=PROD
$ ps -ef | grep ${TARGET_SID} | grep -v grep | awk '{ print $2 }' | while read PID
do
	./aff $PID 1
done

Oracle Licensing Again
I like nixCraft as a resource, but not when it comes to giving advice about Oracle licensing. What’s that got to do with hard affinity? I aimed to offer you readers more information about Linux hard affinity (note to self, blog sometime about the difference between hard and soft affinity if a reader ever asks) so I found this nixCraft article. It does offer more information about hard affinity scheduling, but oh so much more as well. In that page, the author is quoting an unspecified source as follows (quoted herein verbatim):

The scheduler attempts to keep processes on the same CPU as long as practical for performance reasons. Therefore, forcing a specific CPU affinity is useful only in certain applications. For example, application such as Oracle (ERP apps) use # of cpus per instance licensed. You can bound Oracle to specific CPU to avoid license problem. This is a really useful on large server having 4 or 8 CPUS

This quote is talking about how modern *nix schedulers implement soft-affinity (aka cache affinity) and thus handcrafting hard affinity is something better left to advanced situations. That’s fine and dandy, but pardon me for a moment because I’m choking on a hairball over that bit about ERP licensing vis a vis CPU count. That is a complete farce. Please do not try to negotiate a 2 CPU license on an 8 core system where you plan to pin Oracle E-Business Suite processes to the first 2 cores of the system using sched_setaffinity(2). I know I’ve provided a quote of a quote, but that idea is about as brain-dead as they come. It would work, but in a license audit it would not fly.

More About Hard Affinity
Be aware that if you run 100% of a given workload under hard affinity to CPU 0, for instance, that is not the same as running on a single CPU box. The other processors are still online and servicing device interrupts. For instance, if LGWR is running on CPU 0, the hardware interrupt for any given log file parallel write can be serviced by any of the online CPUs in the box. If, for instance, LGWR is pinned to CPU 0 and a LGWR flush I/O interrupt is handled by, say, CPU 3, you are getting both a blessing and a curse. The blessing is that LGWR’s CPU didn’t have to handle the noise of the interrupt. The curse is that your workload is getting “help” from CPUs you didn’t really want in the mix-that is if you are hoping for total isolation of the workload to CPU 0. This is not the same as sophisticated OS partitioning offerings out there that corral event he hardware interrupts to the CPUs in the partition. That is an entirely different topic.

Summary
No, binding a workload to CPUs with sched_setaffinity(2) is not the same as running on the same number of discrete processors. Whoever authored the words “You can bound Oracle to specific CPU to avoid license problem” is likely neither an English professor nor a contract lawyer. I’d ignore that “advice.” I wonder what “problem” the author was referring to? Is legal use of a software product a “problem?” All that aside, hard affinity can play an important role in affording ample processor bandwidth to key Oracle background processes. There were musings in the comment thread on my LGWR processing post about dedicating an entire CPU to LGWR although I have not followed up, that is certainly not the approach. I’ll need to blog about a possible best practice in this concept area soon.

Can’t Get Oracle Off My Mind.

I promise I’ll explain soon the reason I couldn’t get Oracle off my mind during my vacation-in spite of the scenery and activities. This is the time of year I do some fairly aggressive hiking in preparation for my annual hunting trip. Hopefully mere mention of such a seemingly controversial activity will not alienate any readers. Hey, it’s my blog and I’m one of those folks that knows food comes from somewhere-even before it shows up in a market. Anyway, I thought I’d post a few photos. Enjoy!

BLOG UPDATE: Several of you readers asked to see the full size versions of these shots.  Here is a link: Photos on Flickr

p13.jpg

p12.jpgp11.jpg

p10.jpg

p8.jpg

p9.jpg

p7.jpg

p6.jpg

p4.jpg

Don’t Forget ARCH When Pondering Redo and Throwing Partners Under The Bus For Fun, Not Profit.

In my recent post about LGWR processing, I used a title that fits my recurring Manly Men motif just for fun. The title of the post was Manly Men Only Use Solid State Disk for Redo Logging. The post then set out to show that LGWR performance issues are not always I/O related by digging in to what a log file sync wait event really encompasses. Well, I’m learning to choose my words better-the hard way. Let me explain.

The purpose for building the test case for that post using solid state disk (SSD) for redo log files was to establish the fact that even when redo logging I/O is essentially free, it is quite possible to see long duration log file sync wait events. I also mentioned that I did one pass of the test with the initialization parameter _disable_logging set to true and achieved the same results as the SSD run. That establishes the fact that SSD for redo logging is as fast as fast can be. The point of the exercise was to show that when the I/O component is the not the root cause of long duration log file sync wait events it would be foolish to “throw hardware at the problem.” What I failed to point-in this post only-was that SSD for redo is most certainly a huge win if your system is doing any reasonable amount of redo. This test case only generates about 500 redo writes per second so it is not a show-case example of when to apply SSD technology for redo. On the other hand, I would not have been able to make such a rich example of LGWR processing without solid state disk.

Biting The Hand That Feeds You
The good folks at Texas Memory Systems were nice enough to loan me the solid state disk that I happened to use in the LGWR processing post. I’m sad to report that, as per email I received on the topic, their reading of the LGWR processing post left the impression that I do not promote the use of solid state disk for redo. Nothing could be further from the truth. The problem is that I am still working on the tests for my latest proof case for solid state using a much higher-end configuration. Regular readers of this blog know that I have consistently promoted solid state disk for high-end redo logging situations. I have been doing so since about 2000 when I got my first solid state disk-an Imperial MegaRam. That technology could not even hold a candle to the Texas Memory Systems device I have in my lab now though.

Forget SSD for Redo Logging, Really?
Regular readers of this blog know my regard for Nuno Souto, but he left a comment on the LGWR processing thread that I have to disagree with. He wrote:

Folks should forget SSDs for logs: they are good to reduce seek and rotational delays, that is NOT the problem in 99% of the cases with redo logs. They might be much better off putting the indexes in the SSDs!

Noons is correct to point out that rotation and seek are generally not a problem with LGWR writes to traditional round-brown spinning thingies, but I’d like to take it a bit further. While it is true that given a great deal of tuning and knowledge of the workload it is possible to do a tremendous amount of redo logging to traditional disks, that is generally only possible in isolated test situations. The perfect case in point are the high end TPC-C results Oracle achieves on SMP hardware. The TPC-C workload generates a lot of redo and yet Oracle manages to achieve results in the millions of TpmC without solid state disk. That is because there are not as many variables with the TPC workload as there are in production ERP environments. And there is one other dirty little secret: log switches.

Real Life Logging. Don’t Forget Log Switches.
Neither the test case I used for the LGWR processing post nor any audited Oracle TPC-C run are conducted with the database in archive log mode. What? That’s right. The TPC-C specification stipulates that the total system price include disk capacity for a specific number of days worth of transaction logging, but it is not required that logs actually be kept so the databases are never set up in archivelog mode.

One of the most costly aspects of redo logging is not LGWR’s log file parallel write, but instead the cost of ARCH spooling the inactive redo log to the archive log destination. When Oracle performs a log switch, LGWR and ARCH battle for bandwidth to the redo log files.

Noons points out in his comment that rotation and seek are not a problem for LGWR writes which is generally true. However, all too often folks set up their redo logs on a single LUN. And although the single LUN may have many platters under it, LGWR and ARCH going head to head performing sequential writes and sequential large reads of blocks on the same spindles can introduce performance bottlenecks-significant performance bottlenecks. To make matter worse, it is very common to have many databases stored under one SAN array. Who is paying attention to religiously carve out LUNs for one database or the other based on logging requirements (both LGWR and ARCH)? Usually nobody.

The thing I like the most about solid state for redo logging is that it neutralizes concern for both sides of the redo equation (LGWR and ARCH) and does so regardless of how many databases you have. Moreover, if you log in solid state to start with, you don’t have to worry about which databases are going logging-critical because all the databases get zero cost LGWR writes and log switches.

Storage Level Caches
Simply put, redo logging wreaks havoc on SAN and NAS caches. Think about it this way, a cache is important for revisiting data. Although DBAs care for redo logs and archived redo logs with religious fervor, very few actually want to revisit that data-at least not for the sake of rolling a database forward (after a database restore). So if redo logs are just sort of a necessary evil, why allow redo I/O to punish your storage cache? I realize that some storage cache technology out there supports tuning to omit the caching of sequential reads or writes and other such tricks, but folks, how much effort can you put into this? After all, we surely don’t want to eliminate all sequential reads and write caching just because the ones LGWR and ARCH perform are trashing the storage cache. And if such tuning could be done on a per-LUN basis, does that really make things that much simpler? I don’t think so. The point I’m trying to make is that if you want to eliminate all facets of redo logging as a bottleneck, while offloading the storage caches, solid state redo is the way to go.

Summary
No, I don’t throw partners under the bus. I just can’t do total brain dumps in every blog entry and yes, I like catchy blog titles. I just hope that people read the blog entry too.

Hey, I’m new to this blogging stuff after all.

Back From Vacation. When The Rubber Hits The Road…

I’ll spare you good readers the highlights. Here is a lowlight. This is what a 10-ply radial does under a 12,000 pound trailer when it suffers belt separation.  The first view isn’t too ugly:

blowout_2_small.jpg

…but the following shot gets the point across. Ugh:

blowout_1_small.jpg

So, now that I’m back home I need to haul the trailer over to the repair shop. When the tire blew it ripped some of the sheet metal skirting to shreds. Nifty!

And yes, it could have been much, much worse.

It’s good to be back. I’ll try to make a real useful blog entry tomorrow.

Manly Men Only Use Solid State Disk For Redo Logging. LGWR I/O is Simple, But Not LGWR Processing

Let’s look at some statspack data from an OLTP Order Entry benchmark run that achieved 133 transaction per second (TPS):

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
log file sync                                   68,965       1,117     16   64.6
db file sequential read                        159,712         241      2   14.0
CPU time                                                       163           9.4
log file switch (checkpoint incomplete)            164         146    890    8.4
db file parallel write                          19,651          31      2    1.8

And now let’s look at the stats from run number 2 of the same benchmark, same system, same disk setup, etc:

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
db file sequential read                        838,810       1,635      2   69.1
CPU time                                                       253          10.7
db file parallel write                         153,334         145      1    6.1
log file sync                                   93,100         143      2    6.0
log file parallel write                         93,417          73      1    3.1

Benchmark run number 2 achieved 200 TPS. That’s right, statspack number 1 and number 2 above are from the same system. The database files are stored in an an HP EFS Clustered Gateway NAS device and accesed via NFS and the redo logs are, get this, stored on a Texas Memory Systems Solid State Disk. Statspack 2 above was a very busy system (a lot of cached SGA work) and doing over 3,000 IOPS:

Load Profile                            Per Second       Per Transaction
~~~~~~~~~~~~                       ---------------       ---------------
                  Redo size:          1,138,491.64              7,166.65
              Logical reads:             14,640.16                 92.16
              Block changes:              5,538.83                 34.87
             Physical reads:              1,428.73                  8.99
            Physical writes:              1,742.27                 10.97
                 User calls:              2,352.99                 14.81

So if a) system is the same, and b) LGWR is writing redo to the fastest disk possible, what is the story? What changed from statspack 1 to statspack 2 to make log file sync wait events jump 8-fold-with the associated 34% drop in throughput? More users? No. Nothing that simple. I’ll explain.

LGWR I/O? Who Cares-Simple Stuff
I see myriads of information on blogs, forums and all those expert repositories about the I/O aspects of redo logging, but very little about LGWR processing. Yes, a portion of LGWR’s job is to flush the redo buffer to the online redo log, but there’s more to LGWR’s job than writing redo. In fact, I never could understand the fascination with LGWR’s I/O since it is about as simple as it gets: sequential writes to a single file of sizes ranging from 512 bytes to a port-defined maximum (usually 128KB but often 1MB). Yes, it gets a little more complex with multiplexed redo, but that is just concurrent operations of the exact same profile. The part that seems to be hard to grasp for folks is where asynchronous I/O comes in to play.

LGWR Async I/O
If LGWR is clearing more than the port-defined maximum from the buffer, asynchronous I/O is important. Let’s consider an example. Let’s say LGWR has been posted to clear 1.75MB of redo from the buffer on a platform that establishes the LGWR I/O maximum to be 1MB. In this case, LGWR will issue a 1MB write asynchronously immediately followed by a 750KB asynchronous write. These writes are sequential. After the second I/O is in flight, LGWR starts to poll for the completions of the I/Os. So the question often becomes what if LGWR is only flushing, say, 600KB instead of 1.75MB, isn’t asynchronous I/O crucial in that case as well? The answer is no-unless you are using multiplexed redo. If LGWR is flushing an amount less than the port-defined maximum, asynchronous I/O offers nothing-for that specific flushing operation. See, LGWR is a lot different than DBWR. When DBWR has I/O in flight, it also has other stuff it needs to do like building the next write batch and so forth. LGWR on the other hand has nothing to do when its I/Os are in flight. It polls for the completions and once the flushing operation is complete, it then takes action to post the foregrounds that are waiting in a log file sync wait event. So asyncnronous I/O has no play in a LGWR buffer flush operation of an amount smaller than the port-defined maximum I/O size.

Oh, I forgot to mention, LGWR will keep up to 4 I/Os in flight (I need to check if that is more in late releases) when clearing more than the maximum allowed in a single physical write. All in all, LGWR I/O is very simple. So why do session waits, such as log file sync, generate so much folklore? I don’t know, but I know the folklore isn’t helpful.

Throw A Solid State Disk At It
I love that one. Let’s see, writing to solid state disk is faster than round-brown spinning thingies, so if any aspect of LGWR’s duties seem out of sorts, let’s get some really, really fast disk. Don’t get me wrong, I’m a big fan of solid state disk, but throwing it at the wrong problem is not the right thing to do.

The most complex topic regarding LGWR is the log file sync wait event. Why? Because it has very little to do with I/O, that’s why. Sure I/O is a part of it, but there is more to it.

Backgrounder
LGWR spends its time in a work loop consisting of very little, really. Well, in contrast to DBWR that is. Remember, DBWR has to manipulate both cache buffers chains and lrus and flush modified SGA buffers and post processes. While DBWR works on a large amount of metadata and buffers, LGWR does not. A glance at ps(1) output makes that clear. Notice in the following ps(1) output, LGWR manages to perform its tasks with a resident set size one 30th the size of DBWR-and of course that ratio would lean much more to the DBWR side on a large SGA system since DBWR has to map all the buffers in the SGA over the life of the instance (the exception to this is the indirect data buffers feature). LGWR on the other hand has very little working set because it only tends to the redo buffer and will only exercise a few latches (redo allocation, copy, etc).

$ ps -eF | egrep 'UID|lgwr|dbw' | grep -v grep
UID        PID  PPID  C    SZ  RSS PSR STIME TTY          TIME CMD
oracle   19062     1  1 377082 851808 0 16:11 ?       00:00:08 ora_dbw0_bench1
oracle   19064     1  2 379943 29816 0 16:11 ?        00:00:09 ora_lgwr_bench1

Log File Sync
When a foreground process commits changes to the database, it allocates space in the redo buffer and copies the redo information into the buffer-protected either by the redo allocation latch itself or a redo copy latch. The process then enqueues itself for service from LGWR and posts the LGWR (most ports use IPC semaphores for posting). The process then goes to sleep waiting for the return post from LGWR indicating its redo is flushed to disk. So, we now have a process waiting for log file sync event (LFS). Let’s call this process waiter1. How long waiter1 waits depends on a lot of factors.

The first factor that will affect waiter1’s LFS wait is what LGWR was doing when it posted him. Remember, Oracle is a multi-user database on a multi-processor system. There is 1 LGWR and a lot of consumers of his service. On a busy system, odds are quite good that any time LGWR get posted it is in the process of servicing a batch of prior waiters. That is, when waiter1 posted LGWR the odds are good it was already busy flushing a bunch of redo for some other group of processes who are themselves waiting in LFS. This is where it gets dicey.

Before LGWR can flush waiter1’s redo, he has to finish flushing the prior batch and post all of those waiters. Let’s say there was 256KB of redo to flush for 2 waiters on a platform with the port-defined maximum LGWR I/O size set at 128KB. On this platform, LGWR will have to issue 2 asynchronous 128KB writes and poll for those to complete. Once the I/O is complete, LGWR will post each of the two waiters using the post-wait mechanism for that port of Oracle-which is an IPC Semaphore semop() call for most platforms. So, LGWR has made system calls to write, and system calles to poll for I/O completion and finally 2 calls to semop(). LGWR is just a normal user-mode process so all this dipping into the kernel introduces processing delays. Yes, these system calls I listed are non-blocking, but when a processor transitions from kernel to user mode there is opportunity to switch to higher-priority runable processes. In fact, the very act of LGWR posting these 2 processes can cause it to lose CPU. Huh? Sure. What if instead of 2 there were 16 in the batch and there were only 4 CPUs on the system. We call that the thundering herd and it wreaks havoc on LGWR. These process scheduling storms can be quite problematic. Think about it. You have one process (LGWR) that just made 4,8 or 16 or more processes runable by semop()ing them from their log file sync wait which makes them runable in kernel mode which trumps a runable user-mode process. If that happens on a system with, say 8 cores, those processes will need to run somewhere. It is a scheduling storm. For this reason you should always set LGWR’s scheduling priority to the highest possible (with renice for instance).

Without elevated scheduling priority for LGWR, it can often be the case that the processes LGWR wakes up will take the CPU from LGWR. Some platforms (Legacy Unix) support the notion of making processes non-preemptable as well. On those platforms you can make LGWR non-preemptable. Without preemption protection, LGWR can lose CPU before it has exhausted its fair time slice.
Once LGWR loses his CPU it may be quite some time until he gets it back. For instance, if LGWR is preempted in the middle of trying to perform a redo buffer flush, there may be several time slices of execution for other processes before LGWR gets back on CPU. Why am I putting you to sleep about this? Because those time slices that LGWR was off CPU will be charged to foreground processes waiting in a log file sync wait event. Yes, if a foreground is in log file sync wait and LGWR is piled under other processes and can’t get CPU for, say, a few time slices (10ms each), that foreground process’ log file sync will have 10, 20 or more milliseconds of extra wait tacked onto it before LGWR even gets a chance to do I/O. LGWR’s I/O is merely a component of a foreground process’ LFS wait.

So what happened to the benchmark run behind statspack 2 above? CPU starvation specifically for LGWR! How did I do that? Easy. I have a simple program that uses the Linux sched_setaffinity() API call to assign hard processor affinity to processes. These tests were done on a 4-core x86 system and in both cases all the background processes were started up with hard CPU affinity to CPU 0. Once the database was started, I reassign LGWR’s hard affinity to CPU 3. The Pro*C benchmark processes and their associated shadow processes are all running split equally with hard affinity between CPUs 1 and 2. This is the model for both benchmark runs. Yes, LGWR is all alone on CPU 3.

In the case where the benchmark only got 133TPS, I have a “noise” process running on the same CPU as LGWR. This noise process also is pinned to the CPU with hard affinity. What does this “noise” process do that is so heinous as to affect Oracle’s OLTP throughput by 34%? After all, it is only a single process. Yes, single, but also running on the same CPU as LGWR. The noise program is:

double sqrt();
main () { int i; double z; for (;;) { for (i = 0; i < 1000000; i++) { z = sqrt (i); } poll (0, 0, 1); } }

Simple enough. That loop of sqrt() calls from libm takes about 3ms on the CPUs in this benchmark system. So the way this works is that about every 3ms, this process will go to sleep for 1ms using the poll() system call. If this process is running when LGWR needs to run, LGWR will have to wait for 3ms. If that process isn’t executing, LGWR has no impediment to peforming his tasks and LFS are a measely 2ms and OLTP throughput jumps 34%.

This little parasite noise process gets in LGWR’s way a lot too. Basically, any time LGWR gets posted to perform a flush, this noise process is in the way (chalk up 3ms), any time LGWR blocks on an I/O (e.g., a flush that is smaller than the port maximum and therefore a single I/O) this noise process is in the way once the I/O is complete (chalk up 3ms), any time LGWR unluckily loses his CPU to a time slice switch this process will run and take 3ms from LGWR. It all adds up. And in this test case, processes posted by LGWR can never take his CPU (since I’ve set up hard affinity) so this is a nicer test case actually.

Don’t Write At All

Just to be evil, I’ll take it a step further. Did you know that if you set the initialization parameter _disable_logging=TRUE LGWR does everything it normally does except it omits performing the I/O for a redo flush. Don’t do this if you like your database because anything other than a shutdown normal is going to leave your database unrecoverable. However, this is a great test because if performance doesn’t increase when you set this parameter in this fashion, then you have a LGWR I/O processing problem and not a LGWR I/O problem. So I did a run of the same benchmark like this:

SQL> show parameter _disable

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
_disable_logging                     boolean     TRUE

Everything else was the same but now instead of really fast Solid State Disk I/Os, LGWR doesn’t even bother writing to disk at all. What did that do to the benchmark results? Nothing. I still get 133TPS.

Summary
LGWR I/O is very simple. LGWR processing on the other hand is a bit more complex.

Oracle11g: Oracle Inventory On Shared Storage. Don’t Bother Trying To Install 11g RAC That Way.

A few days ago there was a thread on the oracle-l email list about adding nodes in an Oracle Database 10g Real Application Clusters environment. The original post showed a problem that Alex Gorbachev reports he’s only seen with shared Oracle Home installs. I found that odd because I’ve done dozens, upon dozens of RAC installs on shared Oracle Homes with both CFS and NFS and haven’t seen this error:

Remote 'UpdateNodeList' failed on node: 'af-xxx2'. Refer to
'/apps/oracle/oraInventory/logs/addNodeActions2007-07-08_09-36-12PM.log'
for details.
You can manually re-run the following command on the failed nodes after the
installation:
/apps/oracle/product/10.2/oui/bin/runInstaller -updateNodeList -noClusterEnabled
ORACLE_HOME=/apps/oracle/product/10.2 CLUSTER_NODES=af-xxx1,af-xxx2,af-xxx6
CRS=false "INVENTORY_LOCATION=/apps/oracle/oraInventory" LOCAL_NODE=
<node on which command is to be run>

I never have any problems with shared Oracle Home and I blog about the topic a lot as can be seen in in this list of posts. Nonetheless, Alex pointed out that the error has to do with the Oracle Inventory being on a shared filesystem. Another list participant followed up with the following comment about placing the inventory on a shared drive:

Sharing the oraInventory across nodes is not a good practice in my opinion. It runs counter to the whole concept of redundancy in an HA configuration and RAC was not written to support it.

Well, the Oracle Inventory is not a RAC concept, it is an Oracle Universal Installer concept, but I think I know what this poster was saying. However, the topic at hand is shared Oracle Home. When people use the term shared Oracle Home, they don’t mean shared ORACLE_BASE, they mean shared Oracle Home. Nonetheless, I have routinely shared the 10g inventory without problems, but then my software environments might not be as complex as those maintained by the poster of this comment.

Shared Inventory with Oracle Database 11g
No can do! Well, sort of. Today I was installing 11g RAC on one of my RHEL 4 x86 clusters. In the fine form of not practicing what I preach, I mistakenly pointed Oracle Universal Installer to a shared location (NFS) for the inventory when I was installing CRS. I got CRS installed just fine on 2 nodes and proceeded to install the database with the RAC option. It didn’t take long for OUI to complain as follows:

shared_home_11g.jpg

Ugh. This is just a test cluster that I need to set up quick and dirty. So I figured I’d just change the contents of /etc/oraInst.loc to point to some new non-shared location-aren’t I crafty. Well, that got me past the error, but without an inventory with CRS in it, Oracle11g OUI does not detect the cluster during the database install! No node selection screen, no RAC.

I proceeded to blow away all the CRS stuff (ORA_CRS_HOME, inittab entries, /etc/oracle/* and /etc/oraInst.loc) and reinstalled CRS using a non-shared locale for the inventory. The CRS install went fine and subsequently OUI detected the cluster when I went to install the database.

This is a significant change from 10g where the inventory content regarding CRS was not needed for anything. With 10g, the cluster is detected based on what /etc/oracle/ocr.loc tells OUI.

Summary
Shared Oracle Home is an option, shared Oracle Home means shared Oracle Home not shared Oracle Inventory. Oracle11g enforces this best practice nicely!

Application Server Benchmark Proves PostgreSQL is The Best Enterprise Database Server. New SPECjAppServer2004 Cost Metric Introduced Too!

I was so excited to see this InformationWeek.com article reporting that PostgresSQL has proven itself nearly as good as Oracle. On a day billed as “a good day for Open Source”, Sun published a SPECjAppServer2004 result using the Sun Niagra platform running PostgreSQL backing 3 Application Servers running Sun Java Systems Application Server 9.0.

The Sun result was 778.14 JOPS which, as the InformationWeek.com article suggests is roughly 12% under this 874 JOPS Itanium Oracle result. That’s all fine and dandy, but why hand pick that Oracle result over this 1111.96 JOPS Xeon “Cloverdale” Oracle result? I’ll tell you why, it’s because InformationWeek.com was only reporting on this blog entry by Josh Berkus. Josh is the PostgreSQL Lead by the way and Josh knows full well that the 12% in his hand-picked comparison looked a lot better than then 42% thumping Oracle dealt PostgeSQL using 2 little Xeon “Cloverdale” processors. But that isn’t what I’m blogging about.

PostgreSQL: The Worlds Best Application Server
I’m not blogging about the fact that these guys cherry picked results in an attempt to put PostgreSQL on par with Oracle. Afterall, I too dream of the day that everything in the world is free. In the meantime you get what you pay for.

What get’s my ire up about this news piece is the fact that there is no news there. Folks, SPECjAppServer2004 is not a database benchmark. The SPECjAppServer2004 web page puts it like this:

SPECjAppServer2004 (Java Application Server) is a multi-tier benchmark for measuring the performance of Java 2 Enterprise Edition (J2EE) technology-based application servers. SPECjAppServer2004 is an end-to-end application which exercises all major J2EE technologies implemented by compliant application servers as follows:

  • The web container, including servlets and JSPs
  • The EJB container
  • EJB2.0 Container Managed Persistence
  • JMS and Message Driven Beans
  • Transaction management
  • Database connectivity

Yes, there is a database downwind. No this is not a database workload it is an Application Server benchmark. So to say that PostgreSQL made a “good day for Open Source” based upon this result is a lark. But that is not what I’m blogging about.

Cost Metric
SPECjAppServer2004 has no cost metric. By that I mean that there is no cost metric. In fact, the specification lacks a cost metric. Moreover, no published results include cost information. What I’m trying to say is that the benchmark has no cost component. If I started an Open Source project called JOPS Cost Metric Calculator (JCMC), I’d have nothing to work with—but I might get a lot of help I suppose. In fact, if I spent the rest of my life studying the SPECjAppServer2004 specification I wouldn’t find anything about a cost metric. I once saw a Klingon riding a three-legged pink elephant wearing a tee shirt that read, “The SPECjAppServer2004 ghajtaH ghobe’ Cost Metric” which—loosely translated—collaborates with my assertion that there is no cost metric in the SPECjAppServer2004. That’s why I’m a little dissapointed to see Tom Daley’s blog entry with a cost analysis of these SPECjAppServer2004 results.

Summary
Oracle Database has licensed options. Customers can use the best options at prices that fit their budget to solve their problems. Incidentally, Oracle has that choice as well. So when Oracle is competing in an Application Server benchmark, they can use whatever downwind database options they so choose—after all, I heard a rumor that there is no cost component in the SPECjAppServer2004 specification.

Oracle used Enterprise Edition and partitioning. I bet if they tried they could get the same result with Standard Edition or since there are only 8 cores involved, Standard Edition 1. But why bother, THERE IS NO COST COMPONENT. And PostgreSQL comes nowhere close to what Oracle can do with this workload.

What Did He Say?
That’s right, in spite of the blogospherian hopes that PostgreSQL is a 12% heel-biter to Oracle, the full list of audited SPECjAppServer2004 results tells the real story. Not only did Oracle slam this “PostgreSQL result” with the same number of cores—when we consider the 42% thumping Oracle on Xeon 5355 delivered—but Oracle scales as well as this 7629.45 JOPS result with 64 cores will attest. Oh and I forgot to point out that the 778.14 Sun result with PostgreSQL downwind used three application servers. Isn’t it interesting how a single Xeon 5355-based application server was able to push 42% more throughput when using Oracle’s application server with n Oracle database back-end. That is, the Sun result required 3 application servers for a total of 12 cores whereas the Oracle 1111.96 JOPS result used a single Xeon 5355-based application server with 8 cores. Remember, this is an application server benchmark so Oracle pulls 138 JOPS per application server core. That’s a far cry better than Sun’s 65 JOPS per application server core!

When it comes to the database component, I want to see the PostgreSQL 64-core number. Hey, it’s just a porting effort after all. There must be untold countless Open Source contributers working on optimizing the PostgreSQL port to HP-UX for the Superdome.

Where’s Waldo?
There are no SPECjAppServer2004 results for SQL Server in this list. That could very well be the real story.

Oracle Doesn’t Scale Without Multiple Log Writer Processes

There has been a bit of a spat going on between Jonathan Lewis and Don Burleson. Don’s summary of the topic can be found here. I don’t care about the spat. I want to point out something clearly. Very clearly. But first, a quote from Don:

[…] Lewis failed to mention the hidden parameter _lgwr_io_slaves nor the Metalink note that clearly states that initially, only one LGWR process is started and that multiple LGWR processes will only appear under high activity.

The Oracle docs are very clear that multiple log writer processes exist:

“Prior to Oracle8i you could configure multiple log writers using the LGWR_IO_SLAVES parameter.”

Note: There is a big difference between LGWR I/O Slaves and LGWR. Don continues:

(Note: Starting in 8.1.7, lgwr_io_slaves became a hidden parameter _lgwr_io_slaves, and it remains a hidden parm in Oracle10g).

Further, Metalink note 109582.1 says that multiple log writer I/O slaves were first introduced almost a decade ago, way back in Oracle8:

Folks, there is only ever one LGWR whose job it is to clear the redo buffer. DBWR on the other hand is multi-stated (support for multiple true DBWR processes) and that is due entirely to the NUMA requirements of certain ports from the late 1990s. Although it is possible for a single DBWR to saturate a CPU, and therefore need multiple DBWR processes, that is rare (generally related to pathological latch contention on cache buffers chains).

As long as LGWR and DBWR have asynchronous I/O support—and you are not running on a NUMA-optimized port—there really should never be a reason to configure multiple DBWR processes, nor should LGWR saturate a CPU.

Let me put it this way, there are examples of large high-end systems doing OLTP that don’t need multiple LGWR processes. Here is a system with roughly 7,000 spindles, 40 FC HBAs, 128 cores and 2TB RAM that needed neither multiple LGWR nor DBWR processes. And if you think you have a workload that generates more redo on a per-CPU basis than TPC-C you’re probably wrong.

If I ever hear the term I/O slaves again, I’ll:

gross.jpg

Summary
Oracle demonstrates more LGWR scalability than 99.99942% of all production sites would ever need when they run such high end TPC-C benchmarks as the one I cited above. And please, let’s stop the I/O slave madness!

Manly Men Only Deploy Oracle with Fibre Channel – Part VIII. After All, Oracle Doesn’t Support Async I/O on NFS

In the comment section of my recent post about Tim Hall’s excellent NFS step-by-step Linux RAC install Guide, Tim came full circle to ask a question about asynchronous I/O on NFS. He wrote:

What do you set your filesystemio_options init.ora parameter to when using Oracle over NFS?

Based on what you’ve written before I know NFS supports direct I/O, but I’m struggling to find a specific statement about NFS and asynchronous I/O. So should I use:

filesystemio_options=directIO

or

filesystemio_options=setall

My reply to that was going to remind you folks about my recent rant about old Linux distributions combined with Oracle over NFS.  That is, the answer is, “it depends.” It depends on whether you are running a reasonable Linux distribution. But, Tim quickly followed up his query with:

I found my answer. Asynchronous I/O is not supported on NFS:

http://download.oracle.com/docs/cd/B19306_01/server.102/b15658/appc_linux.htm#sthref892

Bummer, I didn’t get to answer it.

Word To The Wise
Don’t use old Linux stuff with NAS if you want to do Oracle over NFS. Metalink 279069.1 provides a clear picture as to why I say that. It points out a couple of important things:

1. RHEL 4 U4 and EL4 both support asynchronous I/O on NFS mounts. That makes me so happy because I’ve been doing asynchronous I/O on NFS mounts with Oracle10gR2 for about 16 months. Unfortunately, ML 279069.1 incorrectly states that the critical fix for Oracle async I/O on NFS is U4, when in fact the specific bug (Bugzilla 161362 ) was fixed in RHEL4 U3 as seen in this Red Hat Advisory from March 2006.

2. Asynchronous I/O on NFS was not supported on any release prior to RHEL4. That’s fine with me because I wouldn’t use any Linux release prior to the 2.6 kernels to support Oracle over NFS!

Summary
The Oracle documentation on the matter was correct since it was produced long before there was OS support for asynchronous I/O on Linux for Oracle over NFS. Metalink 279069.1 is partly correct in that it states support for asynchronous I/O on systems that have the fix for Bugzilla 161363 but it incorrectly suggests that U4 is the requisite release for that fix, but it isn’t—the bug was fixed in U3. And yes, I get really good performance with the following initialization parameter set and have for about 16 months:

filesystemio_options = setall

Manly Man Post Script
Always remember, the Manly Man series is tongue-in-cheek.  Oracle over NFS with Async I/O on the other hand isn’t.

Oracle Database 11g: The SecureFiles Feature is not “Fast Files”, But Could Be Quite Fast

In my recent post about Oracle Database 11g SecureFiles, I referred to a July 11, 2007 press release that insinuated a name change from SecureFiles to Fast Files.

I just received email from one of the Program Managers (perhaps the 11g Program Manager—I’ll have to ask him) who set me straight that the feature is indeed called SecureFiles.

The press piece I refered to was in error.

As an aside, I’m chomping at the bits to do my own testing on just how much faster it is to store LOBs using the  SecureFiles feature than in a traditional file system. As I’ve pointed out, I hope it is better. Having more unstructured data support inside the database is a really good thing.

Manly Men Only Deploy Oracle with 64 Bit Linux – Part I. What About a x86 Port on EM64T/AMD64 Hardware?

In the comment thread of my of my latest installment in the “Manly Man” series, a reader posted a humorous comment that included a serious question:

[…] what are your thoughts about x86 vs. x86-64 Linux, in relation to Oracle and RAC? I’d appreciate a blog entry if you could.

Tim Hall followed up in suit:

I would have thought it was obvious. The number 64 is twice the size of 32, so it must be twice as good, making you twice as manly!

Yes, when it comes to bitness, some is good so more must be better, right? Tim then continued to point out the political incorrectness of the term Manly Man by suggesting the following:

PS. I think politically correct names for this series of blog entries are:

Personally Persons only deploy…
Humanly Humans only deploy…

Before I actually touch the 32 versus 64-bit question from the thread, I’ll submit the following Manly Man entertainment:

irish_spring.jpg

If you have enjoyed the Manly Man series at all, or simply need a little background, you must must see this YouTube Video about Manly Men and Irish Spring.

32 or 64 Bit Linux
When people talk about 32 versus 64 bit Oracle with Linux they are actually talking about 3 topics:

  1. Running 32 bit Oracle on 32 bit Linux with native 32 bit hardware (e.g., Pentium IV (Willamette), Xeon MP (Foster MP) ).
  2. Running 32 bit Oracle on 32 bit Linux with x86_64 hardware.
  3. Running 64 bit Oracle on 64 bit Linux with x86_64 hardware.

The oddball combination folks don’t talk about is 32 bit Oracle running on 64 bit Linux because Oracle doesn’t support it. I do test that however by installing Oracle on a 32 bit server and then NFS mounting that ORACLE_HOME over on a 64 bit Linux system. However, discussing this combination would therefore be moot due to the support aspect.

The most interesting comparison would be between 1 and 2 above provided both systems have precisely the same core clock speed, L2 cache and I/O subsystem. As such, the comparison would come down to how well the 32-bit optimized code is treated by the larger cache line size. There would, of course, be other factors since there are several million more transistors in an EM64T processor than a Pentium IV and other fundamental improvements. I have wished I could make that comparison though. The workload of choice would be one that “fits” in a 32 bit environment (e.g., 1GB SGA, 1GB total PGA) and therefore doesn’t necessarily benefit from 64 bitness.

If anyone were to ask me, I’d say go with 3 above. Oracle on x86_64 Linux is not new.

Bitness
In my recent blog entry about old software configurations in an Oracle over NFS situation, I took my well-deserved swipes at pre-2.6 Linux kernels. Perhaps the most frightening one in my experience was 32-bit RHEL 3. That whole 4/4 split kernel thing was a nightmare—unless you like systems that routinely lock up. But all told I was taking swipes at pre-2.6 kernels without regard for bitness. So long as the 2.6 kernel is on the table, the question of bitness is not necessarily so cut and dried.

In my recent blog entry about Tim Hall’s excellent step-by-step guide for RAC on NFS, a reader shared a very interesting situation he has gone through:

I have a Manly Man question for you. This Manly Man Wanna Be (MMWB) runs a 2-node 10g RAC on Dell 2850s with 2 dual-core Xeon CPUs (total of 4 CPUs). Each server has 16 GB of memory. While MMWB was installing this last year, he struggled mightily with 64-bit RAC on 64-bit Red Hat Linux 4.0. MMWB finally got it working after learning a lot of things about RPMs and such.

However, Boss Of Manly Man Wanna Be (BOMMWB) was nervous about 64-bit being “new,” and all of the difficulties that MMWB had with it, so we reinstalled with 32-bit RAC running on 32-bit Red Hat Linux 4.0.

My naturally petulant reaction would have been to focus on the comment about 64-bit being “new.” I’m glad I didn’t fire off. This topic deserves better treatment.

While I disagree with Boss of Manly Man’s assertion that 64-bit is “new”, I can’t take arms against the fact that this site measured different levels of pain when installing the same release of Oracle on the same release of RHEL4—only varying the bitness. It is unfortunate that this site has committed themselves to a 32 bit database based solely upon the their experiences during the installation. Yes, the x86_64 install of 10gR2 requires a bit more massaging of the platform vis a vis RPMs. In fact, I made a blog entry about 32 bit libraries required on 64 bit RHEL4. While there may occasionally be more headaches during an x86_64 install than x86, I would not deploy Oracle on a 32 bit operating system today unless there was a gun held to my head. All is not lost for this site, however. The database they created with 32 bit Oracle is perfectly usable in-place with 64 bit Oracle after a simple dictionary upgrade procedure documented in the note Metalink note entitled Changing between 32-bit and 64-bit Word Sizes (ML62290.1).

Has Anyone Ever Tested This Stuff?
I have…a lot! But I really doubt we are talking about running 32 bit Oracle on 32 bit hardware. Nobody even makes a native 32 bit x86 server these days (that I know of). I think the question at hand is more about 32 bit Oracle on 32 bit Linux with x86_64 hardware.

There has always been the big question about what 64 bit software performance is like when the workload possesses no characteristics that would naturally benefit from the larger address space. For instance, what about a small number of users attached to an SGA of 1GB and the total PGA footprint is no more than 1GB. That’s a workload that doesn’t need 64 bit. Moreover, what if the comparison is between 32 bit and 64 bit software running on the same server (e.g., AMD Opteron). In this case, the question gets more interesting. After all, the processor caches are the same, the memory->processor bandwidth is constant, the drivers can all DMA just fine. The answer is an emphatic yes! But yes, what? Yes there are occasions where 64 bit code will dramatically outperform 32 bit code on dual-personality 64 bit processors (e.g., AMD Opteron). It is all about porting. Let me explain.

The problem with running sophisticated 32 bit code on 64 bit processors is that the 32 bit code was most likely ported with a different processor cache line size in mind. This is important:

Native 32 bit x86 processors use a 32 byte cache line size whereas 64 bit processors (e.g., AMD64, EM64T) use a 64 byte cache line.
That means, in the case of a native 32 bit processor, load/store and coherency operations are performed on a swath of 32 bytes. Yes, there were exceptions like the Sequent NUMA-Q 2000 which had two different cache line sizes-but that was a prior life for me. Understanding cache line size and how it affects coherency operations is key to database throughput. And unlike Microsoft who never had to do the hard work of porting (IA64 not withstanding), Oracle pays very close attention to this topic. In the case of x86 Linux Oracle, the porting teams presumed the code was going to run on native 32 bit processors-a reasonable presumption.

What Aspects of Bitness Really Matter?
The area of the server that this topic impacts the most (by far) is latching. Sure, you use the database server to manage your data and the accesses to your data seem quite frequent to you (thousands of accesses per second), but that pales in comparison to the rate at which system memory is accessed for latches. These operations occur on the order of millions of times per second. Moreover, accesses to latches are write-intensive and most-painfully contended across multiple CPUs which results in a tremendous amount of bandwidth used for cache coherency. Spinlocks (latches) require attention to detail-period. Just sum up the misses and gets on all the latches during a processor-saturated workload sometime and you’ll see what I mean. What’s this have to do with 32 versus 64 bit?

It’s All About The Port
At porting time, Oracle pays close attention to ensure that latch structures fit within cache lines in a manner that eliminates false sharing. Remember, processors don’t really read or write single words. They read/write or invalidate the entire line that a given word resides in-at least when processor-to-memory operations occur. Imagine, therefore, a latch structure that is, say, 120 bytes long and that the actual latch word is the first element of the structure. Next, imagine that there are only 2 latches in our imaginary server and we are running on a 32 bit OS on a native 32 bit system such as Pentium IV or Xeon MP (Foster 32 bit) and therefore a 32 byte cache line size. We allocate and initialize our 2 structures at instance startup. These structures will lay out in 240 bytes within a single memory page. Since were dutiful enough to align our two structures on a page boundry, what we have is the first structure resides in the first 120 bytes of the memory page-the first 4 32 byte cache lines. But wait, there are 12 extra bytes in the 4th cache line. Doesn’t that mean the first 12 bytes of the second latch structure are going to share space in the 4th cache line? Not if you are good at porting. And in our example, we are.

That’s right, we were aware of our cache line size (32 bytes) so we padded the structure by allocating an array of unsigned integers (4 bytes) three deep as the last element of our structure. Now our latch structure is precisely 132 bytes or 4 lines. Finally, we have our imaginary 32 bit code optimized for a real 32 bit system (and therefore a 32 byte cache line size). That is, we have optimized our 32 bit software for our presumed platform which is 32 bit hardware. Now, if half of the CPUs in the box are hammering the first latch, there is no false sharing with the second. What’s this got to do with hardware bitness?The answer is in the fact that Oracle ports the x86 Linux release with a 32 bit system in mind.

Running 32 bit Code on a 64 bit CPU
The devil is in the details. Thus far our imaginary 2-latch system is optimized for hardware that operates on a 32 byte line. Since our structures fit within 4 32 byte lines or 2 64 byte lines should we execute on a x86_64 system there would be no false sharing so we must also be safe for a system with a 64 byte line, no? Well, true, there will be no false sharing between the two structures since they are now 2 64 byte lines as opposed to 4 32byte lines, but there is more to it.
Do you think it’s possible that the actual latch word in the structure might be adjacent (same 32 bytes) to anything interesting? Remember, heavily contended latches are constantly being tried for by processes on other CPUs. So if the holder of the latch writes on any other word in the cache line that holds the latch word, the processor coherency subsystem invalidates that line. To the other CPUs with processes spinning on the latch, this invalidation “looks” like the latch itself has been freed (changed from on to off) when in fact the latch is still held but an adjacent word in the same line was modified. This sort of madness absolutely thrashes a system. So, the dutiful port engineer rearranges the elements of the latch structure so that there is nothing else ever to be written in the same cache line that has the actual latch word. But remember, we ported to a 32 bit system with a 32 byte line. On the other hand, if you run this code on a 64 bit system–and therefore 64 byte lines–all of your effort to isolate the latch word from other write-mostly words was for naught. That is, if the cache line is now 64 bytes, any write by the latch holder in the first 64 bytes of the structure will cause invalidations (cache thrashing) for other processes trying to acquire the latch (spinning) on other CPUs. This isn’t a false sharing issue between 2 structures, but it has about the same effect.

Difficult Choices – 32 bit Software Optimized for 64 bit Hardware.
What if the porting engineer of our imaginary 2-latch system were to somehow know that the majority of 32 bit Linux servers would some day end up being 64 bit servers compatible with 32 bit software? Well, then he’d surely pad out the structure so that there are no frequently written words in the same 64 bytes in which the latch word resides. If the latch structure we have to begin with is 120 bytes, odds are quite slim that the percentage of read-mostly words will facilitate our need to pack the first 64 bytes with read-mostly objects along side the latch word. It’s a latch folks, it is not a read-mostly object! So what to do? Vapor!

Let’s say our 120 byte latch structure is a simple set of 30 words each being 4 bytes (remember we are dealing with a 32 bit port here). Let’s say further that there are only 4 read-mostly words in the bunch. In our imaginary 2 latch example, we’d have to set up the structure so that the first word is the latch, and the next 16 bytes are the 4 read-mostly elements. Now we have 20 bytes that need protection. To optimize this 32 bit code for a 64 bit processor, we’ll have to pad out to 64 bytes-with junk. So we’ll put an array of unsigned integers 11 deep (44 bytes) immediately after the latch word and our 4 read-mostly words. That fits nicely in 64 bytes-at the cost of wasting 44 bytes of processor cache for every single latch that comes through our processor caches. Think cache buffers chains folks! We aren’t done though.

We started with 120 bytes (30 4 byte words) and have placed only 5 of those words into their own cache line. We have 25 words, or 100 bytes left to deal with. Remember, we are the poor porting engineer that is doing an imaginary 32 bit software port optimized for 64 bit servers since nobody makes 32 bit servers any more. So, we’ll let the first 64 bytes of the remaining 100 fall into their own line. That leaves 36 bytes that we’ll also have to pad out to 64 bytes-there goes another 28 bytes of vapor. All told, we started with 120 bytes and wound up allocating 192 bytes so that our code will perform optimally on a processor that uses a 64 byte cache line. That’s a 60% increase in the footprint we leave on our processor caches which aren’t very large to start with. That’s how Oracle would have to optimize 32 bit Oracle for a 64 bit processor (x86 code on x86_64 kit). But they don’t because that would have been crazy. After all, 32 bit Oracle was intended to run on 32 bit hardware.

Porting, sound easy? Let me throw this one in there. It just so happens that the Oracle latch structure was in fact 120 bytes in Oracle8i on certain ports. Oh, and lest I forget, remember that Oracle keeps track of latch misses. What’s that got to do with this? Uh, that means processes that do not hold the latch increment counters in the latch structure when they miss. Imagine having one of those miss count words in the same line as the latch word itself!

This is tricky stuff.
Who Uses 32 bit Linux for Oracle These Days?
Finally, bugler, sound Taps.

A thread on oracle-l the other day got me thinking. The thread was about the difficulties being endured at a particular Linux RAC site that prompted the DBA there to audit what RPMs he has on the system. It appears as though everything installed was a revision “as high or higher” based on Oracle’s documented requirements. In his request for information from the list, I noticed uname(1) output that suggested he is using a 32 bit RHEL 4 system.

One place I always check for configuration information is Oracle’s Validated Configurations web page. This page covers Linux recipes for installation success. I just looked there to see if there was any help I could give offer that DBA and found that there are no 32 bit validated configurations!

I know there is a lot of 32 bit x86 hardware out there, but I doubt it is even possible to buy one today. Except for training or testing purposes I just can’t muster a reason to even use 32 bit Linux servers for the database tier at this point and to be honest, running a 32 bit port of Oracle on an x86_64 processor makes very little sense to me as well.

Manly Men Only Deploy Oracle with Fibre Channel – Part VII. A Very Helpful Step-by-Step RAC Install Guide for NFS

Tim Hall has stepped up to the plate to document a step-by-step recipe for setting up Oracle10g RAC on NFS mounts. In Tim’s blog entry, he points out that for testing and training purposes it is true that you can simply export some Ext3 filesystem from a Linux server and use it for all things Oracle. Tim only had 2 systems, so what he did was use one of the servers as the NFS server. The NFS server exported a filesystem and both the servers mounted the filesystem. In this model, you have 2 NFS clients and one is acting as both an NFS client and an NFS server.

This is the link to Tim’s excellent step-by-step guide.

How Simple

If you’ve ever had a difficult time getting RAC going, I think you’d be more than happy with how simple it is with NFS and using Tim’s guide and a couple of low-end test servers would prove that out.

Recently I blogged about the fact that most RAC difficulties are in fact storage difficulties. That is not the case with NFS/NAS.

Thanks Tim!


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.