Archive Page 27

Hard Drives Are Arcane Technology. So Why Can’t I Realize Their Full Bandwidth Potential?

We are all at different skill and technical sophistication levels so this post will look like fluff to some readers. This is just a quick post to show a clear depiction of a SAN configuration that wouldn’t do anyone any favors in an Oracle environment.

I showed some snapshots of the lab gear at my disposal in this blog entry. There are a lot of SANs in the lab here, but I can’t say which brand of SAN array I’m blogging about today, and honestly, this problem is not necessarily the architecture of the particular SAN array controller in question. Yes, this is a low-end array, but the concepts I’m blogging about are relevant through the high-end. The following vmstat(1) output shows one of the problems I’m blogging about:

# vmstat 10
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 3  3      0 31577944  81948 653352    0    0    96   597   13    34  0  7 89  4
 5  4      0 31550968  82204 653096    0    0 73732 89913 2919 57992  0 10 60 30
 0  4      0 31576760  82332 652968    0    0 74555 98056 2942 55904  0 10 60 30
 1  6      0 31553824  82460 652840    0    0 80289 85870 2831 61435  0 10 63 28
 0  4      0 31576968  82460 652324    0    0 79466 85397 2860 60537  0 10 62 28
 8  0      0 31573368  82588 654260    0    0 41279 46733 1917 36035  0 32 53 15
 8  0      0 31572896  82588 653744    0    0 11613 13344 1239 13387  0 49 46  5
 5  0      0 31572528  82588 654260    0    0  8875  9039 1193 10727  0 47 49  4
 7  0      0 31572280  82588 653744    0    0  6559  7057 1146  9823  0 47 49  4
 6  0      0 31572176  82588 653744    0    0  6556  7048 1156  9121  0 50 47  3
 4  0      0 31572072  82588 654260    0    0  5736  5956 1136  7955  0 47 50  3
 6  0      0 31571576  82588 654260    0    0  4205  5529 1126  7851  0 47 51  2
 8  0      0 31571584  82588 654260    0    0  5960  5264 1132  7982  0 48 49  3
 7  0      0 31570896  82652 654712    0    0  4239  5167 1119  8086  0 47 51  2
11  0      0 31569968  82652 654712    0    0  5116  4995 1124  7437  0 55 43  2
 7  0      0 31570184  82684 653648    0    0  4782  5082 1125  7880  0 54 44  2

This vmstat(2) output shows that Oracle was getting upwards of 165MB/s (via a single 2Gb FCP path) of combined read/write throughput until the array cache reached its saturation point. At that point, the throughput of the array was relegated to the bandwidth of the spindles being accessed.

Its All About Balance
I have a pretty simple view of storage and it is worth quoting myself:

Physical I/O is a necessary evil. Bottlenecking downwind of the SAN array cache is silly. Bottlenecking above the disks is foolishness.

Bottlenecking The Array (Silliness)
Configuring LUNs with insufficient spindle count to handle the cache miss (read-through) and write-back overhead is what I call bottlenecking the SAN array. This sort of bottleneck is strictly a configuration issue and therefore falls into the silliness category. The vmstat(1) output above is an example of bottlenecking the array. That is, the array could certainly deliver more throughput, but the spindle count was holding it back. Again, I’m not going to talk about what vendor’s SAN array controller it was because as it turns out any of them will act this way under these circumstances. I had a LUN of some 500GB that consisted of very few spindles so it was no surprise to me that the throughput was lousy. This LUN had to be configured this way due to other constraints on the array’s usage (read Kevin had to share some hardware). I got to live the pain I blogged about recently in this post about capacity versus spindle count. I could configure a more reasonable number of drives and get around this performance problem, but only up to the point where the tables turn and the array starts to bottleneck the disks.

Bottlenecking The Disks (Foolishness)
What do I mean by this? Well, most modern SAN arrays will bottleneck long before your application realizes the full bandwidth potential of the all the drives the array can support. I call this bottlenecking the disks and it falls into the foolishness category. This is basic foolishness on the SAN array vendor’s part. Why build storage arrays with horrible imbalance between capacity and performance?

Let me go over a typical case of a SAN array that bottlenecks the disks. Here is a link to some technical detail of a high end array. This particular array supports roughly 140TB of capacity when configured with the maximum spindle count (1024) of 146GB drives. On page 3 of the document, the vendor states the first figure related to bandwidth by citing the internal bandwidth of 15GB/s. What does that mean to the servers connected to this array? Reading further (pg. 15) the document states that there is some 10.6GB/s of aggregate data bandwidth and 5.3GB/s bandwidth for control. This means that if you get all the king’s horses and all the king’s men to work it out, you could feed 10.6GB/s of data to your servers. Indeed, 10.6GB/s is a great deal of bandwidth. So what am I talking about? Taking another look at page 15 we see the vendor’s claims that a single 146GB drive can deliver 99.9MB/s ( max theoretical). If you wanted to drive all these spindles at full throttle, they would theoretically deliver 99.9GB/s (1024 disks X 99.9MB/s) which is much less than the maximum theoretical data bandwidth of the array. In fact, if you drove “only” about 108 of these spindles at full throttle you’d saturate the array. I quoted the word only because 108 is a lot of disks, but that is only 10% of what the array supports.

If Oracle needs to do a physical I/O, let’s not bottleneck somewhere in the pipe! Think about it this way, in the technology stack I’m discussing (Oracle, high-end SAN array, etc), hard drive technology represents the simplest and least advanced component. That is, while hard drives are faster than they were 10 years ago, they have not fundamentally changed. They are still round and brown and they spin. Wouldn’t you think homely old disk drives would be the bane of performance? They aren’t. If we could drive all our disks at their maximum throughput, we’d be in a much better place performance-wise.

Summary
Hard drives are miserably low-tech necessary evils. Will we ever get a storage architecture where the more sophisticated components don’t make matters worse? Yes, we will. I’ll tell you all more as time passes. In the meantime, I bet dollars to donuts that the paltry 128 drives used in the TPC-H benchmark I blogged about were being driven at or near full bandwidth. That, my dear readers, is cool.

It’s Not a Record, But It Is The Most Interesting TPC-H Result! No Manly Man SAN Here.

Not a Record, But It Is In My Book
Not a throughput record that is. This new result at 39,614 QphH comes within 2% of Oracle’s previous top ranking in the 300GB scale. Why would Oracle publish a result that doesn’t even beat their own previous result? Price performance is the answer. I also find the deployment architecture to be very interesting.

Oracle still hold the top ranking at 300GB with their December 2006 result of 40,411 QphH at a price/performance of $18.67 $$/QphH. This new result, however, comes in at only $12.57 QphH-33% cost savings for essentially the same performance! What did Oracle do to reduce the cost? Did they use fewer RAC nodes to reduce their own licensing cost? Did they use a better, cheaper SAN? Did they use a cheaper, faster RAC interconnect? No, none of that. In fact:

  • Both results used an 8 node cluster of blade servers
  • Both results used Infiniband for the RAC cluster interconnect
  • Both results used “networked storage”

So what gives?

No SAN.
The December 2006 result of 40,411 QphH used a $393,765 configuration consisting of some 448 disks in a SAN with 32 MSA 1000 controllers (not including spares). This new result, on the other hand, did not use a SAN at all. Instead, there were 8 blade servers (the RAC cluster) attached to 16 storage blades. The storage blades were running RHEL5 and presenting block storage via SRP over Infiniband to the 8 node RAC cluster. What’s that? Yes, Infiniband-a single, unified connectivity model for both RAC interconnect and storage. With this type of storage, Oracle was able to drive the same performance (yeah, I know, within 2%) with only 128 hard drives. All told, the storage configuration was priced at $141,158-a healthy 64% reduction in storage cost!

This is a very, very good day-but not for Manly Man.

I couldn’t be happier!

Another Thought About Oracle Database 11g SecureFiles.

Well, you can close your eyes or just not read the blog, but you are about to start hearing me harp more and more about the Oracle11g SecureFiles feature. I’m about to start some high-end testing with an eye on the performance delta between storing unstructured data in Oracle Database 11g with the SecureFiles feature versus storing them in a traditional filesystem.

I was reading StorageMojo.com today where Robin was discussing the state of commodity filesystems and their (in)ability to properly deal with certain storage failures. The story seems even more grim than I was aware. I recommend you read the IRON paper and think about those LOBs. Oracle can do this stuff so much better.

Simply put, Oracle is not going to corrupt your data-structured or unstructured. Oracle offers features and options ranging from from block check summing to HARD and DataGuard and a myriad of other features and tools to help ensure that this claim is true.

It’s worth thinking about. Besides, if you put your unstructured data in the filesystem it is “owned” by the system administrators. If you use SecureFiles, it’s yours!

Worth thinking about?

Learn How To Obliterate Processor Caches: Configure Lots and Lots of DBWR Processes. Part I.

Oracle8 introduced true multiple DBWR processes configured with the db_writer_processes parameter. I remember that the maximum value one could set this to back then was 10. Oracle8i raised that limit I recall. These are processes that build their own write batches (scan LRUs, modify cache buffers chains structs, etc) and perform their own I/O (synchronously or asynchronously). They also post (e.g., semop(2) on most platforms) processes waiting in free buffer wait. Originally, multiple true DBWR processes were implemented to support NUMA system that needed processes local to memory and I/O cards to flush buffers-thus eliminating remote I/O operations.

On systems with flat memory systems (non-NUMA) the only reason to implement multiple DBWR processes is if a single DBWR is processor bound. Such a condition is simple to detect because your free buffer waits will shoot through the roof and a cursory glance at top(1) will show DBWR pegged to the top.

I was just giving this NetApp paper a quick read and found the following absolute rule on page 9:

Async I/O is now recommended on all the storage protocols. The recommended value for db_writers_processes is to at least match the number of processor cores on the system.

Beside the fact that the author misspelled the initialization parameter db_writer_processes, this is not very good advice at all. I wouldn’t be as critical if they were to have recommended a DBWR process per socket bound via taskset(1), but one per core just floating around is going to impact L2 caches. The way this works is that the number of cache buffers LRUs are split amongst the DBWR processes. Each DBWR process will maintain a set of LRUs so even with a light amount of flushing, the work will be spread across the DBWR processes. Here is an absolute: Never break up a workload that can be satisfied with a single processor and force it to be distributed across multiple CPUs. When you do that you simply increase the scheduling overhead and increase the amount of work that is done cache-cold (L2). That is, lightly tasked DBWR processes are more likely to get scheduled on CPUs they haven’t recently run on-thus they are cache cold.

I don’t expect anyone to just take my word for it-although it would be nice if at least some did. I do get tired of proving fact that dates back to the mid-1990s. Ugh, that sounded grumpy, sorry. I’ll see if I can’t push a run through and provide some performance data for a pathological (1 per core, no affinity) versus 1 through 1-per-core with affinity. I haven’t done that in a while. It’ll be fun..yippie.

Summary
If anyone gives you an absolute, ignore it. That includes me. As sure as I’m sitting here I will get at least two emails on this topic. One will tell me that they increased db_writer_processes and “got a performance boost.” The other will tell me to remember that sometimes man bites dog.

Fundamentally, however, a single DBWR process on a flat-memory SMP given kernel async I/O, ample scheduling priority, non-preemption on ports that support it, and ample CPU cycles will be able to keep the SGA clean. If any of these criteria are not met then throwing another DBWR at the situation should help, but that is entirely different than starting with 1 per core.

Nearly Free or Not, GridSQL for EnterpriseDB is Simply Better Than Real Application Clusters. It is Shared-Nothing Architecture After All!

According to this businesswire.com piece, EnterpriseDB made quite the splash at LinuxWorld San Francisco 2007 by introducing a clustered version of EnterpriseDB called GridSQL for EnterpriseDB. It’s no surprise that a product based on open source would get honorary mention at LinuxWorld, but I don’t care about that.

What I’m blogging about is the fact that I’m already seeing blogs and press pieces that clump GridSQL in with Oracle Real Application Clusters (RAC). What’s the problem? There is a cluster involved and both products use a cluster so they must be birds of a feather.

Po-tay-toe / Po-tah-toe
Oracle Real Application Clusters is a shared-disk architecture. This GridSQL product is yet another shared-nothing implementation as is IBM’s DB2 UDB EEE (cousin of SP2). Mainframe DB2 is shared-everything of course. Also in the shared-nothing camp are: Teradata, Informix XPS, Microsoft SQL Server with Distributed Partitioned Views (Egad!), Sybase Navigation Server (there’s a blast from the past), Greenplum and the hardware niche guys like DATAllegro and Netezza. Ingres r3 takes a shared everything approach like Oracle RAC and mainframe DB2, but I’m not sure what those folks are up to these days.

Everyone Else Is Doing It
Simply put, the fastest way to get a clustered database out the door is to implement shared-nothing. Take your regular database engine, throw some replication and data shipping in there and poof-you have a shared nothing database. I think everyone got the memo back in 1986 when Michael Stonebraker put his foot down and said that shared nothing was the way to go-in spite of the fact that shared nothing architectures include the necessary evil of data shipping. That Stonebraker paper was written a long time ago and most of the data points in it are OLTP-minded. For instance, I’ll quote Stonebraker:

Consider the number of messages which an SN [shared nothing] system must incur in a typical high transaction processing environment. The example consists of a data base with N objects subject to a load consistin entirely of transactions containing exactly k commands, each affecting only one record. (For TP1 the value of k is 4). For any partitioning of the data base, these k commands remain single-site commands. Suppose that there exists a partitioning of the data base into non-overlapping collections of objects such that all transactions are locally sufficient [WONG83]. Such a data base problem will be termed delightful. Most data base applications are nearly delightful. For example, the TP1 in [ANON84] has 85% delightful transactions.

Using TP1 (a batch mode ATM transaction workload) as a case in point for the claim that most applications will get along nicely (thus the term delightful) in a shared-nothing database! Wow, that was then and this is now. Folks, today’s applications are built on large numbers of tables and complex joins. The reason shared-nothing is nothing like RAC is because instead of only shipping functions (or tasks) and lock messages to the clustered nodes, as is the case with RAC, shared-nothing requires the shipping of data. And as soon as you have a problem with too much data shipping you are required to reload and repartition your database to mitigate the problem. Try getting your partitioning right with an application that has 1,000 tables and 1,300 indexes. Ever chew on crushed glass?

OK, let me try it this way. Let’s say shared-nothing is in fact the best approach for data warehousing. Greenplum puts it this way :

Most of today’s general-purpose relational database management systems are designed for Online Transaction Processing (OLTP) applications. By default, business intelligence applications have inherited this less than optimal architecture. The reality is that BI workloads are fundamentally different from OLTP transaction workloads and therefore require a profoundly different architecture.

So in this view it is one or the other. DSS or OLTP, you choose.

So humor me for a moment. Let’s say for instance that a shared-nothing architecture product on the same hardware outperforms RAC by a flat 50% across all measurements. Nifty! Now let me ask, where is your OLTP? If some shared-nothing product does in fact out perform RAC for data warehousing, that benefit gets canceled out by the fact that the same shared-nothing architecture cannot do OLTP at all. At least not ERP. I won’t even touch the fact that most ERP apps are a bit of an amalgam of both OLTP and DSS-style accesses.

If anyone was going to have phenomenal success with shared nothing it would be IBM with DB2 UDB EEE given its long heritage (SP2) of shared nothing. What do the top 10 audited clustered TPC-H results tell us?

  1. 100GB Scale. Oracle chooses not to put much effort in this scale of the benchmark. Neither does IBM really since their last entry there was a 2003 result.
  2. 300GB Scale. Of the top 10, Oracle holds spots 1-6. In fact, Oracle’s RAC result in the 6th position has stood since September 2005 so IBM’s shared-nothing product has had ample time to trump at least that result. Instead, the most recent DB2 result at this scale was in 2004 which oddly didn’t even beat their own prior 2003 result in position 7 of the top ten! How bizarre-especially for a fundamentally better architecture for DSS-style workloads! Oracle has held the top spot in this scale since December 2006 so there is clearly no leap-frogging going on. The 300GB scale is a clear lock-down on the part of RAC.
  3. 1000GB Scale. Here again RAC has held the top position for many months-10 months to be exact. That is plenty of time for any product of superior architecture to advance in the list. Oracle holds the number 3 slot at this scale as well.
  4. 3000GB Scale. Oracle RAC took the top slot at this scale just 2 months ago with a result that more than doubles the IBM DB2 UDB number at only 15% higher cost!
  5. 10000GB Scale. Here IBM hold the top spot-out of only two entries. Two years elapsed from the time Oracle entered this scale of the benchmark and IBM took the lead.

Based upon points 2-4 in the list above it seems shared-nothing is not somehow inherently superior to shared-disk architecture even for the supposed preferred workload which is warehousing!

 

Using Linux Processor Affinity To Beat The Oracle Licensing Police.

In my recent blog entry about LGWR processing I set up a proof case that included establishing hard processor affinity for LGWR and other processors in order to prove a point. I’ve received a few emails about what tool I used to affinity running processes to specific CPUs. The answer is not what you might presume. I did not use the taskset(1) command that ships with 2.6 kernel distributions because I’ve seen some irregularities with that tool in the past. I prefer to use a bit of my own code that uses sched_setaffinity(2). I’d provide that code as a reference but it is a mitt of a mess and does a lot of other stuff that doesn’t pertain to affinity.

The recent Oracle single socket Xeon “Cloverdale” TPC-C used hard affinity for various purposes. The FDR pointed to by that URL has a hard affinity program (affinity.c) listing that is very succinct. If you want to give hard affinity a play I recommend you cut and paste from that FDR, compile and give it a go. For instance, presume you have compiled affinity.c and wish to round up all the processes for a specific $ORACLE_SID and bind them to CPU 0 you could do something like the following. Note, unless the oracle user has CAP_SYS_NICE capability, you’ll have to make the affinity program setuid root in order to play with sched_setaffinity(2). Such as:

$ cc -o aff affinity.c
$ sudo chmod 6755 aff
$ TARGET_SID=PROD
$ ps -ef | grep ${TARGET_SID} | grep -v grep | awk '{ print $2 }' | while read PID
do
	./aff $PID 1
done

Oracle Licensing Again
I like nixCraft as a resource, but not when it comes to giving advice about Oracle licensing. What’s that got to do with hard affinity? I aimed to offer you readers more information about Linux hard affinity (note to self, blog sometime about the difference between hard and soft affinity if a reader ever asks) so I found this nixCraft article. It does offer more information about hard affinity scheduling, but oh so much more as well. In that page, the author is quoting an unspecified source as follows (quoted herein verbatim):

The scheduler attempts to keep processes on the same CPU as long as practical for performance reasons. Therefore, forcing a specific CPU affinity is useful only in certain applications. For example, application such as Oracle (ERP apps) use # of cpus per instance licensed. You can bound Oracle to specific CPU to avoid license problem. This is a really useful on large server having 4 or 8 CPUS

This quote is talking about how modern *nix schedulers implement soft-affinity (aka cache affinity) and thus handcrafting hard affinity is something better left to advanced situations. That’s fine and dandy, but pardon me for a moment because I’m choking on a hairball over that bit about ERP licensing vis a vis CPU count. That is a complete farce. Please do not try to negotiate a 2 CPU license on an 8 core system where you plan to pin Oracle E-Business Suite processes to the first 2 cores of the system using sched_setaffinity(2). I know I’ve provided a quote of a quote, but that idea is about as brain-dead as they come. It would work, but in a license audit it would not fly.

More About Hard Affinity
Be aware that if you run 100% of a given workload under hard affinity to CPU 0, for instance, that is not the same as running on a single CPU box. The other processors are still online and servicing device interrupts. For instance, if LGWR is running on CPU 0, the hardware interrupt for any given log file parallel write can be serviced by any of the online CPUs in the box. If, for instance, LGWR is pinned to CPU 0 and a LGWR flush I/O interrupt is handled by, say, CPU 3, you are getting both a blessing and a curse. The blessing is that LGWR’s CPU didn’t have to handle the noise of the interrupt. The curse is that your workload is getting “help” from CPUs you didn’t really want in the mix-that is if you are hoping for total isolation of the workload to CPU 0. This is not the same as sophisticated OS partitioning offerings out there that corral event he hardware interrupts to the CPUs in the partition. That is an entirely different topic.

Summary
No, binding a workload to CPUs with sched_setaffinity(2) is not the same as running on the same number of discrete processors. Whoever authored the words “You can bound Oracle to specific CPU to avoid license problem” is likely neither an English professor nor a contract lawyer. I’d ignore that “advice.” I wonder what “problem” the author was referring to? Is legal use of a software product a “problem?” All that aside, hard affinity can play an important role in affording ample processor bandwidth to key Oracle background processes. There were musings in the comment thread on my LGWR processing post about dedicating an entire CPU to LGWR although I have not followed up, that is certainly not the approach. I’ll need to blog about a possible best practice in this concept area soon.

Oracle’s Latest Filesystem Offering. Shades of AdvFS? I Want My OLT!

In this press release, Oracle has started making the existence of the Oracle Linux Test Kit a bit more widely known. I’ve been playing with this kit for a little while now and planned to blog my experiences. It has test components that work against Oracle instances. I’ll blog about my findings as soon as I can.

Long Live AdvFS?
The press release also announces Oracle’s new open source file system called Btrfs. If it was just another file system I wouldn’t mention it in my blog, but the project page mentions plans for features that I think are absolutely critical such as:

  1. Space efficient packing of small files
  2. Writable snapshots
  3. Object level mirroring and striping
  4. Strong integration with device mapper for multiple device support
  5. Efficient incremental backup and FS mirroring

This thing sure smells like a reincarnation of AdvFS, but what do I know?

The project page states that the developers are not even looking into database workloads at the moment but since it is an extent based filesystem I see no reason it wouldn’t work quite well with Oracle databases. I’m most keen on the list of features I’ve listed above though because they go a long way to make life better in production Oracle environments. First, number 1 on the list is good handling of small files. This will be a boon for the handling of trace and logging and a lot of the other ancillary file creation that goes on external to the database. Small files (at least myriads of them) are the bane of most filesystems.

Next on the list are clones-or as the project page calls them writable snapshots. Veritas has offered these for quite some time and they are a very nice feature for creating working copies of databases that are current up to the moment you create the snapshot. Creating a snapshot doesn’t impact performance and a good implementation of clones will not impact the real database even if you are doing a reasonable amount of writes on the clone. Also very cool!

Then there is number 3 on the list-object level mirroring and striping. When all good filesystems “grow up” they offer this sort of feature. Being able to implement software RAID on a file basis is the ultimate cool and I can’t wait to give this one a play! Being able to hand pick which Oracle files-or general purpose files-upon which to apply varying software RAID characteristics is a very nice feature.

Number 4 on the list is Device Mapper support. I’ve ranted on my blog and other forums about Linux handling of device names for quite some time and DM goes a long way to address the pet-peaves. Seeing this filesystem exploit DM at design level is a good thing.

Finally, number 5 above suggests good support for filesystem mirroring. This technology has proven itself useful for Veritas and NetApp customers-as well as others. It’s good to see it in Btrfs.

Clustered Btrfs?
Some time back I made a blog entry about ZFS and discussed the notion of “cluster butter.” That notion refers the the idea that you can take any filesystem and “clusterize” it. Generally adding cluster support after the design phase does not produce much of a cluster filesystem. So, I intend to dive in to see what underpinnings there are in Btrfs for future cluster support.

Can’t Get Oracle Off My Mind.

I promise I’ll explain soon the reason I couldn’t get Oracle off my mind during my vacation-in spite of the scenery and activities. This is the time of year I do some fairly aggressive hiking in preparation for my annual hunting trip. Hopefully mere mention of such a seemingly controversial activity will not alienate any readers. Hey, it’s my blog and I’m one of those folks that knows food comes from somewhere-even before it shows up in a market. Anyway, I thought I’d post a few photos. Enjoy!

BLOG UPDATE: Several of you readers asked to see the full size versions of these shots.  Here is a link: Photos on Flickr

p13.jpg

p12.jpgp11.jpg

p10.jpg

p8.jpg

p9.jpg

p7.jpg

p6.jpg

p4.jpg

Don’t Forget ARCH When Pondering Redo and Throwing Partners Under The Bus For Fun, Not Profit.

In my recent post about LGWR processing, I used a title that fits my recurring Manly Men motif just for fun. The title of the post was Manly Men Only Use Solid State Disk for Redo Logging. The post then set out to show that LGWR performance issues are not always I/O related by digging in to what a log file sync wait event really encompasses. Well, I’m learning to choose my words better-the hard way. Let me explain.

The purpose for building the test case for that post using solid state disk (SSD) for redo log files was to establish the fact that even when redo logging I/O is essentially free, it is quite possible to see long duration log file sync wait events. I also mentioned that I did one pass of the test with the initialization parameter _disable_logging set to true and achieved the same results as the SSD run. That establishes the fact that SSD for redo logging is as fast as fast can be. The point of the exercise was to show that when the I/O component is the not the root cause of long duration log file sync wait events it would be foolish to “throw hardware at the problem.” What I failed to point-in this post only-was that SSD for redo is most certainly a huge win if your system is doing any reasonable amount of redo. This test case only generates about 500 redo writes per second so it is not a show-case example of when to apply SSD technology for redo. On the other hand, I would not have been able to make such a rich example of LGWR processing without solid state disk.

Biting The Hand That Feeds You
The good folks at Texas Memory Systems were nice enough to loan me the solid state disk that I happened to use in the LGWR processing post. I’m sad to report that, as per email I received on the topic, their reading of the LGWR processing post left the impression that I do not promote the use of solid state disk for redo. Nothing could be further from the truth. The problem is that I am still working on the tests for my latest proof case for solid state using a much higher-end configuration. Regular readers of this blog know that I have consistently promoted solid state disk for high-end redo logging situations. I have been doing so since about 2000 when I got my first solid state disk-an Imperial MegaRam. That technology could not even hold a candle to the Texas Memory Systems device I have in my lab now though.

Forget SSD for Redo Logging, Really?
Regular readers of this blog know my regard for Nuno Souto, but he left a comment on the LGWR processing thread that I have to disagree with. He wrote:

Folks should forget SSDs for logs: they are good to reduce seek and rotational delays, that is NOT the problem in 99% of the cases with redo logs. They might be much better off putting the indexes in the SSDs!

Noons is correct to point out that rotation and seek are generally not a problem with LGWR writes to traditional round-brown spinning thingies, but I’d like to take it a bit further. While it is true that given a great deal of tuning and knowledge of the workload it is possible to do a tremendous amount of redo logging to traditional disks, that is generally only possible in isolated test situations. The perfect case in point are the high end TPC-C results Oracle achieves on SMP hardware. The TPC-C workload generates a lot of redo and yet Oracle manages to achieve results in the millions of TpmC without solid state disk. That is because there are not as many variables with the TPC workload as there are in production ERP environments. And there is one other dirty little secret: log switches.

Real Life Logging. Don’t Forget Log Switches.
Neither the test case I used for the LGWR processing post nor any audited Oracle TPC-C run are conducted with the database in archive log mode. What? That’s right. The TPC-C specification stipulates that the total system price include disk capacity for a specific number of days worth of transaction logging, but it is not required that logs actually be kept so the databases are never set up in archivelog mode.

One of the most costly aspects of redo logging is not LGWR’s log file parallel write, but instead the cost of ARCH spooling the inactive redo log to the archive log destination. When Oracle performs a log switch, LGWR and ARCH battle for bandwidth to the redo log files.

Noons points out in his comment that rotation and seek are not a problem for LGWR writes which is generally true. However, all too often folks set up their redo logs on a single LUN. And although the single LUN may have many platters under it, LGWR and ARCH going head to head performing sequential writes and sequential large reads of blocks on the same spindles can introduce performance bottlenecks-significant performance bottlenecks. To make matter worse, it is very common to have many databases stored under one SAN array. Who is paying attention to religiously carve out LUNs for one database or the other based on logging requirements (both LGWR and ARCH)? Usually nobody.

The thing I like the most about solid state for redo logging is that it neutralizes concern for both sides of the redo equation (LGWR and ARCH) and does so regardless of how many databases you have. Moreover, if you log in solid state to start with, you don’t have to worry about which databases are going logging-critical because all the databases get zero cost LGWR writes and log switches.

Storage Level Caches
Simply put, redo logging wreaks havoc on SAN and NAS caches. Think about it this way, a cache is important for revisiting data. Although DBAs care for redo logs and archived redo logs with religious fervor, very few actually want to revisit that data-at least not for the sake of rolling a database forward (after a database restore). So if redo logs are just sort of a necessary evil, why allow redo I/O to punish your storage cache? I realize that some storage cache technology out there supports tuning to omit the caching of sequential reads or writes and other such tricks, but folks, how much effort can you put into this? After all, we surely don’t want to eliminate all sequential reads and write caching just because the ones LGWR and ARCH perform are trashing the storage cache. And if such tuning could be done on a per-LUN basis, does that really make things that much simpler? I don’t think so. The point I’m trying to make is that if you want to eliminate all facets of redo logging as a bottleneck, while offloading the storage caches, solid state redo is the way to go.

Summary
No, I don’t throw partners under the bus. I just can’t do total brain dumps in every blog entry and yes, I like catchy blog titles. I just hope that people read the blog entry too.

Hey, I’m new to this blogging stuff after all.

Back From Vacation. When The Rubber Hits The Road…

I’ll spare you good readers the highlights. Here is a lowlight. This is what a 10-ply radial does under a 12,000 pound trailer when it suffers belt separation.  The first view isn’t too ugly:

blowout_2_small.jpg

…but the following shot gets the point across. Ugh:

blowout_1_small.jpg

So, now that I’m back home I need to haul the trailer over to the repair shop. When the tire blew it ripped some of the sheet metal skirting to shreds. Nifty!

And yes, it could have been much, much worse.

It’s good to be back. I’ll try to make a real useful blog entry tomorrow.

Manly Men Only Use Solid State Disk For Redo Logging. LGWR I/O is Simple, But Not LGWR Processing

Let’s look at some statspack data from an OLTP Order Entry benchmark run that achieved 133 transaction per second (TPS):

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
log file sync                                   68,965       1,117     16   64.6
db file sequential read                        159,712         241      2   14.0
CPU time                                                       163           9.4
log file switch (checkpoint incomplete)            164         146    890    8.4
db file parallel write                          19,651          31      2    1.8

And now let’s look at the stats from run number 2 of the same benchmark, same system, same disk setup, etc:

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
db file sequential read                        838,810       1,635      2   69.1
CPU time                                                       253          10.7
db file parallel write                         153,334         145      1    6.1
log file sync                                   93,100         143      2    6.0
log file parallel write                         93,417          73      1    3.1

Benchmark run number 2 achieved 200 TPS. That’s right, statspack number 1 and number 2 above are from the same system. The database files are stored in an an HP EFS Clustered Gateway NAS device and accesed via NFS and the redo logs are, get this, stored on a Texas Memory Systems Solid State Disk. Statspack 2 above was a very busy system (a lot of cached SGA work) and doing over 3,000 IOPS:

Load Profile                            Per Second       Per Transaction
~~~~~~~~~~~~                       ---------------       ---------------
                  Redo size:          1,138,491.64              7,166.65
              Logical reads:             14,640.16                 92.16
              Block changes:              5,538.83                 34.87
             Physical reads:              1,428.73                  8.99
            Physical writes:              1,742.27                 10.97
                 User calls:              2,352.99                 14.81

So if a) system is the same, and b) LGWR is writing redo to the fastest disk possible, what is the story? What changed from statspack 1 to statspack 2 to make log file sync wait events jump 8-fold-with the associated 34% drop in throughput? More users? No. Nothing that simple. I’ll explain.

LGWR I/O? Who Cares-Simple Stuff
I see myriads of information on blogs, forums and all those expert repositories about the I/O aspects of redo logging, but very little about LGWR processing. Yes, a portion of LGWR’s job is to flush the redo buffer to the online redo log, but there’s more to LGWR’s job than writing redo. In fact, I never could understand the fascination with LGWR’s I/O since it is about as simple as it gets: sequential writes to a single file of sizes ranging from 512 bytes to a port-defined maximum (usually 128KB but often 1MB). Yes, it gets a little more complex with multiplexed redo, but that is just concurrent operations of the exact same profile. The part that seems to be hard to grasp for folks is where asynchronous I/O comes in to play.

LGWR Async I/O
If LGWR is clearing more than the port-defined maximum from the buffer, asynchronous I/O is important. Let’s consider an example. Let’s say LGWR has been posted to clear 1.75MB of redo from the buffer on a platform that establishes the LGWR I/O maximum to be 1MB. In this case, LGWR will issue a 1MB write asynchronously immediately followed by a 750KB asynchronous write. These writes are sequential. After the second I/O is in flight, LGWR starts to poll for the completions of the I/Os. So the question often becomes what if LGWR is only flushing, say, 600KB instead of 1.75MB, isn’t asynchronous I/O crucial in that case as well? The answer is no-unless you are using multiplexed redo. If LGWR is flushing an amount less than the port-defined maximum, asynchronous I/O offers nothing-for that specific flushing operation. See, LGWR is a lot different than DBWR. When DBWR has I/O in flight, it also has other stuff it needs to do like building the next write batch and so forth. LGWR on the other hand has nothing to do when its I/Os are in flight. It polls for the completions and once the flushing operation is complete, it then takes action to post the foregrounds that are waiting in a log file sync wait event. So asyncnronous I/O has no play in a LGWR buffer flush operation of an amount smaller than the port-defined maximum I/O size.

Oh, I forgot to mention, LGWR will keep up to 4 I/Os in flight (I need to check if that is more in late releases) when clearing more than the maximum allowed in a single physical write. All in all, LGWR I/O is very simple. So why do session waits, such as log file sync, generate so much folklore? I don’t know, but I know the folklore isn’t helpful.

Throw A Solid State Disk At It
I love that one. Let’s see, writing to solid state disk is faster than round-brown spinning thingies, so if any aspect of LGWR’s duties seem out of sorts, let’s get some really, really fast disk. Don’t get me wrong, I’m a big fan of solid state disk, but throwing it at the wrong problem is not the right thing to do.

The most complex topic regarding LGWR is the log file sync wait event. Why? Because it has very little to do with I/O, that’s why. Sure I/O is a part of it, but there is more to it.

Backgrounder
LGWR spends its time in a work loop consisting of very little, really. Well, in contrast to DBWR that is. Remember, DBWR has to manipulate both cache buffers chains and lrus and flush modified SGA buffers and post processes. While DBWR works on a large amount of metadata and buffers, LGWR does not. A glance at ps(1) output makes that clear. Notice in the following ps(1) output, LGWR manages to perform its tasks with a resident set size one 30th the size of DBWR-and of course that ratio would lean much more to the DBWR side on a large SGA system since DBWR has to map all the buffers in the SGA over the life of the instance (the exception to this is the indirect data buffers feature). LGWR on the other hand has very little working set because it only tends to the redo buffer and will only exercise a few latches (redo allocation, copy, etc).

$ ps -eF | egrep 'UID|lgwr|dbw' | grep -v grep
UID        PID  PPID  C    SZ  RSS PSR STIME TTY          TIME CMD
oracle   19062     1  1 377082 851808 0 16:11 ?       00:00:08 ora_dbw0_bench1
oracle   19064     1  2 379943 29816 0 16:11 ?        00:00:09 ora_lgwr_bench1

Log File Sync
When a foreground process commits changes to the database, it allocates space in the redo buffer and copies the redo information into the buffer-protected either by the redo allocation latch itself or a redo copy latch. The process then enqueues itself for service from LGWR and posts the LGWR (most ports use IPC semaphores for posting). The process then goes to sleep waiting for the return post from LGWR indicating its redo is flushed to disk. So, we now have a process waiting for log file sync event (LFS). Let’s call this process waiter1. How long waiter1 waits depends on a lot of factors.

The first factor that will affect waiter1’s LFS wait is what LGWR was doing when it posted him. Remember, Oracle is a multi-user database on a multi-processor system. There is 1 LGWR and a lot of consumers of his service. On a busy system, odds are quite good that any time LGWR get posted it is in the process of servicing a batch of prior waiters. That is, when waiter1 posted LGWR the odds are good it was already busy flushing a bunch of redo for some other group of processes who are themselves waiting in LFS. This is where it gets dicey.

Before LGWR can flush waiter1’s redo, he has to finish flushing the prior batch and post all of those waiters. Let’s say there was 256KB of redo to flush for 2 waiters on a platform with the port-defined maximum LGWR I/O size set at 128KB. On this platform, LGWR will have to issue 2 asynchronous 128KB writes and poll for those to complete. Once the I/O is complete, LGWR will post each of the two waiters using the post-wait mechanism for that port of Oracle-which is an IPC Semaphore semop() call for most platforms. So, LGWR has made system calls to write, and system calles to poll for I/O completion and finally 2 calls to semop(). LGWR is just a normal user-mode process so all this dipping into the kernel introduces processing delays. Yes, these system calls I listed are non-blocking, but when a processor transitions from kernel to user mode there is opportunity to switch to higher-priority runable processes. In fact, the very act of LGWR posting these 2 processes can cause it to lose CPU. Huh? Sure. What if instead of 2 there were 16 in the batch and there were only 4 CPUs on the system. We call that the thundering herd and it wreaks havoc on LGWR. These process scheduling storms can be quite problematic. Think about it. You have one process (LGWR) that just made 4,8 or 16 or more processes runable by semop()ing them from their log file sync wait which makes them runable in kernel mode which trumps a runable user-mode process. If that happens on a system with, say 8 cores, those processes will need to run somewhere. It is a scheduling storm. For this reason you should always set LGWR’s scheduling priority to the highest possible (with renice for instance).

Without elevated scheduling priority for LGWR, it can often be the case that the processes LGWR wakes up will take the CPU from LGWR. Some platforms (Legacy Unix) support the notion of making processes non-preemptable as well. On those platforms you can make LGWR non-preemptable. Without preemption protection, LGWR can lose CPU before it has exhausted its fair time slice.
Once LGWR loses his CPU it may be quite some time until he gets it back. For instance, if LGWR is preempted in the middle of trying to perform a redo buffer flush, there may be several time slices of execution for other processes before LGWR gets back on CPU. Why am I putting you to sleep about this? Because those time slices that LGWR was off CPU will be charged to foreground processes waiting in a log file sync wait event. Yes, if a foreground is in log file sync wait and LGWR is piled under other processes and can’t get CPU for, say, a few time slices (10ms each), that foreground process’ log file sync will have 10, 20 or more milliseconds of extra wait tacked onto it before LGWR even gets a chance to do I/O. LGWR’s I/O is merely a component of a foreground process’ LFS wait.

So what happened to the benchmark run behind statspack 2 above? CPU starvation specifically for LGWR! How did I do that? Easy. I have a simple program that uses the Linux sched_setaffinity() API call to assign hard processor affinity to processes. These tests were done on a 4-core x86 system and in both cases all the background processes were started up with hard CPU affinity to CPU 0. Once the database was started, I reassign LGWR’s hard affinity to CPU 3. The Pro*C benchmark processes and their associated shadow processes are all running split equally with hard affinity between CPUs 1 and 2. This is the model for both benchmark runs. Yes, LGWR is all alone on CPU 3.

In the case where the benchmark only got 133TPS, I have a “noise” process running on the same CPU as LGWR. This noise process also is pinned to the CPU with hard affinity. What does this “noise” process do that is so heinous as to affect Oracle’s OLTP throughput by 34%? After all, it is only a single process. Yes, single, but also running on the same CPU as LGWR. The noise program is:

double sqrt();
main () { int i; double z; for (;;) { for (i = 0; i < 1000000; i++) { z = sqrt (i); } poll (0, 0, 1); } }

Simple enough. That loop of sqrt() calls from libm takes about 3ms on the CPUs in this benchmark system. So the way this works is that about every 3ms, this process will go to sleep for 1ms using the poll() system call. If this process is running when LGWR needs to run, LGWR will have to wait for 3ms. If that process isn’t executing, LGWR has no impediment to peforming his tasks and LFS are a measely 2ms and OLTP throughput jumps 34%.

This little parasite noise process gets in LGWR’s way a lot too. Basically, any time LGWR gets posted to perform a flush, this noise process is in the way (chalk up 3ms), any time LGWR blocks on an I/O (e.g., a flush that is smaller than the port maximum and therefore a single I/O) this noise process is in the way once the I/O is complete (chalk up 3ms), any time LGWR unluckily loses his CPU to a time slice switch this process will run and take 3ms from LGWR. It all adds up. And in this test case, processes posted by LGWR can never take his CPU (since I’ve set up hard affinity) so this is a nicer test case actually.

Don’t Write At All

Just to be evil, I’ll take it a step further. Did you know that if you set the initialization parameter _disable_logging=TRUE LGWR does everything it normally does except it omits performing the I/O for a redo flush. Don’t do this if you like your database because anything other than a shutdown normal is going to leave your database unrecoverable. However, this is a great test because if performance doesn’t increase when you set this parameter in this fashion, then you have a LGWR I/O processing problem and not a LGWR I/O problem. So I did a run of the same benchmark like this:

SQL> show parameter _disable

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
_disable_logging                     boolean     TRUE

Everything else was the same but now instead of really fast Solid State Disk I/Os, LGWR doesn’t even bother writing to disk at all. What did that do to the benchmark results? Nothing. I still get 133TPS.

Summary
LGWR I/O is very simple. LGWR processing on the other hand is a bit more complex.

Oracle11g: Oracle Inventory On Shared Storage. Don’t Bother Trying To Install 11g RAC That Way.

A few days ago there was a thread on the oracle-l email list about adding nodes in an Oracle Database 10g Real Application Clusters environment. The original post showed a problem that Alex Gorbachev reports he’s only seen with shared Oracle Home installs. I found that odd because I’ve done dozens, upon dozens of RAC installs on shared Oracle Homes with both CFS and NFS and haven’t seen this error:

Remote 'UpdateNodeList' failed on node: 'af-xxx2'. Refer to
'/apps/oracle/oraInventory/logs/addNodeActions2007-07-08_09-36-12PM.log'
for details.
You can manually re-run the following command on the failed nodes after the
installation:
/apps/oracle/product/10.2/oui/bin/runInstaller -updateNodeList -noClusterEnabled
ORACLE_HOME=/apps/oracle/product/10.2 CLUSTER_NODES=af-xxx1,af-xxx2,af-xxx6
CRS=false "INVENTORY_LOCATION=/apps/oracle/oraInventory" LOCAL_NODE=
<node on which command is to be run>

I never have any problems with shared Oracle Home and I blog about the topic a lot as can be seen in in this list of posts. Nonetheless, Alex pointed out that the error has to do with the Oracle Inventory being on a shared filesystem. Another list participant followed up with the following comment about placing the inventory on a shared drive:

Sharing the oraInventory across nodes is not a good practice in my opinion. It runs counter to the whole concept of redundancy in an HA configuration and RAC was not written to support it.

Well, the Oracle Inventory is not a RAC concept, it is an Oracle Universal Installer concept, but I think I know what this poster was saying. However, the topic at hand is shared Oracle Home. When people use the term shared Oracle Home, they don’t mean shared ORACLE_BASE, they mean shared Oracle Home. Nonetheless, I have routinely shared the 10g inventory without problems, but then my software environments might not be as complex as those maintained by the poster of this comment.

Shared Inventory with Oracle Database 11g
No can do! Well, sort of. Today I was installing 11g RAC on one of my RHEL 4 x86 clusters. In the fine form of not practicing what I preach, I mistakenly pointed Oracle Universal Installer to a shared location (NFS) for the inventory when I was installing CRS. I got CRS installed just fine on 2 nodes and proceeded to install the database with the RAC option. It didn’t take long for OUI to complain as follows:

shared_home_11g.jpg

Ugh. This is just a test cluster that I need to set up quick and dirty. So I figured I’d just change the contents of /etc/oraInst.loc to point to some new non-shared location-aren’t I crafty. Well, that got me past the error, but without an inventory with CRS in it, Oracle11g OUI does not detect the cluster during the database install! No node selection screen, no RAC.

I proceeded to blow away all the CRS stuff (ORA_CRS_HOME, inittab entries, /etc/oracle/* and /etc/oraInst.loc) and reinstalled CRS using a non-shared locale for the inventory. The CRS install went fine and subsequently OUI detected the cluster when I went to install the database.

This is a significant change from 10g where the inventory content regarding CRS was not needed for anything. With 10g, the cluster is detected based on what /etc/oracle/ocr.loc tells OUI.

Summary
Shared Oracle Home is an option, shared Oracle Home means shared Oracle Home not shared Oracle Inventory. Oracle11g enforces this best practice nicely!

Application Server Benchmark Proves PostgreSQL is The Best Enterprise Database Server. New SPECjAppServer2004 Cost Metric Introduced Too!

I was so excited to see this InformationWeek.com article reporting that PostgresSQL has proven itself nearly as good as Oracle. On a day billed as “a good day for Open Source”, Sun published a SPECjAppServer2004 result using the Sun Niagra platform running PostgreSQL backing 3 Application Servers running Sun Java Systems Application Server 9.0.

The Sun result was 778.14 JOPS which, as the InformationWeek.com article suggests is roughly 12% under this 874 JOPS Itanium Oracle result. That’s all fine and dandy, but why hand pick that Oracle result over this 1111.96 JOPS Xeon “Cloverdale” Oracle result? I’ll tell you why, it’s because InformationWeek.com was only reporting on this blog entry by Josh Berkus. Josh is the PostgreSQL Lead by the way and Josh knows full well that the 12% in his hand-picked comparison looked a lot better than then 42% thumping Oracle dealt PostgeSQL using 2 little Xeon “Cloverdale” processors. But that isn’t what I’m blogging about.

PostgreSQL: The Worlds Best Application Server
I’m not blogging about the fact that these guys cherry picked results in an attempt to put PostgreSQL on par with Oracle. Afterall, I too dream of the day that everything in the world is free. In the meantime you get what you pay for.

What get’s my ire up about this news piece is the fact that there is no news there. Folks, SPECjAppServer2004 is not a database benchmark. The SPECjAppServer2004 web page puts it like this:

SPECjAppServer2004 (Java Application Server) is a multi-tier benchmark for measuring the performance of Java 2 Enterprise Edition (J2EE) technology-based application servers. SPECjAppServer2004 is an end-to-end application which exercises all major J2EE technologies implemented by compliant application servers as follows:

  • The web container, including servlets and JSPs
  • The EJB container
  • EJB2.0 Container Managed Persistence
  • JMS and Message Driven Beans
  • Transaction management
  • Database connectivity

Yes, there is a database downwind. No this is not a database workload it is an Application Server benchmark. So to say that PostgreSQL made a “good day for Open Source” based upon this result is a lark. But that is not what I’m blogging about.

Cost Metric
SPECjAppServer2004 has no cost metric. By that I mean that there is no cost metric. In fact, the specification lacks a cost metric. Moreover, no published results include cost information. What I’m trying to say is that the benchmark has no cost component. If I started an Open Source project called JOPS Cost Metric Calculator (JCMC), I’d have nothing to work with—but I might get a lot of help I suppose. In fact, if I spent the rest of my life studying the SPECjAppServer2004 specification I wouldn’t find anything about a cost metric. I once saw a Klingon riding a three-legged pink elephant wearing a tee shirt that read, “The SPECjAppServer2004 ghajtaH ghobe’ Cost Metric” which—loosely translated—collaborates with my assertion that there is no cost metric in the SPECjAppServer2004. That’s why I’m a little dissapointed to see Tom Daley’s blog entry with a cost analysis of these SPECjAppServer2004 results.

Summary
Oracle Database has licensed options. Customers can use the best options at prices that fit their budget to solve their problems. Incidentally, Oracle has that choice as well. So when Oracle is competing in an Application Server benchmark, they can use whatever downwind database options they so choose—after all, I heard a rumor that there is no cost component in the SPECjAppServer2004 specification.

Oracle used Enterprise Edition and partitioning. I bet if they tried they could get the same result with Standard Edition or since there are only 8 cores involved, Standard Edition 1. But why bother, THERE IS NO COST COMPONENT. And PostgreSQL comes nowhere close to what Oracle can do with this workload.

What Did He Say?
That’s right, in spite of the blogospherian hopes that PostgreSQL is a 12% heel-biter to Oracle, the full list of audited SPECjAppServer2004 results tells the real story. Not only did Oracle slam this “PostgreSQL result” with the same number of cores—when we consider the 42% thumping Oracle on Xeon 5355 delivered—but Oracle scales as well as this 7629.45 JOPS result with 64 cores will attest. Oh and I forgot to point out that the 778.14 Sun result with PostgreSQL downwind used three application servers. Isn’t it interesting how a single Xeon 5355-based application server was able to push 42% more throughput when using Oracle’s application server with n Oracle database back-end. That is, the Sun result required 3 application servers for a total of 12 cores whereas the Oracle 1111.96 JOPS result used a single Xeon 5355-based application server with 8 cores. Remember, this is an application server benchmark so Oracle pulls 138 JOPS per application server core. That’s a far cry better than Sun’s 65 JOPS per application server core!

When it comes to the database component, I want to see the PostgreSQL 64-core number. Hey, it’s just a porting effort after all. There must be untold countless Open Source contributers working on optimizing the PostgreSQL port to HP-UX for the Superdome.

Where’s Waldo?
There are no SPECjAppServer2004 results for SQL Server in this list. That could very well be the real story.

Oracle Doesn’t Scale Without Multiple Log Writer Processes

There has been a bit of a spat going on between Jonathan Lewis and Don Burleson. Don’s summary of the topic can be found here. I don’t care about the spat. I want to point out something clearly. Very clearly. But first, a quote from Don:

[…] Lewis failed to mention the hidden parameter _lgwr_io_slaves nor the Metalink note that clearly states that initially, only one LGWR process is started and that multiple LGWR processes will only appear under high activity.

The Oracle docs are very clear that multiple log writer processes exist:

“Prior to Oracle8i you could configure multiple log writers using the LGWR_IO_SLAVES parameter.”

Note: There is a big difference between LGWR I/O Slaves and LGWR. Don continues:

(Note: Starting in 8.1.7, lgwr_io_slaves became a hidden parameter _lgwr_io_slaves, and it remains a hidden parm in Oracle10g).

Further, Metalink note 109582.1 says that multiple log writer I/O slaves were first introduced almost a decade ago, way back in Oracle8:

Folks, there is only ever one LGWR whose job it is to clear the redo buffer. DBWR on the other hand is multi-stated (support for multiple true DBWR processes) and that is due entirely to the NUMA requirements of certain ports from the late 1990s. Although it is possible for a single DBWR to saturate a CPU, and therefore need multiple DBWR processes, that is rare (generally related to pathological latch contention on cache buffers chains).

As long as LGWR and DBWR have asynchronous I/O support—and you are not running on a NUMA-optimized port—there really should never be a reason to configure multiple DBWR processes, nor should LGWR saturate a CPU.

Let me put it this way, there are examples of large high-end systems doing OLTP that don’t need multiple LGWR processes. Here is a system with roughly 7,000 spindles, 40 FC HBAs, 128 cores and 2TB RAM that needed neither multiple LGWR nor DBWR processes. And if you think you have a workload that generates more redo on a per-CPU basis than TPC-C you’re probably wrong.

If I ever hear the term I/O slaves again, I’ll:

gross.jpg

Summary
Oracle demonstrates more LGWR scalability than 99.99942% of all production sites would ever need when they run such high end TPC-C benchmarks as the one I cited above. And please, let’s stop the I/O slave madness!

Manly Men Only Deploy Oracle with Fibre Channel – Part VIII. After All, Oracle Doesn’t Support Async I/O on NFS

In the comment section of my recent post about Tim Hall’s excellent NFS step-by-step Linux RAC install Guide, Tim came full circle to ask a question about asynchronous I/O on NFS. He wrote:

What do you set your filesystemio_options init.ora parameter to when using Oracle over NFS?

Based on what you’ve written before I know NFS supports direct I/O, but I’m struggling to find a specific statement about NFS and asynchronous I/O. So should I use:

filesystemio_options=directIO

or

filesystemio_options=setall

My reply to that was going to remind you folks about my recent rant about old Linux distributions combined with Oracle over NFS.  That is, the answer is, “it depends.” It depends on whether you are running a reasonable Linux distribution. But, Tim quickly followed up his query with:

I found my answer. Asynchronous I/O is not supported on NFS:

http://download.oracle.com/docs/cd/B19306_01/server.102/b15658/appc_linux.htm#sthref892

Bummer, I didn’t get to answer it.

Word To The Wise
Don’t use old Linux stuff with NAS if you want to do Oracle over NFS. Metalink 279069.1 provides a clear picture as to why I say that. It points out a couple of important things:

1. RHEL 4 U4 and EL4 both support asynchronous I/O on NFS mounts. That makes me so happy because I’ve been doing asynchronous I/O on NFS mounts with Oracle10gR2 for about 16 months. Unfortunately, ML 279069.1 incorrectly states that the critical fix for Oracle async I/O on NFS is U4, when in fact the specific bug (Bugzilla 161362 ) was fixed in RHEL4 U3 as seen in this Red Hat Advisory from March 2006.

2. Asynchronous I/O on NFS was not supported on any release prior to RHEL4. That’s fine with me because I wouldn’t use any Linux release prior to the 2.6 kernels to support Oracle over NFS!

Summary
The Oracle documentation on the matter was correct since it was produced long before there was OS support for asynchronous I/O on Linux for Oracle over NFS. Metalink 279069.1 is partly correct in that it states support for asynchronous I/O on systems that have the fix for Bugzilla 161363 but it incorrectly suggests that U4 is the requisite release for that fix, but it isn’t—the bug was fixed in U3. And yes, I get really good performance with the following initialization parameter set and have for about 16 months:

filesystemio_options = setall

Manly Man Post Script
Always remember, the Manly Man series is tongue-in-cheek.  Oracle over NFS with Async I/O on the other hand isn’t.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.