Archive for the 'oracle' Category



A Tip About the ORION I/O Generator Tool

I was recently having a chat with a friend about the Oracle ORION test tool. I like Orion and think it is a helpful tool. However, there is one aspect of Orion I thought I’d blog about because I find a lot of folks don’t know this bit about Orion’s I/O profile.

Generating an OLTP I/O Profile With Orion
If you use Orion to simulate OLTP, be aware that the profile is not exactly like Oracle. Orion uses libaio asynchronous I/O routines (e.g., io_submit(2)/io_getevents(2)) for reads as well as writes.This differs from a real Oracle database workload, because the main reads performed by the server in an OLTP workload are db file sequential reads which are random single-block synchronous reads. For that matter, foreground direct path reads are mostly (if not entirely) blocking single block requests. The net effect of this difference is that Orion can generate a tremendous amount of I/O traffic without the process scheduling overhead Oracle causes with blocking reads.

There is a paper that includes information about ORION in the whitepapers section of my blog. Also, ORION is downloadable from Oracle’s OTN website.

Trivial Pursuit
Why is it that random single-block synchronous reads are called db file sequential read in Oracle? Because the calls are made sequentially, one after the other. It is not because the target disk blocks are sequential in the file being accessed.

No More Oracle Ports to Solaris/AIX/HP-UX. After Oracle11g?

 

BLOG UPDATE: After writing this piece back in December 2006, I found a page on searchdatacenter.com about HP’s promotion to get Sun SPARC customers over to Proliant server. Yes, that would be a no-brainer move, but not painless…enter Transitive. Along the same lines of what I stated in this blog entry, that news about HP included the following bit about Transitive:

In conjunction with the certification announcement, HP publicized a new relationship with Transitive Corp., which ports software across multiple processor and operating system pairs.

Transitive’s QuickTransit for Solaris/SPARC-to-Linux/x86-64 solution enables applications that have been compiled for the Solaris on SPARC to run on certified 64-bit HP ProLiant platforms running Linux without requiring any source code or binary changes, HP reported.

Oracle no longer has to port the database to such a wide array of platforms for the sake of garnering market share. They won that battle—rightly so. Who else had such an optimized RDBMS on so many platforms? Nobody, period. But most of those platforms are dead. So, imagine the reduced cost for Oracle to produce their database—someday in the future—on only one platform. Guess what platform that would be? Linux of course. “Oh boy, he’s fallen off the deep end”, you say. No, really, read on.

Porting Is Expensive—Beyond Labor
It costs Oracle to maintain multiple ports not withstanding that fact that the varying productlines supply engineering resources. Even Sequent had 27 engineers stationed on site at Redwood Shores. All that additional free manpower aside, the very existence of so many ports is a cost that is hard to explain to shareholders and analysts when it is an expense that Microsoft clearly doesn’t have to bear. Remember, the parties that matter the most at this point of the game are the analysts and shareholders.

I’m a huge fan of “real systems”. You know, the sorts of systems where life-and-death applications are running and the people managing them can sleep at night knowing their system isn’t going to kill people because it crashes all the time. I’m glad there are still systems like System z9 mainframes, System p running AIX, Itanium2 Superdomes running HP-UX and so on. These are systems that are tried and true. And, no, they are not open source. These systems belong—period. What does this have to do with Oracle?

One Port
OK, if you are still with me, picture this. Oracle stops porting to all instruction sets except x86_64—and only Linux. That reduces the cost of producing the database product (by a very small amount I know) and makes analysts happy. It looks more like what Microsoft does. It looks more like what the open source databases do. It looks young and fresh. By the way, I know you can run MySQL on SPARC. Like I say, “sometimes man bites dog.” I digress.

How Would Oracle Pull This Off?
The same way Apple pulled the PowerPC to Intel switch—Rosetta. Rosetta works, we all know that. What not a lot of people know is that Rosetta is Transitive. I just found that out myself. Transitive works. IBM is already using Transitive to woo customers to run their x86 Linux applications on PowerPC Linux. It all starts to make your head swim.

Introducing the High-End Oracle Server of the Future
OK, so there is only one Oracle distribution—x86_64 Linux. That’s it. Well, the way it could end up is that if you want to run the single Oracle port in maximum performance mode, you run x86_64 hardware. How bad is that? Remember, this is the era of commodity processors delivering more than 50,000 TpmC. And Moore’s Law is on your side. Although the current application of Transitive is mostly to bring non-Intel executables to Intel platforms, it certainly can go both ways. How would the likes of IBM, HP and Sun deliver value in their high end systems? It could wind up that the competitive edge between these high end vendors boils down to nothing more than which platform performs something like Transitive better. If you want the really cool things that high end System p offers? Buy it. Load Oracle over Transitive and away you go. Like the power savings of Sun CoolThreads? Buy it. Want to run Oracle on it? You know the drill—Transitive.

I’ll leave you with a final thought. This sort of thing would make more business sense than technical sense. Which do you think has more weight?

 

Mark Rittman Changed His Header Photo. Hey, Where is All That OakTable Grey Hair?

Now that’s cool. I see the header photo over at the Mark Rittman Oracle Blog is a photo of that cool impromptu “Oaktable” they put up for us at UKOUG. That was a great conference.

Oh, by the way, Mark Rittman has a really good blog.

Testing RAC Failover: Be Evil, Make New Friends.

In Alejandro Vargas’ blog entry about RAC & ASM, Crash and Recovery Test Scenarios, some tests were described that would cause RAC failovers. Unfortunately, none of the faults described were the of the sort that put clusterware to the test. The easiest types of failures for clusterware to handle are complete, clean outages. Simply powering of a server, for instance, is no challenge for any clusterware to deal with. The other nodes in the cluster will be well aware that the node is dead. The difficult scenarios for clusterware to respond to are states of flux and compromised participation in the cluster. That is, a server that is alive but not participating. The topic of Alejandro’s blog entry was not a definition of a production readiness testing plan by any means, but it was a good segway into the comment I entered:

These are good tests, yes, but they do not truly replicate difficult scenarios for clusterware to resolve. It is always important to perform manual fault-injection testing such as physically severing storage and network connectivity paths and doing so with simultaneous failures and cascading failures alike. Also, another good test to [run] is a forced processor starvation situation by forking processes in a loop until there are no [process] slots [remaining]. These […] situations are a challenge to any clusterware offering.

Clusterware is Serious Business
As I pointed out in my previous blog entry about Oracle Clusterware, processor saturation is a bad thing for Oracle Clusterware—particularly where fencing is concerned. Alejandro had this to say:

These scenarios were defined to train a group of DBA’s to perform recovery, rather than to test the clusterware itself. When we introduced RAC & ASM we did run stress & resilience tests. The starvation test you suggest is a good one, I have seen that happening at customer sites on production environments. Thanks for your comments.

Be Mean!
If you are involved with a pre-production testing effort involving clustered Oracle, remember, be evil! Don’t force failover by doing operational things like shutting down a server or killing Oracle clusterware processes. You are just doing a functional test when you do that. Instead, create significant server load with synthetic tests such as wild loops of dd(1) to /dev/null using absurdly large values assigned to the ibs argument or shell scripts that fork children but don’t wait for them. Run C programs that wildly malloc(2) memory, or maybe a little stack recursion is your flavor—force the system into swapping, etc. Generate these loads on the server you are about to isolate from the network for instance. See what the state of the cluster is afterwards. Of course, you can purposefully execute poorly tuned Parallel Query workloads to swamp a system as well. Be creative.

Something To Think About
For once, it will pay off to be evil. Just make sure whatever you accept as your synthetic load generator is consistent and reproducible because once you start this testing, you’ll be doing it again and again—if you find bugs. You’ll be spending a lot of time on the phone making new friends.

Oracle Performance on Sun’s “Rock” Processors and Oracle Scalability

The late breaking news of the day is that by December 31st, Sun engineers will have taped out the next generation mega-multi-core processor called “Rock”. Some is good, so more must be better. This one offers 16 cores. News has been out about Rock for several years. Back in 2004 the project had the code name “Project 30x” because it aimed to out-perform the US-III of the day by a factor of 30. On the humor side, this zdnet.com article about the Sun “Rock” chip throws in a twist. It seems if the processor is not taped out by the end of the year, Sun engineers have to wear ties. I’d hate to see that happen.

What Does This Have To Do With Oracle?
These processors are going to be really, really fast. The first generation will be based on 65nm real estate, but the already planned follow-on 45nm offering will up the ante even more. I recently blogged about Sun CoolThreads “Niagra” performance with Oracle. Fact is that shops currently running Oracle on SPARC hardware of the Ultra Enterprise class can easily migrate over to these new systems—particularly for OLTP. Those are 8 core (90nm) packages. With Rock, I think we can expect performance commensurate with the packaging. That means a single multi-core system based on this technology is, quite honestly, nearly too big.

Datacenters today are scrambling to get their processor utilization up and their power consumption down. And, oh, is Oracle going to charge .25 of a CPU license per core? The bottom line for Rock is that there is a tremendous number of Sun Ultra Enterprise gear out there that will need replacing soon. Maybe some of the replacement-business will help Sun continue their trend as the only vendor seeing server revenue increases. All the other vendors seem to be finding that new purchases are being held back by the continuing effort to chop these already “small” servers into smaller servers with virtualization.

No, Really, What Does This Have To Do With Oracle?
Folks, this is the era of multi-core processors (e.g., the Xeon 53XX Clovertown) that achieve TPC-C results of 50,000+ per core. Remember, the highest result ever attained by the venerable Starfire UE 10000 was roughly 155,000—with 64 CPUs. It won’t be long until you are carving up a single core to support your Oracle database. And I pose that sub-core Oracle database deployments will be the lion’s share of the market too. But even for the “heavy” databases, it wont be long until they too only require just some of the cores that a single socket will offer.

Who Scales Better? Oracle? DB2? MySQL?
Oracle raced to the head of the pack throughout the 1990s by offering the most robust SMP scalability. So let me ask, if your application can be back-ended by Oracle executing within a virtual processor that represents only a portion of a socket, how scalable does the RDBMS kernel really need to be?

AMD Quad-Core “Barcelona” Processor For Oracle (Part I)

I haven’t seen much in the Oracle blogosphere on this topic. Let me see if I can get it going…

AMD’s move into quad-core processors has me thinking. First, I like how this arstechnica.com article about AMD’s quad-core “Barcelona” processor is a ”true” quad-core as opposed to the Xeon 5300 family which is actually 2 dual core processors mated in a multi-chip module (MCM). The article reads:

AMD touts Barcelona as a “true” quad-core processor, because it features a highly integrated design with all four cores on a single die with some shared parts. This is in contrast to Intel’s “quad-core” Kentsfield parts, which use package-level integration to get two separate dual-core dies in the same socket. For my part, I’m inclined to agree with AMD that Barcelona is real quad-core and Kentsfield isn’t, but I gave up fighting that semantic fight a long time ago. Nowadays, if it has four cores in a single package, I (grudgingly) call it “quad-core.”

I agree with the author on that point.

Just recently I worked the HP demo booth at UKOUG with Steve Shaw of Intel. I actually found myself playing a little po-tay-toe/po-tah-toe regarding the nature of just how true each of these quad-core packages were. Honestly, I think I held that stance for just a moment, because the point is moot. Let me explain. It is all about Oracle licensing.

Oracle licenses Intel cores at .5 of a CPU, rounded up to the next whole number. So a single socket, quad-core system is .5 x 4 or 2 full CPU licenses. On the other hand, single socket/dual-core is .5 x 2 or 1 CPU license. The power of these processors is no longer a challenge of how much you can get as much as it is how little you can get. If the workload can be satisfied with a single socket/dual-core, the price savings in Oracle licensing alone might motivate folks to buy such a system. Oracle is the most expensive thing you buy after all. What systems are there that offer significant performance in a single socket/dual-core? Itanium. It seems you can order the HP Integrity rx3600 with a single scoket. There I said it. Now I need to go kneel on peach pits or something to make me feel properly chastised.

There is more to it than hardware. Oracle ports have always lagged for Itanium Linux. In fact, Oracle10g was released on PowerPC Linux before Itanium.

I just think Intel missed the boat in the late 1990s on getting Merced to market in a package worth having. And who really needed another instruction set? Now I digress.

Using Oracle Disk Manager to Monitor Database I/O

Some of the topics in this post are also covered in the Julian Dyke/Steve Shaw RAC book that came out last summer. I enjoyed being one of the technical reviewers of the book. It is a good book.

Monitoring DBWR Specifically
I have received several emails from existing PolyServe customers asking me why I didn’t just use the Oracle Disk Manager (ODM) I/O monitoring package that is included in the PolyServe Database Utility for Oracle to show the multi-block DBWR writes I blogged about in this post. After all, there is very little left to the imagination when monitoring Oracle using this unique feature of our implementation of the Oracle Disk Manager library specification.

This URL will get you a copy of the I/O Monitoring feature of our Oracle Disk Manager library. It is quite a good feature.

I didn’t use ODM for the first part of that thread because I wanted to discuss using strace(1) for such purposes. Yes, I could have just used the mxodmstat(1) command that comes with that package and I would have seen that the average I/O size was not exactly db_block_size as one would expect. For instance, the following screen shot is an example of cluster wide monitoring of DBWR processes. The first invocation of the command is used to monitor DBWR only followed by another execution of the command to monitor LGWR. The average size of the async writes for DBWR are not precisely 8KB (the db_block_size for this database) as they would be if this was Oracle9i:

NOTE: You may have to right click->view the screen shot

dom1

As an aside, the system was pretty busy as the following screen shot will show. This is a non-RAC database on an HP Proliant DL-585 where database writes are peaking at roughly 140MB/s. You can also see that the service times (Ave ms) for the writes are averaging a bit high (as high as 30ms). Looks like I/O subsystem saturation.

odm2

 

Oh, here’s a quick peek at one nice feature of mxodmstat(1). You can dump out all the active files, clusterwide, for any number of database/instances and nodes using the –lf options:

odm3

I hope you take peek at the User Guide for this feature. It has a lot of examples of what the tool can do. You might find it interesting—perhaps something you should push your vendor to implement?

 

Analyzing Asynchronous I/O Support with Oracle10g

This is not a post about why someone would want to deploy Oracle with a mix of files with varying support for asynchronous I/O. It is just a peek at how Oracle10g handles it. This blog post is a continuation of yesterday’s topic about analyzing DBWR I/O activity with strace(1).

I’ve said many times before that one of the things Oracle does not get sufficient credit for is the fact that the database adapts so well so such a tremendous variety of platforms. Moreover, each platform can be complex. Historically, with Linux for instance, some file systems support asynchronous I/O and others do not. With JFS on AIX, there are mount options to consider as is the case with Veritas on all platforms. These technologies offer deployment options. That is a good thing.

What happens when the initialization parameter filesystemio_options=asynch yet there are a mix of files that do and do not support asynchronous I/O? Does Oracle just crash? Does it offline files? Does it pollute the alert log with messages every time it tries an asynchronous I/O to a file that doesn’t support it? The answer is that it does not of that. It simply deals with it. It doesn’t throw the baby out with the bath water either. Much older versions of Oracle would probably have just marked the whole instance to the least common demoninator (synchronous).

Not Just A Linux Topic
I think the information in this blog post should be considered useful on all platforms. Sure, you can’t use strace(1) on a system that only offers truss(1), but you can do the same general analysis with either. The system calls will be different too. Whereas Oracle has to use the Linux-only libaio routines called io_submit(2)/io_getevents(2), all other ports[1] use POSIX asynchronous I/O (e.g., lio_listio,aio_write,etc) or other proprietary asynchronous I/O library routines.

Oracle Takes Charge
As I was saying, if you have some mix of technology where some files in the database do not support asynchronous I/O, yet you’ve configured the instance to use it, Oracle simply deals with the issue. There are no warnings. It is important to understand this topic in case you run into it though.

Mixing Synchronous with Asynchronous I/O
In the following screen shot I was viewing strace(1) output of a shadow process doing a tablespace creation. The instance was configured to use asynchronous I/O, yet the CREATE TABLESPACE command I issued was to create a file in a filesystem that does not support asynchronous I/O[2]. Performing this testing on a platform where I can mix libaio asynchronous I/O and libc synchronous I/O with the same instance makes it easy to depict what Oracle is doing. At the first arrow in the screen shot, the OMF datafile is created with open(2) using the O_CREAT flag. The file descriptor returned is 13. The second arrow points to the first asynchronous I/O issued against the datafile. The io_submit(2) call failed with EINVAL indicating to Oracle that the operation is invalid for this file descriptor.

NOTE: Firefox users report that you need to right click->view the image to see these screen shots

d2_1

Now, Oracle could have raised an error and failed the CREATE TABLESPACE statement. It did not. Instead, the shadow process simply proceeded to create the datafile with synchronous I/O. The following screen shot shows the same io_submit(2) call failing at the first arrow, but nothing more than the invocation of some shared libraries (the mmap() calls) occurred between that failure and the first synchronous write using pwrite(2)—on the same file descriptor. The file didn’t need to be reopened or any such thing. Oracle simply fires off a synchronous write.

dbw2_2

What Does This Have To Do With DBWR?
Once the tablespace was created, I set out to create tables in it with CTAS statements. To see what DBWR behaved like with this mix of asynchronous I/O support, I once again monitored DBWR with strace(1) sending the trace info to a file called mon.out. The following screen shot shows that the first attempts to flush SGA buffers to the file also failed with EINVAL. All was not lost however, the screen shot also shows that DBWR continued just fine using synchronous writes to this particular file. Note, DBWR does not have to perform this “discovery” on every flushing operation. Once the file is deemed unsuitable for asynchronous I/O, all subsequent I/O will be synchronous. Oracle just continues to work, without alarming the DBA.

dbw2_3

How Would a Single DBWR Process Handle This?

So the next question is what does it mean to have a single database writer charged with the task of flushing buffers from the SGA to a mix of files where not all files support asynchronous I/O? It is not good. Now, as I said, Oracle could have just reverted the entire instance to 100% synchronous I/O, but that would not be in the best interest of performance. On the other hand, if Oracle is doing what I’m about to show you, it would be nice if it made one small alert log entry—but it doesn’t. That is why I’m blogging this (actually it is also because I’m a fan of Oracle at the platform level).

In the following screen shot, I use egrep(1) to pull occurrences from the DBWR strace(1) output file where io_submit(2) and pwrite(2) are intermixed. Again, this is a single DBWR flushing buffers from the SGA to files of varying asynchronous I/O support:

dbw2_4

In this particular case, the very first io_submit(2) call flushed 4 buffers, 2 each to file descriptors 19 and 20. Before calling io_getevents(2) to process the completion of those asynchronous I/Os, DBWR proceeds to issue a series of synchronous writes to file descriptor 24 (another of the non-asynchronous I/O files in this database). By the way, notice that most of those writes to file descriptor 24 were multi-block DBWR writes. The problem with having one DBWR process intermixing synchronous with asynchronous I/O is that any buffers in the write batch bound for a synchronous I/O file will cause a delay in the instantiation of any buffer flushing to asynchronous I/O files. When DBWR walks an LRU to build a batch, it is not considering the lower-level OS support of asynchronous I/O on the file that a particular buffer will be written to. It just builds a batch based on buffer state and age. In short, synchronous I/O requests will cause a delay in the instantiation of subsequent asynchronous requests.

OK, so this is a two edged sword. Oracle handles this complexity nicely—much credit due. However, it is not entirely inconceivable that some of you out there have databases configured with a mix of asynchronous I/O support for your files. From platform to platform this can vary so much. Please be aware that this is not just a file system topic. It can also be a device driver issue. It is entirely possible to have a file system that generically supports asynchronous I/O created on a device where the device driver does not. This scenario will also result in EINVAL on asynchronous I/O calls. Here too, Oracle is likely doing the right thing—dealing with it.

What To Do?
Just use raw partitions. No, of course not. We should be glad that Oracle deals with such complexity so well. If you configure multiple database writers (not slaves) on a system that has a mix of asynchronous I/O support, you’ll likely never know the difference. But the topic is at least on your mind.

[1] Except Windows of course

[2] The cluster  file system in PolyServe’s Database Utility for Oracle uses a mount option to enable both direct I/O and OS asynchronous I/O. However, when using PolyServe’s Oracle Disk Manager (ODM) Library Oracle can perform asynchronous I/O on all mount types. Mount options for direct I/O is quite common as this is a requirement on UFS and OCFS2 as well.

A Quick Announcement About Scalable NAS

A Quick Announcement About NAS

If you, or anyone in your datacenter, is interested in Scalable NAS, this enterprisestorageforum.com article may be of interest. Additionally, if you are interested you can sign up here for a web demonstration of Scalable NAS. Note, the same sign up sheet will allow you to sign up for web demonstrations of PolyServe’s Database Utility for Oracle as well.

Note, clustered storage is really catching on and I think it should be of interest to any forward-looking DBA, Oracle IT Architect, Storage Administrator or Unstructured Data Administrator.

It is possible that NAS will be/should be a part of your Oracle deployement at some point.

 

Analyzing Oracle10g Database Writer I/O Activity on Linux

Using strace(1) to Study Database Writer on Linux
This is a short blog entry for folks that are interested in Oracle10g’s usage of libaio asynchronous I/O routines (e.g., io_submit(2)/io_getevents(2)). For this test, I set up Oracle10g release 10.2.0.1 on Red Hat 4 x86_64. I am using the cluster filesystem bundled with the PolyServe’s Database Utility for Oracle, but for all intents and purposes I could have used ext3.

The workload is a simple loop of INSERT INTO SELECT * FROM statements to rapidly grow some tables thereby stimulating Database Writer (DBWR) to flush modified SGA buffers to disk. Once I got the workload running, I simply executed the strace command as follows where <DBWR_PID> was replaced with the real PID of the DBWR process:

$ strace -o dbw -p <DBWR_PID>

NOTE: Using strace(1) imposes a severe penalty on the process being traced.  I do not recommend using strace(1) on a production instance unless you have other really big problems the strace(1) output would help you get under control.

The second argument to the io_submit(2) call is a long integer that represents the number of I/O requests spelled out in the current call. The return value to an io_submit(2) call is the number of iocb’s processed. One clever thing to do is combine grep(1) and awk(1) to see what degree of concurrent I/O DBWR is requesting on each call. The following screen shot shows an example of using awk(1) to select the io_submit(2) calls DBWR has made to request more than a single I/O. All told, this sould be the majority of DBWR requests.

NOTE: You may have to right click->view the image. Some readers of this blog have reported this. Sorry

strace 1

Another way to do this is to anchor on the return to the io_submit(2) call. The following screen shot shows an example of grep(1) to selecting only the io_submit(2) calls that requested more than 100 I/O transfers in a single call.

strace 2

What File Descriptors?
When io_submit(2) is called for more than one I/O request, the strace(1) output will string out details of each individual iocb. Each individual request in a call to io_submit(2) can be for a write to a different file descriptor. In the following text grep(1)ed out of the strace(1) output file, we see that Oracle requested 136 I/Os in a single call and the first 2 iocbs were requests to write on file descriptor 18.:

io_submit(182926135296, 136, {{0x2a973cea40, 0, 1, 0, 18}, {0x2a973e6e10, 0, 1, 0, 18}

What About ASM?
You can run ASM with libaio. If you do, you can do this sort of monitoring, but you wont really be able to figure out what DBWR is writing to because the file descriptors will just point to raw disk. ASM is raw disk.


Oracle OS Watcher (OSW) Scripts

Dave Moore discusses the relatively new set of OS performance data collections scripts on his webpage here.  Dave writes:

OS Watcher is a utility provided on MetaLink (Note 301137.1) primarily for support issues in a RAC environment. I must admit I was captivated by the name and wondered if I could use this tool instead of expensive 3rd party products for monitoring key operating system metrics. The verdict is “no” and I was less than impressed.

OS Watcher is a series of shell scripts that run on AIX, Solaris, HP-UX and Linux. Simple commands such as ps, top, vmstat, netstat and others are executed at regular intervals and their output is appended to a file in a directory specific to that command.

I have not personally taken the time to play with these scripts (I have my own), but I can read the tea leaves. Oracle support will most likely start asking for this data for any problem you might be having (regardless of whether you have a performance related problem). It might be smart to start collecting this data so you don’t hear something like, “Please reproduce the problem after installing OSW.” Just a thought.

I’ll see if I can arrange a test of how heavy the collection of this data is and blog on what I find. I read through the scripts and it looks like some pretty heavy collection. I never liked performance monitoring tools that carry a heavy “tare weight”. Did any of you use CA Unicenter in the old days?

If you have a Metalink account the toolkit is available here.

RAC Expert or Clusters Expert?

Introducing the Oracle SMP Expert. What is a Spinlock?
I am not joking when I tell you that I met an individual last year that billed himself as an “Oracle SMP expert.” That is fine and dandy, but through the course of our discussion I realized that this person had a severely limited understanding of the most crucial concept in SMP software scalability—critical sections. It wasn’t necessarily the concept of critical sections this individual didn’t really understand, it was the mutual exclusion that must accompany critical sections on SMP systems. In Oracle terms, this person could not deliver a coherent definition for what a latch is—that is, he didn’t understand what a spinlock was and how Oracle implements them. An “Oracle SMP expert” that lacks even cursory understanding of mutual exclusion principles is an awful lot like a “RAC expert” that does not have a firm understanding of what the term “fencing” means.

I have met a lot of “RAC experts” in the last 5 years who lack understanding of clusters principles—most notably what the term “fencing” is and what it means to RAC. Fencing is to clusters what critical sections are to SMP scalability.

Is it possible to be a “RAC expert” without being a cluster expert? The following is a digest of this paper about clusterware I have posted on the Oaktable Network website. For that matter, Julian Dyke and Steve Shaw accepted some of this information for inclusion in this RAC book.

Actually, I think getting it in their book was a part of the bribe for the technical review I did of the book (just joking).

I Adore RAC and Fencing is a Cool Sport!
No, not that kind of fencing. Fencing is a generic clustering term relating to how a cluster handles nodes that should no longer have access to shared resources such as shared disk. For example, if a node in the cluster has access to shared disk but has no functioning interconnects; it really no longer belongs in the cluster. There are several different types of fencing. The most common type came from academia and is referred to by the acronym STOMITH which stands for Shoot The Other Machine In The Head. A more popular variant of this acronym is STONITH where “N” stands for Node. While STONITH is a common term, there is nothing common with how it is implemented. The general idea is that the healthy nodes in the cluster are responsible for determining that an unhealthy node should no longer be in the cluster. Once such a determination is made, a healthy node takes action to power cycle the errant node. This can be done with network power switches for example. All told, STONITH is a “good” approach to fencing because it is generally built upon the notion that healthy nodes monitor and take action to fence unhealthy nodes.

This differs significantly from the “fencing” model implemented in Oracle Clusterware, which doesn’t implement STONITH at all. In Oracle Clusterware, nodes fence themselves by executing the reboot(8) command out of the /etc/init.d/init.cssd. This is a very portable approach to “fencing”, but it raises the question of what happens if the node is so unhealthy that it cannot successfully execute the reboot(8) command. Certainly we’ve all experienced systems that were so incapacitated that commands no longer executed (e.g., complete virtual memory depletion, etc.). In a cluster it is imperative that nodes be fenced when needed, otherwise they can corrupt data. After all, there is a reason the node is being fenced. Having a node with active I/O paths to shared storage after it is supposed to be fenced from the cluster is not a good thing.

Oracle Clusterware and Vendor Clusterware in Parallel
On all platforms, except Linux and Windows, Oracle Clusterware can execute in an integrated fashion with the host clusterware. An example of this would be Oracle10g using the libskgx[n/p] libraries supplied by HP for the MC ServiceGuard environment. When Oracle runs with integrated vendor clusterware, Oracle makes calls to the vendor-supplied library to perform fencing operations. This blog post is about Linux, so the only relationship between vendor clusterware and Oracle clusterware is when Oracle-validated compatible clusterware runs in parallel with Oracle Clusterware. One such example of this model is Oracle10g RAC on PolyServe Matrix Server.

In situations where Oracle’s fencing mechanism is not able to perform its fencing operation, the underlying validated host clusterware will fence the node, as is the case with PolyServe Matrix Server. It turns out that the criteria used by Oracle Clusterware to trigger fencing are the same criteria that host clusterware uses to take action. Oracle instituted the Vendor Clusterware Validation suites to ensure that underlying clusterware is compatible and complements Oracle clusterware. STONITH is one form of fencing, but far from the only one. PolyServe supports a sophisticated form of STONITH where the healthy nodes integrate with management interfaces such as Hewlett-Packard’s iLO (Integrated Lights-Out) and Dell DRAC. Here again, the most important principle of clustering is implemented—healthy nodes take action to fence unhealthy nodes— which ensures that the fencing will occur. This form of STONITH is more sophisticated than the network power-switch approach, but in the end they do the same thing—both approaches power-cycle unhealthy nodes. However, it is not always desirable to have an unhealthy server power-cycled just for the sake of fencing.

Fabric Fencing
With STONITH, there could be helpful state information lost in the power reset. Losing that information may make cluster troubleshooting quite difficult. Also, if the condition that triggered the fencing persists across reboots, a “reboot loop” can occur. For this reason, PolyServe implements Fabric Fencing as the preferred option for customers running Real Application Clusters. Fabric Fencing is implemented in the PolyServe SAN management layer. PolyServe certifies a comprehensive list of Fiber Channel switches that are tested with the Fabric Fencing code. All nodes in a PolyServe cluster have LAN connectivity to the Fiber Channel switches. With Fabric Fencing, healthy nodes make SNMP calls to the Fiber Channel switch to disable all SAN access from unhealthy nodes. This form of fencing is built upon the sound principle of having healthy servers fence unhealthy servers, but the fenced server is left in an “up” state—yet completely severed from shared disk access. Administrators can log into it, view logs and so on, but before the node can rejoin the cluster, it must be rebooted.

Kernel Mode Clusterware
The most important aspect of host clusterware, such as PolyServe, is that it is generally implemented in Kernel Mode. In the case of PolyServe, the most critical functionality of SAN management, cluster filesystem, volume manager and so on are implemented in Kernel Mode on both Linux and Windows. On the other hand, when fencing code is implemented in User Mode, there is always the risk that the code will not get processor cycles to execute. Indeed, with clusters in general, overly saturated nodes often need to be fenced because they are not responding to status requests by other nodes in the cluster. When nodes in the cluster are getting so saturated as to trigger fencing, having critical clusterware code execute in Kernel Mode is a higher level of assurance that the fencing operation will succeed. That is, if all the nodes in the cluster are approaching a critical state and a fencing operation is necessary against an errant node, having Kernel Mode fencing architected as either robust STONITH or Fabric Fencing ensures the correct action will take place.

Coming Soon
What about SCSI-III Persistent Reservation. Isn’t I/O fencing as good as server fencing? No, it isn’t.

Data Direct Networks and Texas Memory Systems. Should Be Fun.

Fun Weekend Ahead
Our cool lab manager has reconfigured the DDN storage allocated to my cluster of DL585s so that I get 65 spindles per LUN. This particular cluster had been hobbled with LUNs derived from only 16 spindles. Looks like I get to do some reconfiguration of RAC on the cluster this weekend. The good thing is that the cluster also has a Texas Memory Systems Solid State Disk (Another PolyServe Partner) configured for Redo logging. I should be able to get some good performance and stability readings.

This cluster always has ASM and PolyServe CFS set up side-by-side. Makes for good fun.

DBWR Multiblock Writes? Yes, Indeed!

Learning Something New
Learning Oracle is a never ending effort. “OK, tell me something I didn’t know”, you say? You may know this bit I’m about to blog about, but I sure didn’t. I don’t know when the Database Writer I/O profile changed, but it has—somewhere along the way.

I have a simple test of 10gR2 (10.2.0.1) on RHEL4 x86_64 using filesystem files. I was using strace(1) on the single DBWR process I have configured for this particular instance. The database uses an 8KB block size and there are no variable block sizes anywhere (pools are not even configured). The workload I’m running while monitoring DBWR is quite simple. I have a tablespace called TEST created using Oracle Managed Files (OMF)—thus the peculiar filenames you’ll see in the screen shots below. I have 2 tables in the TEST tablespace and am simply looping INSERT INTO SELECT * FROM statements between the 2 tables as the stimulus to get DBWR busy.

In the following screen shot you’ll see that I took a look at the file descriptors DBWR is using by listing a few of them out in /proc//fd. The interesting file descriptors for this topic are:

  • FD 18 – System Tablespace
  • FD 19 – UNDO Tablespace
  • FD 20 – SYSAUX Tablespace
  • FD 23 – The TEST Tablespace

NOTE: Please right click the image to open in a viewer. Some of you readers have reported a problem, but we’ve found that it is as simple as clicking it. I need to investigate  if that is something that wordpress is doing.

p1

I have purposefully set up this test to not use libaio, thus filesystemio_options was not set in the parameter file. In the next screen shot I use grep to pull all occurrences of the pwrite(1) system calls that DBWR is making that are not 8KB in size. Historically there should be none since DBWR’s job is to clean scattered SGA buffers by writing single blocks to random file offsets. That has always been DBWR’s lot in life.

p3

So, as strace(1) is showing, these days DBWR is exhibiting a variation of its traditional I/O profile. In this synchronous I/O case, on this port of Oracle, DBWR is performing synchronous multi-block writes to sequential blocks on disk! That may seem like trivial pursuit, but it really isn’t. First, where are the buffers? The pwrite system call does not flush scattered buffers as do such routines as writev(),lio_listio() or odm_io()—it is not a gathered write. So if DBWR’s job is to build write batches by walking LRUs and setting up write-lists by LRU age, how is it magically finding SGA buffers that are adjacent in memory and bound for sequential offsets in the same file? Where is the Twilight Zone soundtrack when you need it? For DBWR to issue these pwrite() system calls requires the buffers to be contiguous in memory.

Of course DBWR also behaved in a more “predictable” manner during this test as well as the following screen shot shows:

p2

Is This A Big Problem?
No, I don’t think so—unless you’re like me and have had DBWR’s I/O profile cast in stone dating back to Version 6 of Oracle. All this means is that when you are counting DBWR write calls, you can’t presume they are always single block. Now that is something new.

Porting
I often point out that Microsoft has always had it quite easy with SQL Server. I’m not doing that pitiful Microsoft bashing that I despise so much, but think about it. SQL Server started with a fully functional version of Sybase (I was a Sybase fan) and brought it to market on 1 Operating System and only 1 platform (x86). Since then their “porting” effort as “exploded” into x86, x86_64 and IA64. That is the simple life.

This DBWR “issue” may only be relevant to this Oracle revision and this specific port. I can say with great certainty that if Sequent were still alive, the Sequent port would have used one of the gathered write calls we had at our disposal in this particular case. With Oracle, so much of the runtime is determined by porting decisions. So the $64,000 dollar question is, why isn’t DBWR just using writev(1) which would nicely clear all those SGA buffers from their memory-scattered locations?

Things That Make You Go Hmmmmm

Marketing Efforts Prove SunFire T2000 Is Not Fit For Oracle.

I’ll try not to make a habit of referencing a comment on my blog as subject matter for a new post, but this one is worth it. One of my blog readers posted this comment. The question posed was what performance effect there would be with DSS or OLTP with the Sun CoolThreads architecture—given it has a single FPU shared by 8 cores. The comment was:

I have heard though that the CoolThread processors are not always great at supporting databases because they only have a single floating point processor? Would you see this as a problem in either a OLTP or DSS environment that don’t have any requirement for calculations that may involve floating points?

Since this is an Oracle blog, I’ll address this with an Oracle-oriented answer. I think most of you know how much I dislike red herring marketing techniques, so I’ll point out that there has been a good deal of web- FUD about the fact that the 8-core packaging of the CoolThreads architecture all share a single floating point unit (FPU). Who cares?

Oracle and Floating Point
The core Oracle kernel does not, by and large, use floating point operations. There are some floating point ops in layers that don’t execute at high frequency and therefore are not of interest. Let’s stick to the things that happen thousands or tens of thousands of times per second (e.g., buffer gets, latching, etc, etc).  And, yes there are a couple of new 10g native float datatypes (e.g., BINARY_FLOAT, BINARY_DOUBLE), but how arithmetic operations are performed on these are a porting decision.  That is, the team that ports Oracle to a given architecture must choose whether the handling of data types is done with floating point operations or not. Oracle documentation on the matter states:

The BINARY_FLOAT and BINARY_DOUBLE types can use native hardware arithmetic instructions…

Having a background in port-level engineering of Oracle, I’ll point out that the word “can” in this context is very important. I have a query out about whether the Solaris ports do indeed do this, but what is the real impact either way?

At first glance one expect that an operation like select sum(amt_sold) to benefit significantly if the amt_sold column was defined as a BINARY_FLOAT or BINARY_DOUBLE, but that is just not so. Oracle documentation is right to point out that machine floating point types are, uh, not the best option for financial data. The documentation reads further:

These types do not always represent fractional values precisely, and handle rounding differently than the NUMBER types. These types are less suitable for financial code where accuracy is critical.

So those folks out there that are trying to market against CoolThreads based largely on its lack of good FPU support can forget the angle of poor database performance. It is a red herring. Well, ok, maybe there is an application out there that is not financial and would like to benefit from the fact that a BINARY_FLOAT is 4 bytes of storage whereas a NUMBER is 21 bytes. But there again I would have to see real numbers from a real test to believe there is any benefit. Why? Remember that accesses to a row with a BINARY_FLOAT column is prefaced with quite a bit of  SGA code that is entirely integer. Not to mention the fact that it is unlikely a table would only have that column in it. All the other adjacent columns add overhead in the caching and fetching of this nice, new small BINARY_FLOAT column. All the layers of code to parse the query, to construct the plan, to allocate heaps and so on are mostly integer operations. Then to access each row piece in each block is laden with cache gets/misses (logical I/O) and necessary physical I/O. For each potential hardware FPU operating on a BINARY_FLOAT column there are orders of magnitude more integer operations.

All that “theory” aside, it is entirely possible to actually measure before we mangle as goes the cliché. Once again, thanks to my old friend Glenn Fawcett for a pointer to a toolkit for measuring floating point operations.

Why the Passion?
I remember the FUD marketing that competitors tried to use against Sequent when the infamous Pentium FDIV bug was found. That bug had no affect on the Sequent port of Oracle. It seems that was a subtle fact that the marketing personnel working for our competitors missed because they went wild with it. See, at the time Sequent was an Oracle server power house with systems based deep at the core with Intel processors (envision a 9 square foot board loaded with ASICS, PALs and other goodies with a little Pentium dot in the middle). Sequent was the development platform for Unix Oracle Parallel Server and Intra-node Parallel Query. Oracle ran their entire datacenter on Sequent Symmetry systems at the time (picture 100+ refrigerator sized chassis lined up in rows at Redwood Shores) and Oracle Server Technologies ran their nightly regression testing against Sequent Symmetry systems as well. Boring, I know. But I was in Oracle Advanced Engineering at the time and I didn’t appreciate the FUD marketing that our competitors (whose systems were RISC based) tried to play up with the supposed impact of that bug on Oracle performance on Sequent gear. I do not like FUD marketing. If you are a regular reader of my blog I bet you know what other current FUD marketing I particularly dislike.

More to the point of CoolThreads, I’ve seen web content from companies using what I consider to be red herring marketing against the SunFire T[12]000 family of servers. I am probably one of the biggest proponents of fair play out there and suggesting CoolThreads technology is not fit for Oracle due to poor FPU support is just not right. Now, does that mean I’d choose a SunFire over an industry standard server? Well, that would be another blog entry.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.