Archive Page 35

Busy Idle Processes. Huh? The AIX KPROC process called “wait”.

A recent thread on the oracle-l email list was about the AIX 5L KPROC process called “wait”. The email that started the thread reads:

We are reviewing processes on our P690 machine and get the following.

I’ve googled a little bit but can’t find anything of interest. Are these processes that I should be concerned with – should we kill them? A normal ps -ef | grep 45078 does not return the process, so I really can’t figure out what these are.

$ ps auxw | head -10

USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
root 45078 9.3 0.0 48 36 – A Oct 13 120026:37 wait

root 40980 9.0 0.0 48 36 – A Oct 13 116428:47 wait

root 36882 8.9 0.0 48 36 – A Oct 13 114010:26 wait

root 32784 8.8 0.0 48 40 – A Oct 13 113205:56 wait

[…output truncated…]

Another participant in the thread followed up with:

you will find the answer in:

http://www-304.ibm.com/jct09002c/isv/tech/faq/individual.jsp?oid=1:89156

And yet another good member of the list added:

Also, the reason you don’t see it with “ps -ef” is that ps doesn’t show kernel processes by default – you have to specify the “-k” flag, e.g.:

/opt/oracle ->ps -efk|grep wait

root 8196 0 0 Nov 11 – 720:31 wait
root 53274 0 0 Nov 11 – 3628:35 wait
root 57372 0 0 Nov 11 – 554:40 wait
root 61470 0 0 Nov 11 – 1883:24 wait
[…output truncated…]

So What Do I Have To Add?
So why am I blogging about this if the mystery has been explained? Well, I think having a kernel process attributed with time when the processor is in the idle loop is just strange. Microprocessors only have two states; running and idle. On a Unix system, the running state is attributed to either user or kernel mode. Attributing the idle state to anything is like charging nothing to something.

Yes, I suppose I’m nit-picking. There is something about the running state that I find so many people do not know and it has to do with processor efficiency. Regardless of which mode—user or kernel—the processor monitoring tools can only report that the processor was idle or not. That’s all. Processor monitoring tools (e.g., vmstat, sar, etc) cannot report processor efficiency. Remember that a processor is not always getting work done efficiently. Not that there is anything you can do about it, but a processor running in either mode accessing heavily contended memory is getting very little work done per cycle. The term CPI (cycles per instruction)  is used to represent this efficiency. Think of it this way, if a CPU accesses a memory location in cache, the instruction completes in a couple of CPU cycles. If the processor is accessing a word in a memory line that is being completely hammered by other processors (shared memory), that single instruction will stall the processor until it completes. As such, the workload is said to execute with a high CPI.

There you have it, some trivial pursuit.

What Does This Have To Do With Oracle?
Well, I’ll give you an example. A process spinning on a latch is executing the test loop in cache. The loop executes at a very, very low CPI. So if you have a lot of processes routinely spinning on latches, you have a low CPI—but that doesn’t mean you are getting any throughput. Latch contention is just tax if you will. When the latch is released, the processors that are spinning get a cacheline invalidation. They immediately read the line again. The loading of that line brings the CPI way up for a moment as the line is installed into cache, and on and on it goes. The “ownership” of the memory line with the latch structure just ping-pongs around the box. Envision a bunch of one-armed people standing around passing around a hot potato. Yep, that about covers it. No, not actually. Somewhere there has to be a copy of the potato and a race to get back to the original. Hmmm, I’ll have to work on that analogy—or take an interest in hierarchical locking. <smiley>

Therein lies the reason that just a few contended memory lines with really popular Oracle latches (e.g., redo allocation, hot chains latches, etc) can account for reasonable percentages of the work that gets done on an Oracle system. On the other hand, systems with really balanced processor/memory capabilities (e.g., System p, Opteron on Hypertransport, etc), and systems with very few processors don’t have much trouble with this stuff. And, of course, Oracle is always working to eliminate singleton latches as well.

 

Analysis and Workaround for the Solaris 10.2.0.3 Patchset Problem on VxFS Files

In the blog entry about the Solaris 10.2.0.3 patchset not functioning on VxFS, I reported that Metalink says the patchset does not work on VxFS. That is true. Since the Metalink notes have not been updated, I’ll blog a bit about what I’ve found out. Note, the Metalink note says not to use the patchset because of this bug. I am not here to fight Oracle support.

It turns out that what is happening is the Solaris porting group is now using an ioctl() that is not supported on VxFS files—but not calling the ioctl(2) directly. The bug results in an error stack a bit like this:

ORA-01501: CREATE DATABASE failed
ORA-00200: control file could not be created
ORA-00202: control file: ‘/some/path/control01.ctl’
ORA-27037: unable to obtain file status
SVR4 Error: 25: Inappropriate ioctl for device

The text in bug number 5747918 is nice enough to include the output of truss when the problem happens. The ioctl() is _ION. This is the ioctl(2) that is implemented within the directio(3C) library routine. No, don’t believe this developers.sun.com webpage when it refers to directio(3C) as a system call. It isn’t. However, they do provide an example of using the directio(3C) call in this small directio(3C) test program.

The Solaris directio(3C) call is used to push direct I/O onto a file. In the demonstration of the bug (5747918), the 10.2.0.3 patchset is trying to push direct I/O onto the file descriptor held on the control file stored in VxFS. That isn’t how you get direct I/O on VxFS. I wonder if this call to directio(3C) only happens if you have filesystemio_options=DirectIO|setall. That would make sense.

Workaround
If you use ODM on VxFS, this call to directio(3C) does not occur so you wont see the problem. Thanks to a reader comment on my blog and my age old friend still at Veritas (I mean Symantec) for verification that ODM works around the problem.

A Test Program

If you create a file in a VxFS mount called “foo”, like this:

$ dd if=/dev/zero of=foo bs=4096 count=16

And then compile and run the following small program, you will see the same problem Oracle 10.2.0.3 is exhibiting. The same program on UFS should work fine.

$ cat t.c
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#include <stdio.h>
#include <sys/types.h>
#include <sys/fcntl.h>

main ()
{

int ret, handle;
handle = open (“./foo”, O_RDONLY);
if ((ret = directio (handle, DIRECTIO_ON)) < 0)
{
printf (“Failure : return code is: %d\n”, ret);
}
else
{
printf
(“The ioctl embedded in the directio(3C) call functions on the file.\n”);
}
} /* End */

Another Potential Workaround
If you want to test the rest of what 10.2.0.3 has to offer without ODM—even with a VxFS database—I think you should be able to explicitly set filesystemio_options=none and get around the problem. Be aware I have not tested that however. The worst thing that could happen is that setting filesystemio_options in this manner is indeed a workaround that would allow you to test the many other reasons you actually need 10.2.0.3!

If you find otherwise, please comment on the blog.

Oracle 10.2.0.3 Patchset is Not Functional with Solaris SPARC 64-bit and Veritas Filesystem

A Faulty Patchset
One of the members of the oracle-l email list just posted up that 10.2.0.3 is a non-functional patchset for Solaris SPARC 64-bit if you are using Veritas VxFS for datafiles. I just checked Metalink and found that this is Oracle Bug 5747918 covered in Metalink Note: 405825.1.

According to the email on oracle-l:

Applies to:

Oracle Server – Enterprise Edition – Version: 10.2.0.3 to 10.2.0.3
Oracle Server – Standard Edition – Version: 10.2.0.3 to 10.2.0.3
Solaris Operating System (SPARC 64-bit)
Oracle Database Server 10.2.0.3
Sun Solaris Operating System
Veritas Filesystem

Solution

Until a patch for this bug is available, or 10.2.0.3 for Solaris is re-released, you should restore from backups made before the attempted upgrade. Do not attempt to upgrade again until a fix for this issue is available. Bug:5747918 is published on Metalink if you wish to follow the progress.

I’d stay with 10.2.0.2 I think.

Comparing 10.2.0.1 and 10.2.0.3 Linux RAC Fencing. Also, Fencing Failures (Split Brain).

BLOG UPDATE 2011.08.11 : For years my criticism of Oracle Clusterware fencing methodology brought ire from many who were convinced I was merely a renegade. The ranks of “the many” in this case were generally well-intended but overly convinced that Oracle was the only proven clustering technology in existence.  It took many years for Oracle to do so, but they did finally offer support for IPMI fencing integration in the 11.2 release of Oracle Database. It also took me a long time to get around to updating this post.  Whether by graces of capitulation or a reinvention of the wheel, you too can now, finally, enjoy a proper fencing infrastructure. For more information please see: http://download.oracle.com/docs/cd/E11882_01/rac.112/e16794/admin.htm#CHDGIAAD

I’ve covered the clusters concept of fencing quite a bit on this blog (e.g., RAC Expert or Clusters Expert and Now is the Time to Open Source, etc), and in papers such as this paper about clusterware, and in an appendix in the Julian Dyke/Steve Shaw book about RAC on Linux. If I’ve said it once, I’ve said it 1000 times; if you are not a clusters expert you cannot be a RAC expert. Oddly though, Oracle seems to be sending a message that clusterware is commoditized—and it really isn’t. On the other hand, Oracle was brilliant for heading down the road of providing their own clusterware. Until all the kinks are worked out, it is good to know as much as you can about what is under the covers.

Linux RAC “Fencing”
As I’ve pointed out in the above referenced pieces, Oracle “fencing” is not implemented by healthy servers taking action against rogue servers (e.g., STONITH), but instead the server that needs to be “fenced” is sent a message. With that message, the sick server will then reboot itself. Of course, a sick server might not be able to reboot itself. I call this form of fencing ATONTRI (Ask The Other Node To Reboot Itself).This blog entry is not intended to bash Oracle clusterware “fencing”—it is what it is, works well and for those who choose there is the option of running integrated Legacy clusterware or validated third party clusterware to fill in the gaps. Instead, I want to blog about a couple of interesting observations and then cover some changes that were implemented to the Oracle init.cssd script under 10.2.0.3 that you need to be aware of.

Logging When Oracle “Fences” a Server
As I mentioned in this blog entry about the 10.2.0.3 CRS patchset, I found 10.2.0.1 CRS—or is that “clusterware”—to be sufficiently stable to just skip over 10.2.0.2. So what I’m about to point out might be old news to you folks. The logging text produced by Oracle clusterware changed between 10.2.0.1 and 10.2.0.3. But, since CRS has a fundamental flaw in the way it logs this text, you’d likely never know it.

Lot’s of Looking Going On
As an aside, one of the cool things about bloggingis that I get to track the search terms folks use to get here. Since the launch of my blog, I’ve had over 11000 visits from readers looking for information about the most common error message returned if you have a botched CRS install on Linux—that text being:

PROT-1: Failed to initialize ocrconfig

No News Must Be Good News
I haven’t yet blogged about the /var/log/messages entry you are supposed to see when Oracle fences a server, but if I had, I don’t think it would be a very common google search string anyway? No the reason isn’t that Oracle so seldomly needs to fence a server. The reason is that the text generally (nearly never actually) doesn’t make it into the system log. Let’s dig into this topic.

The portion of the init.cssd script that acts as the “fencing” agent in 10.2.0.1 is coded to produce the following entry in the /var/log/messages file via the Linux logger(1) command (line numbers precede code):

194 LOGGER=”/usr/bin/logger”
[snip]
1039 *)
1040 $LOGERR “Oracle CSSD failure. Rebooting for cluster integrity.”
1041
1042 # We want to reboot here as fast as possible. It is imperative
1043 # that we do not flush any IO to the shared disks. Choosing not
1044 # to flush local disks or kill off processes gracefully shuts
1045 # us down quickly.
[snip]
1081 $EVAL $REBOOT_CMD

Let’s think about this for a moment. If Oracle needs to “fence” a server, the server that is being fenced should produce the followingtext in /var/log/messages:

Oracle CSSD failure.Rebooting for cluster integrity.

Where’s Waldo?
Why is it when I google for “Oracle CSSD failure.Rebooting for cluster integrity” I get 3, count them, 3 articles returned? Maybe the logger(1) command simply doesn’t work? Let’s give that a quick test:

[root@tmr6s14 log]# logger “I seem to be able to get messages to the log”
[root@tmr6s14 log]# tail -1 /var/log/messages
Jan 9 15:16:33 tmr6s14 root: I seem to be able to get messages to the log
[root@tmr6s14 log]# uname -a
Linux tmr6s14 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux

Interesting. Why don’t we see the string Oracle CSSD failure when Oracle fences then? It’s because the logger(1) command merely sends a message to syslogd(8) via a socket—and then it is off to the races. Again, back to the 10.2.0.1 init.cssd script:

22 # FAST_REBOOT – take out the machine now. We are concerned about
23 # data integrity since the other node has evicted us.
[…] lines deleted
177 case $PLATFORM in
178 Linux) LD_LIBRARY_PATH=$ORA_CRS_HOME/lib
179 export LD_LIBRARY_PATH
180 FAST_REBOOT=”/sbin/reboot -n -f”

So at line 1040, the script sends a message to syslogd(8) and then immediately forces a reboot at line 1081—with the –n option to the reboot(8) command forcing a shutdown without sync(1). So there you have it, the text is drifting between the bash(1) context executing the init.cssd script and the syslogd(8) process that would do a buffered write anyway. I think the planets must really be in line for this text to ever get to the /var/log/messages file—and I think the google search for that particular string goes a long way towards backing up that notion. When I really want to see this string pop up in /var/log/messages, I fiddle with putting sync(1) comands and sleep before the line 1081. That is when I am, for instance, pulling physical connections from the Fibre Channel SAN paths and studying what Oracle behaves like by default.

By the way, the comments at lines 22-23 are the definition of ATONTRI.

Paranoia?
I’ve never understood that paranoia at lines 1042-1043 which state:

We want to reboot here as fast as possible. It is imperative that we do not flush any IO to the shared disks.

It may sound a bit nit-picky, but folks this is RAC and there are no buffered writes to shared disk! No matter really, even if there was a sync(1) command at line 1080 in the 10.2.0.1 init.cssd script, the likelihood of getting text to /var/log/messages is still going to be a race as I’ve pointed out.

Differences in 10.2.0.3
Google searches for fencing articles anchored with the Oracle CSSD failure string are about to get even more scarce. In 10.2.0.3, the text that the script attempts to send to the /var/log/messages file changed—the string no longer contains CSSD, but CRS instead. The following is a snippet from the init.cssd script shipped with 10.2.0.3:

452 *)
453 $LOGERR “Oracle CRS failure. Rebooting for cluster integrity.”

A Workaround for a Red Hat 3 Problem in 10.2.0.3 CRS
OK, this is interesting. In the 10.2.0.3 init.cssd script, there is a workaround for some RHEL 3 race condition. I would be more specific about this, but I really don’t care about any problems init.cssd has in its attempt to perform fencing since for me the whole issue is moot. PolyServe is running underneath it and PolyServe is not going to fail a fencing operation. Nonetheless, if you are not on RHEL 3, and you deploy bare-bones Oracle-only RAC (e.g., no third party clusterware for fencing), you might take interest in this workaround since it could cause a failed fencing. That’s split-brain to you and I.

Just before the actual execution of the reboot(8) command, every Linux system running 10.2.0.3 will now suffer the overhead of the code starting at line 489 shown in the snippet below. The builtin test of the variable $PLATFORM is pretty much free, but if for any reason you are on a RHEL 4, Novell SuSE SLES9 or even Oracle Enterprise Linux (who knows how they attribute versions to that) the code at line 491 is unnecessary and could put a full stop to the execution of this script if the server is in deep trouble—and remember fencings are suppose to handle deeply troubled servers.

Fiddle First, Fence Later
Yes, the test at line 491 is a shell builtin, no argument, but as line 226 shows, the shell command at line 491 is checking for the existence of the file /var/tmp/.orarblock. I haven’t looked, but bash(1) is most likely calling open(1) with O_CREAT and O_EXCL and returning true on test –e if the open(1) call gets EEXIST returned and false if not. In the end, however, if checking for the existence for a file in /var/tmp is proving difficult at the time init.cssd is trying to “fence” a server, this code is pretty dangerous since it can cause a failed fencing on a Linux RAC deployment. Further, at line 494 the script will need to open a file and write to it. All this on a server that is presumed sick and needs to get out of the cluster. Then again, who is to say that the bash process executing the init.cssd script is not totally swapped out permanently due to extreme low memory thrashing? Remember, servers being told to fence themselves (ATONTRI) are not healthy. Anyway, here is the relevant snippet of 10.2.0.3 init.cssd:

226 REBOOTLOCKFILE=/var/tmp/.orarblock
[snip]
484 # Workaround to Redhat 3 issue with multiple invocations of reboot.
485 # Here if oclsomon and ocssd are attempting a reboot at the same time
486 # then the kernel could lock up. Here we have a crude lock which
487 # doesn’t eliminate but drastically reduces the likelihood of getting
488 # two reboots at once.
489 if [ “$PLATFORM” = “Linux” ]; then
490 CEDETO=
491 if [ -e “$REBOOTLOCKFILE” ]; then
492 CEDETO=`$CAT $REBOOTLOCKFILE`
493 fi
494 $ECHO $$ > $REBOOTLOCKFILE
495
496 if [ ! -z “$CEDETO” ]; then
497 REBOOT_CMD=”$SLEEP 0″
498 $LOGMSG “Oracle init script ceding reboot to sibling $CEDETO.”
499 fi
500 fi
501
502 $EVAL $REBOOT_CMD

Blog Content and Format Change Announcement

This is just a quick announcement to point out that I have done a little format clean-up on the blog. The blog is about clustering and other platform topics related to Oracle, but the fact of the matter is that most IT shops that care about Oracle platform (especially clustering) topics also likely deploy non-Oracle databases.

So, I am going to start posting some stuff here along those lines. Yes, the first thing I’ve just posted there is about SQL Server 2005 consolidation and scale-out (shared database concurrent) reporting with SQL Server 2005.  This page has a whitepaper that covers new SQL Server 2005 functionality which allows databases in the PolyServe Database Utility for SQL Server to scale-out for concurrent reporting on up to 16 servers in the cluster. Switching from normal OLTP mode to scale-out reporting mode does not require any replication or structural changes to the data and the mode change occurs in less than 1 minute. Switching back to normal OLTP mode is just the opposite operation and also takes less than one minute. There are no physical storage manipulations (e.g., filesystem remounting) and no server reboots involved.

The clusterdeconfig Tool: Completely Cleaning Up After a Botched Oracle Clusterware Installation

I haven’t seen a lot of chatter about the Oracle Database Deinstallation Tool for Oracle Clusterware and Real Application Clusters on the web. In fact, a search in Metalink for the name of the actual tool—clusterdeconfig—returned no documents or Metalink forum threads with mention of the tool. I found that to be strange. This is a very helpful tool because things can go wrong when installing CRS and having a deinstall tool is better than the typical wild rm(1) command execution that is usually necessary to get back to a clean state for an installation retry.

Finding the Tool
That was a chore but I did find it so I thought I’d pass on a link to you. The following is a link to the Zip file. I hope you have a fast internet connection because it is over 60MB:

http://web51-01.oracle.com/otndocs/products/clustering/deinstall/clusterdeconfig.zip

The sshUserSetup.sh Script
When you unzip the clusterdeconfig.zip file you’ll notice it contains a script called sshUserSetup.sh that you may find helpful in setting up pass-through ssh.

Real Priorities Today
There, I blogged. But the real priority today is to go get some Dim Sum…so I’m about to shut off my lapt <fizzt>

A Successful Application of 10.2.0.3 CRS Patchset on RHEL4 x86_64. So?

Upgrading CRS to 10.2.0.3 on RHEL4 x86_64
It is quite likely I’m the last person to get around to updating my 10gR2 CRS—er, clusterware—with the 10.2.0.3 patchset. Why? Well, upgrades always break something and since 10.2.0.1 CRS was really quite stable for the specific task of node membership services (libskgxn.so), I was happy to stay with it and skip 10.2.0.2. Compared to the offal we referred to as 10.1 CRS, I have been very happy with 10gR2 CRS for the main job of CRS (which is monitoring node health). Fencing is another topic as I’ve blogged about before.

Oh, Great, He’s Blogging Screen Shots of Stuff Working Fine
Well, I can’t think of anything more boring to look at than a screen shot of a successful execution of an upgrade script. With the 10.2.0.3 upgrade it is root102.sh—the root script that OUI instructs you to execute in $ORA_CRS_HOME after it finishes such activities as copying pre-10.2.0.3 files over to ${ORA_CRS_HOME}/install/prepatch10203 and so on. So why am I blogging on a successful application of this patchset?

Knowing How Bad Something Has Failed—and Where
Often times when RAC installation and patch applications go awry—a very frequent ordeal—it is often nice to see what you should have seen at the point where it went wrong. Such clues can sometimes be helpful. It is for this reason that when I—and others in my group—write install guides for Oracle products on our Database Utility for Oracle clustering package I often include a lot of boring screen shots.

Testing a Rolling Application of 10.2.0.3 CRS
As described later in this post it is fully supported to implement a shared ORA_CRS_HOME—as it is on OCFS2 and Red Hat GFS. In fact, there are several permutations of supported configurations to choose from:

  • Local CRS HOME, raw disk OCR/CSS
  • Local CRS HOME, CFS OCR/CSS
  • Local CRS HOME, NFS OCR/CSS
  • Shared CRS HOME, raw disk OCR/CSS
  • Shared CRS HOME, CFS OCR/CSS
  • Shared CRS HOME, NFS OCR/CSS

As a normal part of my testing, I wanted to make sure the storing the OCR and CSS disks on the PolyServe CFS in no way impacts the ability to perform a 10.2.0.3 rolling upgrade of local ORA_CRS_HOME installations. It doesn’t. First, OUI determined is was OK for me to do so because ORA_CRS_HOME on all three nodes of this puny little cluster were installed under /opt on the internal drives. The CRS files (e.g., OCR/CSS), on the other hand, were on PolyServe:

tmr6s15:/opt/oracle/crs/install # grep u02 *
paramfile.crs:CRS_OCR_LOCATIONS=/u02/crs/ocr.dbf
paramfile.crs:CRS_VOTING_DISKS=/u02/crs/css1.dbf,/u02/crs/css2.dbf,/u02/crs/css3.dbf
rootconfig:CRS_OCR_LOCATIONS=/u02/crs/ocr.dbf
rootconfig:CRS_VOTING_DISKS=/u02/crs/css1.dbf,/u02/crs/css2.dbf,/u02/crs/css3.dbf
tmr6s15:/opt/oracle/crs/install # mount | grep u02
/dev/psd/psd1p3 on /u02 type psfs (rw,dboptimize,shared,data=ordered)

The first screen shot shows what to expect when OUI determines a rolling application of this patch is allowed:

NOTE: You may have to right click->view the image (e.g., with firefox I believe)

CRS1

Next, OUI instructs you to stop CRS on a node and then execute the root102.sh script:

CRS2

If all that goes well, you’ll see the following sort of feedback as root102.sh does its work:

CRS3

I was able to move along to the other two nodes and get the same feedback from root102.sh there as well.

To Share or Not to Share ORA_CRS_HOME
Oracle and PolyServe fully support the installation of CRS in either shared or unshared filesystems. The choice is up to the administrator. There are important factors to consider when making this decision. Using a shared ORA_CRS_HOME facilitates a single, central location for maintenance and operations such as log monitoring and so on. Some administrators consider this a crucial factor on larger clusters; it eliminates the need to monitor large numbers of ORA_CRS_HOME locations, each requiring logging into a different server. When ORA_CRS_HOME is shared in the PolyServe cluster filesystem, administrators can access the files from any node in the cluster.

A shared ORA_CRS_HOME does have one important disadvantage—rolling patch application is not supported. However, a patch that manipulates the Oracle Cluster Repository cannot be applied in a rolling fashion anyway. Although 10.2.0.3 is not that such a patch, it is not inconceivable that other upgrades could make format changes to the OCR that that would be incompatible with the prior-versions executing on other nodes. Oracle would, of course, inform you that such a release was not a candidate for rolling upgrade just as they do with a good number of the Critical Patch Updates (CPU).
The parallel to shared ORACLE_HOME is apparent. Many Oracle patches for the database require updates to the data dictionary, so a lot of administrators ignore the exaggerated messaging from Oracle Corporation regarding “Rolling Upgrades” of ORACLE_HOME and deploy a shared ORACLE_HOME, eliminating the need to patch several ORACLE_HOME locations whenever a patch is required. This concern is only obvious to large IT shops where there is not just one RAC database, but perhaps 10 or more. These same administrators generally apply this logic to ORA_CRS_HOME. Indeed, having only one location to patch in either the ORA_CRS_HOME or ORACLE_HOME case significantly reduces the time it takes to apply a patch. To that end, planning a very brief outage to apply patches to shared ORA_CRS_HOME and/or ORACLE_HOME for up to 16 nodes in a cluster is an acceptable situation for many applications. For those cases where downtime cannot be tolerated, Oracle Data Guard is required anyway and again the question of shared or unshared ORACLE_HOME and ORA_CRS_HOME arises. The question can only be answered on a per-application basis and the choice is yours. PolyServe finds that, in general, when an application is migrated from a single large UNIX platform to RAC on Linux, administrators do not have sufficient time to deal with the increased amount of software maintenance. These IT shops generally opt for the “single system feel” that shared software installs for ORACLE_HOME and ORA_CRS_HOME offers. In fact, PolyServe customers have used Share Oracle Home since 2002 with Oracle9i and then with Oracle10g—it has always been a staple feature of the Database Utility for Oracle. With Oracle10g the choice is yours.

Using The cpuid(1) Linux Command for In-depth Processor Information

Not to be confused with the x86 ISA CPUID instruction (which serialized the CPU by the way), I found a nice little tool for in-depth CPU information called cpuid(1).I’ve snipped a bit of the manpage and pasted it below. The RPM for the cpuid(1) tool can be found here.

Let’s take a quick look at the contrast between what this tool reports and what is generically available if you can /proc/cpuinfo. Once again, I’ll go over to my favorite lab cluster of DL585s fit with Opteron 850s running the PolyServe Database Utility for Oracle. I’ll use more(1) to get one processor worth of information:

$ cat /proc/cpuinfo | more
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 33
model name : AMD Opteron ™ Processor 850
stepping : 0
cpu MHz : 1800.005
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse
sse2 ht syscall nx mmxext lm 3dnowext 3dnow pni
bogomips : 3599.35
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

On the other hand, the cpuid(1) command shows:

$ cpuid | more

CPU 0:
vendor_id = “AuthenticAMD”
version information (1/eax):
processor type = primary processor (0)
family = Intel Pentium 4/Pentium D/Pentium Extreme Edition/Celeron/Xeon/Xeon MP/Itanium2, AMD At
hlon 64/Athlon XP-M/Opteron/Sempron/Turion (15)
model = 0x1 (1)
stepping id = 0x0 (0)
extended family = 0x0 (0)
extended model = 0x2 (2)
(simple synth) = AMD Dual Core Opteron (Italy/Egypt JH-E1), 940-pin, 90nm
miscellaneous (1/ebx):
process local APIC physical ID = 0x0 (0)
cpu count = 0x2 (2)
CLFLUSH line size = 0x8 (8)
brand index = 0x0 (0)
brand id = 0x00 (0): unknown
feature information (1/edx):
x87 FPU on chip = true
virtual-8086 mode enhancement = true
debugging extensions = true
page size extensions = true
time stamp counter = true
RDMSR and WRMSR support = true
physical address extensions = true
machine check exception = true
CMPXCHG8B inst. = true
APIC on chip = true
SYSENTER and SYSEXIT = true
memory type range registers = true
PTE global bit = true
machine check architecture = true
conditional move/compare instruction = true
page attribute table = true
page size extension = true
processor serial number = false
CLFLUSH instruction = true
debug store = false
thermal monitor and clock ctrl = false
MMX Technology = true
FXSAVE/FXRSTOR = true
SSE extensions = true
SSE2 extensions = true
self snoop = false
hyper-threading / multi-core supported = true
therm. monitor = false
IA64 = false
pending break event = false
feature information (1/ecx):
PNI/SSE3: Prescott New Instructions = true
MONITOR/MWAIT = false
CPL-qualified debug store = false
VMX: virtual machine extensions = false
Enhanced Intel SpeedStep Technology = false
thermal monitor 2 = false
context ID: adaptive or shared L1 data = false
cmpxchg16b available = false
xTPR disable = false
extended processor signature (0x80000001/eax):
generation = AMD Athlon 64/Opteron/Sempron/Turion (15)
model = 0x1 (1)
stepping = 0x0 (0)
(simple synth) = AMD Dual Core Opteron (Italy/Egypt JH-E1), 940-pin, 90nm
extended feature flags (0x80000001/edx):
x87 FPU on chip = true
virtual-8086 mode enhancement = true
debugging extensions = true
page size extensions = true
time stamp counter = true
RDMSR and WRMSR support = true
physical address extensions = true
machine check exception = true
CMPXCHG8B inst. = true
APIC on chip = true
SYSCALL and SYSRET instructions = true
memory type range registers = true
global paging extension = true
machine check architecture = true
conditional move/compare instruction = true
page attribute table = true
page size extension = true
multiprocessing capable = false
no-execute page protection = true
AMD multimedia instruction extensions = true
MMX Technology = true
FXSAVE/FXRSTOR = true
SSE extensions = true
RDTSCP = false
long mode (AA-64) = true
3DNow! instruction extensions = true
3DNow! instructions = true
extended brand id = 0xe86 (3718):
MSB = reserved (0b111010)
NN = 0x6 (6)
AMD feature flags (0x80000001/ecx):
LAHF/SAHF supported in 64-bit mode = false
CMP Legacy = true
SVM: secure virtual machine = false
AltMovCr8 = false
brand = “AMD Opteron ™ Processor 850”
L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
instruction # entries = 0x8 (8)
instruction associativity = 0xff (255)
data # entries = 0x8 (8)
data associativity = 0xff (255)
L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
instruction # entries = 0x20 (32)
instruction associativity = 0xff (255)
data # entries = 0x20 (32)
data associativity = 0xff (255)
L1 data cache information (0x80000005/ecx):
line size (bytes) = 0x40 (64)
lines per tag = 0x1 (1)
associativity = 0x2 (2)
size (Kb) = 0x40 (64)
L1 instruction cache information (0x80000005/ecx):
line size (bytes) = 0x40 (64)
lines per tag = 0x1 (1)
associativity = 0x2 (2)
size (Kb) = 0x40 (64)
L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
instruction # entries = 0x0 (0)
instruction associativity = L2 off (0)
data # entries = 0x0 (0)
data associativity = L2 off (0)
L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
instruction # entries = 0x200 (512)
instruction associativity = 4-way (4)
data # entries = 0x200 (512)
data associativity = 4-way (4)
L2 unified cache information (0x80000006/ecx):
line size (bytes) = 0x40 (64)
lines per tag = 0x1 (1)
associativity = 16-way (8)
size (Kb) = 0x400 (1024)
Advanced Power Management Features (0x80000007/edx):
temperature sensing diode = 0x1 (1)
frequency ID (FID) control = 0x1 (1)
voltage ID (VID) control = 0x1 (1)
thermal trip (TTP) = 0x1 (1)
thermal monitor (TM) = 0x0 (0)
software thermal control (STC) = 0x0 (0)
TscInvariant = 0x0 (0)
Physical Address and Linear Address Size (0x80000008/eax):
maximum physical address = 0x28 (40)
maximum linear address = 0x30 (48)
Logical CPU cores (0x80000008/ecx):
number of logical CPU cores – 1 = 0x1 (1)
ApicIdCoreIdSize = 0x0 (0)
SVM Secure Virtual Machine (0x8000000a/eax):
SvmRev: SVM revision = 0x0 (0)
SVM Secure Virtual Machine (0x8000000a/edx):
LBR virtualization = false
NASID: number of address space identifiers = 0x0 (0):
(multi-processing synth): multi-core (c=2)
(synth) = AMD Dual Core Opteron (Italy/Egypt JH-E1), 940-pin, 90nm Processor 875

And the manpage:


CPUID(1)
NAME

cpuid – Dump CPUID information for each CPU

SYNOPSIS

cpuid [options…]

DESCRIPTION

cpuiddumpsdetailedinformationabouttheCPU(s) gathered from the CPUID instruction, and also determines the exact model of CPU(s) from that information.

It dumps all information available from the CPUID instruction.The exact collectionofinformation availablevariesbetweenmanufacturers and even between different CPUs from a single manufacturer.

The following information is available consistently on all modern CPUs:

vendor_id

version information (1/eax)

miscellaneous (1/ebx)

feature information (1/ecx)

Oracle Database on CAS, NAS, FCP. Your Choice. Why Not Some of Each?

When it comes to storage protocols, the big storage vendors are sending a clear message: Some is good, more must be better!

NAS, CAS, What a Mess (That Almost Rhymes)
Yes, Oracle cares about Oracle over NFS, and clustered storage is taking off, but the clustered storage offerings are fracturing into structured versus unstructured data optimization and that is a bad choice to have to make.

Back in June 2006, Tony Asaro of the Enterprise Storage Group covered clustered storage in this SearchStorage.com article. He said:

Clustered storage is gaining ground with an increasing number of vendors and systems in today’s market. Over time, clustered storage will be a requisite architectural design element for all storage systems.

The article covers a lot of different clustered storage offerings running the gamut from products like Isilon to CAS technology such as the EMC Centera and mention of PolyServe oddly listed alongside 3Par who partners with PolyServe for scalable NAS. One particular quote in the article stands out:

The Isilon IQ NAS storage system is one of the best examples of a true storage cluster.

Special Purpose Storage
While this may be true, I want to blog about a very important issue that I see arising out of the clustered storage wars. You see, so many of these interesting technologies are very special purpose. Some do streaming media well, others do seismic, others do RDBMS, but few—if not only one—do it all. Deploying special purpose storage technology means you are certain to have more than one kind of storage. For instance, if you adopt EMC Centera for unstructured data, you are going to need some other solution for your structured data—and since this is an Oracle blog we’ll presume Oracle is charged with your structured data.

Centera Storage is Optimized for Databases
“Hold it, Kevin”, you say. I can hear it already. I know, you read this EMC solution brief covering Centera posted on Oracle’s website! It says (emphasis added by me):

[…] Centera’s unique storage capabilities, you can centralize and manage massive volumes of information generated by all aspects of your organization […]

A document on Oracle’s website states Centera handles information generated by all aspects of your organization. Certainly that must also include the things you cram into your ERP database! No, CAS is an EMC term for write-once, or in their terminology “fixed content.” In short they implement WORM on ordinary magnetic media. Centera is not for databases.

So Oracle and EMC both recommend Centera for some of your data. How many different types of storage presentation do you want? What do you do with your database then? Oh, of course, I know, ASM. Centera is a network attached storage device so if you are settling on IP, wouldn’t life be simpler with NAS for the database too? But as I pointed out in this blog entry about ASM over NFS, EMC specifically recommends against combining ASM and NFS. So how many different connectivity models do you want? See, what I don’t get is how the market tolerates having products marketed to them in a way that doesn’t have their best interests in mind. It suits EMC quite well to sell you some Centera and some Celerra for NFS or even a mix of Centera and DMX via FCP (FCP is expensive). Any storage vendor that pushes Content DB will get a head nod from Oracle, but in the end, Content DB runs on all major platforms. So who are the forces behind this drive towards such special purpose and fractured storage management architectures?

Unlike Isilon and EMC file serving, with PolyServe you can buy any commodity hardware. And unlike Isilon and EMC, you can choose Windows Server or Linux—no proprietary embedded operating system. And most importantly—unlike Isilon and EMC—with PolyServe you get general purpose network attached clustered storage. So, sure, do your Content DB and Oracle Database (RAC included) all in one management infrastructure. Makes sense to me, but of course I’m biased.

Isilon: The Best Example of a True Storage Cluster
Yes, Isilon is a true clustered storage, but the product doesn’t support the Oracle database. Yet another special purpose offering. But, as I said here, I wish Isilon well. We are, after all, kindred spirits in this clustered storage wave. 

OK, there, I shamelessly plugged the outfit I work for <smiley face>.

 

Geeks in Cubicles, The “Browser Wars”, Unpaid Workers are “Truly Dedicated”

The Browser Wars Rage On
While reading the latest Time Magazine about how you are the person of the year, I stumbled across some interesting stuff. In this Time Magazine article, we learn that Blake Ross is “Outfoxing Microsoft” with the Firefox web brower. Are there really any living human beings left that care about the “browser wars”? I thought it was all about content now. Oh, Well.

Near the beginning of the article, we get this jewel regarding how most software is developed:

Most software is developed exactly the way you think it is: you pay a bunch of geeks in cubicles to write it

Lovely. On the contrary when referring to some of the people that write open source software, the article quotes Blake:

[open source developers] aren’t necessarily professionals

But no worries, when it comes to the commitment level of open source developers, the article quotes Blake as follows:

It also means the people are truly dedicated because there’s no payday

Uh, OK, that’s really nifty. I don’t know about you, but I’m a lot happier with software developed by people that do it because they need to meet their financial obligations. The thought of my local 911 service running on software written by ueber-dedicated, unpaid not-necessarily-professionals makes me restless. Think about it, they might actually have to attend to their day job at some point, or is that where they are getting the best of “their ideas?”

Oh the Hypocrisy!
I used Firefox to post this blog entry. You know what I would have used of Firefox wasn’t free? IE6—I wouldn’t pay for Firefox. When I installed Firefox, there is a welcome to Firefox page that reads:

Experience the difference. Firefox is developed and supported by Mozilla, a global community working together to make the Web a better place for everyone.

I don’t think whoever wins the nonexistent browser wars can make the Web a better place for everyone. It’s not the browser, it’s the content.

What’s This Have to do with Oracle
Oracle is not open source. I’m glad there are those “geeks in cubicles” developing and maintaining the database server. I know a lot of them, and they deserve a lot of respect.

 

Announcement: Scalable Windows File Serving Web Demo

Yes, this is an Oracle-related blog, but most Oracle sites have file serving requirements and the majority have Windows infrastructure as well. This is just an invitation to you readers that might be interested:

PolyServe Windows Scalable File Serving Web Demo Announcement


AMD Quad-Core “Barcelona” Processor For Oracle (Part III). NUMA Too!

To continue my thread about AMD’s future Quad-core processors code named “Barcelona” (a.k.a. K8L), I need to elaborate a bit on my last installment on this thread where I pointed out that AMDs marketing material suggests we should expect 70% better OLTP performance from Barcelona than Socket F (Opteron 2220). To be precise, the marketing materials are predicting a 70% increase on a per-processor basis. That is a huge factor that I need to blog, so here it is.

“Friendemies”
While doing the technical review for the Julian Dyke/Steve Shaw RAC on Linux Book I got to know Steve Shaw a bit. Since then we have become more familiar with each other especially after manning the HP booth in the exhibitor hall at UKOUG 2006. Here is a photo of Steve in front of the HP Enterprise File Services Clustered Gateway demo. The EFS is an OEMed version of the PolyServe scalable file serving utility (scalable clustered storage that works).

shaw_4.JPG

People who know me know I’m a huge AMD fan, but they also know I am not a techno-religious zealot. I pick the best, but there is no room for loyalty in high technology (well, on second thought, I was loyal to Sequent to the bitter end…oh well). So over the last couple of years, Steve and I have occasionally agreed to disagree about the state of affairs between Intel and AMD processor fitness for Oracle. Steve and I are starting to see eye to eye a lot more these days because I’m starting to smell the coffee as they say.

It’s All About The Core
When it comes to Oracle performance on industry standard servers, the only thing I can say is, “It’s the core, stupid”—in that familiar Clintonian style of course. Oracle licenses the database at the rate of .5 per core, rounded up. So a quad-core processor is licensed as 2 CPUs. Let’s look at some numbers.

Since AMD’s Quad-core promo video is based on TPC results, I think it is fair to go with them. TPC-C is not representative of what real applications do to a processor, but the workload does one thing really well—it exploits latency issues. For OLTP, memory latency is the most important performance characteristic. Since AMD’s material sets our expectations for some 70% improvement in OLTP over the Opteron 2200, we’ll look at TPC-C.

This published TPC-C result shows that the Opteron 2200 can perform 69,846 TpmC per processor. If the AMD quad-core promotional video proves right, the Barcelona processor will come it at approximately 118,739 TpmC per processor (a 70% improvement).

TpmC/Oracle-license
Since a quad-core AMD is licensed by Oracle as 2 CPUs, it looks like Barcelona will be capable of 59,370 TpmC per Oracle license. Therein lies the rub, as they say. There are a couple of audited TPC-C results with the Intel “Tulsa” processor (a.k.a. Xeon 7140, 7150), such as this IBM System x result, that show this current high-end Xeon processor is capable of some 82,771 TpmC per processor. Since the Xeon 71[45]0 is a dual-core processor, the Oracle-license price factor is 82,771 TpmC per Oracle license. If these numbers hold any water, some 9 months from now when Barcelona ships, we’ll see a processor that is 28% less price-performant from a strict Oracle licensing standpoint. My fear is that it will be worse than that because Barcelona is socket-compatible with Socket F systems—such as the Opteron 2200. I’ve been at this stuff for a while and I cannot imagine the same chipset having enough headroom to feed a processor capable of 70% more throughput. Also, Intel will not stand still. I am comparing current Xeon to future Barcelona.

A Word About TPC-C Analysis
I admit it! I routinely compare TPC-C results on the same processor using results achieved by different databases. For instance, in this post, I use a DB2/SLES on IBM System x to make a point about the Xeon 7150 (“Tulsa”) processor. E-gad, how can I do that with a clear conscience? Well, think about it this way. If DB2 on IBM System x running SuSE can achieve 82,771 TpmC per Xeon 7150 and this HP result shows us that SQL Server 2005 on Proliant ML570G4 (Xeon 7140) can do 79,601 TpmC per CPU, you have to at least believe Oracle would do as well. There are no numbers anywhere that suggest Oracle is head and shoulders above either of these two software configurations on identical hardware. We can only guess because Oracle seems to be doing TPC-C with Itanium exclusively these days. I think that is a bummer, but Steve Shaw likes it (he works for Intel)!

What Does NUMA Have To Do With It?
Uh, Opteron/HyperTransport systems are NUMA systems. I haven’t blogged much about that yet, but I will. I know a bit about Oracle on NUMA—a huge bit.

I hope you’ll stay tuned because we’ll be looking at real numbers.

The “Dread Factor”, Multi-vendor Support, Unbreakable Linux.

Dread the Possible, Ignore the Probable
“One throat to choke”, is the phrase I heard the last time I spoke with someone who went to extremes to reduce the number of technology providers in their production Oracle deployment. You know, Unbreakable Linux, single-source support provider, etc. I’m sorry, but, if you are running Oracle on Linux there is no way to get single-provider support. We all find this out sooner or later. Sure, you can send your money to a sole entity, but that is just a placebo. If I thought my life depended on single-provider support, I’d buy an IBM System i solution (AS400)—soup to nuts. At least I’d get close.

With Linux there is always going to be multiple providers because it runs on commodity hardware. You then add storage (SAN array, switches, HBAs), load the OS and Oracle and other software. There you go—multiple providers. So why is it that sometime people get a comfort from this theory of single-provider support on the software (OS and Oracle only of course) side of things? Is it a reality?

Dread Factor
No, single-provider support with Oracle on Linux is not a reality. That is why serious software providers and their careful customers rely on TSANet to ensure all parties play by the rules and do not start pointing fingers at the expense of the customer. Oracle is a participant in TSANet, so is PolyServe.

I was reading an interesting magazine article—also available online—about how we humans fear the wrong things. You know, things like fearing a commercial airliner fatality more than an auto fatality—the latter taking 500-fold more lives per year. The article explains why. We dread an airliner crash more. The article points out:

[…] the more we dread, the more anxious we get, and the more anxious we get, the less precisely we calculate the odds of the thing actually happening. “It’s called probability neglect,”

What Does This Have To Do With Oracle?
Well, we fear how “helpless” we might be in a case where the OS or third party platform software provider is pointing at Oracle and Oracle is pointing back. By the way, have you ever finger-pointed at a 800lb gorilla? Yes that is a possible scenario. Is that somehow more calamitous than working with Oracle on a clear, concise Oracle-only bug (e.g., some ORA-0600 crash problem)? Probably not, but fear of the former is an example of what the magazine article calls the Dread Factor.

New Year’s Resolution: Fear the Probable
We have a Wall Street customer that does not run Oracle on our Database Utility for Oracle RAC, in their RAC solution but do use our scalable file serving in their ETL workflow. They run Oracle on Itanium Linux and we don’t do Itanium. But, since we are in there, I know a bit about their operations. In the month of November 2006, one of their operations managers told me they had nearly 90 Oracle TARs open—half of which where ORA-00600/ORA-07445 problems. All those TARs were affecting a single application—a single RAC database. Yes, it is conceivable that they also have also faced a multi-vendor problem (e.g., HBA firmware/Red Hat SCSI midlayer) at some point in this deployment. Do you think they really care? In this shop, the database tier is 100% Unbreakable Linux—the old style, not the new style. The old style Unbreakable Linux being RHEL with Oracle and no third-party kernel loadable modules. That’s them–they have a “single throat to choke”. How do you think that is working out for them? It hasn’t made a bit of difference.

Oracle is an awesome database. It is huge and complex. You are going to hit bugs so it might be a good New Year’s resolution to fear the probable more then the possible. Get the most stable, managable, supported configuration you can so that you are not dealing with day to day headaches between those probable bugs. That is, don’t hinge your deployment on some possible support finger pointing match. Real, difficult, single-vendor bugs are most probable. Choose your partners well for those possible bugs.

A Case Study
The majority of the suse-oracle email list participants have the “no-third-party” model deployed. They are, if you will, the poster children for Unbreakable Linux. So I keep an eye out there to see how the theory plays out in reality. Let’s take a peek. In a recent thread about an Asynchronous I/O problem in the Linux kernel, the poster wrote:

We already tried this…opened a TAR with Oracle, opened an issue with Novell…got 2 fixes from Novell, but both are not helping around the bug. The database crashes after approx. 1 week of heavy load and you have to restart the machine to free the ipc-resources.

Remember that with an Unbreakable Linux deployment, if you hit a Linux kernel problem you can call Oracle or the provider of your Linux distribution. This person tried both, but the saga continued:

[…] we filed a bug…with both parties, Novell AND Oracle.We escalated this case at Novell, because it’s a kernel bug…no change for the last 4-6 weeks. But…as you see…no solution after about 3 months…

Since Linux is open source, the code is open to all for reading. I’ve blogged before about the dubious value in being able to read the source for the OS or layers such as clustered filesystems since an IT shop is not likely to fix the problem themselves anyway. The customer having this async I/O problem took advantage of that “benefit”:

I took a deep look into the kernel-code, especially the part of the bug in aio.c As far as i see, it looks like a list-corruption of the list of outstanding io-requests. So i don’t think that it is driver-specific…it looks like a general bug.

But, as I routinely point out, having the source really doesn’t help an IT shop much as this installment on the thread shows:

It’s very unfortunate that this bug (bz #165140) is still not resolved
as both Oracle and SUSE eng. teams are looking into problem.

An Historical Example of Good Multi-Vendor Support
Back in the 1990’s Veritas, Oracle and Sun got together to build a program called VOS to ensure their joint customers get the handling they deserve. Kudos to Oracle and Sun. That was typical of Oracle back in the Open Systems days. Things were a lot more “open” back then.

I participate in the oracle-l list. There was a recent thread there about the dreaded “finger-pointing” illusion. In this post a list participant set the record straight. His post points out that having more than “one through to choke” is better than being all alone:

In the context of clustering, even if you eliminate the third-party cluster-ware products, you still have the other pieces of the pie, like the OS, the storage (SAN, etc.), the interconnect, etc., so the finger-pointing will not go away. I have worked with the VOS support many times in the past and I can tell you that in each conference call, VERITAS support never pointed fingers towards anyone. In fact, their support people were so competent that they even identified issues that were related to SAN and even the analysts from the storage SAN company were not able to identify them.


Lessons From Real Life
Multi-vendor support is a phenomenon across all industries. A good friend of mine has a real job and does real work for a living—dangerous work, with huge dangerous equipment that he owns. He knows that there are certain things he has to do with his machinery that substantially increase the probability of something going wrong. In those cases, he doesn’t fret about the possibility that there may be some political outcome. He focuses on the probable.

A bit over a year ago he experienced “the probable” and took photos for me. While moving a 60,000+ lb piece of machinery, he hit a patch of ice and yes, 30 ton track vehicles do slide on ice just like your co-worker’s red sports car.

In the following shot, the machinery had just slipped off the road so he called in another of his pieces to help.

cat1

In the next shot they had worked at the problem until the tracks were headed in the right direction and the tether was freshly cut loose. He said the anxiety was so thick you could cut it with a knife. It is quite probable he is right. Then again, it is possible he was exaggerating. I’ll let you be the judge.

cat2
I’ll blog another time about where that machine had to go after that photo…it wasn’t pretty.

AMD Quad-core “Barcelona” Processor For Oracle (Part II)

I am a huge AMD fan, but I am now giving up my hopes of finding any substantial information that could be used to predict what Oracle performance might be like on next year’s Barcelona (a.k.a. K8L) quad-core processor. I did, however, find another ” interesting blog” while trolling for information on this topic. Note, the quotes! Folks, NOTE THE QUOTES!!! I’m insinuating something there…

Lowered Expectations?
Anyway, what I am finding is that by AMD’s own predictions, we should expect Barcelona to outperform Intel’s Clovertown (Xeon 5355) processor by about 15% or so. The problem is that there really are no real numbers. You can view this AMD video about Barcelona. In it you’ll find a slide that shows their estimated 70% OLTP improvement over the Opteron 2200 SE product. The 2200 is a Socket F processor and luckily for us there is an audited TPC-C result of 34,923 TpmC/core. Note, I’m boiling down TPC results by core to make some sense of this. The Barcelona processor is 100% compatible with the Socket F family. I find it hard to imagine that Barcelona will be able to squeeze out a 70% performance increase from the same chipset. Oh well. But if it did, that would be a TPC-C result of 59,369 per core. So why then is that AMD video so focused on leap-frogging the Xeon 5355 which “only” gets 30,092 TpmC/core? And why the fixation on the Xeon 5355 when the Xeon 7140 “Tulsa” achieves 39,800 TpmC/core? It was nice and convenient to be able to compare the 2200SE, 5355 and 7140 with TPC results based on the same database—SQL Server.

I also see no evidence of IBM, HP or Dell planning to base a server on Barcelona. That’s scary. I’m expecting some quasi-inside information from Sun. Let’s see if that will help any of this make sense.

The following is shot of the AMD slide predicting 70% performance over the Xeon 5160 and Opteron 2200SE (which as I point out is a bit moot). You may have to right-click and view to zoom in on it:

AMD-Barcelona2

OLTP is Old News
Finally, I’m discovering that you don’t get much information about processors when searching for that old, boring OLTP stuff. If I search for “megatasking +AMD” on the other hand—now that produces a richness of information! I’ve also learned that “enthusiast” is a buzzword AMD and Intel are both beating on heavily. I was completely unaware that there is actually what is known as an “enthusiast market”. It seems customers in this particular market buy processors that also wind up in servers for OLTP. I just hope the processors they are making for “enthusiasts” are also reasonably fit for Oracle databases. I’m afraid we aren’t going to know until we find out.

In the meantime, I think I’ll push some megatasking tests through my cluster of DL585s.

Partition, or Real Application Clusters Will Not Work.

OK, that was a come-on title. I’ll admit it straight away. You might find this post interesting nonetheless. Some time back, Christo Kutrovsky made a blog entry on the Pythian site about buffer cache analysis for RAC. I meant to blog about the post, but never got around to it—until today.

Christo’s entry consisted of some RAC theory and a buffer cache contents SQL query. I admit I have not yet tested his script against any of my RAC databases. I intend to do so soon, but I can’t right now because they are all under test. However, I wanted to comment a bit on Christo’s take on RAC theory. But first I’d like to comment about a statement in Christo’s post. He wrote:

There’s a caveat however. You have to first put your application in RAC, then the query can tell you how well it runs.

Not that Christo is saying so, but please don’t get into the habit of using scripts against internal performance tables as a metric of how “well” things are running. Such scripts should be used as tools to approach a known performance problem—a problem measured much closer to the user of the application. There are too many DBAs out there that run scripts way down-wind of the application and if they see such metrics as high hit ratios in cache, or other such metrics they rest on their laurels. That is bad mojo. It is not entirely unlikely that even a script like Christo’s could give a very “bad reading” yet application performance is satisfactory and vise versa. OK, enough said.

Application Partitioning with RAC
The basic premise Christo was trying to get across is that RAC works best when applications accessing the instances are partitioned in such a way as to not require cross-instance data shipping. Of course that is true, but what lengths do you really have to go to in order to get your money’s worth out of RAC? That is, we all recall how horrible block pings were with OPS—or do we? See, most people that loathed the dreaded block ping in OPS thought that the poison was in the disk I/O component of a ping when in reality the poison was in the IPC (both inter and intra instance IPC). OK, what am I talking about? It was quite common for a block ping in OPS to take on the order of 200-250 milliseconds on a system where disk I/O is being serviced with respectable times like 10ms. Where did the time go? IPC.

Remembering the Ping
In OPS, when a shadow process needed a block from another instance, there was an astounding amount of IPC involved to get the block from one instance to the other. In quick and dirty terms (this is just a brief overview of the life of a block ping) it consisted of the shadow process requesting the local LCK process to communicate with the remote LCK process who in turn communicated with the DBWR process on that node. That DBWR process then flushed the required block (along with all the modified blocks covered by the same PCM lock) to disk. That DBWR then posted his local LCK who in turn posted the LCK process back where the original requesting shadow process is waiting. That LCK then posts the shadow process and the shadow process then reads the block from disk. Whew. Note, at every IPC point the act of messaging only makes the process being posted runable. It then waits in line for CPU in accordance with its mode and priority. Also, when DBWR is posted on the holding node, it is unlikely that it was idle, so the life of the block ping event also included some amount of time that was spent while DBWR finished servicing the SGA flushing it was already doing when it got posted. All told, there was quite often some 20 points where the processes involved were in runable states. Considering the time quantum for scheduling is/was 10ms, you routinely got as much as 200ms overhead on a block ping that was just scheduling delay. What a drag.

What Does This Have To Do With RAC?
Christo’s post discusses divide and conquer style RAC partitioning, and he is right. If you want RAC to perform perfectly for you, you have to make sure that RAC isn’t being used. Oh he’s gone off the deep end again you say. No, not really. What I’m saying is that if you completely partition your workload then RAC is indeed not really being used. I’m not saying Christo is suggesting you have to do that. I am saying, however, you don’t have to do that. This blog post is not just a shill for Cache Fusion, but folks, we are not talking about block pings here. Cache Fusion—even over Gigabit Ethernet—is actually quite efficient. Applications can scale fairly well with RAC without going to extreme partitioning efforts. I think the best message is that application partitioning should be looked at as a method of exploiting this exorbitantly priced stuff you bought. That is, in the same way we try to exploit the efficiencies gained by fundamental SMP cache-affinity principals, so should attempts be made to localize demand for tables and indexes (and other objects) to instances—when feasible. If it is not feasible to do any application partitioning, and RAC isn’t scaling for you, you have to get a bigger SMP. Sorry. How often do I see that? Strangely not that often. Why?

Over-configuring
I can’t count how often I see production RAC instances running throughout an entire RAC cluster at processor utilization levels well below 50%. And I’m talking about RAC deployments where no attempt has been made to partition the application. These sites often don’t need to consider such deployment tactics because the performance they are getting is meeting their requirements. I do cringe and bite my tongue however when I see 2 instances of RAC in a two node cluster—void of any application partitioning—running at, say, 40% processor utilization on each node. If no partitioning effort has been made, that means there is cache fusion (GCS/GES) in play—and lots of it. Deployments like that are turning their GbE Cache Fusion interconnect into an extension of the system bus if you will. If I was the administrator of such a setup, I’d ask Santa to scramble down the chimney and pack that entire workload into one server at roughly 80% utilization. But that’s just me. Oh, actually, packing two 40% RAC workloads back into a single server doesn’t necessarily produce 80% utilization. There is more to it than that. I’ll see if I can blog about that one too at some point.

What about High-Speed, Low-Latency Interconnects?
With OLTP, if the processors are saturated on the RAC instances you are trying to scale, high-speed/low latency interconnect will not buy you a thing. Sorry. I’ll blog about why in another post.

Final Thought
If you are one of the few out there that find yourself facing a total partitioning exercise with RAC, why not deploy a larger SMP instead? Comments?


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.