Archive Page 32

Oracle Updates the Linux RAC Technology Compatibility Matrix

Subsequent to the HP acquisition of PolyServe, Oracle has updated the Linux RAC Technology Compatibility Matrix to include RAC support for PolyServe on HP Proliant Servers:

 

Linux RAC Technology Compatibility Matrix

 

Testing Direct Versus Buffered UFS on Solaris 10 with Swingbench.

Padraig O’Sullivan has a good blog entry on Swingbench. Included is the topic of using Swingbench to test direct versus buffed UFS on Solaris 10. Good post—check it out.

AMD Quad-Core “Barcelona” Processor For Oracle. How Badly Do You Need Enterprise Edition Oracle?

 

This blog entry is 6th in a series about Oracle on AMD’s upcoming quad-core processor code named “Barcelona.” The following is a link to the other installments on this thread:

Oracle on Opteron, K8L, NUMA, etc

Got Quad-Core? Need Enterprise Edition Oracle?
There is quite a buzz today about Oracle’s changes to software licensing for the database products. According to this ZDNet article, the changes are specific to the Standard Edition family of database products. The article refers to Oracle’s multi-core pricing guide which was updated on February 16, 2007. Get out your slide rule and gulp a heaping helping of patience.

Quad-Core x86_64
The ZDNet Article states:

Servers with four quad-core chips are relatively rare right now, but Intel and AMD plan to release processors for that segment later this year.

Um, the Xeon “Cloverdale” processors are quad-core and shipping already. AMD “Barcelona” is coming out this year. So what does this change really mean? If you use one of the Standard Edition products, you are longer limited based on cores, but sockets instead.

Misinformation—Lot’s of It
It’s Christmas for the bean counters. According to this News.com article, you can just simply switch out Enterprise Edition with Standard Edition:

Customers no longer must buy licenses for each of the 16 cores to run the top-end Enterprise Edition, but instead may buy licenses for the four sockets and run Standard Edition. That cuts list licensing prices from between $320,000 and $480,000–depending on Oracle adjustments that factor in multi-core processor performance–to $60,000.

I am still scratching my head about that one. Customers don’t swap out EE for SE at the drop of a hat—or do you? Since the choice would have never been there before to run SE on that many cores, could it be that SE will start to be the preferred multi-core edition? Can you live without the differences between EE and SE?

Barcelona
Folks that if have EE on a 4-Socket F (2200/8200) Opteron system today might be wise to think very hard about whether they can drop to SE because if they plug in Barcelona processors (they are socket-compatible), EE is going to be very, very expensive. That is, if you stay with EE and plug in Barcelona processors you will double your license cost.

I find this to be a very interesting policy change.

HP to Acquire PolyServe to Bolster NAS Offerings with Clustered Storage

You faithful readers of this blog know my position on NAS for Oracle. Clustered Storage is getting hot and HP has just stepped up to the plate by acquiring PolyServe. Here is a link to HP’s website with details:

HP To Acquire PolyServe

As you regular readers can imagine, my blogging will certainly sound a lot different going forward.

Experience Direct I/O Before Experiencing RAC

Bob Sneed makes a good point about direct I/O with regards to preparation for moving to RAC (should you find yourself in that position). I know exactly what he is talking about as I’ve seen people hit with the rude awakening of switching from buffered to unbuffered I/O while trying to implement RAC. The topic is related to the troubles people see when they migrate to RAC from a non-RAC setup where  regular buffered filesystems are being used. Implementing RAC forces you to use direct I/O (or RAW) so if you’ve never seen your application work without the effect of external caching in the OS page cache, going to RAC will include this dramatic change in I/O dynamic. All this at the same time as experiencing whatever normal RAC phenomenon your application may hit as well.

In this blog entry, Bob says:

If you ever intend to move a workload to RAC, tuning it to an unbuffered concurrent storage stack can be a crucial first step! Since there are no RAC storage options that use OS-level filesystem buffering […]

Direct I/O
Bob stipulates “unbuffered concurrent” since Solaris has a lot of different recipes for direct I/O some of which do not throw in concurrent I/O automatically. If you’ve been following my blog, you’ve detected that I think it is a bit crazy that there are still technology solutions out there that do not automatically include concurrent I/O along with direct I/O.

Here are some links to my recent thread on direct I/O :

Standard File Utilities with Direct I/O

Oracle Direct I/O Brought to You By Deranged Monkeys

Direct I/O Can Crash Dataguard. Tricky ORA-01031.

DBWR with CIO on JFS2. Resource Starvation?

What Performs Better, Direct I/O or Direct I/O? There is No Such Thing As a Stupid Question!

 

 

Cache Hit Ratio. Who Needs It?

Some things can never be said often enough. Buffer cache hit ratio is a worthless indicator of performance when Oracle is pounding the daylights out of a few cache buffer chains. I see Hemant Chitale has just blogged on this topic:

Buffer Cache Hit Ratio GOOD or BAD ?

About Logical Reads
All logical reads in Oracle start with a hash algorithm on the database block address. Since there is an unknown number of blocks in the database (dba), this cannot be a “perfect hash” so there are hash collisions. Oracle resolves this by “chaining” dbas with equal hash values. Chains hang off of “buckets” and each bucket has a latch. To walk a chain (looking for the exact dba your session needs), the latch is first aquired. These are db block gets and db block consistent gets depending on the type of block you are looking for (versioning). Applications that clone a lot of blocks can have a “piling up” affect on the buckets that govern these hot chains. Fix that problem at the application level before worrying about hit ratio and long before trying to deal with latch dynamics (e.g., spin count, increasing buckets, etc).

Standard File Utilities with Direct I/O

In my last blog entry about Direct I/O, I covered the topic of what Direct I/O can mean beyond normal Oracle database files. A reader followed up with a comment based on his experience with Direct I/O via Solaris –forcedirectio mount option:

I’ve noticed that on Solaris filesystems with forcedirectio , a “compress” becomes quite significantly slower. I had a database where I was doing disk-based backups and if I did “cp” and “compress” scripting to a forcedirectio filesystem the database backup would be about twice as long as one on a normally mounted filesystem.

I’m surprised it was only twice as slow. He was not alone in pointing this out. A fellow OakTable Network member who has customers using PolyServe had this to say in a side-channel email discussion:

Whilst I agree with you completely, I can’t help but notice that you ‘forgot’ to mention that all the tools in fileutils use 512-byte I/Os and that the response time to write a file to a dboptimised filesystem is very bad indeed…

I do recall at one point cp(1) used 512byte I/Os by default but that was some time ago and it has changed. I’m not going to name the individual that made this comment because if he wanted to let folks know who he is, he would have made the comment on the blog.  However, I have to respectfully disagree with this comment. It is too broad and a little out of date. Oh, and fileutils have been rolled up into coreutils actually. What tools are those? Wikipedia has a good list.

When it comes to the tools that are used to manipulate unstructured data, I think the ones that matter the most are cp, dd, cat, sort, sum, md5sum, split, uniq and tee. Then, from other packages, there are tar and gzip. There are others, but these seem to be the heavy hitters.

Small Bites
As I pointed out in my last blog entry about DIO, the man page for open(2) on Enterprise Linux distributions quotes Linus Torvalds as saying:

The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances

I beg to differ. I think he should have given that title to anyone that thinks a program like cp(1) needs to operate with little itsy-bitsy-teenie-weenie I/Os. The following is the current state of affairs (although not exhaustive) as per measurements I just took with strace on RHEL4:

  • tar: 10KB default, override with –blocking-factor
  • gzip: 32KB in/16KB out
  • cat, md5sum, split, uniq, cp: 4KB

So as you can see these tools vary, but the majority do operate with insidiously ridiculous small I/O sizes. And 10KB as the default for tar? Huh? What a weird value to pick out of the air. At least you can override that by supplying an I/O size using the –blocking-factor option. But still, 10KB? Almost seems like the work of “deranged monkeys.” But is all lost? No.

Open Source
See, I just don’t get it. Supposedly Open Source is so cool because you can read and modify source code to make your life easier and yet people are reluctant to actually do that.  As far as that list of coreutils goes, only cp(1) causes a headache on a direct I/O mounted filesystem because you can’t pipeline it. Can you imagine the intrusive changes one would have to make to cp(1) to stop doing these ridiculous 4KB operations? I can, and have. The following is what I do to the coreutils cp(1):

copy.c:copy_reg()
/* buf_size = ST_BLKSIZE (sb);*/
buf_size = 8388608 ;

Eek! Oh the horror. Imagine the testing! Heaven’s sake! But, Kevin, how can you copy a small file with such large I/O requests? The following is a screen shot of two copy operations on a direct I/O mounted filesystem. I copy once with my cp command that will use a 8MB buffer and then again with the shipping cp(1) which uses a 4KB buffer.

fig4.jpg

Folks, in both cases the file is smaller than the buffer size. The custom cp8M will use an 8MB buffer but can safely (and quickly) copy a 41 byte file the same way the shipping cp(1) does with a 4KB buffer. The file is smaller than the buffer in both cases—no big deal.

So then you have to go through and make custom file tools right? No, you don’t. Let’s look at some other tools.

Living Happily With Direct I/O
…and reaping the benefits of not completely smashing your physical memory with junk that should not be cached. In the following screen shot I copy a redo log to get a working copy. My current working directory is a direct I/O mounted PSFS and I’m on RHEL4 x86_64. After copying I used gzip straight out of the box as they say. I then followed that with a pipeline command of dd(1) reading the infile with 8MB reads and writing to the pipe (stdout) with 8MB writes. The gzip command is reading the pipe with 32KB reads and in both cases is writing the compressed output with 16KB writes.

fig5.jpg

It seems gzip was written by monkeys who were apparently not deranged. The effect of using 32KB input and 16KB output is apparent. There was only a 16% speedup when I slammed 8MB chucks into gzip on the pipeline example. Perhaps the sane monkeys that implemented gzip could talk to the deranged monkeys that implemented all those tools that do 4KB operations.

What if I pipeline so that gzip is reading and writing on pipes but dd is adapted on both sides to do large reads and writes? The following screen shot shows that using dd as the reader and writer does pick up another 5%:

fig6.jpg

So, all told, there is 20% speedup to be had going from canned gzip to using dd (with 8MB I/O) on the left and right hand of a pipeline command. To make that simpler one could easily write the following scripts:

#!/bin/bash

dd if=$1 bs=8M

and

#!/bin/bash

dd of=$1 bs=8M

Make these scripts executable and use as follows:

$ large_read.sh file1.dbf | gzip –c -9 | large_write.sh file1.dbf.gz

But why go to that trouble? This is open source and we are all so very excited that we can tweak the code. A simple change to any of these tools that operate with 4KB buffers is very easy as I pointed out above. To demonstrate the benefit of that little tiny tweak I did to coreutils cp(1), I offer the following screen shot. Using cp8M offers a 95% speedup over cp(1) by moving 42MB/sec on the direct I/O mounted filesystem:

fig7.jpg

More About cp8M
Honestly, I think it is a bit absurd that any modern platform would ship a tool like cp(1) that does really small I/Os. If any of you can test cp(1) on, say, AIX, HP-UX or Solaris you might find that it is smart enough to do large I/O requests if is sees the file is large. Then again, since OS page cache also comes with built-in read-ahead, the I/O request size doesn’t really matter since the OS is going to fire off a read-ahead anyway.

Anyway, for what it is worth, here is the README that we give to our customers when we give them cp8M:

$ more README

INTRODUCTION
Files stored on DBOPTIMIZED mounted filesystems do not get accessed with buffered I/O. Therefore, Linux tools that perform small I/O requests will suffer a performance degradation compared to buffered filesystems such as normal mounted PolyServe CFS , Ext3, etc. Operations such as copying a file with cp(1) will be very slow since cp(1) will read and write small amounts of data for every operation.

To alleviate this problem, PolyServe is providing this slightly modified version of the Open Source cp(1) program called cp8M. The seed source for this tool is from the coreutils-5.2.1 package. The modification to the source is limited to changing the I/O size that cp(1) issues from ST_BLOCKSIZE to 8 MB. The following code snippet is from the copy.c source file and depicts the entirety of source changes to cp(1):

copy.c:copy_reg()

/* buf_size = ST_BLKSIZE (sb);*/

buf_size = 8388608 ;

This program is statically linked and has been tested on the following filesystems on RHEL 3.0, SuSE SLES8 and SuSE SLES9:

* Ext3

* Regular mounted PolyServe CFS

* DBOPTIMIZED mounted PSFS

Both large and small files have been tested. The performance improvement to be expected from the tool is best characterized by the following terminal session output where a 1 GB file is copied using /bin/cp and then with cp8M. The source and destination locations were both DBOPTIMIZED.

# ls -l fin01.dbf

-rw-r–r– 1 root root 1073741824 Jul 14 12:37 fin01.dbf

# time /bin/cp fin01.dbf fin01.dbf.bu
real 8m41.054s

user 0m0.304s

sys 0m52.465s

# time /bin/cp8M fin01.dbf fin01.dbf.bu2

real 0m23.947s

user 0m0.003s

sys 0m6.883s

Oracle Direct I/O Brought to You By Deranged Monkeys

If you have an Linux system, check the “bugs” section of the man page for the open(2) system call and you’ll see the following quote from Linus Torvalds:

The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances -Linus

I’m not joking, read that man page and you’ll see. Now, while I much prefer a mount option approach to direct I/O, I don’t think the O_DIRECT style of direct I/O was the brain child of a deranged monkey. I wonder if Linus is insinuating that the interface would be better if it was written by a sane monkey—or perhaps even a deranged monkey that is not on some serious mind-controlling substances?

There is nothing strange about O_DIRECT and most of the Unix derivations I am aware of are happy to offer it (Solaris being the notable exception offering directio(3C) instead). I’d love to know more about the context of that Linus quote. I’ve been around O_DIRECT since the very early 1990s. Sequent supported O_DIRECT opens on DYNIX/ptx file system files way back in 1991.

The Linux kernel development community still languishes over the fact that software like Oracle does not like to kernel-dive to access buffered data, preferring to do its own buffering instead.

A Mount-Option Approach
Why? Well, if you have programs that perform properly aligned I/O calls (e.g., cat(1), dd(1), cp(1), etc) but you don’t want them “polluting “ your system page cache, then you either need a mount-option approach to do Direct I/O or the tools need to be re-coded to open O_DIRECT. Back in 2001 I had the opportunity to make that choice for PolyServe and I haven’t regretted it once. Let me explain.

Let’s say, for instance, you generate and compress a few gigabytes of archived redo logs per day—or roughly ~40KB per second. It doesn’t sound like much, I know. But let’s look at page cache costs. When ARCH spools an offline redo log to the archive log destination the OS page cache will be used to buffer the I/O. When your compression tool (e.g. compress(P), gzip(1)) reads the file, page cache will once again be used. As the output of compress needs to be written page cache is used. Finally, when the archived redo is copied off the system (e.g., to tape), page cache will again be used. All this caching for data that is not used again—save for emergencies. But really, caching sequentially read archived files and compress output? Makes little sense.

The only way to not cache this sort of data is O_DIRECT, but I/Os issued against an O_DIRECT opened file must be multiples of the underlying disk block size (generally 512 bytes). The buffer in the calling process used for the I/O must also be a aligned on an address that is a multiple of the OS page size. It turns out that most OS tools perform proper alignment of their I/O buffers. So where is the rub? The I/O sizes! Even if you coded your compress tool to use O_DIRECT (deranged monkey syndrome), the odds that the output file will be a multiple of 512 bytes is nil. Let’s look at an example.

Direct I/O for Better Memory Utilization
In the following session I performed 6 steps to see the effect of direct I/O:

  1. Use df to determine space and exact filesystem of my current working directory (CWD)
  2. Check the Mount options. My CWD is a PolyServe PSFS mounted with the DBOptimized mount option which “renders” direct I/O akin to the Solaris –forcedirectio mount option.
  3. List my redo logs. Note, they are OMF files so the names are a bit strange.
  4. Check free memory on the system
  5. Copy a redo log
  6. Check free memory again to see how much memory was used by the OS page cache

fig1.jpg

OK, hold it, in step 5 I copied a 128MB file and yet the free memory available only changed by 176KB (from step 4 to step 6). My copy of an online log closely resembles what ARCH does—it simply copies the inactive online redo log to the archive log destination. I like the ability to not consume 256MB of physical memory to copy a file that is no longer really part of the database! The cp(1) command performs I/O with requests that are 512byte multiples, so the PolyServe CFS mounted in the DBOptimized mode simply “renders” the I/O through the direct I/O code path. No, cp(1) does not open with O_DIRECT, yet I relieved the pressure on free memory by copying with Direct I/O via the mount option. That’s good.

File Compression with Direct I/O Mounted Filesystem
But what about compressing files in a direct I/O filesystem? Let’s take a look. In the next session I did the following:

  1. Check free memory on the system
  2. Used ls(1) to see my copy of the redo log file.
  3. Used gzip(1) with maximum compression on the copy of the redo log file
  4. Used ls(1) to see the file size of the compressed file.
  5. Check free memory on the system to see what OS page cache was used

fig2.jpg

OK, this is good. I take a 128MB redo log file and compress it down to 29,582,800 bytes—which is, of course, 57,778 512 byte chunks plus one 464 byte chunk. According to the differences in free memory from step 1 and step 5, only 64KB of system memory was “wasted” in the act of compressing that file. Why do I say wasted? Because cache is best used for sharing data such as in the SGA, however, here I was able to read in 128MB and write out 28.2MB and only used 64KB of page cache in the process. Memory costs money and efficiency matters. This is the reason I prefer a mount option approach to direct I/O.

Back to the example. How did I write an amount that included a stray 464 bytes with direct I/O? That is not a multiple of the underlying disk driver requirement which is 512 bytes.

Under The Covers
On Linux, gzip(1) uses 32KB reads and 16KB writes. The output file created by gzip(1) is 29,839,295 bytes which is 1,805 writes at 16KB and one last odd-ball write of 9,680 bytes—something that would be impossible to do with direct I/O were it not for the direct I/O mount option. Let’s look at strace. The last write was 9,680 bytes:

fig3.jpg

Direct I/O Without Compile-Time O_DIRECT
I can’t speak about other direct I/O mount implementations, but I can explain how PolyServe does this. All I/O bound for files in a DBOptimized mounted PSFS filesystem are quickly examined to see if the I/O meets the underlying device driver DMA requirements. In the kernel we use simple arithmetic to determine if the I/O size is a multiple of the underlying disk block size (satisfies DMA requirement) and whether the I/O buffer is aligned on a page boundry. If both conditions are true, the I/O is DMAed directly from the process address space to the disk. If not, we simply grab an OS page cache buffer, perform the I/O and then immediately invalidate that page so no other process can read dirty data (PolyServe is sort of big on cache coherency if you get my drift).

Best of Both Worlds
In the end, Linus might be right about O_DIRECT, but sitting here at PolyServe makes me say, “Who cares.” We supported direct I/O on Linux before Linux supported O_DIRECT (it was just a patch at that time). In fact, we did a 10-node Oracle9i RAC, 10 TB, 10,000 user OLTP Proof of Concept way back in 2002—before Linux O_DIRECT was mainstream. Here is a link to the paper if you are interested in that proof point.


Direct I/O Can Crash Dataguard. A Tricky ORA-01031.

 

BLOG CORRECTION: When explaining the use of the ORASYM environment variable to a reader, I noticed I typed it and assigned it to the wrong value. A double-decker bug! This entry corrects that bug. Please read the comments for context.

If improperly configured as per your platform, that is.

A Tricky ORA-01031 Error Case
I noticed something very odd today in the Linux x86_64 port of Oracle10gR2. If you set filesystemio_options=directIO and start up the instance (without connecting through the listener), subsequent connects as SYSDBA through the listener will open the orapw file with O_DIRECT. Huh? Yes, a normal, non-shared file opened O_DIRECT. I don’t know when that started, or what other platforms do this, but if the orapw file is located on a filesystem that does not support open(2) with O_DIRECT, you get an ORA-01031. This was no big deal really, other than the mystery, since I store my database on PolyServe with the direct I/O mount option. That is, the filesystemio_options=directIO is redundant for the database files and—as it turns out—creates a problem if the orapw file is on PSFS with regular mount options. I could simply omit the init.ora setting. But I’d like to cover this anyway.

 

It seems this was hit before by someone back in Oracle9iR2 days as per Metalink 3312751. So, as I was saying, I have a couple of simple solutions for my particular case where I hit this, but I thought there might be someone out there that would enjoy seeing the way I figured out it was the orapw file that was causing me trouble. But first, what did this do to Dataguard in my environment?

 

Mixing Dataguard, O_DIRECT and the orapw File
It may be that nobody on earth will ever hit this problem. I don’t know. But when it happened to me I had shut down my Dataguard primary, set filesystemio_options= directIO (for some test or another) and then went off and did other things. You know the memory is the first thing to go. “All of the sudden”, my Dataguard setup was not working and I could not remember what had changed (test gear). After starting my Dataguard primary back up (with filesystemio_options=directIO), I saw the following little treat in my alert log:

Thu Feb 22 10:12:46 2007
Errors in file /u01/app/oracle/admin/PROD/bdump/prod_arc0_6977.trc:
ORA-01017: invalid username/password; logon denied
ORA-27041: unable to open file
Linux-x86_64 Error: 22: Invalid argument
Additional information: 2
Thu Feb 22 10:12:46 2007
Error 1017 received logging on to the standby
————————————————————
Check that the primary and standby are using a password file
and remote_login_passwordfile is set to SHARED or EXCLUSIVE,
and that the SYS password is same in the password files.
returning error ORA-16191
It may be necessary to define the DB_ALLOWED_LOGON_VERSION
initialization parameter to the value “10”. Check the
manual for information on this initialization parameter.

OK, forget for a moment that Metalink note 259142.1 states that DB_ALLOWED_LOGON_VERSION is replaced by a setting in the sqlnet.ora file yet this Oracle10gR2 Dataguard primary is spitting out mention of that as possible remedy for the problem. I want to go over how I figured out what the problem actually was.

 

Oracle Binary Wrapper
Since the problem at hand was the inability to connect as SYSDBA through the listener, I would not be able to simply run sqlplus under strace (or truss). The listener is the parent of the dedicated server process in that situation. So, I had to implement a wrapper around the Oracle binary. To do this you simply mv $ORACLE_HOME/bin/oracle to something like $ORACLE_HOME/bin/oracle.bin and then create a script like this:

 

-bash-3.00$ cat $ORACLE_HOME/bin/oracle
#!/bin/bash
export ORASYM=/u01/app/oracle/product/10.2.0/db_1/bin/oracle.bin
if [ -f /tmp/trace ]
then
strace -f -o /tmp/trace.$$ /u01/app/oracle/product/10.2.0/db_1/bin/oracle.bin

else

exec /u01/app/oracle/product/10.2.0/db_1/bin/oracle.bin $*
fi

 

I didn’t create the /tmp/trace file until after I started the instance. This is how you can have background processes executing without single stepping (e.g., strace, truss) all Oracle processes. Instead you target a specific connection to the database by creating the /tmp/trace file immediately before connecting and then removing the file so that other connections are normal. I then ran sqlplus and generated the ORA-01031 error:

 

-bash-3.00$ sqlplus /nolog
SQL*Plus: Release 10.2.0.1.0 – Production on Thu Feb 22 12:44:28 2007
Copyright (c) 1982, 2005, Oracle. All rights reserved.
SQL> conn sys@PROD/test as sysdba
ERROR:
ORA-01031: insufficient privileges
SQL> exit

 

The shadow process PID was 20209 so I grep(1)ed for orapw from /tmp/trace.20209 and found the culprit:

NOTE: Right click-> view the image

orapw1.jpg

 

As the strace output revealed, the shadow process created through the listener uses O_DIRECT when opening the orapw file. Trivial pursuit!

Solution
The solution is very platform specific as I was saying, but it is not inconceivable to hit this on other platforms. In my case I simply don’t run with filesystemio_options= directIO because I store all my database files in PolyServe(HP) PSFS with the direct I/O mount option anyway. Works just fine, but I thought I’d elaborate on the point by relocating the orapw file to a mount that supports O_DIRECT and then use a symlink to $ORACLE_HOME/dbs followed by a successful connection as follows:

 

-bash-3.00$ cp /u01/app/oracle/product/10.2.0/db_1/dbs/orapwPROD \ /u01/app/oracle/oradata/PROD/orapwPROD
-bash-3.00$ rm /u01/app/oracle/product/10.2.0/db_1/dbs/orapwPROD
-bash-3.00$ ln -s /u01/app/oracle/oradata/PROD/orapwPROD \ /u01/app/oracle/product/10.2.0/db_1/dbs/orapwPROD
-bash-3.00$ df /u01/app/oracle/oradata/PROD
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/psd/psd2p2 1225017312 3617672 1221399640 1% /u02
-bash-3.00$ mount | grep u02
/dev/psd/psd2p2 on /u02 type psfs (rw,dboptimize,shared,data=ordered)
-bash-3.00$ sqlplus /nolog

SQL*Plus: Release 10.2.0.1.0 – Production on Thu Feb 22 12:52:01 2007
Copyright (c) 1982, 2005, Oracle. All rights reserved.
SQL> conn sys@PROD/test as sysdba
Connected.
SQL> show parameter filesystem

NAME TYPE VALUE
———————————— ———– ——————————
filesystemio_options string directIO

 

Forward Look
I’ll be making a blog entry quite soon about how O_DIRECT can save memory for non-database activity like archived redo log compression.

 

 

DBWR with CIO on JFS2. Resource Starvation?

In a recent post to the oracle-l email list, a participant posted the following:

We are running Oracle 9.2.0.7 & 10.2.0.2 on Aix 5.2.  I am seeing the following warnings in the DB Writer trace file:

Warning: lio_listio returned EAGAIN
Performance degradation may be seen.

There have been several very interesting threads on the list lately about Oracle on AIX with JFS2 filesystems. I’d like to blog quickly about this thread.

The error message in the DBWR trace file indicates that lio_listio() calls are being starved for resources, but what resources?

A lio_listio() Backgrounder
The error message in the DBWR trace file indicates that lio_listio() calls are being starved for resources. For those who don’t know, lio_listio() is the traditional POSIX gathered write system call. Most Oracle ports use it. It supports DBWR’s need to flush multiple buffers to multiple files in one call, but completion processing is a little rough. The way the call works is your program (e.g., DBWR) either issues a long list of requests and blocks until they are all done (mode=LIO_WAIT), or the mode argument is set to LIO_NWAIT which means your process can continue to do other work. When all the LIO_NWAIT requests are completed, your process gets delivered a signal also specified at call time. So, if for some reason, DBWR is doing a lot of small burst writes, it will likely be handling a lot of signals. No big deal really, but that is how it works. The way I/O completion handling works with POSIX AIO is one of the main reasons Oracle was motivated to define the Oracle Disk Manager library specification(ODM).  The differences between ODM and POSIX AIO is a topic for a different blog entry.

About lio_listio() on Filesystem Files
Details, details. The way AIO is implemented from one filesystem to the next varies greatly—unfortunately. I prefer simplicity and that is why the cluster filesystem coupled with PolyServe Matrix Server supports the same Linux kernel internals for AIO as are exercised when using raw devices. If you mount a PSFS in DBoptimized mode, calls to lio_listio(3) are met with the same kernel handling as if the call was targeting raw partitions. But, this blog entry is not about PolyServe since the oracle-l post was about Oracle9i and Oracle10g on AIX.

AIO on JFS2
So, I sent an email to the poster to make sure that this error was being hit on JFS2 and he confirmed it was. You see, AIO on JFS2 is handled by kernel processes of which there are only 10 per processor by default. The number of system wide AIO in flight is limited by the number of these AIO “handlers” as specified in IBM’s AIO documentation (see maxervers). To get around this lio_listio() error, more AIO kernel processes are needed, but how many? That IBM documenation suggests:

The value should be about the same as the expected number of concurrent AIO requests.

How Many is Enough. How Many is Too Many?
Hmmm, that is going to be a very difficult number to guess for the original poster who had both 9i and 10g running on the same pSeries server. Remember, there is no way to control DBWR. If you have a really dirty SGA, DBWR will do really large write batches. That is a good thing.  

In cases like this I tend to generally work out what safe ceilings are. I know Bret Olszewski and if he gives a reasonable value for a tunable, I tend to take his word. In this paper about database performance on JFS, Bret and his co-authors state:

[…] the number of aioservers in the system limits the number of asynchronous I/O operations that can be in progress simultaneously. The maximum number of aioservers that can be created is controlled by the maxservers attribute, which has a default value of 10 per processor. For our experiments, we used a maxservers value of 400 per processor.

The paper also points out that there are 4 CPUs in the System p 680 test system. So while the default is 10 per CPU, it looks like if the collective DBWR in-flight I/O are bursting at high rates—such as as 1600 concurrent I/Os—setting aioservers to 400 per processor is not unreasonable.

The paper also covers a technique used to determine if you have too many aioservers. Basically, if you have a bunch of these kernel processes that do not get CPU time then the I/O load is sufficiently being handled by the processes that are burning CPU. You can then reduce the number if it seems important to you. Makes sense, but I can’t imagine having sleeping kernel processes around costs much.

A Word About Filesystem Performance
OK, if Bret specifies a tunable, I take his advice. On the other hand, I have to call out a problem in that JFS2 paper. In the analysis section, the conclusion is that Concurrent I/O performs 200% better than plain Direct I/O, and comes within 8% of raw logical volumes.That makes sense since CIO implicitly invokes DIO and DIO should be as fast as raw disk with the exception of write serialization (which CIO addresses). The test was 100% processor bound and I/O intensive so any processor overhead in the I/O code path will account for that 8%–there is, after all, more software in the filesystem case than the raw logical volume case but is it really code path that costs the 8 percent? The paper states:

The Concurrent I/O run has a higher %system and lower %user than the raw LV run because of the additional context switches due to traversing down the file system path for I/O, and due to the use of AIO server threads for servicing AIO requests […]

OK, the bit about “additional context switches due to traversing down the file system path” is really weird. There is no such thing unless you are opening files for every I/O—which, of course, Oracle does not do. Oracle keeps files open and I/O is nothing more than a position and read/write  (e.g., pread(), pwrite()) or an asynchronous write (e.g., lio_listio()).  It is true that converting a pathname to an inode costs (e.g., the namei()/lookup() kernel routines), but that is paid in the open(2) call. Once a file is open there is no cost associated with the pathname because I/O routines do not even use a pathname—they use the file descriptor. In short,  there is no “traversing down the file system path” associated with I/O on an open file. If I had $50 to bet, I would wager the 1600 AIO handlers jumping on and off those 4 CPUs would account for the 8%, but what do I know?

 

Those Little Round Brown Spinning Things Will Fail

StorageMojo.com has a lively thread going about hard drive reliability and RAID/Replication mitigation techniques—some of which are completely incompatible with Oracle. RAID5 is not being treated very well in the thread and that is ok with us—remember: BAARF.

 

Installing Oracle on RHEL5

Tim Hall has posted a blog entry about installing Oracle on RHEL 5. There are URLs in the blog entry that provide very step-by-step install directions. I wanted to point out, however, that in one of the URLs the stipulation is made that Secure Linux (SELinux) is presumed to be disabled but I didn’t see tips on how to do that. The topic of disabling SELinux also came up in the comments thread of a post I did about applying the 10.2.0.3 patchset on RHEL4.

Red Hat dedicates a webpage to the topic of disabling SELinux.

All that aside, a friend of mine (works for Red Hat)  is actually doing the Oracle Certification of RHEL5. I think I’ll wait until he is done before I “try the water.”

Database Systems Pioneer Starts Database Company.

Required reading for anyone who is really serious about database internals is what I always called the “big Red Book”. Readings in Database System by Michael Stonebraker sits prominently in my personal library. I recommend it to anyone interested in the fundamentals. Hold it, I just realized that this Amazon link for the book is Amazon.ca and the price is $256.40 Loonies and only US $57.00 on this Amazon.com site. Wow, I bet Stonebraker would rather sell his stuff in Canada! Or then again, he might be too busy.

 

According to this Network World article, Michael Stonebraker has been leading a startup called Vertica Systems. Vertica is going to bring a new database system to market. Vertica has Jerry Held and Ray Lane in the head-shed. I just blogged about some interesting perspectives Ray Lane has on the “traditional software” model in my blog entry entitled Scalable NFS Powered by Open Source Cluster Filesystems. I recommend you check it out to see if those views seem congruous given Ray’s involvement in Vertica. Or is this stealth-news that Vertica is some amalgam of Open Source and “Ad-Revenue funded Software.” Either way, I think bringing yet another database management system to market would be quite the challenge. CA spun off Ingres to go do battle with Oracle recently as well. Doing battle with Oracle is a good way to wind up floating face down in murky water somewhere.

I also noted that the latest round of funding raised by Vertica included money from New Enterprise Associates. Hmm, My company, PolyServe, was also seeded by NEA. Maybe we are all one big happy family. After all, the Network World article states:

It is designed to handle data warehousing, business intelligence, fraud detection and other applications, even in environments with hundreds of terabytes of data.

PolyServe customers build tremendously large clusters for unstructured and structured data alike. I know of one of our Oil and Gas customers that has more than one cluster handling over 250TB but I digress…

We’ll have to keep an eye on Vertica; it is Michael Stonebraker after all!

 

What Performs Better, Direct I/O or Direct I/O? There is No Such Thing As a Stupid Question!

When I was just starting out in IT, one of my first mentors told me that the only stupid question is the one you don’t ask. I really appreciated that mentality because I asked a lot of them—a lot. I’m not blogging about a stupid question, per se, but one that does get asked so often that it seems it shouldn’t have to be answered ever again. The topic is Oracle performance with direct/raw disk I/O versus buffered filesystems.

In a side-channel discussion with a reader, I was asked the following question:

We have created table with same size of data into three different tablespaces respetively. To our surprise, the query was so quick on normal filesystem, unfortunately, query where the table resides on ASM disk groups, ran very slow comparatively to the normal filesystem.

If an AIX filesystem is using both direct I/O and concurrent I/O, there should be no difference in I/O service times between the filesystem and ASM tests. Note to self: The whole direct I/O plus concurrent I/O deserves another blog entry.

My reply to the question was:

If it was dramatically quicker on the AIX filesystem (JFS?) then you are likely getting cached in the OS buffer cache and comparing to raw transfers with ASM.  If the dataset is small enough to stay cached in the OS cache, then it will always be quicker than raw/direct IO (e.g., ASM or a filesystem with direct mount options).

The reader followed up with:

But, Kevin,

People say, sometime is direct IO, i.e. bypassing the OS cache is more faster.
If so, what we are doing with raw device is the same, i.e. direct IO?

And the thread continued with my reply:

Complex topic. “Faster” is not the word. Faster means service times and there is no way an I/O direct from the disk into the address space of an Oracle process can have faster service times because it is raw. In fact, the opposite is true. If the block being read is in the OS page cache, then the service time (from Oracle’s perspective) will be very fast. If it isn’t in the cache, or if the I/O is a write then the story is entirely different. The overhead associated with acquiring an OS buffer, performing the DMA into that buffer from disk and then copied into the SGA/PGA is too costly in processor terms than most systems can afford. Not to mention at that time the buffer is in memory twice…which is not very efficient by any means.

In the end it really depends on what your workload is. If for some reason you have a workload that you just can’t seem to get resident in Oracle’s buffering, then the OS page cache can be helpful.

In the past I’ve taken the rigid stance that direct or raw I/O is the only acceptable deployment option only to be proven wrong by peculiar customer workloads. Over time I started to realize that it is insane for someone like me—or anyone out there for that matter—to tell a customer that direct or raw I/O is the only answer to their problems in spite of what their actual performance is. I’m not saying that it is common for workloads to benefit from OS buffering, but if it does for some particular workload then fine. No religion here.

It turned out that the reader increased his SGA and found parity between the filesystem and ASM test cases as is to be expected. I’ll add, however, that only a filesystem gives you the option of both buffered and unbuffered I/O including mixed on a per-tablespace basis if it helps solve a problem.

My old buddy Glenn Fawcett puts a little extra coverage on the topic from a Solaris perspective here.

The fact remains that there is no such thing as a stupid question.

 

Network Appliance OnTap GX–Specialized for Transaction Logging.

Density is Increasing, But Certainly Not That Cheap
Netapp’s SEC 10-Q form for their quarter ending in October 2006 has a very interesting prediction. I was reading this post on StorageMojo about Isilon and saw this quote from the SEC form (emphasis added by me):

According to International Data Corporation’s (IDC’s) Worldwide Disk Storage Systems 2006-2010 Forecast and Analysis, May 2006, IDC predicts that the average dollar per petabyte (PB) will drop from $8.53/PB in 2006 to $1.85/PB in 2010.

Yes, Netapp is telling us that IDC thinks we’ll be getting storage at $8.53 per Petabyte within the next three years. Yippie! Here is the SEC filing if you want to see for yourself.

We Need Disks, Not Capacity
Yes, drive density is on the way up so regardless of how off the mark Netapp’s IDC quote is, we are going to continue to get more capacity from fewer little round brown spinning things. That doesn’t bode well for OLTP performance. I blogged recently on the topic of choosing the correct real estate from disks when laying out your storage for Oracle databases. I’m afraid it won’t be long until IT shops are going to force DBAs to make bricks without straw by assigning, say, 3 disks for a fairly large database. Array cache to the rescue! Or not.

Array Cache and NetApp NVRAM Cache Obliterated With Sequential Writes
The easiest way to completely trash an most array caches is to perform sequential writes. Well, for that matter, sequential writes happen to be the bane of NVRAM cache on Filers too. No, Filers don’t handle sequential writes well. A lot of shops get a Filer and dedicate it to transaction logging. But wait, that is a single point of failure. What to do? Get a cluster of Filers just for logging? What about Solid State Disk?

Solid State Disk (SSD) price/capacity is starting to come down to the point where it is becoming attractive to deploy them for the sole purpose of offloading the sequential write overhead generated from Oracle redo logging (and to a lesser degree TEMP writes too). The problem is they are SAN devices so how do you provision them so that several databases are logging on the SSD? For example, say you have 10 databases that, on average, are each thumping a large, SAN array cache with 4MB/s for a total sequential write load of 40MB/s. Sure, that doesn’t sound like much, but to a 4GB array cache, that means a complete recycle every 100 seconds or so. Also, rememeber that buffers in the array cache are pinned while being flushed to back to disk. That pain is certainly not being helped by the fact that the writes are happening to fewer and fewer drives these days as storage is configured for capacity instead of IOPS. Remember, most logging writes are 128KB or less so a 40MB logging payload is derived from some 320, or more, writes per second. Realistically though, redo flushing on real workloads doesn’t tend to benefit from the maximum theoretical piggy-back commit Oracle supports, so you can probably count on the average redo write being 64KB or less—or a write payload of 640 IOPS. Yes a single modern drive can satisfy well over 200 small sequential writes per second, but remember, LUNS are generally carved up such that there are other I/Os happening to the same spindles. I could go on and on, but I’ll keep it short—redo logging is tough on these big “intelligent” arrays. So offload it. Back to the provisioning aspect.

Carving Luns. Lovely. 
So if you decide to offload just the logging aspect of 10 databases to SSD, you have to carve out a minimum of 20 LUNS (2 redo logs per database) zone the Fibre Channel switch so that you have discrete paths from servers to their raw chunks of disk. Then you have to fiddle with raw partitions on 10 different servers. Yuck. There is a better way.

SSD Provisioning Via NFS
Don’t laugh—read on. More and more problems ranging from software provisioning to the widely varying unstructured data requirements today’s applications are dealing with keep pointing to NFS as a solution. Provisioning very fast redo logging—and offloading the array cache while you are at it—can easily be done by fronting the SSD with a really small File Serving Cluster. With this model you can provision those same 10 servers with highly available NFS because if a NAS head in the File Serving Utility crashes, 100% of the NFS context is failed over to a surviving node transparently—and within 20 seconds. That means LGWR file descriptors for redo logs remain completely valid after a failover. It is 100% transparent to Oracle. Moreover, since the File Serving Utility is symmetric clustered storage—unlike clustered Filers like OnTap GX—the entire capacity of the SSD can be provisioned to the NAS cluster as a single, simple LUN. From there, the redo logging space for all those databases are just files in a single NFS exported filesystem—fully symmetric, scalable NFS. The whole thing can be done with one vender too since Texas Memory Systems is a PolyServe reseller. But what about NFS overhead and 1GbE bandwidth?

NFS With Direct I/O (filesystemio_options=directIO|setall)
When the Oracle database—running on Solaris, HP-UX or Linux—opens redo logs on an NFS mount, it does so with Direct I/O. The call overhead is very insignificant for sequential small writes when using Direct I/O on an NFS client. The expected surge in kernel mode cycles due to the NFS overhead really doesn’t happen with simple positioning and read/write calls—especially when the files are open O_DIRECT (or directio(3C) for Solaris). What about latency? That one is easy. LGWR will see 1ms service times 100% of the time, no matter how much load is placed on the down-wind SSD. And bandwidth? Even without bonding, 1GbE is sufficient for logging and these SSDs (I’ve got them in my lab) handle requests in 1ms all the way up to full payload which (depending on model) goes up to 8 X 4Gb FC—outrageous!

Now that is a solution to a problem using real, genuine clustered storage. And, no I don’t think NetApp really believes a Petabyte of disk will be under $9 in the next three years. That must be a typo. I know all about typos as you blog readers can attest.

 


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.