Clustered Storage | Kevin Closson's Blog: Platforms, Databases and Storage

Archive for the 'Clustered Storage' Category

Manly Men Only Deploy Oracle with Fibre Channel – Part V. What About Oracle9i on RHAS 2.1? Yippie!

Due to my Manly Man Fibre Channel Series Part I , Part II , Part III and Part IV, my email box is getting loaded with a lot of questions about various Oracle over NFS combinations. The questions run the gamut from how to best tune Oracle9i on Red Hat AS 2.1 to Oracle10g on Red Hat RHEL 3 (all on NAS/NFS of course). And then it dawned on me. When I say I’m a fan of Oracle over NFS, that is just entirely too generic.

It Ain’t Linux Unless It Is a 2.6 Kernel
Honestly folks, Red Hat 3.0-or worse yet, RHAS 2.1? Sheer madness. I’m more than convinced that there are a lot of solid RHEL 3.0 systems out there running Oracle. To those folks I’d say, “If it isn’t broken, don’t fix it.” But RHAS 2.1? That wasn’t even an operating system and to be hyper-critically honest, the “franken-kernel” that was RHEL 3.0 wasn’t really that much better, what with that hugemem 4×4 split garbage and all. SuSE SLES8 was vastly more stable than RHEL 3.0. But I digress. Look, if you are running on a pre-2.6 Kernel Linux distribution you’ve simply got to do yourself a favor and plan an upgrade! Now, back to NAS.

What Oracle on NFS?
I’ll be brief, I wouldn’t even think about using Oracle9i on NAS. I know there are a ton of databases out there doing it, but that is just me. The Oracle Server code specific to NFS (Operating System Dependent code) has gone through some serious evolution/maturation. I’ve watched the changes specifically handling NFS mature from 9i through 10g and now into 11g. Simply put, I didn’t like what I say in Oracle9i-specific to NFS that is. Oracle9i is a perfectly fine release-albeit the port to 64bit Linux was pretty scary. I guess I wasn’t that brief. So I’ll continue.

So, Oracle9i on NAS is a no-go (in my book), what about Oracle10g? There again, I’ll be brief. In my opinion, Oracle10gR1 on NAS was about as elegant as a fish flopping around on a hot sidewalk-not a pretty picture. Yes, I have my reasons why for all this stuff, but this blog entry is purely an assertion of my opinion.

Thus far, I discussed 9i and 10gR1 Linux ports. I cannot speak authoritatively about the Solaris ports of either vis a vis fitness for NFS. If I was a betting man and had two dimes to rub together I would wager them that even the Solaris releases of 9i and 10g were probably pretty shaky on NAS. That leads us to 10gR2.

Solid
Oracle10gR2 on NAS is solid-at least for Linux clients. I have seen Metalink stories about Legacy Unix ports that have RMAN problems with NFS as a near-line backup target. Again, I cannot speak for all these sundry platforms. They are good platforms, but I don’t deal with them day to day.

11g
Don’t jump the gun…tomorrow AM…

Examples
In this May 5, 2007 post on toasters, a list participant posted the following:

We are about to start testing Oracle 9i (single instance) with NetApp NAS (6070) filers. We currently have Oracle running on Solaris 9 with SAN storage attached and VERITAS.

I wouldn’t touch that project with a 10 foot pole. If that database is stable, I wouldn’t switch out the storage architecture-especially on that old of an Oracle release.

I’ve also had a thread going with Chen Shapira who has blogged about Oracle troubles on NAS. Her point throughout that blog entry, and the comments to follow, was that they’ve suffered uptime impact that never really solidly indicts to the storage, but there seems to be a lot of fingers pointed that way. Having read of the types of instability his systems have suffered, I suspected old stuff. It came out in the comment section that they are on RHEL 3.0 64-bit. Now, like I’ve said, RHEL 3.0 is carrying a lot of Oracle databases out there I know, but I wonder how many on NAS? When I say Oracle on NFS, I’m mostly saying Linux Oracle10gR2 releases on Linux 2.6 Kernels—and beyond.

I made a blog entry on this topic back in October of last year as well.

Old Operating System Releases
I take criticism (by true believers mostly) when I point out that running Oracle on a Legacy Unix release that is, say, four years old is not a reason for concern. I wish I could say the same thing about the current state of the art in the Linux world. Dating back to my first high-end Linux project (The Tens–A 10 Node, 10TB, 10,000 User Oracle9i Linux Cluster Project in 2002), I’ve been routinely reminded that Linux stands for:

(L)inux (i)s (n)ot (u)ni(x)

Now, that said, you’ll find much less dissatisfaction with Oracle in general on 2.6 Linux Kernel based systems, but in my opinion, that goes extra for NAS deployments

Standard File System Tools? We Don’t Need No Standard File System Tools!

Published March 16, 2007 Cluster File System , Clustered Storage , filesystem performance , I/O Topics , Linux CFS Performance , linux file system performance , Unstructured Data 3 Comments

Yesterday I posted a blog entry about copying files on Solaris. I received some side channel email on the post such as one with the following tidbit from a very good, long time friend of mine. He wrote:

So optimizing cp() is now your hobby? What’s next….. “ed”… no wait “df”.. boy it sure would be great if I could get a 20% improvement in “ls”… I am sure these commands are limiting the number of orders/hr my business can process :)))

Didn’t that blog entry show a traditional cp(1) implementation utilizing 26% less kernel mode processor cycles? Oh well.

It’s About the Whole System
While those were words spoken in jest, it warrants a blog entry and I’ll tell you why. It is true this is an Oracle related blog and such filesystem tools as cp(1) are not in the Oracle code path. I blog about these things for two reasons: 1) a lot of my readers enjoy learning more about the platform in general and 2) many—perhaps most—Oracle systems have normal file system tools such as cp(1), compress(1) and others running while Oracle is running. For that matter, the Oracle server can call out to the same libraries these tools use for such functionality as BFILE and UTL_FILE. For that reason, I feel these topics are related to Oracle platforms. After all, a garbage-can implementation of the standard filesystem tools—and/or the kernel code paths that service them—is going to take cycles away from Oracle. Now please don’t quote me as saying the mmap()-enabled Solaris cp(1) is a “garbage-can” implementation. I’m just making the point that if such tools are implemented poorly Oracle can be affected even though they are not in the scope of a transaction. It’s about the whole system.

Legacy Code. What Comes Around…Stays Around.
Let’s not think for even a moment that the internals of such tools as ls(1) and df(1) are beyond scrutiny. Both ls(1) and df(1) use the stat(2) system call. We Oracle-minded folks often forget that there is much more unstructured data than structured so it is a good thing there are still some folks like PolyServe (HP) minding the store for the performance of such mundane topics as stat(2). Why? Well, perfect examples are the online photo operations such as Snapfish. Try having thousands of threads accessing tens of millions of files (photos) for fun. See, Snapfish uses the HP Enterprise File Services Clustered Gateway NAS powered by PolyServe. You can bet we pay attention to “mundane” topics like what ls(1) behaves like in a directory with 1, 2 or 100 million small files. The stat(2) system call is extremely important in such situations.

He’s Off His Rocker—This is an Oracle Blog.
What could this possibly have to do with Oracle? Well, if you run Oracle on a platform that only specializes in the code underpinnings of the most common server I/O (e.g., db file sequential read, db file scattered read, direct path read/write, LGWR and DBWR writes), you might not end up very happy if you have to do things that hammer the filesystem with Oracle features like UTL_FILE, BFILE, external tables, imp/exp and so forth, cp(1), tar(1), compress(1) and so on. It’s all about taking a holistic view instead of “camps” that focus on segments of the I/O stack.

As the cliché goes, standard file operations and highly specialized Oracle code paths are often joined at the hip.

HP to Acquire PolyServe to Bolster NAS Offerings with Clustered Storage

Published February 26, 2007 Clustered Storage , Isilon , NFS CFS ASM , Oracle NAS , Oracle NFS 14 Comments

You faithful readers of this blog know my position on NAS for Oracle. Clustered Storage is getting hot and HP has just stepped up to the plate by acquiring PolyServe. Here is a link to HP’s website with details:

HP To Acquire PolyServe

As you regular readers can imagine, my blogging will certainly sound a lot different going forward.

Standard File Utilities with Direct I/O

Published February 23, 2007 Clustered Storage , direct I/O , Linux CFS , Linux CFS Performance , oracle , Oracle Direct I/O 10 Comments

In my last blog entry about Direct I/O, I covered the topic of what Direct I/O can mean beyond normal Oracle database files. A reader followed up with a comment based on his experience with Direct I/O via Solaris –forcedirectio mount option:

I’ve noticed that on Solaris filesystems with forcedirectio , a “compress” becomes quite significantly slower. I had a database where I was doing disk-based backups and if I did “cp” and “compress” scripting to a forcedirectio filesystem the database backup would be about twice as long as one on a normally mounted filesystem.

I’m surprised it was only twice as slow. He was not alone in pointing this out. A fellow OakTable Network member who has customers using PolyServe had this to say in a side-channel email discussion:

Whilst I agree with you completely, I can’t help but notice that you ‘forgot’ to mention that all the tools in fileutils use 512-byte I/Os and that the response time to write a file to a dboptimised filesystem is very bad indeed…

I do recall at one point cp(1) used 512byte I/Os by default but that was some time ago and it has changed. I’m not going to name the individual that made this comment because if he wanted to let folks know who he is, he would have made the comment on the blog. However, I have to respectfully disagree with this comment. It is too broad and a little out of date. Oh, and fileutils have been rolled up into coreutils actually. What tools are those? Wikipedia has a good list.

When it comes to the tools that are used to manipulate unstructured data, I think the ones that matter the most are cp, dd, cat, sort, sum, md5sum, split, uniq and tee. Then, from other packages, there are tar and gzip. There are others, but these seem to be the heavy hitters.

Small Bites
As I pointed out in my last blog entry about DIO, the man page for open(2) on Enterprise Linux distributions quotes Linus Torvalds as saying:

The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances

I beg to differ. I think he should have given that title to anyone that thinks a program like cp(1) needs to operate with little itsy-bitsy-teenie-weenie I/Os. The following is the current state of affairs (although not exhaustive) as per measurements I just took with strace on RHEL4:

tar: 10KB default, override with –blocking-factor
gzip: 32KB in/16KB out
cat, md5sum, split, uniq, cp: 4KB

So as you can see these tools vary, but the majority do operate with insidiously ridiculous small I/O sizes. And 10KB as the default for tar? Huh? What a weird value to pick out of the air. At least you can override that by supplying an I/O size using the –blocking-factor option. But still, 10KB? Almost seems like the work of “deranged monkeys.” But is all lost? No.

Open Source
See, I just don’t get it. Supposedly Open Source is so cool because you can read and modify source code to make your life easier and yet people are reluctant to actually do that. As far as that list of coreutils goes, only cp(1) causes a headache on a direct I/O mounted filesystem because you can’t pipeline it. Can you imagine the intrusive changes one would have to make to cp(1) to stop doing these ridiculous 4KB operations? I can, and have. The following is what I do to the coreutils cp(1):

copy.c:copy_reg()
/* buf_size = ST_BLKSIZE (sb);*/
buf_size = 8388608 ;

Eek! Oh the horror. Imagine the testing! Heaven’s sake! But, Kevin, how can you copy a small file with such large I/O requests? The following is a screen shot of two copy operations on a direct I/O mounted filesystem. I copy once with my cp command that will use a 8MB buffer and then again with the shipping cp(1) which uses a 4KB buffer.

Folks, in both cases the file is smaller than the buffer size. The custom cp8M will use an 8MB buffer but can safely (and quickly) copy a 41 byte file the same way the shipping cp(1) does with a 4KB buffer. The file is smaller than the buffer in both cases—no big deal.

So then you have to go through and make custom file tools right? No, you don’t. Let’s look at some other tools.

Living Happily With Direct I/O
…and reaping the benefits of not completely smashing your physical memory with junk that should not be cached. In the following screen shot I copy a redo log to get a working copy. My current working directory is a direct I/O mounted PSFS and I’m on RHEL4 x86_64. After copying I used gzip straight out of the box as they say. I then followed that with a pipeline command of dd(1) reading the infile with 8MB reads and writing to the pipe (stdout) with 8MB writes. The gzip command is reading the pipe with 32KB reads and in both cases is writing the compressed output with 16KB writes.

It seems gzip was written by monkeys who were apparently not deranged. The effect of using 32KB input and 16KB output is apparent. There was only a 16% speedup when I slammed 8MB chucks into gzip on the pipeline example. Perhaps the sane monkeys that implemented gzip could talk to the deranged monkeys that implemented all those tools that do 4KB operations.

What if I pipeline so that gzip is reading and writing on pipes but dd is adapted on both sides to do large reads and writes? The following screen shot shows that using dd as the reader and writer does pick up another 5%:

So, all told, there is 20% speedup to be had going from canned gzip to using dd (with 8MB I/O) on the left and right hand of a pipeline command. To make that simpler one could easily write the following scripts:

#!/bin/bash

dd if=$1 bs=8M

and

#!/bin/bash

dd of=$1 bs=8M

Make these scripts executable and use as follows:

$ large_read.sh file1.dbf | gzip –c -9 | large_write.sh file1.dbf.gz

But why go to that trouble? This is open source and we are all so very excited that we can tweak the code. A simple change to any of these tools that operate with 4KB buffers is very easy as I pointed out above. To demonstrate the benefit of that little tiny tweak I did to coreutils cp(1), I offer the following screen shot. Using cp8M offers a 95% speedup over cp(1) by moving 42MB/sec on the direct I/O mounted filesystem:

More About cp8M
Honestly, I think it is a bit absurd that any modern platform would ship a tool like cp(1) that does really small I/Os. If any of you can test cp(1) on, say, AIX, HP-UX or Solaris you might find that it is smart enough to do large I/O requests if is sees the file is large. Then again, since OS page cache also comes with built-in read-ahead, the I/O request size doesn’t really matter since the OS is going to fire off a read-ahead anyway.

Anyway, for what it is worth, here is the README that we give to our customers when we give them cp8M:

$ more README

INTRODUCTION
Files stored on DBOPTIMIZED mounted filesystems do not get accessed with buffered I/O. Therefore, Linux tools that perform small I/O requests will suffer a performance degradation compared to buffered filesystems such as normal mounted PolyServe CFS , Ext3, etc. Operations such as copying a file with cp(1) will be very slow since cp(1) will read and write small amounts of data for every operation.

To alleviate this problem, PolyServe is providing this slightly modified version of the Open Source cp(1) program called cp8M. The seed source for this tool is from the coreutils-5.2.1 package. The modification to the source is limited to changing the I/O size that cp(1) issues from ST_BLOCKSIZE to 8 MB. The following code snippet is from the copy.c source file and depicts the entirety of source changes to cp(1):

copy.c:copy_reg()

/* buf_size = ST_BLKSIZE (sb);*/

buf_size = 8388608 ;

This program is statically linked and has been tested on the following filesystems on RHEL 3.0, SuSE SLES8 and SuSE SLES9:

* Ext3

* Regular mounted PolyServe CFS

* DBOPTIMIZED mounted PSFS

Both large and small files have been tested. The performance improvement to be expected from the tool is best characterized by the following terminal session output where a 1 GB file is copied using /bin/cp and then with cp8M. The source and destination locations were both DBOPTIMIZED.

# ls -l fin01.dbf

-rw-r–r– 1 root root 1073741824 Jul 14 12:37 fin01.dbf

# time /bin/cp fin01.dbf fin01.dbf.bu
real 8m41.054s

user 0m0.304s

sys 0m52.465s

# time /bin/cp8M fin01.dbf fin01.dbf.bu2

real 0m23.947s

user 0m0.003s

sys 0m6.883s

Oracle Direct I/O Brought to You By Deranged Monkeys

Published February 23, 2007 Cluster File System , Clustered Storage , ocfs2 , oracle , Oracle Clusters , Oracle Direct I/O , server consolidation 5 Comments

If you have an Linux system, check the “bugs” section of the man page for the open(2) system call and you’ll see the following quote from Linus Torvalds:

The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances -Linus

I’m not joking, read that man page and you’ll see. Now, while I much prefer a mount option approach to direct I/O, I don’t think the O_DIRECT style of direct I/O was the brain child of a deranged monkey. I wonder if Linus is insinuating that the interface would be better if it was written by a sane monkey—or perhaps even a deranged monkey that is not on some serious mind-controlling substances?

There is nothing strange about O_DIRECT and most of the Unix derivations I am aware of are happy to offer it (Solaris being the notable exception offering directio(3C) instead). I’d love to know more about the context of that Linus quote. I’ve been around O_DIRECT since the very early 1990s. Sequent supported O_DIRECT opens on DYNIX/ptx file system files way back in 1991.

The Linux kernel development community still languishes over the fact that software like Oracle does not like to kernel-dive to access buffered data, preferring to do its own buffering instead.

A Mount-Option Approach
Why? Well, if you have programs that perform properly aligned I/O calls (e.g., cat(1), dd(1), cp(1), etc) but you don’t want them “polluting “ your system page cache, then you either need a mount-option approach to do Direct I/O or the tools need to be re-coded to open O_DIRECT. Back in 2001 I had the opportunity to make that choice for PolyServe and I haven’t regretted it once. Let me explain.

Let’s say, for instance, you generate and compress a few gigabytes of archived redo logs per day—or roughly ~40KB per second. It doesn’t sound like much, I know. But let’s look at page cache costs. When ARCH spools an offline redo log to the archive log destination the OS page cache will be used to buffer the I/O. When your compression tool (e.g. compress(P), gzip(1)) reads the file, page cache will once again be used. As the output of compress needs to be written page cache is used. Finally, when the archived redo is copied off the system (e.g., to tape), page cache will again be used. All this caching for data that is not used again—save for emergencies. But really, caching sequentially read archived files and compress output? Makes little sense.

The only way to not cache this sort of data is O_DIRECT, but I/Os issued against an O_DIRECT opened file must be multiples of the underlying disk block size (generally 512 bytes). The buffer in the calling process used for the I/O must also be a aligned on an address that is a multiple of the OS page size. It turns out that most OS tools perform proper alignment of their I/O buffers. So where is the rub? The I/O sizes! Even if you coded your compress tool to use O_DIRECT (deranged monkey syndrome), the odds that the output file will be a multiple of 512 bytes is nil. Let’s look at an example.

Direct I/O for Better Memory Utilization
In the following session I performed 6 steps to see the effect of direct I/O:

Use df to determine space and exact filesystem of my current working directory (CWD)
Check the Mount options. My CWD is a PolyServe PSFS mounted with the DBOptimized mount option which “renders” direct I/O akin to the Solaris –forcedirectio mount option.
List my redo logs. Note, they are OMF files so the names are a bit strange.
Check free memory on the system
Copy a redo log
Check free memory again to see how much memory was used by the OS page cache

OK, hold it, in step 5 I copied a 128MB file and yet the free memory available only changed by 176KB (from step 4 to step 6). My copy of an online log closely resembles what ARCH does—it simply copies the inactive online redo log to the archive log destination. I like the ability to not consume 256MB of physical memory to copy a file that is no longer really part of the database! The cp(1) command performs I/O with requests that are 512byte multiples, so the PolyServe CFS mounted in the DBOptimized mode simply “renders” the I/O through the direct I/O code path. No, cp(1) does not open with O_DIRECT, yet I relieved the pressure on free memory by copying with Direct I/O via the mount option. That’s good.

File Compression with Direct I/O Mounted Filesystem
But what about compressing files in a direct I/O filesystem? Let’s take a look. In the next session I did the following:

Check free memory on the system
Used ls(1) to see my copy of the redo log file.
Used gzip(1) with maximum compression on the copy of the redo log file
Used ls(1) to see the file size of the compressed file.
Check free memory on the system to see what OS page cache was used

OK, this is good. I take a 128MB redo log file and compress it down to 29,582,800 bytes—which is, of course, 57,778 512 byte chunks plus one 464 byte chunk. According to the differences in free memory from step 1 and step 5, only 64KB of system memory was “wasted” in the act of compressing that file. Why do I say wasted? Because cache is best used for sharing data such as in the SGA, however, here I was able to read in 128MB and write out 28.2MB and only used 64KB of page cache in the process. Memory costs money and efficiency matters. This is the reason I prefer a mount option approach to direct I/O.

Back to the example. How did I write an amount that included a stray 464 bytes with direct I/O? That is not a multiple of the underlying disk driver requirement which is 512 bytes.

Under The Covers
On Linux, gzip(1) uses 32KB reads and 16KB writes. The output file created by gzip(1) is 29,839,295 bytes which is 1,805 writes at 16KB and one last odd-ball write of 9,680 bytes—something that would be impossible to do with direct I/O were it not for the direct I/O mount option. Let’s look at strace. The last write was 9,680 bytes:

Direct I/O Without Compile-Time O_DIRECT
I can’t speak about other direct I/O mount implementations, but I can explain how PolyServe does this. All I/O bound for files in a DBOptimized mounted PSFS filesystem are quickly examined to see if the I/O meets the underlying device driver DMA requirements. In the kernel we use simple arithmetic to determine if the I/O size is a multiple of the underlying disk block size (satisfies DMA requirement) and whether the I/O buffer is aligned on a page boundry. If both conditions are true, the I/O is DMAed directly from the process address space to the disk. If not, we simply grab an OS page cache buffer, perform the I/O and then immediately invalidate that page so no other process can read dirty data (PolyServe is sort of big on cache coherency if you get my drift).

Best of Both Worlds
In the end, Linus might be right about O_DIRECT, but sitting here at PolyServe makes me say, “Who cares.” We supported direct I/O on Linux before Linux supported O_DIRECT (it was just a patch at that time). In fact, we did a 10-node Oracle9i RAC, 10 TB, 10,000 user OLTP Proof of Concept way back in 2002—before Linux O_DIRECT was mainstream. Here is a link to the paper if you are interested in that proof point.

Network Appliance OnTap GX–Specialized for Transaction Logging.

Published February 10, 2007 Clustered Storage , Data OnTap GX , LGWR Performance , Oracle NAS , Oracle NFS , Oracle performance , Oracle SAN Topics , Oracle Storage Related Problems , redo logging performance 9 Comments

Density is Increasing, But Certainly Not That Cheap
Netapp’s SEC 10-Q form for their quarter ending in October 2006 has a very interesting prediction. I was reading this post on StorageMojo about Isilon and saw this quote from the SEC form (emphasis added by me):

According to International Data Corporation’s (IDC’s) Worldwide Disk Storage Systems 2006-2010 Forecast and Analysis, May 2006, IDC predicts that the average dollar per petabyte (PB) will drop from $8.53/PB in 2006 to $1.85/PB in 2010.

Yes, Netapp is telling us that IDC thinks we’ll be getting storage at $8.53 per Petabyte within the next three years. Yippie! Here is the SEC filing if you want to see for yourself.

We Need Disks, Not Capacity
Yes, drive density is on the way up so regardless of how off the mark Netapp’s IDC quote is, we are going to continue to get more capacity from fewer little round brown spinning things. That doesn’t bode well for OLTP performance. I blogged recently on the topic of choosing the correct real estate from disks when laying out your storage for Oracle databases. I’m afraid it won’t be long until IT shops are going to force DBAs to make bricks without straw by assigning, say, 3 disks for a fairly large database. Array cache to the rescue! Or not.

Array Cache and NetApp NVRAM Cache Obliterated With Sequential Writes
The easiest way to completely trash an most array caches is to perform sequential writes. Well, for that matter, sequential writes happen to be the bane of NVRAM cache on Filers too. No, Filers don’t handle sequential writes well. A lot of shops get a Filer and dedicate it to transaction logging. But wait, that is a single point of failure. What to do? Get a cluster of Filers just for logging? What about Solid State Disk?

Solid State Disk (SSD) price/capacity is starting to come down to the point where it is becoming attractive to deploy them for the sole purpose of offloading the sequential write overhead generated from Oracle redo logging (and to a lesser degree TEMP writes too). The problem is they are SAN devices so how do you provision them so that several databases are logging on the SSD? For example, say you have 10 databases that, on average, are each thumping a large, SAN array cache with 4MB/s for a total sequential write load of 40MB/s. Sure, that doesn’t sound like much, but to a 4GB array cache, that means a complete recycle every 100 seconds or so. Also, rememeber that buffers in the array cache are pinned while being flushed to back to disk. That pain is certainly not being helped by the fact that the writes are happening to fewer and fewer drives these days as storage is configured for capacity instead of IOPS. Remember, most logging writes are 128KB or less so a 40MB logging payload is derived from some 320, or more, writes per second. Realistically though, redo flushing on real workloads doesn’t tend to benefit from the maximum theoretical piggy-back commit Oracle supports, so you can probably count on the average redo write being 64KB or less—or a write payload of 640 IOPS. Yes a single modern drive can satisfy well over 200 small sequential writes per second, but remember, LUNS are generally carved up such that there are other I/Os happening to the same spindles. I could go on and on, but I’ll keep it short—redo logging is tough on these big “intelligent” arrays. So offload it. Back to the provisioning aspect.

Carving Luns. Lovely.
So if you decide to offload just the logging aspect of 10 databases to SSD, you have to carve out a minimum of 20 LUNS (2 redo logs per database) zone the Fibre Channel switch so that you have discrete paths from servers to their raw chunks of disk. Then you have to fiddle with raw partitions on 10 different servers. Yuck. There is a better way.

SSD Provisioning Via NFS
Don’t laugh—read on. More and more problems ranging from software provisioning to the widely varying unstructured data requirements today’s applications are dealing with keep pointing to NFS as a solution. Provisioning very fast redo logging—and offloading the array cache while you are at it—can easily be done by fronting the SSD with a really small File Serving Cluster. With this model you can provision those same 10 servers with highly available NFS because if a NAS head in the File Serving Utility crashes, 100% of the NFS context is failed over to a surviving node transparently—and within 20 seconds. That means LGWR file descriptors for redo logs remain completely valid after a failover. It is 100% transparent to Oracle. Moreover, since the File Serving Utility is symmetric clustered storage—unlike clustered Filers like OnTap GX—the entire capacity of the SSD can be provisioned to the NAS cluster as a single, simple LUN. From there, the redo logging space for all those databases are just files in a single NFS exported filesystem—fully symmetric, scalable NFS. The whole thing can be done with one vender too since Texas Memory Systems is a PolyServe reseller. But what about NFS overhead and 1GbE bandwidth?

NFS With Direct I/O (filesystemio_options=directIO|setall)
When the Oracle database—running on Solaris, HP-UX or Linux—opens redo logs on an NFS mount, it does so with Direct I/O. The call overhead is very insignificant for sequential small writes when using Direct I/O on an NFS client. The expected surge in kernel mode cycles due to the NFS overhead really doesn’t happen with simple positioning and read/write calls—especially when the files are open O_DIRECT (or directio(3C) for Solaris). What about latency? That one is easy. LGWR will see 1ms service times 100% of the time, no matter how much load is placed on the down-wind SSD. And bandwidth? Even without bonding, 1GbE is sufficient for logging and these SSDs (I’ve got them in my lab) handle requests in 1ms all the way up to full payload which (depending on model) goes up to 8 X 4Gb FC—outrageous!

Now that is a solution to a problem using real, genuine clustered storage. And, no I don’t think NetApp really believes a Petabyte of disk will be under $9 in the next three years. That must be a typo. I know all about typos as you blog readers can attest.

Isilon Leads in Clustered Storage–Without Support for Oracle

Published February 10, 2007 Clustered Storage Leave a Comment

One of my favorite blogs, StorageMojo.com, is covering Isilon and Clustered Storage here. The debate is heating up because there are some folks that think NetApp OnTap GX is clustered storage. I have blogged before about how a clustered namespace is not clustered storage such as:

NetApp’s OnTap GX for Oracle. Clustered Name Space.

And other articles here:

FS, CFS, NFS, ASM Topics

Remember, this is an Oracle blog and Isilon is indeed clustered storage, but they can’t do Oracle. And while OnTap GX can do Oracle, it is not symmetric clustered storage so it won’t scale.

Scalable NFS Powered By Open Source Cluster Filesystems

Published February 9, 2007 Cluster File System , Clustered Storage , High Performance Linux filesystem , I/O Topics , Linux CFS , ocfs , ocfs2 4 Comments

40 Terabytes Per Week With Linux-based Clusters at Dunnhumby
It seems reasonable to think that this company tested the open source clustering stuff, but I don’t know for certain. There are folks out there using Open Source cluster filesystems for “large I/O” processing as is apparent in this recent OCFS2 bug report (emphasis added by me):

During maintenance window, decided to use the OCFS2 filesystem to store a large backup file (about 5-10 gig file). SCP’ed the file from an outside server to node1 of the cluster […]

A little third-party perspective is necessary. Not even back in 1990, with Fujitsu Swallow IV drives, was 10GB considered “large.” The OCFS2 user that filed the bug continued:

After a few minutes, node1 crashed.

Let’s think about that for a moment. The user is bringing unstructured data into the OCFS2 cluster filesystem using scp (1). Just for the heck of it, let’s take the user at his word and do the math. He said, “After a few minutes.” Let’s say a few minutes are 3—180 seconds. That means the scp(1) was likely not trafficked over Gigabit Ethernet because that would be more like enough time to move about 20GB at full bandwidth with a single wire. That pretty much leaves 100BaseT. So, somewhere along 2GB or so, OCFS2 crumbled. Hmmm, lowered expectations. And the fun continued:

Node1 restarted, but crashed again attempting to reenter the cluster.
Leaving Node1 down, attempted reboot of Node2 and Node3.
Both panic crashed during restart attempting to start OCFS2 and join the cluster.
Eventually, found that we had to start Node1 first, then restart the other two nodes.

Good grief, I’m not even going to comment on that bit, but I will point out that the suggested workaround to use the O_DIRECT enabled coreutils seems off mark. The user is trying to scp(1), not cp(1) or mv(1).

If It Isn’t Free, It’s Junk. Ad Revenue Funds Robust Software Development.
In spite of the fact that Ray Lane says traditional software products are soon to be replaced by cobbled together bits and pieces of open source stuff or what Wharton refers to as “ad supported software”, sometimes the good things in life are not free.

Huge Amounts of Unstructured Data
A recent article in Information Week’s Optimize Magazine covered one of PolyServe’s customers, Dunnhumby. These folks manipulate a lot of data using HP Blades as compute nodes accessing data over NFS in a PolyServe File Serving Utility scalable NAS solution. In their own words:

Each week, more than 40 terabytes of data is generated […]

“Hold it”, you say, that’s a comparison of OCFS2 to PolyServe CFS via NFS. What does OCFS2 have to do with NFS? That is a good question. OCFS2 is proclaimed to be a general purpose filesystem (emphasis added by me):

WHAT IS OCFS2?

OCFS2 is the next generation of the Oracle Cluster File System for Linux. It is an extent based, POSIX compliant file system. Unlike the previous release (OCFS), OCFS2 is a general-purpose file system

So why not export OCFS2 filesystems via NFS? That is the sort of thing you do with a general purpose filesystem after all. And, since OCFS2 is a cluster filesystem there shouldn’t be any second thoughts about exporting the same filesystems from multiple nodes—that’s scalable file serving. In fact, that has been tried before. That URL points to a bug report where a user was trying to implement scalable file serving using OCFS2. He reports:

I’m using OCSF2 for backups and to store files used by nfs clients. We have some errors during three file uploading from remote clients. In that case only one node can access those files but the other node receive from dlm a bad lockres error message […]

Right, OK. So what came next? Read on:

So I tried to stop ocfs2 and o2cb services on the second node but I can’t because heartbeat prevents any stop attempt. A stop attempt on the first node instead hungs and I have to reboot the first node because it is impossible to unmount ocfs2 filesystems (even if I use the lazy option).

I’m sure it couldn’t get any worse, right? He continued:

That is a serious problem because to recover the right functionality I had to reboot the first node (o2cb/ocfs2 services hang and after reboot ASM losts spfiles, so problem impacts even the databases running on cluster). There is any kind of action I can do to avoid that?

Surely he must be doing something really convoluted to hit problems so easily! He explains the scenario:

The scenario is:
node X exports filesystem to host Y
node W exports filesystem to host Z

from Y I create a file then I delete it then ls command on Z lists the file but I cannot open it. I receive I lot of messages like this:

Oct 20 08:53:34 proxb31 kernel: (15612,1):ocfs2_populate_inode:234 ERROR:
Invalid dinode: i_ino=9977187, i_blkno=9977187, signature = INODE01, flags = 0x0
Oct 20 08:53:34 proxb31 kernel: (15612,1):ocfs2_read_locked_inode:389 ERROR:
populate inode failed! i_blkno=9977187, i_ino=9977187

Good grief! Cache coherency problems? You mean like this warning about OCFS cache coherency :

Reasons for using odirect cp:

1. Buffered and direct ios are still racy in the kernel. As Oracle is doing directio, doing a normal cp exposes one to the chance of copying a stale page data.

2. Direct ios are less stressful on the page cache. As Oracle datafiles are invariably large, directio is more efficient in the long run.

3. In a clustered environment, the blocks on disk could be updated by any nodes in the cluster. Using odirect io ensures the latest version of the block is always read.

Oh boy. Anyway, back to the bug report. The bug report states that as of January 4, 2007, there is a patch for NFS exported OCFS2 problem being tested at Oracle, however, the following comment was given to help set expectations:

One thing I’m concerned with is having two clients connect to seperate nodes. Since NFSD is not cluster aware, there may be some issues with unlinked inodes being in cache on one node and looked up on another. Is it possibleto confine your nfs exports to a single node for now, until we can get a better handle on that particular issue.

That seems like something that should have been spelled out in the Product Requirements Document, but I’m old-fashioned.

Scalable File Serving with Linux. Who Needs a Cluster-Aware NFSD?
The NAS heads in a PolyServe File Serving Utility configuration (e.g., HP EFS Clustered Gateway), run the enterprise distributions; RHEL4 and SuSE SLES9. So while those folks in the Ray Lane and Wharton’s open source dream world might think that NFSD cannot function in a cluster with data consistency, PolyServe—with that dying traditional software model—seems to have pulled it off. Do you think Dunnhumby pushes 40TB of data per week through a PolyServe File Serving Utility cluster without NFSD scalability or—more importantly—cache coherency? Not a chance.

Real Application Clusters: The Shared Database Architecture for Loosely-Coupled Clusters

Published February 8, 2007 Clustered Storage , Real Application Clusters , server consolidation , Shared Applications Tier , Shared APPL_TOP , shared nothing architecture , shared nothing database 9 Comments

The typical Real Application Clusters (RAC) deployment is a true enigma. Sometimes I just scratch my head because I don’t get it. I’ve got this to say, if you think Shared Nothing Architecture is the way to go, then deploy it. But this is an Oracle blog, so let’s talk about RAC.

RAC is a shared disk architecture, just like DB2 on IBM mainframes. It is a great architecture, one that I agree with as is manifested by my working for shared data clustering companies all these years. Again, since this is an Oracle blog I think arguments about shared disk versus shared nothing are irrelevant.

Dissociative Identity Disorder
The reason I’m blogging this topic is because in my opinion the typical RAC deployment exhibits the characteristics of a person suffering from Dissociative Identity Disorder. Mind you, I’m discussing the architecture of the deployment, not the people that did the deployment. That is, we spend tremendous amounts of money for shared disk database architecture and then throw it into a completely shared nothing cluster. How much sense does that make? What areas of operations does that paradigm affect? Why does Oracle promote shared disk database deployments on shared-nothing clusters? What is the cause of this Dissociative Identity Disorder? The answer: the lack of a general purpose shared disk filesystem that is suited to Oracle database I/O that works on all Unix derivations and Linux. But wait, what about NFS?

Shared “Everything Else”
I can’t figure out any other way to label the principle I’m discussing so I’ll just call it “Shared Everything Else”. However, the term Shared Everything Else (SEE for short) insinuates that there is less importance in that particular content—an insinuation that could not be further from the truth. What do I mean? Well, consider the Oracle database software itself. How do you suppose an Oracle RAC (shared disk architecture) database can exist without having the product installed somewhere.

The product install directory for the database is called Oracle Home. Oracle has supported the concept of a shared Oracle Home since the initial release of RAC—even with Oracle9i. Yes, Metalink note 240963.1 describes the requirement for Oracle9i to have context dependent symbolic links (CDSL), but that was Oracle9i. Oracle10g requires no context dependent symbolic links. Oracle Universal Installer will install a functional shared Oracle Home without a any such requirements.

What if you don’t share a software install? It is very easy to have botched or mismatched product installs—which doesn’t sit well with a shared disk database. In a recent post on the oracle-l list, sent the following call for help:

We are trying to install a 2-node RAC with ASM (Oracle 10.2.0.2.0 on Solaris 10) and getting the error below when using dbca to create the database.The error occurs when dbca is done creating the DB (100%).Any suggestions?

We have tried starting atlprd2 instance manually and get the error below regarding an issue with spfile which is on ASM.

ORA-01565: error in identifying file ‘+SYS_DG/atlprd/spfileatlprd.ora’
ORA-17503: ksfdopn:2 Failed to open file +SYS_DG/atlprd/spfileatlprd.ora
ORA-03113: end-of-file on communication channel

OK, for those who are not Oracle-minded, this sort of deployment is what I call the Dissociative Identity Disorder since the database will be deployed on a bunch of LUNs provisioned, masked and accessed as RAW disk from the OS side—ASM is a collection of RAW disks. This is clearly not a SEE deployment.The original poster followed up with a status of the investigatory work he had to do to try and get around this problem:

[…] we have checked permissions and they are the same.We also checked and the same disk groups are mounted in both ASM instances

also.We have also tried shutting everything down (including reboot of both servers) and starting everything from scratch (nodeapps, asm, listeners, instances), but the second node won’t start.Keep getting the same error […]

What a joy. Deploying a shared disk database in a shared nothing cluster! There he was on each server checking file permissions (I just counted, there are 20,514 files in one of my Oracle10g Oracle Homes), investigating the RAW disk aspects of ASM, rebooting servers and so on. Good thing this is only a 2 node cluster. What if it was an 8 node cluster? What if he had 10 different clusters?

As usual, the oracle-l support channel comes through. Another list participant posted the following:

Seem to be a known issue (Metalink Note 390591.1). We encountered similar issue in Linux RAC cluster and has been resoled by following this note.

The cause was included in his post (emphasis added by me):

Cause

Installing the 10.2.0.2 patchset in a RAC installation on any Unix platform does not correctly update the libknlopt.a file on all nodes. The local node where the installer is run does update libknlopt.a but remote nodes do not get the updated file. This can lead to dumps or internal errors on the remote nodes if Oracle is subsequently relinked.

That was the good and bad, now the ugly—his post continues with the following excerpt from the Oracle Metalink note:

There are two solutions for this problem:

1) Manual copy of the “libknlopt.a” library to the offending nodes:

-ensure all instances are shut down
-manually copy $ORACLE_HOME/rdbms/lib/libknlopt.a from the local node to all remote nodes

-relink Oracle on all nodes :
make -f ins_rdbms.mk ioracle

2) Install the patchset on every node using the “-local” option:

What’s So Bad About Shared Nothing Clusters?
I’m not going to get into that, but one of the central knock-offs Oracle uses against shared-nothing database architecture is the fact that replication is required. Since the software used to access RAC needs to be kept in lock-step, replication is required there as well, and as we see from this oracle-l email thread, replication is not all that simple with a complex software deployment like the Oracle database product. But speaking of complex, the Oracle database software pales in comparison to the Oracle E-Business Suite. How in the world do people manage to deploy E-Biz on anything other than a huge central server? Shared Applications Tier.

Shared Applications Tier
Yes, just like Oracle Home, the huge, complex Oracle E-Business Suite can be installed in a shared fashion as well. It is called a Shared Applications Tier. One of the other blogs I read has been discussing this topic as well, but this is not just a blogosphere topic—it is mainline. Perhaps the best resource for Shared Applications Tier is Metalink note 243880.1, but Metalink notes 384248.1 and 233428.1 should not be overlooked. The long story short is that Oracle supports SEE, but they don’t promote it for who-knows-what-reason.

Is SEE Just About Product Installs?
Absolutely not. Consider intrinsic RAC functionality that doesn’t function at all without a shared filesystem:

External Tables with Parallel Query Option
UTIL_FILE
BFILE

I’m sure there are others (perhaps compiled PL/SQL), but who cares. The product is expensive and if you are using shared disk architecture you should be able to use all the features of shared disk architecture. However, without a shared filesystem, External Tables and the other features listed are not cluster-ready. That is, you can use External Tables, UTIL_FILE and BFILE—but only from one node. Isn’t RAC about multi-node scalability?

So Why the Rant?
The Oracle Universal Installer will install a fully functional Oracle10g shared Oracle Home to simplify things, the complex E-Business Suite software is architected for shared install and there are intrinsic database features that require shared data outside of the database so why deploy a shared database architecture product on a platform that only shares the database? You are going to have to explain it to me like I’m six years old; because I know I’m not going to understand. Oh, yes, and don’t forget that with a shared-nothing platform, all the day to day stuff like imp/exp, SQL*Loader, compressed archive redo, logging, trace, scripts, spool and so on mean you have to pick a server and go. How symmetric is that? Not as symmetric as the software for which you bought the cluster (RAC), that’s for certain.

Shared Oracle Home is a Single Point of Failure
And so is the SYSTEM tablespace in a RAC database, so what is the point?People who choose to deploy RAC on a platform that doesn’t support shared Oracle Home often say this. Yes a single shared Oracle Home is a single point of failure, but like I said, so is the SYSTEM tablespace in every RAC database out there. Shops that espouse shared software provisioning (e.g., shared Oracle Home) are not dolts, so the off-the-cuff single point of failure red herring is just that. When we say shared Oracle Home, do we mean a single shared Oracle Home? Well, not necessarily. If you have, say, a 4 or 8 node RAC cluster, why assume that SEE or not to SEE is a binary choice? It is perfectly reasonable to have 8 nodes share something like 2 Oracle Homes. That is a significant condensing factor and appeases the folks that concentrate on the possible single point of failure aspect of a shared Oracle Home (whilst often ignoring the SYSTEM tablespace single point of failure). A total availability solution requires Data Guard in my opinion, and Data Guard is really good, solid technology.

Choices
All told, NFS is the only filesystem that can be used across all Unix (and Linux) platforms for SEE. However, not all NFS offerings are suffiently scalable and resilient for SEE. This is why there is a significant technology trend towards clustered storage (e.g., NetApp OnTAP GX, PolyServe(HP) EFS Clustered Gateway, etc).

Finally, does anyone think I’m proposing some sort of mix-match NFS here with a little SAN there sort of ordeal? Well, no, I’m not. Pick a total solution and go with it…either NFS or SAN, the choice is yours, but pick a total platform solution that has shared data to complement the database architecture you’ve chosen. RAC and SEE!

Microsoft-Minded People Covering Shared Data Clustering

Published January 20, 2007 Cluster File System , Clustered Storage , Microsoft Certified Professional Online Leave a Comment

Just FYI.

Since shared data clustering is a little foreign to the Microsoft community, I watch how they cover products like PolyServe Matrix Server. Here is an example:

Microsoft Certified Professional Online Coverage of PolyServe Matrix Server

High Availability…MySpace.com Style

Published January 19, 2007 Amazon ASM , Automatic Storage Management , Clustered Storage , Isilon , ISLN , Web 2.0 2 Comments

I was checking out Paul Vallee’s comments about MySpace’s definition of uptime. It seems others are seeing spotty uptime with this poster child of the Web 2.0 phenomenon.

I’m watching MySpace for other reasons though. They have deployed the Isilon IQ Clustered Storage solution for serving up the video content. Isilon is a competitor of my company, PolyServe. Isilon is good at what they do—read intensive workloads (e.g., streaming media). I don’t like the fact that it is a hardware/software solution. I’m a much bigger fan of being free of vendor lock-in. In the end, I’m an Oracle guy and Isilon can’t do Oracle so that’s that.

Anyway, another thing that is interesting about Web 1.0 and now Web 2.0 shops is the odd amount of “IT Street Cred” they seem to get. Folks like Amazon, eBAY and now MySpace are not IT shops, really. They have gargantuan technology staff, and their IT budget is not representative of normal companies. Basically, they can take the oddest of technology combinations and throw tremendous headcount of very gifted people at the problem to make it work. Not your typical COTS shop.

Now, having said that, are these shops solving interesting problem? Sure. Would any normal Oracle shop be able to do things they way, say, Amazon does it? Likely not. Back in 2004, Amazon admitted to an IT budget of USD $64 Million before some $16 Million savings realized in one way or another by deploying Linux.

Using OProfile to Monitor Kernel Overhead on Linux With Oracle

Published December 14, 2006 Cluster File System , Clustered Storage , DBWR Performance , Linux CFS , Linux CFS Performance , OProfile , oracle 9 Comments

Yes, this Blog post does have OProfile examples and tips, but first my obligatory rant…

When it comes to Oracle on clustered Linux, FUD abounds. My favorite FUD is concerning where kernel mode processor cycles are being spent. The reason it is my favorite is because there is no shortage of people that likely couldn’t distinguish between a kernel mode cycle and a kernel of corn hyping the supposed cost of running Oracle on filesystem files—especially cluster filesystems. Enter OProfile.

OProfile Monitoring of Oracle Workloads
When Oracle is executing, the majority of processor cycles are spent in user mode. If, for instance, the processor split is 75/25 (user/kernel), OProfile can help you identify how the 25% is being spent. For instance, what percentage is spent in process scheduling, kernel memory management, device driver routines and I/O code paths.

System Support
The OProfile website says:

OProfile works across a range of CPUs, include the Intel range, AMD’s Athlon and AMD64 processors range, the Alpha, ARM, and more. OProfile will work against almost any 2.2, 2.4 and 2.6 kernels, and works on both UP and SMP systems from desktops to the scariest NUMAQ boxes.

Now, anyone that knows me or has read my blog intro knows that NUMA-Q meant a lot to me—and yes, my Oak Table Network buddies routinely remind me that I still haven’t started attending those NUMA 12-step programs out there. But I digress.

Setting Up OProfile—A Tip
Honestly, you’ll find that setting up OProfile is about as straight forward as explained in the OProfile documentation. I am doing my current testing on Red Hat RHEL 4 x86_64 with the 2.6.9-34 kernel. Here is a little tip: one of the more difficult steps to getting OProfile going is finding the right kernel-debug RPM. It is not on standard distribution medium and hard to find—thus the URL I’ve provided. I should think that most people are using RHEL 4 for Oracle anyway.

OProfile Examples
Perhaps the best way to help get you interested in OProfile is to show some examples. As I said above, a very important bit of information OProfile can give you is what not to worry about when analyzing kernel mode cycles associated with an Oracle workload. To that end, I’ll provide an example I took from one of my HP Proliant DL-585’s with 4 sockets/8 cores attached to a SAN array with 65 disk drives. I’m using an OLTP workload with Oracle10gR2 and the tablespaces are in datafiles stored in the PolyServe Database Utility for Oracle; which is a clustered-Linux server consolidation platform. One of the components of the Utility is the fully symmetric cluster filesystem and that is where the datafiles are stored for this OProfile example. The following shows a portion of a statpack collected from the system while OProfile analysis was conducted.

NOTE: Some browsers require you to right click->view to see reasonable resolution of these screen shots

As the statspack shows, there were nearly 62,000 logical I/Os per second—this was a very busy system. In fact, the processors were saturated which is the level of utilization most interesting when performing OProfile. The following screen shot shows the set of OProfile commands used to begin a sample. I force a clean collection by executing the oprofile command with the –deinit option. That may be overkill, but I don’t like dirty data. Once the collection has started I run vmstat(8) to monitor processor utilization. The screen shot shows that the test system was not only 100% CPU bound, but there were over 50,000 context switches per second. This, of course, is attributed to a combination of factors—most notably the synchronous nature of Oracle OLTP reads and the expected amount of process sleep/wake overhead associated with DML locks, background posting and so on. There is a clue in that bit of information—the scheduler must be executing 50,000+ times per second. I wonder how expensive that is? We’ll see soon, but first the screen shot showing the preparatory commands:

So the next question to ask is how long of a sample to collect. Well, if the workload has a “steady state” to achieve, it is generally sufficient to let it get to that state and monitor about 5 or 10 minutes. It does depend on the ebb and flow of the workload. You don’t really have to invoke OProfile before the workload commences. If you know your workload well enough, watch for the peak and invoke OProfile right before it gets there.

The following screen shot shows the oprofile command used to dump data collected during the sample followed by a simple execution of the opreport command.

OK, here is where it gets good. In the vmstat(8) output above we see that system mode cycles were about 20% of the total. This simple report shows us a quick sanity check. The aggregate of the core kernel routines (vmlinux) account for 65% of that 20%–13% of all processor cycles. Jumping over the cost of running OProfile (23%) to the Qlogics Host Bus Adaptor driver we see that even though there are 13,142 IOPS, the device driver is handling that with only about 6% of system mode cycles—about 1.2% of all processor cycles.

The Dire Cost of Deploying Oracle on Cluster Filesystems
It is true that Cluster Filesystems inject code in the I/O code path. To listen to the FUD-patrol, you’d envision a significant processor overhead. I would if I heard the FUD and wasn’t actually measuring anything. As an example, the previous screen shot shows that by adding the PolyServe device driver and PolyServe Cluster Filesystem modules (psd, psfs) together there is 3.1% of all kernel mode cycles (.6% of all cycles) expended in PolyServe code—even at 13,142 physical disk transfers per second. Someone please remind me the importance of using raw disk again? I’ve been doing performance work on direct I/O filesystems that support asynchronous I/O since 6.0.27 and I still don’t get it. Anyway, there is more that OProfile can do.

The following screen shot shows an example of getting symbol-level costing. Note, I purposefully omitted the symbol information for the Qlogic HBA driver and OProfile itself to cut down on noise. So, here is a trivial pursuit question: what percentage of all processor cycles does RHEL 4 on a DL-585 expend in processor scheduling code when the system is sustaining some 50,000 context switches per second? The routine to look for is schedule() and the following example of OProfile shows the answer to the trivial pursuit question is 8.7% of all kernel mode cycles (1.7% of all cycles).

The following example shows me where PolyServe modules rank in the hierarchy of non-core kernel (vmlinux) modules. Looks like only about 1/3^rd the cost of the HBA driver and SCSI support module combined.

If I was concerned about the cost of PolyServe in the stack, I would use the information in the following screen shot to help determine what the problem is. This is an example of per-symbol accounting. To focus on the PolyServe Cluster Filesystem, I grep the module name which is psfs. I see that the component routines of the filesystem such as the lock caching layer (lcl), cluster wide inode locking (cwil) and journalling are evenly distributed in weight—no “sore thumbs” sticking up as they say. Finally, I do the same analysis for our driver, PSD, and there too see no routine accounting for any majority of the total.

Summary
There are a couple of messages in this blog post. First, since tools such as OProfile exist, there is no reason not to actually measure where the kernel mode cycles go. Moreover, this sort of analysis can help professionals avoid chasing red herrings such as the fairy tales of measurable performance impact when using Oracle on quality direct I/O cluster filesystems. As I like to say, “Measure before you mangle.” To that end, if you do find yourself in a situation where you are losing a significant amount of your processor cycles in kernel mode, OProfile is the tool for you.

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage

Archive for the 'Clustered Storage' Category

Manly Men Only Deploy Oracle with Fibre Channel – Part V. What About Oracle9i on RHAS 2.1? Yippie!

Standard File System Tools? We Don’t Need No Standard File System Tools!

HP to Acquire PolyServe to Bolster NAS Offerings with Clustered Storage

Standard File Utilities with Direct I/O

Oracle Direct I/O Brought to You By Deranged Monkeys

Isilon Leads in Clustered Storage–Without Support for Oracle

Scalable NFS Powered By Open Source Cluster Filesystems

Real Application Clusters: The Shared Database Architecture for Loosely-Coupled Clusters

Microsoft-Minded People Covering Shared Data Clustering

High Availability…MySpace.com Style

Using OProfile to Monitor Kernel Overhead on Linux With Oracle

DISCLAIMER

Pages

Blogroll

Follow Blog via Email

Recent Posts

Recent Comments

Fond Memories

Copyright

Archive for the 'Clustered Storage' Category

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

DISCLAIMER

Pages

Blogroll

Follow Blog via Email

Recent Posts

Recent Comments

Fond Memories

Copyright