Archive for the 'NFS CFS ASM' Category

Automatic Databases Automatically Detect Storage Capabilities, Don’t They?

Doug Burns has started an interesting blog thread about the Oracle Database 11g PARALLEL_IO_CAP_ENABLED parameter in his blog entry about Parallel Query and Oracle Database 11g. Doug is discussing Oracle’s new concept of built-in I/O subsystem calibration-a concept aimed at more auto-tuning database instances. The idea is that Oracle is trying to make PQ more aware of the down-wind I/O subsystem capability so that it doesn’t obliterate it with a flood of I/O. Yes, a kinder, gentler PQO.

I have to admit that I haven’t yet calibrated this calibration infrastructure. That is, I aim to measure the difference between what I know a given I/O subsystem is capable of and what DBMS_RESOURCE_MANAGER.CALIBRATE_IO thinks it is capable of. I’ll blog the findings of course.

In the meantime, I recommend you follow what Doug is up to.

A Really Boring Blog Entry
Nope, this is not just some look at that other cool blog over there post. At first glance I would hope that all the regular readers of my blog would wonder what value there is in throttling I/O all the way up in the database itself given the fact that there are several points at which I/O can/does get throttled downwind. For example, if the I/O is asynchronous, all operating systems have a maximum number of asynchronous I/O headers (the kernel structures used to track asynchronous I/Os) and other limiting factors on the number of outstanding asynchronous I/O requests. Likewise, SCSI kernel code is fit with queues of fixed depth and so forth. So why then is Oracle doing this up in the database? The answer is that Oracle can run on a wide variety of I/O subsystem architectures and not all of these are accessed via traditional I/O system calls. Consider Direct NFS for instance.

With Direct NFS you get disk I/O implemented via the remote procedure call interface (RPC). Basically, Oracle shoots the NFS commands directly at the NAS device as opposed to using the C library read/write routines on files in an NFS mount-which eventually filters down to the same thing anyway, but with more overhead. Indeed, there is throttling in the kernel for the servicing of RPC calls, as is the case with traditional disk I/O system calls, but I think you see the problem. Oracle is doing the heavy lifting that enables you to take advantage of a wide array of storage options-and not all of them are accessed with age-old traditional I/O libraries. And it’s not just DNFS. There is more coming down the pike, but I can’t talk about that stuff for several months given the gag order. If I could, it would be much easier for you to visualize the importance of DBMS_RESOURCE_MANAGER.CALIBRATE_IO. In the meantime, use your imagination. Think out of the box…way out of the box…

Databases are the Contents of Storage. Future Oracle DBAs Can Administer More. Why Would They Want To?

I’ve taken the following quote from this Oracle whitepaper about low cost storage:

A Database Storage Grid does not depend on flawless execution from its component storage arrays. Instead, it is designed to tolerate the failure of individual storage arrays.

In spite of the fact that the Resilient Low-Cost Storage Initiative program was decommissioned along with the Oracle Storage Compatability Program, the concepts discussed in that paper should be treated as a barometer of the future of storage for Oracle databases-with two exceptions: 1) Fibre Channel is not the future and 2) there’s more to “the database” than just the database. What do I mean by point 2? Well, with features like SecureFiles, we aren’t just talking rows and columns any more and I doubt (but I don’t know) that SecureFiles is the end of that trend.

Future Oracle DBAs
Oracle DBAs of the future become even more critical to the enterprise since the current “stove-pipe” style IT organization will invariably change. In today’s IT shop, the application team talks to the DBA team who talks to the Sys Admin team who tlks to the Storage Admin team. All this to get an application to store data on disk through a Oracle database. I think that will be the model that remains for lightly-featured products like MySQL and SQL Server, but Oracle aims for more. Yes, I’m only whetting your appetite but I will flesh out this topic over time. Here’s food for thought: Oracle DBAs should stop thinking their role in the model stops at the contents of the storage.

So while Chen Shapira may be worried that DBAs will get obviated, I’d predict instead that Oracle technology will become more full-featured at the storage level. Unlike the stock market where past performance is no indicator of future performance, Oracle has consistently brought to market features that were once considered too “low-level” to be in the domain of a Database vendor.

The IT industry is going through consolidation. I think we’ll see Enterprise-level IT roles go through some consolidation over time as well. DBAs who can wear more than “one hat” will be more valuable to the enterprise. Instead of thinking about “encroachment” from the low-end database products, think about your increased value proposition with Oracle features that enable this consolidation of IT roles-that is, if I’m reading the tea leaves correctly.

How to Win Friends and Influence People
Believe me, my positions on Fibre Channel have prompted some fairly vile emails in my inbox-especially the posts in my Manly Man SAN series. Folks, I don’t “have it out”, as they say, for the role of Storage Administrators. I just believe that the Oracle DBAs of today are on the cusp of being in control of more of the stack. Like I said, it seems today’s DBA responsibilities stop at the contents of the storage-a role that fits the Fibre Channel paradigm quite well, but a role that makes little sense to me. I think Oracle DBAs are capable of more and will have more success when they have more control. Having said that, I encourage any of you DBAs who would love to be in more control of the storage to look at my my post about the recent SAN-free Oracle Data Warehouse. Read that post and give considerable thought to the model it discusses. And give even more consideration to the cost savings it yields.

The Voices in My Head
Now my alter ego (who is a DBA, whereas I’m not) is asking, “Why would I want more control at the storage level?” I’ll try to answer him in blog posts, but perhaps some of you DBAs can share experiences where performance or availability problems were further exacerbated by finger pointing between you and the Storage Administration group.

Note to Storage Administrators
Please, please, do not fill my email box with vitriolic messages about the harmony today’s typical stove-pipe IT organization creates. I’m not here to start battles.

Let me share a thought that might help this whole thread make more sense. Let’s recall the days when an Oracle DBA and a System Administrator together (yet alone) were able to provide Oracle Database connectivity and processing for thousands of users without ever talking to a “Storage Group.” Do you folks remember when that was? I do. It was the days of Direct Attach Storage (DAS). The problem with that model was that it only took until about the late 1990s to run out of connectivity-enter the Fibre Channel SAN. And since SANs are spokes attached to hubs of storage systems (SAN arrays), we wound up with a level of indirection between the Oracle server and its blocks on disk. Perhaps there are still some power DBAs that remember how life was with large numbers of DAS drives (hundreds). Perhaps they’ll recall the level of control they had back then. On the other hand, perhaps I’m going insane, but riddle me this (and feel free to quote me elsewhere):

Why is it that the industry needed SANs to get more than a few hundred disks attached to a high-end Oracle system in the late 1990s and yet today’s Oracle databases often reside on LUNs comprised of a handful of drives in a SAN?

The very thought of that twist of fate makes me feel like a fish flopping around on a hot sidewalk. Do you remember my post about capacity versus spindles? Oh, right, SAN cache makes that all better. Uh huh.

Am I saying the future is DAS? No. Can I tell you now exactly what model I’m alluding to? Not yet, but I enjoy putting out a little food for thought.

What Is Good Throughput With Oracle Over NFS?

The comment thread on my blog entry about the simplicity of NAS for Oracle got me thinking. I can’t count how many times I’ve seen people ask the following question:

Is N MB/s good throughput for Oracle over NFS?

Feel free to plug in any value you’d like for N. I’ve seen people ask if 40MB/s is acceptable. I’ve seen 60, 80, name it-I’ve seen it.

And The Answer Is…
Let me answer this question here and now. The acceptable throughput for Oracle over NFS is full wire capacity. Full stop! With Gigabit Ethernet and large Oracle transfers, that is pretty close to 110MB/s. There are some squeak factors that might bump that number one way or the other but only just a bit. Even with the most hasty of setups, you should expect very close to 100MB/s straight out of the box-per network path. I cover examples of this in depth in this HP whitepaper about Oracle over NFS.

The steps to a clean bill of health are really very simple. First, make sure Oracle is performing large I/Os. Good examples of this are tablespace CCF (create contiguous file) and full table scans with port-maximum multi-block reads. Once you verify Oracle is performance large I/Os, do the math. If you are not close to 100MB/s on a GbE network path, something is wrong. Determining what’s wrong is another blog entry. I want to capitalize on this nagging question about expectations. I reiterate (quoting myself):

Oracle will get line speed over NFS, unless something is ill-configured.

Initial Readings
I prefer to test for wire-speed before Oracle is loaded. The problem is that you need to mimic Oracle’s I/O. In this case I mean Direct I/O. Let’s dig into this one a bit.

I need something like a dd(1) tool that does O_DIRECT opens. This should be simple enough. I’ll just go get a copy of the coreutils package that has O_DIRECT tools like dd(1) and tar(1). So here goes:

[root@tmr6s15 DD]# ls ../coreutils-4.5.3-41.i386.rpm
[root@tmr6s15 DD]# rpm2cpio < ../coreutils-4.5.3-41.i386.rpm | cpio -idm
11517 blocks
[root@tmr6s15 DD]# ls
bin  etc  usr
[root@tmr6s15 DD]# cd bin
[root@tmr6s15 bin]# ls -l dd
-rwxr-xr-x  1 root root 34836 Mar  4  2005 dd
[root@tmr6s15 bin]# ldd dd =>  (0xffffe000) => /lib/tls/ (0x00805000)
        /lib/ (0x007ec000)

I have an NFS mount exported from an HP EFS Clustered Gateway (formerly PolyServe):

 $ ls -l /oradata2
total 8388608
-rw-r--r--  1 root root 4294967296 Aug 31 10:15 file1
-rw-r--r--  1 root root 4294967296 Aug 31 10:18 file2
$ mount | grep oradata2
voradata2:/oradata2 on /oradata2 type nfs

Let’s see what the dd(1) can do reading a 4GB file and over-writing another 4GB file:

 $ time ./dd --o_direct=1048576,1048576 if=/oradata2/file1 of=/oradata2/file2 conv=notrunc
4096+0 records in
4096+0 records out

real    1m32.274s
user    0m3.681s
sys     0m8.057s

Test File Over-writing
What’s this bit about over-writing? I recommend using conv=notrunc when testing write speed. If you don’t, the file will be truncated and you’ll be testing write speeds burdened with file growth. Since Oracle writes the contents of files (unless creating or extended a datafile), it makes no sense to test writes to a file that is growing. Besides, the goal is to test the throughput of O_DIRECT I/O via NFS, not the filer’s ability to grow a file. So what did we get? Well, we transferred 8GB (4GB in, 4GB out) and did so in 92 seconds. That’s 89MB/s and honestly, for a single path I would actually accept that since I have done absolutely no specialized tuning whatsoever. This is straight out of the box as they say. The problem is that I know 89MB/s is not my typical performance for one of my standard deployments. What’s wrong?

The dd(1) package supplied with the coreutils has a lot more in mind than O_DIRECT over NFS. In fact, it was developed to help OCFS1 deal with early cache-coherency problems. It turned out that mixing direct and non-direct I/O on OCFS was a really bad thing. No matter, that was then and this is now. Let’s take a look at what this dd(1) tool is doing:

$ strace -c ./dd --o_direct=1048576,1048576 if=/oradata2/file1 of=/oradata2/file2 conv=notrunc
4096+0 records in
4096+0 records out
Process 32720 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 56.76    4.321097        1054      4100         1 read
 22.31    1.698448         415      4096           fstatfs
 10.79    0.821484         100      8197           munmap
  9.52    0.725123         177      4102           write
  0.44    0.033658           4      8204           mmap
  0.16    0.011939           3      4096           fcntl
  0.02    0.001265          70        18        12 open
  0.00    0.000178          22         8           close
  0.00    0.000113          23         5           fstat
  0.00    0.000091          91         1           execve
  0.00    0.000015           2         8           rt_sigaction
  0.00    0.000007           2         3           brk
  0.00    0.000006           3         2           mprotect
  0.00    0.000004           4         1         1 access
  0.00    0.000002           2         1           uname
  0.00    0.000002           2         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    7.613432                 32843        14 total

Eek! I’ve paid for a 1:1 fstatfs(2) and fcntl(2) per read(2) and a mmap(2)/munmap(2) call for every read(2)/write(2) pair! Well, that wouldn’t be a big deal on OCFS since fstatfs(2) is extremely cheap and the structure contents only changes when filesystem attributes change. The mmap(2)/munmap(2) costs a bit, sure, but on a local filesystem it would be very cheap. What I’m saying is that this additional call overhead wouldn’t laden down OCFS throughput with the –o_direct flag-but I’m not blogging about OCFS. With NFS, this additional call overhead is way to expensive. All is not lost.

I have my own coreutils dd(1) that I implements O_DIRECT open(2). You can do this too, it is just GNU after all. With this custom GNU coreutils dd(1) I have, the call profile is nothing more than read(2) and write(2) back to back. Oh, I forgot to mention, the dd(1) doesn’t work with /dev/null or /dev/zero since it tries to throw an O_DIRECT open(2) at those devices which makes the tool croak. My dd(1) checks if in or out is /dev/null or /dev/zero and omits the O_DIRECT for that side of the operation. Anyway, here is what this tool got:

$ time dd_direct if=/oradata2/file1 of=/oradata2/file2 bs=1024k conv=notrunc
4096+0 records in
4096+0 records out

real    1m20.162s
user    0m0.008s
sys     0m1.458s

Right, that’s more like it-80 seconds or 102 MB/s. Shaving those additional calls off brought throughput up 15%.

What About Bonding/Teaming NICS
Bonding NICs is a totally different story as I point out somewhat in this paper about Oracle Database 11g Direct NFS. You can get very mixed results if the network interface over which you send NFS traffic is bonded. I’ve seen 100% scalability of NICs in a bonded pair and I’ve seen as low as 70%. If you are testing a bonded pair, set your expectations accordingly.

Oracle11g: Oracle Inventory On Shared Storage. Don’t Bother Trying To Install 11g RAC That Way.

A few days ago there was a thread on the oracle-l email list about adding nodes in an Oracle Database 10g Real Application Clusters environment. The original post showed a problem that Alex Gorbachev reports he’s only seen with shared Oracle Home installs. I found that odd because I’ve done dozens, upon dozens of RAC installs on shared Oracle Homes with both CFS and NFS and haven’t seen this error:

Remote 'UpdateNodeList' failed on node: 'af-xxx2'. Refer to
for details.
You can manually re-run the following command on the failed nodes after the
/apps/oracle/product/10.2/oui/bin/runInstaller -updateNodeList -noClusterEnabled
ORACLE_HOME=/apps/oracle/product/10.2 CLUSTER_NODES=af-xxx1,af-xxx2,af-xxx6
CRS=false "INVENTORY_LOCATION=/apps/oracle/oraInventory" LOCAL_NODE=
<node on which command is to be run>

I never have any problems with shared Oracle Home and I blog about the topic a lot as can be seen in in this list of posts. Nonetheless, Alex pointed out that the error has to do with the Oracle Inventory being on a shared filesystem. Another list participant followed up with the following comment about placing the inventory on a shared drive:

Sharing the oraInventory across nodes is not a good practice in my opinion. It runs counter to the whole concept of redundancy in an HA configuration and RAC was not written to support it.

Well, the Oracle Inventory is not a RAC concept, it is an Oracle Universal Installer concept, but I think I know what this poster was saying. However, the topic at hand is shared Oracle Home. When people use the term shared Oracle Home, they don’t mean shared ORACLE_BASE, they mean shared Oracle Home. Nonetheless, I have routinely shared the 10g inventory without problems, but then my software environments might not be as complex as those maintained by the poster of this comment.

Shared Inventory with Oracle Database 11g
No can do! Well, sort of. Today I was installing 11g RAC on one of my RHEL 4 x86 clusters. In the fine form of not practicing what I preach, I mistakenly pointed Oracle Universal Installer to a shared location (NFS) for the inventory when I was installing CRS. I got CRS installed just fine on 2 nodes and proceeded to install the database with the RAC option. It didn’t take long for OUI to complain as follows:


Ugh. This is just a test cluster that I need to set up quick and dirty. So I figured I’d just change the contents of /etc/oraInst.loc to point to some new non-shared location-aren’t I crafty. Well, that got me past the error, but without an inventory with CRS in it, Oracle11g OUI does not detect the cluster during the database install! No node selection screen, no RAC.

I proceeded to blow away all the CRS stuff (ORA_CRS_HOME, inittab entries, /etc/oracle/* and /etc/oraInst.loc) and reinstalled CRS using a non-shared locale for the inventory. The CRS install went fine and subsequently OUI detected the cluster when I went to install the database.

This is a significant change from 10g where the inventory content regarding CRS was not needed for anything. With 10g, the cluster is detected based on what /etc/oracle/ocr.loc tells OUI.

Shared Oracle Home is an option, shared Oracle Home means shared Oracle Home not shared Oracle Inventory. Oracle11g enforces this best practice nicely!

Manly Men Only Deploy Oracle with Fibre Channel – Part VIII. After All, Oracle Doesn’t Support Async I/O on NFS

In the comment section of my recent post about Tim Hall’s excellent NFS step-by-step Linux RAC install Guide, Tim came full circle to ask a question about asynchronous I/O on NFS. He wrote:

What do you set your filesystemio_options init.ora parameter to when using Oracle over NFS?

Based on what you’ve written before I know NFS supports direct I/O, but I’m struggling to find a specific statement about NFS and asynchronous I/O. So should I use:




My reply to that was going to remind you folks about my recent rant about old Linux distributions combined with Oracle over NFS.  That is, the answer is, “it depends.” It depends on whether you are running a reasonable Linux distribution. But, Tim quickly followed up his query with:

I found my answer. Asynchronous I/O is not supported on NFS:

Bummer, I didn’t get to answer it.

Word To The Wise
Don’t use old Linux stuff with NAS if you want to do Oracle over NFS. Metalink 279069.1 provides a clear picture as to why I say that. It points out a couple of important things:

1. RHEL 4 U4 and EL4 both support asynchronous I/O on NFS mounts. That makes me so happy because I’ve been doing asynchronous I/O on NFS mounts with Oracle10gR2 for about 16 months. Unfortunately, ML 279069.1 incorrectly states that the critical fix for Oracle async I/O on NFS is U4, when in fact the specific bug (Bugzilla 161362 ) was fixed in RHEL4 U3 as seen in this Red Hat Advisory from March 2006.

2. Asynchronous I/O on NFS was not supported on any release prior to RHEL4. That’s fine with me because I wouldn’t use any Linux release prior to the 2.6 kernels to support Oracle over NFS!

The Oracle documentation on the matter was correct since it was produced long before there was OS support for asynchronous I/O on Linux for Oracle over NFS. Metalink 279069.1 is partly correct in that it states support for asynchronous I/O on systems that have the fix for Bugzilla 161363 but it incorrectly suggests that U4 is the requisite release for that fix, but it isn’t—the bug was fixed in U3. And yes, I get really good performance with the following initialization parameter set and have for about 16 months:

filesystemio_options = setall

Manly Man Post Script
Always remember, the Manly Man series is tongue-in-cheek.  Oracle over NFS with Async I/O on the other hand isn’t.

Manly Men Only Deploy Oracle with Fibre Channel – Part VII. A Very Helpful Step-by-Step RAC Install Guide for NFS

Tim Hall has stepped up to the plate to document a step-by-step recipe for setting up Oracle10g RAC on NFS mounts. In Tim’s blog entry, he points out that for testing and training purposes it is true that you can simply export some Ext3 filesystem from a Linux server and use it for all things Oracle. Tim only had 2 systems, so what he did was use one of the servers as the NFS server. The NFS server exported a filesystem and both the servers mounted the filesystem. In this model, you have 2 NFS clients and one is acting as both an NFS client and an NFS server.

This is the link to Tim’s excellent step-by-step guide.

How Simple

If you’ve ever had a difficult time getting RAC going, I think you’d be more than happy with how simple it is with NFS and using Tim’s guide and a couple of low-end test servers would prove that out.

Recently I blogged about the fact that most RAC difficulties are in fact storage difficulties. That is not the case with NFS/NAS.

Thanks Tim!

Manly Men Only Deploy Oracle with Fibre Channel – Part VI. Introducing Oracle11g Direct NFS!

Since December 2006, I’ve been testing Oracle11g NAS capabilities with Oracle’s revolutionary Direct NFS feature. This is a fantastic feature. Let me explain. As I’ve laboriously pointed out in the Manly Man Series, NFS makes life much simpler in the commodity computing paradigm. Oracle11g takes the value proposition further with Direct NFS. I co-authored Oracle’s paper on the topic:

Here is a link to the paper.

Here is a link to the joint Oracle/HP news advisory.

What Isn’t Clearly Spelled Out. Windows Too?
Windows has no NFS in spite of stuff like SFU and Hummingbird. That doesn’t stop Oracle. With Oracle11g, you can mount directories from the NAS device as CIFS shares and Oracle will access them with high availability and performance via Direct NFS. No, not CIFS, Direct NFS. The mounts only need to be visible as CIFS shares diring instance startup.

Who Cares?
Anyone that likes simplicity and cost savings.

The Worlds Largest Installation of Oracle Databases
…is Oracle’s On Demand hosting datacenter in Austin, Tx. Folks, that is a NAS shop. They aren’t stupid!

Quote Me

The Oracle11g Direct NFS feature is another classic example Oracle implementing features that offer choices in the Enterprise data center. Storage technologies, such as Tiered and Clustered storage (e.g., NetApp OnTAP GX, HP Clustered Gateway), give customers choices—yet Oracle is the only commercial database vendor that has done the heavy lifting to make their product work extremely well with NFS. With Direct NFS we get a single, unified connectivity model for both storage and networking and save the cost associated with Fibre Channel. With built-in multi-path I/O for both performance and availability, we have no worries about I/O bottlenecks. Moreover, Oracle Direct NFS supports running Oracle on Windows servers accessing databases stored in NAS devices—even though Windows has no native support for NFS! Finally, simple, inexpensive storage connectivity and provisioning for all platforms that matter in the Grid Computing era!

Manly Men Only Deploy Oracle with Fibre Channel – Part V. What About Oracle9i on RHAS 2.1? Yippie!

Due to my Manly Man Fibre Channel Series Part I , Part II , Part III and Part IV, my email box is getting loaded with a lot of questions about various Oracle over NFS combinations. The questions run the gamut from how to best tune Oracle9i on Red Hat AS 2.1 to Oracle10g on Red Hat RHEL 3 (all on NAS/NFS of course). And then it dawned on me. When I say I’m a fan of Oracle over NFS, that is just entirely too generic.

It Ain’t Linux Unless It Is a 2.6 Kernel
Honestly folks, Red Hat 3.0-or worse yet, RHAS 2.1? Sheer madness. I’m more than convinced that there are a lot of solid RHEL 3.0 systems out there running Oracle. To those folks I’d say, “If it isn’t broken, don’t fix it.” But RHAS 2.1? That wasn’t even an operating system and to be hyper-critically honest, the “franken-kernel” that was RHEL 3.0 wasn’t really that much better, what with that hugemem 4×4 split garbage and all. SuSE SLES8 was vastly more stable than RHEL 3.0. But I digress. Look, if you are running on a pre-2.6 Kernel Linux distribution you’ve simply got to do yourself a favor and plan an upgrade! Now, back to NAS.

What Oracle on NFS?
I’ll be brief, I wouldn’t even think about using Oracle9i on NAS. I know there are a ton of databases out there doing it, but that is just me. The Oracle Server code specific to NFS (Operating System Dependent code) has gone through some serious evolution/maturation. I’ve watched the changes specifically handling NFS mature from 9i through 10g and now into 11g. Simply put, I didn’t like what I say in Oracle9i-specific to NFS that is. Oracle9i is a perfectly fine release-albeit the port to 64bit Linux was pretty scary. I guess I wasn’t that brief. So I’ll continue.

So, Oracle9i on NAS is a no-go (in my book), what about Oracle10g? There again, I’ll be brief. In my opinion, Oracle10gR1 on NAS was about as elegant as a fish flopping around on a hot sidewalk-not a pretty picture. Yes, I have my reasons why for all this stuff, but this blog entry is purely an assertion of my opinion.

Thus far, I discussed 9i and 10gR1 Linux ports. I cannot speak authoritatively about the Solaris ports of either vis a vis fitness for NFS. If I was a betting man and had two dimes to rub together I would wager them that even the Solaris releases of 9i and 10g were probably pretty shaky on NAS. That leads us to 10gR2.

Oracle10gR2 on NAS is solid-at least for Linux clients. I have seen Metalink stories about Legacy Unix ports that have RMAN problems with NFS as a near-line backup target. Again, I cannot speak for all these sundry platforms. They are good platforms, but I don’t deal with them day to day.

Don’t jump the gun…tomorrow AM…

In this May 5, 2007 post on toasters, a list participant posted the following:

We are about to start testing Oracle 9i (single instance) with NetApp NAS (6070) filers. We currently have Oracle running on Solaris 9 with SAN storage attached and VERITAS.

I wouldn’t touch that project with a 10 foot pole. If that database is stable, I wouldn’t switch out the storage architecture-especially on that old of an Oracle release.

I’ve also had a thread going with Chen Shapira who has blogged about Oracle troubles on NAS. Her point throughout that blog entry, and the comments to follow, was that they’ve suffered uptime impact that never really solidly indicts to the storage, but there seems to be a lot of fingers pointed that way. Having read of the types of instability his systems have suffered, I suspected old stuff. It came out in the comment section that they are on RHEL 3.0 64-bit. Now, like I’ve said, RHEL 3.0 is carrying a lot of Oracle databases out there I know, but I wonder how many on NAS? When I say Oracle on NFS, I’m mostly saying Linux Oracle10gR2 releases on Linux 2.6 Kernels—and beyond.

I made a blog entry on this topic back in October of last year as well.

Old Operating System Releases
I take criticism (by true believers mostly) when I point out that running Oracle on a Legacy Unix release that is, say, four years old is not a reason for concern. I wish I could say the same thing about the current state of the art in the Linux world. Dating back to my first high-end Linux project (The Tens–A 10 Node, 10TB, 10,000 User Oracle9i Linux Cluster Project in 2002), I’ve been routinely reminded that Linux stands for:

(L)inux (i)s (n)ot (u)ni(x)

Now, that said, you’ll find much less dissatisfaction with Oracle in general on 2.6 Linux Kernel based systems, but in my opinion, that goes extra for NAS deployments

Manly Men Only Deploy Oracle with Fibre Channel – Part IV. SANs are Simple, RAC is Difficult!

Several months back I made a blog entry about the RAC poll put together by Jared Still. The poll can be found here. Thus far there have been about 150 participants through the poll—best I can tell. Some of the things I find interesting about the results are:

1. Availability was cited 46% of the time as the motivating factor for deploying RAC whereas scalability counted for 37%.

2. Some 46% of the participants state that RAC has met between 75% and 100% of their expectations.

3. More participants (52%) say they’d stay with RAC given the choice to revert to non-RAC.

4. 52% of the deployments are Linux (42% Red Hat, 6% Oracle Enterprise Linux, 4% SuSE) and 34% are using the major Legacy Unix offerings (Solaris 17%, AIX 11%, HP-UX 6%).

5. 84% of the deployments are using block storage (e.g., FCP, iSCSI) with 42% of all respondents using ASM on block storage. Nearly one quarter of the respondents say they use a CFS. Only 13% use file storage (NAS via NFS).

Surveys often make for tough cipherin’. It sure would be interesting to see which of the 52% that use Linux also state they’d stay with RAC given the choice to revert or re-deploy with a non-RAC setup. Could they all have said they’d stick with RAC? Point 1 above is also interesting because Oracle markets RAC as a prime ingredient for availability as per MAA.

Of course point 5 is very interesting to me.

RAC is Simple…on Simple Storage
We are talking about RAC here, so the 84% from point 5 above get to endure the Storage Buffet. On the other hand, the 24% of the block storage deployments that layered a CFS over the raw partitions didn’t have it as bad, but the rest of them had to piece together the storage aspects of their RAC setup. That is, they had to figure out what to do with the clusterware files, database, Oracle Home and so forth. The problem with CFS is that there is no one CFS that covers all platforms. That war was fought and lost. NFS on the other hand is ubiquitous and works nicely for RAC. On that note, an email came in to my inbox last Friday on this very topic. The author of that email said:

[…] we did quite a lot of tests in the summer last year and figured out that indeed using Oracle/NFS can make a very good combination (many at [COMPANY XYZ] were spectical, I had no opinion as I had never used it, I wanted to see the fact). So I have convinced our management to go the NFS way (performance ok for the workload under question, way simpler management).

[…] The production setup (46 nodes, some very active, some almost idle accessing 6 NAS “heads”) does its job with satisfying performance […]

What do I see in this email? NFS works well enough for this company that they have deployed 46 nodes—but that’s not all. I pay particular attention to the 3 most important words in that quote: “way simpler management.”

Storage Makes or Breaks Many RAC Deployments
I watched intently as Charles Schultz detailed his first forray into RAC. First, I’ll point out that Charles and I had an email side-bar conversation on this topic. He is aware that I intended to weave his RAC experience into a blog entry of my own. So what’s there to blog about? Well, I’ll just come right out and say it—RAC is usually only difficult when difficult storage is used. How can I say that? Let’s consider Charles’ situation.

First, Charles is an Oracle Certified Master who has no small amount of exposure to large Oracle environments. Charles points out on his blog that the environment they were trying to deploy RAC into has some 150 or more databases consuming some 10TB of storage! That means Charles is no slouch. And being the professional he is, Charles points out that he took specialized RAC training to prepare for the task of deploying Oracle in their environment. So why did Charles struggle with setting up a 2-node RAC cluster to the point of making a post to the oracle-l email list for assistance? The answer is simply that the storage wasn’t simple.

It turned out that Charles’ “RAC difficulty” wasn’t even RAC. I assert that the highest majority of what is termed “RAC difficulty” isn’t RAC at all, but the platform or storage instead. By platform I mean Linux RPM dependency and by storage I mean SAN madness. Charles’ difficulties boiled down to Linux FCP multipathing issues. Specifically, multipathing was causing ASM to see multiple entries for each LUN. I made the following comment on Charles’ blog:

Hmm, RHEL4 and two nodes. Things should not be that difficult. I think what you have is more on your hands than RAC. I’ve seen OCFS2, and ASM [in Charles’ blog thread]. That means you also have simple raw disks for OCR/CSS and since this is Dell, is my guess right that you have EMC storage with PowerPath?

Lot’s on your plate. You know me, I’d say NAS…

Ok, I’m sorry for SPAMing your site, Charles, but your situation is precisely what I talk about. You are a Certified Master who has also been to specific RAC training and you are experiencing this much difficulty on a 2 node cluster using a modern Linux distro. Further, most of your problems seem to be storage related. I think that all speaks volumes.

Charles replied with:

[…] I agree whole-heartedly with your statements; my boss made the same observations after we had already sunk over 40 FTE of 2 highly skilled DBAs plunking around with the installation.

If I read that correctly, Charles and a colleague spent a week trying to work this stuff out and Charles is certainly not alone in these types of situations that generally get chalked up as “RAC problems.” There was a lengthy thread on oracle-l about very similar circumstances not that long ago.

Back To The Poll
It has been my experience that most RAC difficulties are storage related—specifically the storage presentation. As point 5 in the poll above shows, some 84% of the respondents had to deal with raw partitions at one time or another. Indeed, even with CFS, you have to get the raw partitions visible and like-named on each node of the cluster before you can create a filesystem. If I hear of one more RAC deployment falling prey to storage difficulties, I’ll…


Ah, forget that. I use the following mount options on Linux RAC NFS clients:


and I generally widen up a few kernel tunables when using Oracle over NFS:

net.core.rmem_default = 524288
net.core.wmem_default = 524288
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem=4096 524288 16777216
net.ipv4.tcp_wmem=4096 524288 16777216
net.ipv4.tcp_mem=16384 16384 16384

Once the filesystem(s) is/are mounted, I have 100% of my storage requirements for RAC taken care of. Most importantly, however, is to not forget Direct I/O when using NFS, so I set the following init.ora parameter filesystemio_options as follows:


Life is an unending series of choices. Choosing between simple or difficult storage connectivity and provisioning is one of them. If you overhear someone lamenting about how difficult “RAC” is, ask them how they like their block storage (FCP, iSCSI).

Manly Men Only Deploy Oracle with Fibre Channel – Part II. What’s So Simple and Inexpensive About NFS for Oracle?

The things I routinely hear from DBAs leads me to believe that they often don’t understand storage. Likewise, the things I hear from Storage Administrators convinces me they don’t always know what DBAs and system administrators have to do with those chunks of disk they dole out for Oracle. This is a long blog entry aimed at closing that gap with a particular slant to Oracle over NFS. Hey, it is my blog after all.

I also want to clear up some confusion about points I made in a recent blog entry. The confusion was rampant as my email box will attest so I clearly need to fix this.

I was catching up on some blog reading the other day when I ran across this post on Nuno Souto’s blog dated March 18, 2006. The blog entry was about how Noon’s datacenter had just taken on some new SAN gear. The gist of the blog entry is that they did a pretty major migration from one set of SAN gear to the other with very limited impact—largely due to apparent 6-Ps style forethought. Noons speaks highly of the SAN technology they have.

Anyone that participates in the oracle-l email list knows Noons and his important contributions to the list. In short, he knows his stuff—really well. So why am I blogging about this? It dawned on me that my recent post about Manly Men Only Deploy Oracle with Fibre Channel Storage jumped over a lot of ground work. I assure you all that neither Noons nor the tens of thousands of Oracle shops using Oracle on FCP are Manly Men as I depicted in my blog entry. I’m not trying to suggest that people are fools for using Fibre Channel SANs. Indeed, after waiting patiently from about 1997 to about 2001 for the stuff to actually work warrants at least some commitment to the technology. OK, ok, I’m being snarky again. But wait, I do have a point to make.

Deploying Oracle on NAS is Simpler and Cheaper, Isn’t It?
In my blog entry about “Manly Man”, I stated matter-of-factly that it is less expensive to deploy Oracle on NAS using NFS than on SANs. Guess what, I’m right, it is. But I didn’t sufficiently qualify what I was talking about. I produced that blog entry presuming readers would have the collective information of my prior blog posts about Oracle over NFS in mind. That was a weak presumption. No, when someone like Noons says his life is easier with SAN he means it. Bear in mind his post was comparing SAN to DAS, but no matter. Yes, Fibre Channel SAN was a life saver for too many sites to count in the late 90s. For instance, sites that bought into the “server consolidation” play of the late 1990s. In those days, people turned off their little mid-range Unix servers with DAS and crammed the workloads into a large SMP. The problem was that eventually the large SMP couldn’t physically attach any more DAS. It turns out that Fibre was needed first and foremost to get large numbers of disks connected to the huge SMPs of the era. That is an entirely different problem to solve than getting large numbers of servers connected to storage.

Put Your Feet in the Concrete
Most people presume that Oracle over NFS must be exponentially slower than Fibre Channel SAN. They presume this because at face value the wires are faster (e.g., 4Gb FCP versus 1Gb Ethernet). True, 4Gb is more bandwidth than 1Gb, but you can have more than one NFS path to storage and the latencies are a wash. I wanted to provide some numbers so I thought I’d use Network Appliance’s data that suggested a particular test of 8-way Solaris servers running Oracle OLTP over NFS comes within 21% of what is possible on a SAN. Using someone else’s results was mistake number 1. Folks, 21% degredation for NFS compared to SAN is not a number cast in stone. I just wanted to show that it is not a day and night difference and I wanted to use Network Appliance numbers for validity. I would not be happy with 21% either and that is good, because the numbers I typically see are not even in that range to start with. I see more like 10% and that is with 10g. 11g closes the gap nicely.

I’ll be producing data for those results soon enough, but let’s get back to the point. 21% of 8 CPUs worth of Oracle licenses would put quite a cost-savings burden on NAS in order to yield a net gain. That is, unless you accept the fact that we are comparing Oracle on NAS versus Oracle on SAN in which case the Oracle licensing gets cancelled out. And, again, let’s not hang every thought on that 21% of 8 CPUs performance difference because it is by no means a constant.

Snarky Email
After my Manly Man post, a fellow member of the OakTable Network emailed me the viewpoint of their very well-studied Storage Administrator. He calculated the cost of SAN connectivity for a very, very small SAN (using inexpensive 8-port FC switches) and factored in Oracle Enterprise Edition licensing to produce a cost per throughput using the data from that Network Appliance paper—the one with the 21% deficit. That is, he used the numbers at hand (21% degradation), Oracle Enterprise Edition licensing cost and his definition of a SAN (low connectivity requirements) and did the math correctly. Given those inputs, the case for NAS was pretty weak. To my discredit, I lashed back with the following:

…of course he is right that Oracle licensing is the lion’s share of the cost. Resting on those laurels might enable him to end up the last living SAN admin.

Folks, I know that 21% of 8 is 1.7 and that 1.7 Enterprise Edition Licenses can buy a lot of dual-port FCP HBAs and even a midrange Fibre Channel switch, but that is not the point I failed to make. The point I failed to make was that I’m not talking about solving the supposed difficulties of provisioning storage to those one or two remaining refrigerator-sized Legacy Unix boxes you might have. There is no there, there. It is not difficult at all to run a few 4Gb FCP wires to separate 8 or 16 port FC switches and then back to the storage array. Even Manly Man can do that. That is not a problem that needs solved because that is neither difficult nor is it expensive (at least the SAN aspect isn’t). As the adage goes, a picture speaks a thousand words. The following is a visual of a problem that doesn’t need to be solved—a simple SAN connected to a single server. Ironically, what it depicts is potentially millions of dollars worth of server and storage connected with merely thousands of dollars worth of Fibre Channel connectivity gear. In case the photo isn’t vivid enough, I’ll point out that on the left is a huge SMP (e.g., HP Superdome) and on the right is an EMC DMX. In the middle is a redundant set of 8-port switches—cheap, and simple. Even providing private and public Ethernet connectivity in such a deployment is a breeze by the way.


I Ain’t Never Doing That Grid Thing.
Simply put, if the only Oracle you have deployed—now and forever—sits in a couple of refrigerator-sized legacy SMP boxes, I’m going to sound like a loon on this topic. I’m talking about provisioning storage to commodity servers—grid computing. Grid may not be where you are today, but it is in fact where you will be someday. Consider the fact that most datacenters are taking their huge machines and chopping them up into little machines with hardware/software virtualization anyway so we might as well just get to the punch and deploy commodity servers. When we do, we feel the pain of Fibre Channel SAN connectivity and storage provisioning. Because connecting large numbers of servers to storage was not exactly a design center for Fibre Channel SAN technology. Just the opposite is true; SANs were originally meant to connect a few servers to a huge number of disks—more than was possible with DAS.

Commodity Computing (Grid) == Huge SAN
Large numbers of servers connected to a SAN makes the SAN very complex. Not necessarily more disks, but the presentation and connectivity aspects get very difficult to deal with.

If you are unlucky enough to be up to your knees in the storage provisioning, connectivity and cost nightmare associated with even a moderate number of commodity servers in a SAN environment you know what I’m talking about. In these types of environments, people are deploying and managing director-class Fibre Channel switches where each port can cost up to $5,000 and they are deploying more than one switch for redundancy sake. That is, each commodity server needs a 2 port FC HBA and 2 paths to two different switches. Between the HBAs and the FC switch ports, the cost is as much as $10,000-$12,000 just to connect a “pizza box” to the SAN. That’s the connectivity story and the provisioning story is not much prettier.

Once the cabling is done, the Storage Administrator has to zone the switches and provision storage (e.g., create LUNs, LUN masking, etc). For RAC, that would be a minimum of 3 masked LUNs for each database. Then the System Administrator has to make sure Oracle has access to those LUNs. That is a lot of management overhead. NAS on the other hand uses very inexpensive NICs and switches. Ah, now there is an interesting point. Using NAS means each server only has one type of network connectivity instead of two (e.g., FC and Ethernet). Storage provisioning is also simpler—the database server administrator simply mounts the NFS filesystem and the DBA can go straight to work with RAC or non-RAC Oracle databases. How simple. And yes, the Oracle licensing cost is a constant, so in this paradigm, the only way to recuperate cost is in the storage connectivity side. The savings are worth consideration, and the simplicity is very difficult to argue.

It’s time for another picture. The picture below depicts a small commodity server deployment—38 servers that need storage.


Let’s consider the total connectivity problem starting with the constant—Ethernet. Yes, every one of these 38 servers needs both Ethernet and Fibre Channel connectivity. For simplicity, let’s say only 8 of these servers are using RAC. The 8 that host RAC will need a minimum of 4 Gigabit Ethernet NICs/cables—2 for the public interfaces and two for a bonded, private network for Oracle Cache Fusion (GCS, GES) for a total of 32. The remaining 30 could conceivably do fine with 2 public networks each for a subtotal of 60. All told, we have 92 Ethernet paths to deal with before we look at storage networking.

On the storage side, we’ll need redundant paths for all 38 server to multiple switches so we start with 38 dual-port HBAs and 76 front-side Fibre Channel switch ports. Each switch will need a minimum of 2 paths back to storage, but honestly, would anyone try to feed 38 modern commodity servers with 2 4Gb paths worth of storage bandwidth? Likely not. On the other hand, it is unlikely the 30 smaller servers will each need dedicated 4Gb I/O bandwidth to storage so we’ll play zone trickery on the switch and group sets of 2 from the 30 yielding a requirement for 15 back-side I/O paths from each switch for a subtotal of 30 back-side paths. Following in suit, the remaining 8 RAC servers will require 4 back-side paths from each of the two switches for a subtotal of 8 back-side paths. To sum it up, we have 76 front-side and 38 back-side paths for a total of 114 storage paths. Yes, I know this can be a lot simpler by limiting the number of switch-to-storage paths. That’s a game called Which Servers Should We Starve for I/O and it isn’t fun to play. These arrangements are never attempted with small switches. That’s why the picture depicts large, expensive director-class switches.

Here’s our mess. We have 92 Ethernet paths and 114 storage paths. How would NAS make this simpler? Well, Ethernet is the constant here so we simply add more inexpensive Ethernet infrastructure. We still need redundant switches and I/O paths, but Ethernet is cheap and simple and we are down to a single network topology instead of two. Just add some simple NICs and simple Ethernet switches and go. And oh, by the way, the two network-topologies-model (e.g., GbE_+ FCP) generally means two different “owners” since the SAN would generally be owned by the Storage Group and the Ethernet would be owned by the Networking Group. With NAS, all connectivity from the Ethernet switches forward can be owned by the Networking Group freeing the Storage Group to focus on storage—as opposed to storage networking.

And, yes, Oracle11g has features that make the connectivity requirement on the Ethernet side simpler but 10g environments can benefit from this architecture too.

Not a Sales Pitch
Thus far, this blog entry has been the what. This would make a pretty hollow blog entry if I didn’t at least mention the how. The odds are very slim that your datacenter would be able to do a 100% NAS storage deployment. So Network Appliance handles this by offering multiple protocol storage from their Filers. The devil shall not remain with the details.

Total NAS? Nope. Multi-Protocol Storage.
I’ll be brief. You are going to need both FCP and NAS, I know that. If you have SQL Server (ugh) you certainly aren’t going to connect those servers to NAS. There are other reasons FCP isn’t going to go away soon enough. I accept the fact that both protocols are required in real life. So let’s take a look a multi-protocol storage and how it fits into this thread.

Network Appliance Multi-Protocol Support
Network Appliance is an NFS device. If you want to use it for FCP or iSCSI SAN, large files in the Filer’s filesystem (WAFL) are served with either FCP or iSCSI protocol and connectivity. Fine. It works. I don’t like it that much, but it works. In this paradigm, you’d choose to run the simplest connectivity type you deem fit. You could run some FCP to a few huge Legacy SMPs, FCP to some servers running SQL Server (ugh), and most importantly Ethernet for NFS to whatever you choose—including Oracle on commodity servers. Multi-protocol storage in this fashion means total vendor lock-in, but it would allow you to choose between the protocols and it works.

SAN Gateway Multi-Protocol Support
Don’t get rid of your SAN until there is something reasonable to replace it with. How does that statement fit this thread? Well, as I point out in this paper, SAN-NAS gateway devices are worth consideration. Products in this space are the HP Enterprise File Services Clustered Gateway and EMC Celerra. With these devices you leverage your existing SAN by connecting the “NAS Heads” to the SAN using very low-end, simple Fibre Channel SAN connectivity (e.g., small switches, few cables). From there, you can provision NFS mounts to untold numbers of NFS clients—a few, dozens or hundreds. The mental picture here should be a very small amount of the complex, expensive connectivity (Fibre Channel) and a very large amount of the inexpensive, simple connectivity (Ethernet). What a pleasant mental picture that is. So what’s the multi-protocol angle? Well, since there is a down-wind SAN behind the NAS gateway, you can still directly cable your remaining Legacy Unix boxes with FCP. You get native FCP storage (unlike NetApp with the blocks-from-file approach) for the systems that need it and NAS for the ones that don’t.

I’m a Oracle DBA, What’s in It for Me?
Excellent question and the answer is simply simplicity! I’m not just talking simplicity, I’m talking simple, simple, simple. I’m not just talking about simplicity in the database tier either. As I’ve pointed out upteen times, NFS will support you from top to bottom—not just the database tier, but all your unstructured data such as software installations as well. Steve Chan chimes in on the simplicity of shared software installs in the E-Biz world too. After the NFS filesystem is mounted, you can do everything from ORACLE_HOME, APPL_TOP, clusterware files (e.g., the OCR and CSS disks), databases, RMAN, imp/exp, SQL*Loader/External Tables, ETL, compiled PL/SQL, UTL_FILE, BFILE, trace/logging, scripts, and on and on. Without NFS, what sort of mix-match of raw, filesystem, raw+ASM combination would be required? A complex one—and the really ironic part is you’d probably still end up with some NFS mounts in addition to all that raw disk and non-CFS filesystem space as well!

Whew. That was a long blog entry.

Mount Options for Oracle over NFS. It’s All About the Port.

BLOG UPDATE: This post has developed an interesting comment thread worth noting.

I currently have a nearly chaotic set of differing configurations to deal with that run the gamut of x86_64 servers attached to 2/4 Gb FCP SANs and others to NAS via GbE. So sometimes I miss the mark. I just tried to fire up one of my databases on a DL585 running RHEL4 attached to the Enterprise File Services Clustered Gateway NAS device. In the midst of the chaos I mistakenly mounted the filesystem containing the Oracle Database 10g test database using the wrong mount options, so:

$ tail alert*log
Mon Jun 18 15:04:28 2007
WARNING:NFS file system /mnt mounted with incorrect options
WARNING:Expected NFS mount options: rsize>=32768,wsize>=32768,hard
Mon Jun 18 15:04:28 2007
ORA-00202: control file: ‘/u01/app/oracle/product/10.2.0/db_1/rw/DATA/cntlbench_1’
ORA-27054: NFS file system where the file is created or resides is not mounted with correct options
Additional information: 3
Mon Jun 18 15:04:28 2007

I clearly did not mount the filesystems correctly. After remounting with the following options, everything was OK:


But then these mount options are port-specific and as they say in true Clintonian form, “It’s the Port Stupid.”

It’s All About the Port
The only complaint I have about Oracle over NFS is at the port level. I intend to start blogging about the idiosyncrasies between, say, certain Legacy Unix and Linux ports of Oracle with regard to NAS mount options. I think RMAN has the most issues and, again, these are always port level. For instance, certain ports inspect the mount options of the actual mounted filesystem and others will look at the mnttab. And then, in some cases, certain ports do it one way for the instance and then another for functionality such as RMAN. Sometimes when the database or tools don’t like the mount options they return an error message spelling out what is missing and other times just a generic complaint that the mount options are incorrect—and that too varies by port and version of Oracle as well. Recently I found that the HP-UX port of Oracle10g needs the llock mount option which is apparently not documented very well.

In all cases, issues regarding mount options are the responsibility of the Oracle port team for the release. That is where this functionality is built. The layers above the I/O layer (Operating System Dependent code) have no idea whether there is DAS, SAN or NAS down stream. That abstraction is one of the main reasons Oracle is the best database out there. That porting heritage goes back to Oracle version 4. Anyway, I digress…

Yes, these mount option topics are more complicated than they should be, but this situation is not permanent. As we get closer to July 11, I’ll be blogging more about what that means. Regardless, I stand fast in my view that provisioning storage for Oracle via NFS is simpler, simpler, simpler than SANs and that goes for both RAC or non-RAC databases. Just mount the filesystem and go…

In the meantime, if you have a particular port of Oracle10g that isn’t getting along with your NAS, remember our motto, “It’s the port[…]” so log an SR and Oracle will get you on your way.

Manly Men Only Deploy Oracle with Fibre Channel – Part 1. Oracle Over NFS is Weird.

Beware, lot’s of tongue in cheek in this one. If you’re not the least bit interested in storage protocols, saving money or a helpful formula for safely configuring I/O bandwidth for Oracle, don’t read this.

I was reading Pawel Barut’s Log Buffer #48 when the following phrase caught my attention:

For many of Oracle DBAs it might be weird idea: Kevin Closson is proposing to install Oracle over NFS. He states that it’s cheaper, simpler and will be even better with upcoming Oracle 11g.

Yes, I have links to several of my blog entries about Oracle over NFS on my CFS, NFS, ASM page, but that is not what I want to blog about. I’m blogging specifically about Powet’s assertion that “it might be a weird idea”—referring to using NAS via NFS for Oracle database deployments.

I think the most common misconception people have is regarding the performance of such a configuration. True, NFS has a lot of overhead that would surely tax the Oracle server way too much—that is if Oracle didn’t take steps to alleviate the overhead. The primary overhead is in NFS client-side caching. Forget about it. Direct I/O and asynchronous I/O are available to the Oracle server for NFS files with just about every NFS client out there.

Manly Men™ Choose Fibre Channel
I hear it all the time when I’m out in the field or on the phone with prospects. First I see the wheels turning while math is being done in the head. Then, one of those cartoon thought bubbles pops up with the following:

Hold it, that Closson guy must not be a Manly Man™. Did he just say NFS over Gigabit Ethernet? Ugh, I am Manly Man and I must have 4Gb Fibre Channel or my Oracle database will surely starve for I/O!

Yep, I’ve been caught! Gasp, 4Gb has more bandwidth than 1Gb. I have never recommended running a single path to storage though.

Bonding Network Interfaces
Yes, it can be tricky to work out 802.3ad Link Aggregation, but it is more than possible to have double or triple bonded paths to the storage. And yes, scalability of bonded NICs varies, but there is a simplicity and cost savings (e.g., no FCP HBAs or expensive FC switches) with NFS that cannot be overlooked. And, come in closely and don’t tell a soul, you won’t have to think about bonding NICs for Oracle over NFS forever, wink, wink, nudge, nudge.

But, alas, Manly Man doesn’t need simplicity! Ok, ok, I’m just funning around.

No More Wild Guesses
A very safe rule of thumb to keep your Oracle database servers from starving for I/O is:
100Mb I/O per GHz CPU

So, for example, if you wanted to make sure an HP c-Class server blade with 2-socket 2.66 GHz “Cloverdale” Xeon processors had sufficient I/O for Oracle, the math would look like this:

12 * 2.66 * 4 * 2 == 255 MB/s

Since the Xeon 5355 is a quad-core processor and the 480c c-Class blade supports two of them there are 21.28 GHz for the formula. And, 100 Mb is about 12 MB. So if Manly Man configures, say, two 4Gb FC paths (for redundancy) to the same c-Class blade he is allocating about 1000 MB/s bandwidth. Simply put, that is expensive overkill. Why? Well, for starters, the blade would be 100% saturated at the bus level if it did anything with 1000 MB/s so it certainly couldn’t satisfy Oracle performing physical I/O and actually touching the blocks (e.g., filtering, sorting, grouping, etc). But what if Manly Man configured the two 4Gb FCP paths for failover with only 1 path active path (approximately 500 MB/s bandwidth)? That is still overkill.

Now don’t get me wrong. I am well aware that 2 “Cloverdale” Xeons running Parallel Query can scoop up 500MB/s from disk without saturating the server. It turns out that simple light weight scans (e.g., select count(*) ) are about the only Oracle functionality that breaks the rule of 100Mb I/O per GHz CPU. I’ve even proven that countless times such as in this dual processor, single core Opteron 2.8 Ghz proof point. In that test I had IBM LS20 blades configured with dual processor, single-core Opterons clocked at 2.8 GHz. So if I plug that into the formula I’d use 5.6 for the GHz figure which supposedly yields 67 MB/s as the throughput at which those processors should have been saturated. However, on page 16 of this paper I show those two little single-core Opterons scanning disk at the rate of approximately 380MB/s. How is that? The formula must be wrong!

No, it’s not wrong. When Oracle is doing a light weight scan it is doing very, very little with the blocks of data being returned from disk. On the other hand, if you read further in that paper, you’ll see on page 17 that a measly 21MB/s of data loading saturated both processors on a single node-due to the amount of data manipulation required by SQL*Loader. OLTP goes further. Generally, when Oracle is doing OLTP, as few as 3,000 IOps from each processor core will result in total saturation. There is a lot of CPU intensive stuff wrapped around those 3,000 IOps. Yes, it varies, but look at your OLTP workload and take note of the processor utilization when/if the cores are performing on the order of 3,000 IOps each. Yes, I know, most real-world Oracle databases don’t even do 3,000 IOps for an entire server which takes us right back to the point: 100Mb I/O per GHz CPU is a good, safe reference point.

What Does the 800 Pound Gorilla Have To Say?
When it comes to NFS, Network Appliance is the 800lb gorilla. They have worked very hard to get to where they are. See, Network Appliance likely doesn’t care if Manly Man would rather deploy FCP for Oracle instead of NFS since their products do both protocols-and iSCSI too. All told, they may stand to make more money if Manly Man does in fact go with FCP since they may have the opportunity to sell expensive switches too. But, no, Network Appliance dispels the notion that 4Gb (or even 2Gb) FCP for Oracle is a must.

In this NetApp paper about FCP vs iSCSI and NFS, measurements are offered that show equal performance with DSS-style workloads (Figure 4) and only about 21% deficit when comparing OLTP on FCP to NFS. How’s that? The paper points out that the FCP test was fitted with 2Gb Fibre Channel HBAs and the NFS case had two GbE paths to storage yet Manly Man only achieved 21% more OLTP throughput. If NFS was so inherently unfit for Oracle, this test case with bandwidth parity would have surely made the point clear. But that wasn’t the case.

If you look at Figure 2 in that paper, you’ll see that the NFS case (with jumbo frames) spent 31.5% of cycles in kernel mode compared to 22.4% in the FCP case. How interesting. The NFS case lost 28% more CPU to kernel mode overhead and delivered 21% less OLTP throughput. Manly Man must surely see that addressing that 28% extra kernel mode overhead associated with NFS will bring OLTP throughput right in line with FCP and:

– NFS is simpler to configure

– NFS can be used for RAC and non-RAC

– NFS is cheaper since GbE is cheaper (per throughout) than FCP

Now isn’t that weird?

The 28%.

I can’t tell you how and when the 28% additional kernel-mode overhead gets addressed, but, um, it does. So, Manly Man, time to invent the wheel.

Oracle Over NFS. I Need More Monitoring Tools? A Bonded NIC Roller Coaster.

As you can tell by my NFS-related blog entries, I am an advocate of Oracle over NFS. Forget those expensive FC switches and HBAs in every Oracle Database server. That is just a waste. Oracle11g will make that point even more clearly soon enough. I’ll start sharing how and why as soon as I am permitted. In the meantime…

Oracle over NFS requires bonded NICs for redundant data paths and performance. That is an unfortunate requirement that Oracle10g is saddled with. And, no, I’m not going to blog about such terms as IEEE 802.3ad, PAgP, LACP, balance-tlb, balance-alb or even balance-rr. The days are numbered for those terms-at least in the Oracle database world. I’m not going to hint any futher about that though.

Monitoring Oracle over NFS
If you are using Oracle over NFS, there are a few network monitoring tools out there. I don’t like any of them. Let’s see, there’s akk@da and Andrisoft WanGuard. But don’t forget Anue and Aurora, Aware, BasicState, CommandCenter NOC, David, Dummmynet, GFI LANguard, Gomez, GroundWork, Hyperic HQ, IMMonitor, Jiploo, Monolith, moods, Network Weathermap, OidView, Pandetix, Pingdom, Pingwy, skipole-monitor, SMARTHawk, Smarts, WAPT, WFilter, XRate1, arping, Axence NetVision, BBMonitor, Cacti, CSchmidt collection, Cymphonix Network Composer, Darkstat, Etherape, EZ-NOC, Eye-on Bandwidth, Gigamon University, IPTraf, Jnettop, LITHIUM, mrtg-ping-probe, NetMRG, NetworkActiv Scanner, NimTech, NPAD, Nsauditor, Nuttcp, OpenSMART, Pandora FMS, PIAFCTM, Plab, PolyMon, PSentry, Rider, RSP, Pktstat, SecureMyCompany, SftpDrive, SNM, SpeedTest, SpiceWorks, Sysmon, TruePath, Unbrowse, Unsniff, WatchMouse, Webalizer, Web Server Stress Tool, Zenoss, Advanced HostMonitor, Alvias, Airwave, AppMonitor, BitTorrent, bulk, BWCTL, Caligare Flow Inspector, Cittio, ClearSight, Distinct Network Monitor, EM7, EZMgt, GigaMon, Host Grapher II, HPN-SSH, Javvin Packet Analyzer, Just-ping, LinkRank, MoSSHe, mturoute, N-able OnDemand, Netcool, netdisco, Netflow Monitor, NetQoS, Pathneck, OWAMP, PingER, RANCID, Scamper, SCAMPI, Simple Infrastructure Capacity Monitor, Spirent, SiteMonitor, STC, SwitchMonitor, SysUpTime, TansuTCP, thrulay, Torrus, Tstat, VSS Monitoring, WebWatchBot, WildPackets, WWW Resources for Communications & Networking Technologies, ZoneRanger, ABwE, ActivXpets, AdventNet Web NMS, Analyse It, Argus, Big Sister, CyberGauge, eGInnovations, Internet Detective, Intellipool Network Monitor, JFF Network Management System, LANsurveyor, LANWatch, LoriotPro, MonitorIT, Nagios, NetIntercept, NetMon, NetStatus, Network Diagnostic Tool, Network Performance Advisor, NimBUS, NPS, Network Probe, NetworksA-OK, NetStat Live, Open NerveCenter, OPENXTRA, Packeteer, PacketStorm, Packetyzer, PathChirp, Integrien, Sniff’em, Spong, StableNet PME, TBIT, Tcptraceroute, Tping, Trafd, Trafshow, TrapBlaster, Traceroute-nanog, Ultra Network Sniffer, Vivere Networks, ANL Web100 Network Configuration Tester, Anritsu, aslookup, AlertCenter, Alertra, AlertSite, Analyse-it, bbcp, BestFit, Bro, Chariot, CommView, Crypto-PAn, elkMonitor, DotCom, Easy Service Monitor, Etherpeek, Fidelia, Finisar, Fpinger, GDChart, HipLinkXS, ipMonitor, LANExplorer, LinkFerret, LogisoftAR, MGEN, Netarx, NetCrunch, NetDetector, NetGeo, NEPM, NetReality, NIST Net, NLANR AAD, NMIS, OpenNMS PageREnterprise, PastMon, Pathprobe, remstats, RIPmon, RFT, ROMmon, RUDE, Silverback, SmokePing, Snuffle, SysOrb, Telchemy, TCPTune, TCPurify, UDPmon, WebAttack, Zabbix, AdventNet SNMP API, Alchemy Network Monitor, Anasil analyzer, Argent, Autobuf, Bing, Clink, DSLReports, Firehose, GeoBoy, PacketBoy, Internet Control Portal, Internet Periscope, ISDNwatch, Metrica/NPR, Mon, NetPredict, NetTest, Nettimer, Net-One-1, Pathrate, RouteView, sFlow, Shunra, Third Watch, Traceping, Trellian, HighTower, WCAT, What’s Up Gold, WS_FTP, Zinger, Analyzer, bbftp, Big Brother, Bronc, Cricket, EdgeScape, Ethereal (now renamed Wireshark), gen_send/gen_recv, GSIFTP, Gtrace, Holistix, InMon, NcFTP, Natas, NetAlly, NetScout, Network Simulator, Ntop, PingGraph, PingPlotter, Pipechar, RRD, Sniffer, Snoop, StatScope, Synack, View2000, VisualPulse, WinPcap, WU-FTPD, WWW performance monitoring, Xplot, Cheops, Ganymede, hping2, Iperf, JetMon, MeasureNet, MatLab, MTR, NeoTrace, Netflow, NetLogger, Network health, NextPoint, Nmap, Pchar, Qcheck, SAA, SafeTP, Sniffit, SNMP from UCSD, Sting, ResponseNetworks, Tcpshow, Tcptrace WinTDS, INS Net Perf Mgmt survey, tcpspray, Mapnet, Keynote, prtraceroute clflowd flstats, fping, tcpdpriv, NetMedic Pathchar, CAIDA Measurement Tool Taxonomy, bprobe & cprobe, mrtg, NetNow, NetraMet, Network Probe Daemon, InterMapper, Lachesis, Optimal Networks and last but not least, Digex.

Simplicity Please
The networking aspect of Oracle over NFS is the simplest type of networking imaginable. The database server issues I/O to NFS filesystems being exported over simple, age old Class C private networks (192.168.N.N). We have Oracle statspack to monitor what Oracle is asking of the filesystem. However, if the NFS traffic is being sent over a bonded NIC, monitoring the flow of data is important as well. That is also a simple feat on Linux since /proc tracks all that on a per-NIC basis.

I hacked out a very simple little script to monitor eth2 and eth3 on my system. It isn’t anything special, but it shows some interesting behavior with bonded NICS. The following screen shot shows the simple script executing. A few seconds after starting the script, I executed an Oracle full table scan with Parallel Query in another window. Notice how /proc data shows that the throughput has peaks and valleys on a per-second basis. The values being reported are Megabytes so it is apparent that the bursts of I/O are achieving full bandwidth of the GbE network storage paths, but what’s up with the pulsating action? Is that Oracle, or the network? I can’t tell you just yet. Here is the screen shot nonetheless:


For what it is worth, here is a listing of that silly little script. Its accuracy compensates for its lack of elegance. Cut and paste this on a Linux server and tailor eth[23] to whatever you happen to have.

$ cat
function get_data() {
cat /proc/net/dev | egrep “${token1}|${token2}” \
| sed ‘s/^.*://g’ | awk ‘{ print $1 }’ | xargs echo


while true

set – `get_data`

sleep $INTVL

set – `get_data`

echo $a_if1 $b_if1 | awk -v intvl=$INTVL ‘{
printf(“%7.3f\t”, (($1 – $2) / 1048576) / intvl) }’
echo $a_if2 $b_if2 | awk -v intvl=$INTVL ‘{
printf(“%7.3f\n”, (($1 – $2) / 1048576) / intvl) }’


Combining ASM and NAS. Got Proof?

I blogged yesterday about Oracle over NFS performance and NFS protocol for Oracle. In the post I referenced a recent thead on where Oracle over NFS performance was brought into question by a list participant. I addressed that in yesterday’s blog entry. The same individual that questioned Oracle NFS performance also called for proof that Oracle supports using large files in an NFS mount as “disks” in an ASM disk group. I didn’t care to post the reply in c.d.o.s because I’m afraid only about 42 people would ever see the information.

Using ASM on NAS (NFS)
I’ve blogged before about how I think that is a generally odd idea, but there may be cases where it is desirable to do so. In fact, it would be required for RAC on Standard Edition. The point is that Oracle does support it. I find it odd actually that I have to provide a reference as evidence that such a technology combination is supported. No matter, here is the reference:

Oracle Documentation about using NAS devices says:

C.3 Creating Files on a NAS Device for Use with Automatic Storage Management

If you have a certified NAS storage device, you can create zero-padded files in an NFS mounted directory and use those files as disk devices in an Automatic Storage Management disk group. To create these files, follow these steps:


To use files as disk devices in an Automatic Storage Management disk group, the files must be on an NFS mounted file system. You cannot use files on local file systems.

A Dirty Little Trick
If you want to play with ASM, there is an undocumented initialization parameter that enables the server to use ASM with normal filesystem files. The parameter is called _asm_allow_only_raw_disks. Setting it to FALSE allows one to test ASM using zero-filled files in any normal filesystem. And, no, it is not supported in production.

More Information
For more information about ASM on NAS, I recommend:

About the Oracle Storage Compatibility Program

Oracle over NFS Performance is “Glacial”, But At Least It Isn’t “File Serving.”

I assert that Oracle over NFS is not going away anytime soon—it’s only going to get better. In fact, there are futures that make it even more attractive from a performance and availability standpoint, but even today’s technology is sufficient for Oracle over NFS. Having said that, there is no shortage of misunderstanding about the model. The lack of understanding ranges from clear ignorance about the performance characteristics to simple misunderstanding about how Oracle interacts with the protocol.

Perhaps ignorance is not always the case when folks miss the mark about the performance characteristics. Indeed, when someone tells me the performance is horrible with Oracle over NFS—and the say they actually measured the performance—I can’t call them a bold-faced liar. I’m sure nay-sayers in the poor-performance crowd saw what they saw, but they likely had a botched test. I too have seen the results of a lot of botched or ill-constructed tests, but I can’t dismiss an entire storage and connectivity model based on such results. I’ll discuss possible botched tests in a later post. First, I’d like to clear up the common misunderstanding about NFS and Oracle from a protocol perspective.

The 800lb Gorilla
No secrets here; Network Appliance is the stereotypical 800lb gorilla in the NFS space. So why not get some clarity on the protocol from Network Appliance’s Dave Hitz? In this blog entry about iSCSI and NAS, Dave says:

The two big differences between NAS and Fibre Channel SAN are the wires and the protocols. In terms of wires, NAS runs on Ethernet, and FC-SAN runs on Fibre Channel.

Good so far—in part. Yes, most people feed their Oracle database servers with little orange glass, expensive Host Bus Adaptors and expensive switches. That’s the FCP way. How did we get here? Well, FCP hit 1Gb long before Ethernet and honestly, the NFS overhead most people mistakenly fear in today’s technology was truly a problem in the 2000-2004 time frame. That was then, this is now.

As for NAS, Dave stopped short by suggesting NAS (e.g., NFS, iSCSI) runs over Ethernet. There is also IP over Infiniband. I don’t believe NetApp plays Infiniband so that is likely the reason for the omission.

Dave continues:

The protocols are also different. NAS communicates at the file level, with requests like create-file-MyHomework.doc or read-file-Budget.xls. FC-SAN communicates at the block level, with requests over the wire like read-block-thirty-four or write-block-five-thousand-and-two.

What? NAS is either NFS or iSCSI—honestly. However, only NFS operates with requests like “read-file-Budget.xls”. But that is not the full story and herein comes the confusion when the topic of Oracle over NFS comes up. Dave has inadvertently contributed to the misunderstanding. Yes, an NFS client may indeed cause NFS to return an entire Excel spreadsheet, but that is certainly not how accesses to Oracle database files are conducted. I’ll state it simply, and concisely:

Oracle over NFS is a file positioning and read/write workload.

Oracle over NFS is not traditional “file serving.” Oracle on an NFS client does not fetch entire files. That would simply not function. In fact, Oracle over NFS couldn’t possibly have less in common with traditional “file serving.” It’s all about Direct I/O.

Direct I/O with NFS
Oracle running on an NFS client does not double buffer by using both an SGA and the NFS client page cache. All platforms (that matter) support Direct I/O for files in NFS mounts. To that end, the cache model is SGA->Storage Cache and nothing in between—and therefore none of the associated NFS client cache overhead. And as I’ve pointed out in many blog entries before, I only call something “Direct I/O” if it is real Direct I/O. That is, Direct I/O and concurrent I/O (no write ordering locks).

I/O Libraries
Oracle uses the same I/O libraries (in Oracle9i/Oracle10g) to access files in NFS mounts as it does for:

  • raw partitions
  • local file systems
  • block cluster file systems (e.g. GFS, PSFS, GPFS, OCFS2)
  • ASM over NFS
  • ASM on Raw Partitions

Oops, I almost forgot, there is also Oracle Disk Manager. So let me restate. When Oracle is not linked with an Oracle Disk Manager library or ASMLib, the same I/O calls are used for all of the storage options in the list I just provided.

So what’s the point? Well, the point I’m making is that Oracle behaves the same on NFS as it does on all the other storage options. Oracle simply positions within the files and reads or writes what’s there. No magic. But how does it perform?

The Performance is Glacial
There is a recent thread on about 10g RAC that wound up twisting through other topics including Oracle over NFS. When discussing the performance of Oracle over NFS, one participant in the thread stated his view bluntly:

And the performance will be glacial: I’ve done it.

Glacial? That is:
a. Of, relating to, or derived from a glacier.
b. Suggesting the extreme slowness of a glacier: Work proceeded at a glacial pace.

Let me see if I can redefine glacial using modern tested results with real computers, real software, and real storage. This is just a snippet, but it should put the term glacial in a proper light.

In the following screen shot, I list a simple script that contains commands to capture the cumulative physical I/O the instance has done since boot time followed with a simple PL/SQL block that performs full light-weight scans against a table followed by another peek at the cumulative physical I/O. For this test I was not able to come up with a huge amount of storage so I created and loaded a table with order entry history records—about 25GB worth of data. So that the test runs for a reasonable amount of time I scan the table 4 times using the simple PL/SQL block.

NOTE: You may have to right click-> view the image


The following screen shot shows that Oracle scanned 101GB in 466 seconds—223 MB/s scanning throughput. I forgot to mention, this is a DL585 with only 2 paths to storage. Before some slight reconfiguration I had to do I had 3 paths to storage where I was seeing 329MB/s—or about 97% linear scalability when considering the maximum payload on GbE is on the order of 114MB/s for this sort of workload.


NFS Overhead? Cheating is Naughty!
The following screen shot shows vmstat output taken during the full table scanning. It shows that the Kernel mode processor utilization when Oracle uses Direct I/O to scan NFS files falls consistently in range of 22%. That is not entirely NFS overhead by any means either.

Of course Oracle doesn’t know if its I/O is truly physical since there could be OS buffering. The screen shot also shows the memory usage on the server. There was 31 of 32GB free which means I wasn’t scanning a 25GB table that was cached in the OS page cache. This was real I/O going over a real wire.


For more information I recommend:

This paper about Scalable Fault Tolerant NAS and the NFS-related postings on my blog.


I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 2,937 other followers

Oracle ACE Program Status

Click It

website metrics

Fond Memories


All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: