Archive for the 'NFS CFS ASM' Category

Automatic Databases Automatically Detect Storage Capabilities, Don’t They?

Doug Burns has started an interesting blog thread about the Oracle Database 11g PARALLEL_IO_CAP_ENABLED parameter in his blog entry about Parallel Query and Oracle Database 11g. Doug is discussing Oracle’s new concept of built-in I/O subsystem calibration-a concept aimed at more auto-tuning database instances. The idea is that Oracle is trying to make PQ more aware of the down-wind I/O subsystem capability so that it doesn’t obliterate it with a flood of I/O. Yes, a kinder, gentler PQO.

I have to admit that I haven’t yet calibrated this calibration infrastructure. That is, I aim to measure the difference between what I know a given I/O subsystem is capable of and what DBMS_RESOURCE_MANAGER.CALIBRATE_IO thinks it is capable of. I’ll blog the findings of course.

In the meantime, I recommend you follow what Doug is up to.

A Really Boring Blog Entry
Nope, this is not just some look at that other cool blog over there post. At first glance I would hope that all the regular readers of my blog would wonder what value there is in throttling I/O all the way up in the database itself given the fact that there are several points at which I/O can/does get throttled downwind. For example, if the I/O is asynchronous, all operating systems have a maximum number of asynchronous I/O headers (the kernel structures used to track asynchronous I/Os) and other limiting factors on the number of outstanding asynchronous I/O requests. Likewise, SCSI kernel code is fit with queues of fixed depth and so forth. So why then is Oracle doing this up in the database? The answer is that Oracle can run on a wide variety of I/O subsystem architectures and not all of these are accessed via traditional I/O system calls. Consider Direct NFS for instance.

With Direct NFS you get disk I/O implemented via the remote procedure call interface (RPC). Basically, Oracle shoots the NFS commands directly at the NAS device as opposed to using the C library read/write routines on files in an NFS mount-which eventually filters down to the same thing anyway, but with more overhead. Indeed, there is throttling in the kernel for the servicing of RPC calls, as is the case with traditional disk I/O system calls, but I think you see the problem. Oracle is doing the heavy lifting that enables you to take advantage of a wide array of storage options-and not all of them are accessed with age-old traditional I/O libraries. And it’s not just DNFS. There is more coming down the pike, but I can’t talk about that stuff for several months given the gag order. If I could, it would be much easier for you to visualize the importance of DBMS_RESOURCE_MANAGER.CALIBRATE_IO. In the meantime, use your imagination. Think out of the box…way out of the box…

Databases are the Contents of Storage. Future Oracle DBAs Can Administer More. Why Would They Want To?

I’ve taken the following quote from this Oracle whitepaper about low cost storage:

A Database Storage Grid does not depend on flawless execution from its component storage arrays. Instead, it is designed to tolerate the failure of individual storage arrays.

In spite of the fact that the Resilient Low-Cost Storage Initiative program was decommissioned along with the Oracle Storage Compatability Program, the concepts discussed in that paper should be treated as a barometer of the future of storage for Oracle databases-with two exceptions: 1) Fibre Channel is not the future and 2) there’s more to “the database” than just the database. What do I mean by point 2? Well, with features like SecureFiles, we aren’t just talking rows and columns any more and I doubt (but I don’t know) that SecureFiles is the end of that trend.

Future Oracle DBAs
Oracle DBAs of the future become even more critical to the enterprise since the current “stove-pipe” style IT organization will invariably change. In today’s IT shop, the application team talks to the DBA team who talks to the Sys Admin team who tlks to the Storage Admin team. All this to get an application to store data on disk through a Oracle database. I think that will be the model that remains for lightly-featured products like MySQL and SQL Server, but Oracle aims for more. Yes, I’m only whetting your appetite but I will flesh out this topic over time. Here’s food for thought: Oracle DBAs should stop thinking their role in the model stops at the contents of the storage.

So while Chen Shapira may be worried that DBAs will get obviated, I’d predict instead that Oracle technology will become more full-featured at the storage level. Unlike the stock market where past performance is no indicator of future performance, Oracle has consistently brought to market features that were once considered too “low-level” to be in the domain of a Database vendor.

The IT industry is going through consolidation. I think we’ll see Enterprise-level IT roles go through some consolidation over time as well. DBAs who can wear more than “one hat” will be more valuable to the enterprise. Instead of thinking about “encroachment” from the low-end database products, think about your increased value proposition with Oracle features that enable this consolidation of IT roles-that is, if I’m reading the tea leaves correctly.

How to Win Friends and Influence People
Believe me, my positions on Fibre Channel have prompted some fairly vile emails in my inbox-especially the posts in my Manly Man SAN series. Folks, I don’t “have it out”, as they say, for the role of Storage Administrators. I just believe that the Oracle DBAs of today are on the cusp of being in control of more of the stack. Like I said, it seems today’s DBA responsibilities stop at the contents of the storage-a role that fits the Fibre Channel paradigm quite well, but a role that makes little sense to me. I think Oracle DBAs are capable of more and will have more success when they have more control. Having said that, I encourage any of you DBAs who would love to be in more control of the storage to look at my my post about the recent SAN-free Oracle Data Warehouse. Read that post and give considerable thought to the model it discusses. And give even more consideration to the cost savings it yields.

The Voices in My Head
Now my alter ego (who is a DBA, whereas I’m not) is asking, “Why would I want more control at the storage level?” I’ll try to answer him in blog posts, but perhaps some of you DBAs can share experiences where performance or availability problems were further exacerbated by finger pointing between you and the Storage Administration group.

Note to Storage Administrators
Please, please, do not fill my email box with vitriolic messages about the harmony today’s typical stove-pipe IT organization creates. I’m not here to start battles.

Let me share a thought that might help this whole thread make more sense. Let’s recall the days when an Oracle DBA and a System Administrator together (yet alone) were able to provide Oracle Database connectivity and processing for thousands of users without ever talking to a “Storage Group.” Do you folks remember when that was? I do. It was the days of Direct Attach Storage (DAS). The problem with that model was that it only took until about the late 1990s to run out of connectivity-enter the Fibre Channel SAN. And since SANs are spokes attached to hubs of storage systems (SAN arrays), we wound up with a level of indirection between the Oracle server and its blocks on disk. Perhaps there are still some power DBAs that remember how life was with large numbers of DAS drives (hundreds). Perhaps they’ll recall the level of control they had back then. On the other hand, perhaps I’m going insane, but riddle me this (and feel free to quote me elsewhere):

Why is it that the industry needed SANs to get more than a few hundred disks attached to a high-end Oracle system in the late 1990s and yet today’s Oracle databases often reside on LUNs comprised of a handful of drives in a SAN?

The very thought of that twist of fate makes me feel like a fish flopping around on a hot sidewalk. Do you remember my post about capacity versus spindles? Oh, right, SAN cache makes that all better. Uh huh.

Am I saying the future is DAS? No. Can I tell you now exactly what model I’m alluding to? Not yet, but I enjoy putting out a little food for thought.

What Is Good Throughput With Oracle Over NFS?

The comment thread on my blog entry about the simplicity of NAS for Oracle got me thinking. I can’t count how many times I’ve seen people ask the following question:

Is N MB/s good throughput for Oracle over NFS?

Feel free to plug in any value you’d like for N. I’ve seen people ask if 40MB/s is acceptable. I’ve seen 60, 80, name it-I’ve seen it.

And The Answer Is…
Let me answer this question here and now. The acceptable throughput for Oracle over NFS is full wire capacity. Full stop! With Gigabit Ethernet and large Oracle transfers, that is pretty close to 110MB/s. There are some squeak factors that might bump that number one way or the other but only just a bit. Even with the most hasty of setups, you should expect very close to 100MB/s straight out of the box-per network path. I cover examples of this in depth in this HP whitepaper about Oracle over NFS.

The steps to a clean bill of health are really very simple. First, make sure Oracle is performing large I/Os. Good examples of this are tablespace CCF (create contiguous file) and full table scans with port-maximum multi-block reads. Once you verify Oracle is performance large I/Os, do the math. If you are not close to 100MB/s on a GbE network path, something is wrong. Determining what’s wrong is another blog entry. I want to capitalize on this nagging question about expectations. I reiterate (quoting myself):

Oracle will get line speed over NFS, unless something is ill-configured.

Initial Readings
I prefer to test for wire-speed before Oracle is loaded. The problem is that you need to mimic Oracle’s I/O. In this case I mean Direct I/O. Let’s dig into this one a bit.

I need something like a dd(1) tool that does O_DIRECT opens. This should be simple enough. I’ll just go get a copy of the coreutils package that has O_DIRECT tools like dd(1) and tar(1). So here goes:

[root@tmr6s15 DD]# ls ../coreutils-4.5.3-41.i386.rpm
[root@tmr6s15 DD]# rpm2cpio < ../coreutils-4.5.3-41.i386.rpm | cpio -idm
11517 blocks
[root@tmr6s15 DD]# ls
bin  etc  usr
[root@tmr6s15 DD]# cd bin
[root@tmr6s15 bin]# ls -l dd
-rwxr-xr-x  1 root root 34836 Mar  4  2005 dd
[root@tmr6s15 bin]# ldd dd =>  (0xffffe000) => /lib/tls/ (0x00805000)
        /lib/ (0x007ec000)

I have an NFS mount exported from an HP EFS Clustered Gateway (formerly PolyServe):

 $ ls -l /oradata2
total 8388608
-rw-r--r--  1 root root 4294967296 Aug 31 10:15 file1
-rw-r--r--  1 root root 4294967296 Aug 31 10:18 file2
$ mount | grep oradata2
voradata2:/oradata2 on /oradata2 type nfs

Let’s see what the dd(1) can do reading a 4GB file and over-writing another 4GB file:

 $ time ./dd --o_direct=1048576,1048576 if=/oradata2/file1 of=/oradata2/file2 conv=notrunc
4096+0 records in
4096+0 records out

real    1m32.274s
user    0m3.681s
sys     0m8.057s

Test File Over-writing
What’s this bit about over-writing? I recommend using conv=notrunc when testing write speed. If you don’t, the file will be truncated and you’ll be testing write speeds burdened with file growth. Since Oracle writes the contents of files (unless creating or extended a datafile), it makes no sense to test writes to a file that is growing. Besides, the goal is to test the throughput of O_DIRECT I/O via NFS, not the filer’s ability to grow a file. So what did we get? Well, we transferred 8GB (4GB in, 4GB out) and did so in 92 seconds. That’s 89MB/s and honestly, for a single path I would actually accept that since I have done absolutely no specialized tuning whatsoever. This is straight out of the box as they say. The problem is that I know 89MB/s is not my typical performance for one of my standard deployments. What’s wrong?

The dd(1) package supplied with the coreutils has a lot more in mind than O_DIRECT over NFS. In fact, it was developed to help OCFS1 deal with early cache-coherency problems. It turned out that mixing direct and non-direct I/O on OCFS was a really bad thing. No matter, that was then and this is now. Let’s take a look at what this dd(1) tool is doing:

$ strace -c ./dd --o_direct=1048576,1048576 if=/oradata2/file1 of=/oradata2/file2 conv=notrunc
4096+0 records in
4096+0 records out
Process 32720 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 56.76    4.321097        1054      4100         1 read
 22.31    1.698448         415      4096           fstatfs
 10.79    0.821484         100      8197           munmap
  9.52    0.725123         177      4102           write
  0.44    0.033658           4      8204           mmap
  0.16    0.011939           3      4096           fcntl
  0.02    0.001265          70        18        12 open
  0.00    0.000178          22         8           close
  0.00    0.000113          23         5           fstat
  0.00    0.000091          91         1           execve
  0.00    0.000015           2         8           rt_sigaction
  0.00    0.000007           2         3           brk
  0.00    0.000006           3         2           mprotect
  0.00    0.000004           4         1         1 access
  0.00    0.000002           2         1           uname
  0.00    0.000002           2         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    7.613432                 32843        14 total

Eek! I’ve paid for a 1:1 fstatfs(2) and fcntl(2) per read(2) and a mmap(2)/munmap(2) call for every read(2)/write(2) pair! Well, that wouldn’t be a big deal on OCFS since fstatfs(2) is extremely cheap and the structure contents only changes when filesystem attributes change. The mmap(2)/munmap(2) costs a bit, sure, but on a local filesystem it would be very cheap. What I’m saying is that this additional call overhead wouldn’t laden down OCFS throughput with the –o_direct flag-but I’m not blogging about OCFS. With NFS, this additional call overhead is way to expensive. All is not lost.

I have my own coreutils dd(1) that I implements O_DIRECT open(2). You can do this too, it is just GNU after all. With this custom GNU coreutils dd(1) I have, the call profile is nothing more than read(2) and write(2) back to back. Oh, I forgot to mention, the dd(1) doesn’t work with /dev/null or /dev/zero since it tries to throw an O_DIRECT open(2) at those devices which makes the tool croak. My dd(1) checks if in or out is /dev/null or /dev/zero and omits the O_DIRECT for that side of the operation. Anyway, here is what this tool got:

$ time dd_direct if=/oradata2/file1 of=/oradata2/file2 bs=1024k conv=notrunc
4096+0 records in
4096+0 records out

real    1m20.162s
user    0m0.008s
sys     0m1.458s

Right, that’s more like it-80 seconds or 102 MB/s. Shaving those additional calls off brought throughput up 15%.

What About Bonding/Teaming NICS
Bonding NICs is a totally different story as I point out somewhat in this paper about Oracle Database 11g Direct NFS. You can get very mixed results if the network interface over which you send NFS traffic is bonded. I’ve seen 100% scalability of NICs in a bonded pair and I’ve seen as low as 70%. If you are testing a bonded pair, set your expectations accordingly.

Oracle11g: Oracle Inventory On Shared Storage. Don’t Bother Trying To Install 11g RAC That Way.

A few days ago there was a thread on the oracle-l email list about adding nodes in an Oracle Database 10g Real Application Clusters environment. The original post showed a problem that Alex Gorbachev reports he’s only seen with shared Oracle Home installs. I found that odd because I’ve done dozens, upon dozens of RAC installs on shared Oracle Homes with both CFS and NFS and haven’t seen this error:

Remote 'UpdateNodeList' failed on node: 'af-xxx2'. Refer to
for details.
You can manually re-run the following command on the failed nodes after the
/apps/oracle/product/10.2/oui/bin/runInstaller -updateNodeList -noClusterEnabled
ORACLE_HOME=/apps/oracle/product/10.2 CLUSTER_NODES=af-xxx1,af-xxx2,af-xxx6
CRS=false "INVENTORY_LOCATION=/apps/oracle/oraInventory" LOCAL_NODE=
<node on which command is to be run>

I never have any problems with shared Oracle Home and I blog about the topic a lot as can be seen in in this list of posts. Nonetheless, Alex pointed out that the error has to do with the Oracle Inventory being on a shared filesystem. Another list participant followed up with the following comment about placing the inventory on a shared drive:

Sharing the oraInventory across nodes is not a good practice in my opinion. It runs counter to the whole concept of redundancy in an HA configuration and RAC was not written to support it.

Well, the Oracle Inventory is not a RAC concept, it is an Oracle Universal Installer concept, but I think I know what this poster was saying. However, the topic at hand is shared Oracle Home. When people use the term shared Oracle Home, they don’t mean shared ORACLE_BASE, they mean shared Oracle Home. Nonetheless, I have routinely shared the 10g inventory without problems, but then my software environments might not be as complex as those maintained by the poster of this comment.

Shared Inventory with Oracle Database 11g
No can do! Well, sort of. Today I was installing 11g RAC on one of my RHEL 4 x86 clusters. In the fine form of not practicing what I preach, I mistakenly pointed Oracle Universal Installer to a shared location (NFS) for the inventory when I was installing CRS. I got CRS installed just fine on 2 nodes and proceeded to install the database with the RAC option. It didn’t take long for OUI to complain as follows:


Ugh. This is just a test cluster that I need to set up quick and dirty. So I figured I’d just change the contents of /etc/oraInst.loc to point to some new non-shared location-aren’t I crafty. Well, that got me past the error, but without an inventory with CRS in it, Oracle11g OUI does not detect the cluster during the database install! No node selection screen, no RAC.

I proceeded to blow away all the CRS stuff (ORA_CRS_HOME, inittab entries, /etc/oracle/* and /etc/oraInst.loc) and reinstalled CRS using a non-shared locale for the inventory. The CRS install went fine and subsequently OUI detected the cluster when I went to install the database.

This is a significant change from 10g where the inventory content regarding CRS was not needed for anything. With 10g, the cluster is detected based on what /etc/oracle/ocr.loc tells OUI.

Shared Oracle Home is an option, shared Oracle Home means shared Oracle Home not shared Oracle Inventory. Oracle11g enforces this best practice nicely!

Manly Men Only Deploy Oracle with Fibre Channel – Part VIII. After All, Oracle Doesn’t Support Async I/O on NFS

In the comment section of my recent post about Tim Hall’s excellent NFS step-by-step Linux RAC install Guide, Tim came full circle to ask a question about asynchronous I/O on NFS. He wrote:

What do you set your filesystemio_options init.ora parameter to when using Oracle over NFS?

Based on what you’ve written before I know NFS supports direct I/O, but I’m struggling to find a specific statement about NFS and asynchronous I/O. So should I use:




My reply to that was going to remind you folks about my recent rant about old Linux distributions combined with Oracle over NFS.  That is, the answer is, “it depends.” It depends on whether you are running a reasonable Linux distribution. But, Tim quickly followed up his query with:

I found my answer. Asynchronous I/O is not supported on NFS:

Bummer, I didn’t get to answer it.

Word To The Wise
Don’t use old Linux stuff with NAS if you want to do Oracle over NFS. Metalink 279069.1 provides a clear picture as to why I say that. It points out a couple of important things:

1. RHEL 4 U4 and EL4 both support asynchronous I/O on NFS mounts. That makes me so happy because I’ve been doing asynchronous I/O on NFS mounts with Oracle10gR2 for about 16 months. Unfortunately, ML 279069.1 incorrectly states that the critical fix for Oracle async I/O on NFS is U4, when in fact the specific bug (Bugzilla 161362 ) was fixed in RHEL4 U3 as seen in this Red Hat Advisory from March 2006.

2. Asynchronous I/O on NFS was not supported on any release prior to RHEL4. That’s fine with me because I wouldn’t use any Linux release prior to the 2.6 kernels to support Oracle over NFS!

The Oracle documentation on the matter was correct since it was produced long before there was OS support for asynchronous I/O on Linux for Oracle over NFS. Metalink 279069.1 is partly correct in that it states support for asynchronous I/O on systems that have the fix for Bugzilla 161363 but it incorrectly suggests that U4 is the requisite release for that fix, but it isn’t—the bug was fixed in U3. And yes, I get really good performance with the following initialization parameter set and have for about 16 months:

filesystemio_options = setall

Manly Man Post Script
Always remember, the Manly Man series is tongue-in-cheek.  Oracle over NFS with Async I/O on the other hand isn’t.

Manly Men Only Deploy Oracle with Fibre Channel – Part VII. A Very Helpful Step-by-Step RAC Install Guide for NFS

Tim Hall has stepped up to the plate to document a step-by-step recipe for setting up Oracle10g RAC on NFS mounts. In Tim’s blog entry, he points out that for testing and training purposes it is true that you can simply export some Ext3 filesystem from a Linux server and use it for all things Oracle. Tim only had 2 systems, so what he did was use one of the servers as the NFS server. The NFS server exported a filesystem and both the servers mounted the filesystem. In this model, you have 2 NFS clients and one is acting as both an NFS client and an NFS server.

This is the link to Tim’s excellent step-by-step guide.

How Simple

If you’ve ever had a difficult time getting RAC going, I think you’d be more than happy with how simple it is with NFS and using Tim’s guide and a couple of low-end test servers would prove that out.

Recently I blogged about the fact that most RAC difficulties are in fact storage difficulties. That is not the case with NFS/NAS.

Thanks Tim!

Manly Men Only Deploy Oracle with Fibre Channel – Part VI. Introducing Oracle11g Direct NFS!

Since December 2006, I’ve been testing Oracle11g NAS capabilities with Oracle’s revolutionary Direct NFS feature. This is a fantastic feature. Let me explain. As I’ve laboriously pointed out in the Manly Man Series, NFS makes life much simpler in the commodity computing paradigm. Oracle11g takes the value proposition further with Direct NFS. I co-authored Oracle’s paper on the topic:

Here is a link to the paper.

Here is a link to the joint Oracle/HP news advisory.

What Isn’t Clearly Spelled Out. Windows Too?
Windows has no NFS in spite of stuff like SFU and Hummingbird. That doesn’t stop Oracle. With Oracle11g, you can mount directories from the NAS device as CIFS shares and Oracle will access them with high availability and performance via Direct NFS. No, not CIFS, Direct NFS. The mounts only need to be visible as CIFS shares diring instance startup.

Who Cares?
Anyone that likes simplicity and cost savings.

The Worlds Largest Installation of Oracle Databases
…is Oracle’s On Demand hosting datacenter in Austin, Tx. Folks, that is a NAS shop. They aren’t stupid!

Quote Me

The Oracle11g Direct NFS feature is another classic example Oracle implementing features that offer choices in the Enterprise data center. Storage technologies, such as Tiered and Clustered storage (e.g., NetApp OnTAP GX, HP Clustered Gateway), give customers choices—yet Oracle is the only commercial database vendor that has done the heavy lifting to make their product work extremely well with NFS. With Direct NFS we get a single, unified connectivity model for both storage and networking and save the cost associated with Fibre Channel. With built-in multi-path I/O for both performance and availability, we have no worries about I/O bottlenecks. Moreover, Oracle Direct NFS supports running Oracle on Windows servers accessing databases stored in NAS devices—even though Windows has no native support for NFS! Finally, simple, inexpensive storage connectivity and provisioning for all platforms that matter in the Grid Computing era!


I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 2,935 other followers

Oracle ACE Program Status

Click It

website metrics

Fond Memories


All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: