Archive for the 'Oracle NFS' Category



Which Version Supports Oracle Over NFS? Oracle9i? Oracle10g?

Recently, a participant on the oracle-l email list asked the following question:

Per note 359515.1 nfs mounts are supported for datafiles with oracle 10. Does anyone know if the same applies for 9.2 databases?

I’d like to point out a correction. While Metalink note 359515.1 does cover Oracle10g related information about NFS mount options for various platforms, that does not mean Oracle over NFS is limited to Oracle10g. In fact, that couldn’t be further from the truth. But before I get ahead of myself I’d like to dive in to the port-level aspect of this topic.

There are no single set of NFS mount options that work across all Oracle platforms. In spite of that fact, another participant of the oracle-l list replied to the original query with the following:

try :
rw,bg,vers=3,proto=tcp,hard,intr,rsize=32768,wsize=32768,forcedirectio

OK, the problem is that of the 6 platforms that support Oracle over NFS (e.g., Solaris, HP-UX, AIX, Linux x86/x86_64/AI64), the forcedirectio NFS mount option is required only on Solaris and HP-UX. For this reason, I’ll point out that the best references for NFS mount options to use for Oracle10g is Metalink 359515.1 and NAS vendors’ documents for Oracle9i.

Oracle9i
Support for Oracle9i on NFS was a little spottier than Oracle10g, but it was there. The now defunct Oracle Storage Compatibility Program (OSCP) was very important in ensuring Oracle9i would work with varying NAS offerings. The Oracle server has evolved nicely to handle Oracle over NFS to such a degree that the OSCP program is no longer even necessary. That means that Oracle10g is sufficiently robust to know whether the NFS mount you are feeding it is valid. That aside, the spotty Oracle9i support I allude to is actually at the port level mostly. That is, from one port to another, Oracle9i may or may not have required patches to operate efficiently and with integrity. One such example is the Oracle9i port to Linux where Oracle Patch number 2448994 was necessary so that Oracle would open files on NFS mounts with the O_DIRECT flag of the open(2) call. But, imagine this, it was not that simple. No, you had to have the following correct:

  • The proper mount options specified by the NAS vendor
  • A version of the Linux kernel that supported O_DIRECT
  • Oracle patch 2448994
  • The Correct setting for the filesystemio_options init.ora parameter

Whew, what a mess. Well, not that bad really. Allow me to explain. Both of the Linux 2.6 Enterprise kernels (RHEL 4, SuSE 9) support open(2)s of NFS files with the O_DIRECT. So there is one requirement taken care of—because I assume nobody is using RHAS 2.1. The patch is simple to get from Metalink and the correct setting of the filesystemio_options parameter is “directIO”. Finally, when it comes to mount options, NAS vendors do pretty well documenting their recommendations. Netapp has an entire website dedicated to the topic of Oracle over NFS. HP EOMs the File Serving Utility for Oracle from PolyServe and documents their mount options in their User Guide as well as in this paper about Oracle on the HP Clustered Gateway NAS.

Oracle10g
I’m not aware of any patches for any Oracle10g port to enable Oracle over NFS. I watch the Linux ports closely and I can state that canned, correct support for NFS is built in. If there were any Oracle10g patches required for NFS I think they’d be listed in Metalink 359515.1 which, at this time, does not specify any. As far as the Linux ports go, you simply mount the NFS filesystems correctly and set the init.ora parameter filesystemio_options=setall and you get both Direct I/O and asynchronous I/O.

The Decommissioning of the Oracle Storage Certification Program

I’ve known about this since December 2006, but since the cat is out of the proverbial bag, I can finally blog about it.

Oracle has taken another step to break down Oracle-over-NFS adoption barriers. In the early days of Oracle supporting deployments of Oracle over NFS, the Oracle Storage Compatibility Program (OSCP) played a crucial role in ensuring a particular NAS device was suited to the needs of an Oracle database. Back then the model was immature but a lot has changed since then. In short, if you are using Oracle over NFS, storage-related failure analysis is as straight-forward as it is with a SAN. That is, it takes Oracle about the same amount of time to determine fault is in the storage—downwind of their software—with either architecture. To that end, Oracle has announced the decommissioning of the Oracle Storage Compatibility Program. The URL for the OSCP (click here , or here for a a copy of the web page in the Wayback Machine) states the following (typos preserved):

At this time Oracle believes that these three specialized storage technologies are well understood by the customers, are very mature, and the Oracle technology requirements are well know. As of January, 2007, Oracle will no longer validate these products. We thank our partners for their contributions to the OSCP.

Lack of Choice Does Not Enable Success
It will be good for Oracle shops to have even more options to choose from when selecting a NAS-provider as an Oracle over NFS platform.  I look forward to other players to emerge on the scene. This is not just Network Appliance’s party by any means. Although I don’t have first-hand experience, I’ve been told that the BlueArc Titan product is a very formidable platform for Oracle over NFS—but it should come as no surprise that I am opposed to vendor lock-in.

Oracle Over NFS—The Demise of the Fibre Channel SAN
That seems to be the conclusion people draw when Oracle over NFS comes up. That is not the case, so your massive investment in SAN infrastructure was not a poor choice. It was the best thing going at the time. If you have a formidable SAN, you would naturally use a SAN-gateway to preserve your SAN investment while reducing the direct SAN connectivity headaches. In this model deploying another commodity server is as simple as plugging in Cat 5 cabling, and mounting an exported NFS filesystem from the SAN gateway. No raw partitions to fiddle with on the commodity server, no LUNs to carve out on the SAN and most importantly, no FCP connectivity overhead. All the while, the data is stored in the SAN so your existing backup strategy applies. This model works for Linux, Solaris, HPUX, AIX.

Oracle over NFS—Who Needs It Anyway?
The commodity computing paradigm is so drastically different than the central server approach we grew to know in the 1990s. You know, one or two huge servers connected to DAS or a SAN. It is very simple to run that little orange cabling from a single cabinet to a couple of switches. These days people throw around terms like grid without ever actually drawing a storage connectivity schematic. Oracle’s concept of a grid is, of course, a huge Real Application Clusters database spread out over a large number of commodity servers. Have you ever tried to build one with a Fibre Channel SAN? I’m not talking about those cases where you meet someone at an Oracle User Group that refers to his 3 clustered Linux servers running RAC as a grid. Oh how I hate that! I’m talking about connecting, say, 50,100 or 250 servers all running Oracle—some RAC, but mostly not—to a SAN. I’m talking about commodity computing in the Enterprise—but the model I’m discussing is so compelling it should warrant consideration from even the guy with the 3-node “grid”. I’m talking, again, about Oracle over NFS—the simplest connectivity and storage provisioning model available for Oracle in the commodity computing paradigm.

Storage Connectivity and Provisioning
Providing redundant storage paths for large numbers of commodity servers with Fibre Channel is too complex and too expensive. Many IT shops are spending more than the cost of each server to provide redundant SAN storage paths since each server needs 2 Host Bus Adaptors (or a dual port HBA) and 2 ports in large Director-class switches (at approximately USD $4,000 per). These same servers are also fitted with Gigabit Ethernet. How many connectivity models do you want to deal with? Settle on NFS for Oracle and stick with bonded Gigabit Ethernet as the connectivity model—very simple! With the decommissioning of the OSCP, Oracle is making the clear statement that Oracle over NFS is no longer an edge-case deployment model. I’d recommend giving it some thought.

Partition, or Real Application Clusters Will Not Work.

OK, that was a come-on title. I’ll admit it straight away. You might find this post interesting nonetheless. Some time back, Christo Kutrovsky made a blog entry on the Pythian site about buffer cache analysis for RAC. I meant to blog about the post, but never got around to it—until today.

Christo’s entry consisted of some RAC theory and a buffer cache contents SQL query. I admit I have not yet tested his script against any of my RAC databases. I intend to do so soon, but I can’t right now because they are all under test. However, I wanted to comment a bit on Christo’s take on RAC theory. But first I’d like to comment about a statement in Christo’s post. He wrote:

There’s a caveat however. You have to first put your application in RAC, then the query can tell you how well it runs.

Not that Christo is saying so, but please don’t get into the habit of using scripts against internal performance tables as a metric of how “well” things are running. Such scripts should be used as tools to approach a known performance problem—a problem measured much closer to the user of the application. There are too many DBAs out there that run scripts way down-wind of the application and if they see such metrics as high hit ratios in cache, or other such metrics they rest on their laurels. That is bad mojo. It is not entirely unlikely that even a script like Christo’s could give a very “bad reading” yet application performance is satisfactory and vise versa. OK, enough said.

Application Partitioning with RAC
The basic premise Christo was trying to get across is that RAC works best when applications accessing the instances are partitioned in such a way as to not require cross-instance data shipping. Of course that is true, but what lengths do you really have to go to in order to get your money’s worth out of RAC? That is, we all recall how horrible block pings were with OPS—or do we? See, most people that loathed the dreaded block ping in OPS thought that the poison was in the disk I/O component of a ping when in reality the poison was in the IPC (both inter and intra instance IPC). OK, what am I talking about? It was quite common for a block ping in OPS to take on the order of 200-250 milliseconds on a system where disk I/O is being serviced with respectable times like 10ms. Where did the time go? IPC.

Remembering the Ping
In OPS, when a shadow process needed a block from another instance, there was an astounding amount of IPC involved to get the block from one instance to the other. In quick and dirty terms (this is just a brief overview of the life of a block ping) it consisted of the shadow process requesting the local LCK process to communicate with the remote LCK process who in turn communicated with the DBWR process on that node. That DBWR process then flushed the required block (along with all the modified blocks covered by the same PCM lock) to disk. That DBWR then posted his local LCK who in turn posted the LCK process back where the original requesting shadow process is waiting. That LCK then posts the shadow process and the shadow process then reads the block from disk. Whew. Note, at every IPC point the act of messaging only makes the process being posted runable. It then waits in line for CPU in accordance with its mode and priority. Also, when DBWR is posted on the holding node, it is unlikely that it was idle, so the life of the block ping event also included some amount of time that was spent while DBWR finished servicing the SGA flushing it was already doing when it got posted. All told, there was quite often some 20 points where the processes involved were in runable states. Considering the time quantum for scheduling is/was 10ms, you routinely got as much as 200ms overhead on a block ping that was just scheduling delay. What a drag.

What Does This Have To Do With RAC?
Christo’s post discusses divide and conquer style RAC partitioning, and he is right. If you want RAC to perform perfectly for you, you have to make sure that RAC isn’t being used. Oh he’s gone off the deep end again you say. No, not really. What I’m saying is that if you completely partition your workload then RAC is indeed not really being used. I’m not saying Christo is suggesting you have to do that. I am saying, however, you don’t have to do that. This blog post is not just a shill for Cache Fusion, but folks, we are not talking about block pings here. Cache Fusion—even over Gigabit Ethernet—is actually quite efficient. Applications can scale fairly well with RAC without going to extreme partitioning efforts. I think the best message is that application partitioning should be looked at as a method of exploiting this exorbitantly priced stuff you bought. That is, in the same way we try to exploit the efficiencies gained by fundamental SMP cache-affinity principals, so should attempts be made to localize demand for tables and indexes (and other objects) to instances—when feasible. If it is not feasible to do any application partitioning, and RAC isn’t scaling for you, you have to get a bigger SMP. Sorry. How often do I see that? Strangely not that often. Why?

Over-configuring
I can’t count how often I see production RAC instances running throughout an entire RAC cluster at processor utilization levels well below 50%. And I’m talking about RAC deployments where no attempt has been made to partition the application. These sites often don’t need to consider such deployment tactics because the performance they are getting is meeting their requirements. I do cringe and bite my tongue however when I see 2 instances of RAC in a two node cluster—void of any application partitioning—running at, say, 40% processor utilization on each node. If no partitioning effort has been made, that means there is cache fusion (GCS/GES) in play—and lots of it. Deployments like that are turning their GbE Cache Fusion interconnect into an extension of the system bus if you will. If I was the administrator of such a setup, I’d ask Santa to scramble down the chimney and pack that entire workload into one server at roughly 80% utilization. But that’s just me. Oh, actually, packing two 40% RAC workloads back into a single server doesn’t necessarily produce 80% utilization. There is more to it than that. I’ll see if I can blog about that one too at some point.

What about High-Speed, Low-Latency Interconnects?
With OLTP, if the processors are saturated on the RAC instances you are trying to scale, high-speed/low latency interconnect will not buy you a thing. Sorry. I’ll blog about why in another post.

Final Thought
If you are one of the few out there that find yourself facing a total partitioning exercise with RAC, why not deploy a larger SMP instead? Comments?

Troubles with Oracle on NAS? Old Stuff Deployed?

In Vidya Bala’s Blog post about Oracle on NAS, there is evidence of past problems with this NAS storage under older Linux distributions (e.g., SLES8) and older Oracle releases (e.g., Oracle9i). Most folks know I am a staunch proponent of Oracle on NAS and have blogged about it here and here.The most important thing to remember is that a noac mount option is no substitute for open(,O_DIRECT,).

 

I’ve blogged that, in my opinion, the first production-quality stack for Oracle on NAS is Oracle10gR2 on 2.6 Kernel releases. However, I can’t speak from authority on the Legacy Unix capabilities in this space. I’ve got too much Linux around here.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.