Manly Men Only Deploy Oracle with Fibre Channel – Part IV. SANs are Simple, RAC is Difficult! | Kevin Closson's Blog: Platforms, Databases and Storage

Manly Men Only Deploy Oracle with Fibre Channel – Part IV. SANs are Simple, RAC is Difficult!

Several months back I made a blog entry about the RAC poll put together by Jared Still. The poll can be found here. Thus far there have been about 150 participants through the poll—best I can tell. Some of the things I find interesting about the results are:

1. Availability was cited 46% of the time as the motivating factor for deploying RAC whereas scalability counted for 37%.

2. Some 46% of the participants state that RAC has met between 75% and 100% of their expectations.

3. More participants (52%) say they’d stay with RAC given the choice to revert to non-RAC.

4. 52% of the deployments are Linux (42% Red Hat, 6% Oracle Enterprise Linux, 4% SuSE) and 34% are using the major Legacy Unix offerings (Solaris 17%, AIX 11%, HP-UX 6%).

5. 84% of the deployments are using block storage (e.g., FCP, iSCSI) with 42% of all respondents using ASM on block storage. Nearly one quarter of the respondents say they use a CFS. Only 13% use file storage (NAS via NFS).

Surveys often make for tough cipherin’. It sure would be interesting to see which of the 52% that use Linux also state they’d stay with RAC given the choice to revert or re-deploy with a non-RAC setup. Could they all have said they’d stick with RAC? Point 1 above is also interesting because Oracle markets RAC as a prime ingredient for availability as per MAA.

Of course point 5 is very interesting to me.

RAC is Simple…on Simple Storage
We are talking about RAC here, so the 84% from point 5 above get to endure the Storage Buffet. On the other hand, the 24% of the block storage deployments that layered a CFS over the raw partitions didn’t have it as bad, but the rest of them had to piece together the storage aspects of their RAC setup. That is, they had to figure out what to do with the clusterware files, database, Oracle Home and so forth. The problem with CFS is that there is no one CFS that covers all platforms. That war was fought and lost. NFS on the other hand is ubiquitous and works nicely for RAC. On that note, an email came in to my inbox last Friday on this very topic. The author of that email said:

[…] we did quite a lot of tests in the summer last year and figured out that indeed using Oracle/NFS can make a very good combination (many at [COMPANY XYZ] were spectical, I had no opinion as I had never used it, I wanted to see the fact). So I have convinced our management to go the NFS way (performance ok for the workload under question, way simpler management).

[…] The production setup (46 nodes, some very active, some almost idle accessing 6 NAS “heads”) does its job with satisfying performance […]

What do I see in this email? NFS works well enough for this company that they have deployed 46 nodes—but that’s not all. I pay particular attention to the 3 most important words in that quote: “way simpler management.”

Storage Makes or Breaks Many RAC Deployments
I watched intently as Charles Schultz detailed his first forray into RAC. First, I’ll point out that Charles and I had an email side-bar conversation on this topic. He is aware that I intended to weave his RAC experience into a blog entry of my own. So what’s there to blog about? Well, I’ll just come right out and say it—RAC is usually only difficult when difficult storage is used. How can I say that? Let’s consider Charles’ situation.

First, Charles is an Oracle Certified Master who has no small amount of exposure to large Oracle environments. Charles points out on his blog that the environment they were trying to deploy RAC into has some 150 or more databases consuming some 10TB of storage! That means Charles is no slouch. And being the professional he is, Charles points out that he took specialized RAC training to prepare for the task of deploying Oracle in their environment. So why did Charles struggle with setting up a 2-node RAC cluster to the point of making a post to the oracle-l email list for assistance? The answer is simply that the storage wasn’t simple.

It turned out that Charles’ “RAC difficulty” wasn’t even RAC. I assert that the highest majority of what is termed “RAC difficulty” isn’t RAC at all, but the platform or storage instead. By platform I mean Linux RPM dependency and by storage I mean SAN madness. Charles’ difficulties boiled down to Linux FCP multipathing issues. Specifically, multipathing was causing ASM to see multiple entries for each LUN. I made the following comment on Charles’ blog:

Hmm, RHEL4 and two nodes. Things should not be that difficult. I think what you have is more on your hands than RAC. I’ve seen OCFS2, and ASM [in Charles’ blog thread]. That means you also have simple raw disks for OCR/CSS and since this is Dell, is my guess right that you have EMC storage with PowerPath?

Lot’s on your plate. You know me, I’d say NAS…

Ok, I’m sorry for SPAMing your site, Charles, but your situation is precisely what I talk about. You are a Certified Master who has also been to specific RAC training and you are experiencing this much difficulty on a 2 node cluster using a modern Linux distro. Further, most of your problems seem to be storage related. I think that all speaks volumes.

Charles replied with:

[…] I agree whole-heartedly with your statements; my boss made the same observations after we had already sunk over 40 FTE of 2 highly skilled DBAs plunking around with the installation.

If I read that correctly, Charles and a colleague spent a week trying to work this stuff out and Charles is certainly not alone in these types of situations that generally get chalked up as “RAC problems.” There was a lengthy thread on oracle-l about very similar circumstances not that long ago.

Back To The Poll
It has been my experience that most RAC difficulties are storage related—specifically the storage presentation. As point 5 in the poll above shows, some 84% of the respondents had to deal with raw partitions at one time or another. Indeed, even with CFS, you have to get the raw partitions visible and like-named on each node of the cluster before you can create a filesystem. If I hear of one more RAC deployment falling prey to storage difficulties, I’ll…

Ah, forget that. I use the following mount options on Linux RAC NFS clients:

rw,bg,hard,nointr,tcp,vers=3,timeo=300,rsize=32768,wsize=32768,actimeo=0

and I generally widen up a few kernel tunables when using Oracle over NFS:

net.core.rmem_default = 524288
net.core.wmem_default = 524288
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.ipfrag_high_thresh=524288
net.ipv4.ipfrag_low_thresh=393216
net.ipv4.tcp_rmem=4096 524288 16777216
net.ipv4.tcp_wmem=4096 524288 16777216
net.ipv4.tcp_timestamps=0
net.ipv4.tcp_sack=0
net.ipv4.tcp_window_scaling=1
net.core.optmem_max=524287
net.core.netdev_max_backlog=2500
sunrpc.tcp_slot_table_entries=128
sunrpc.udp_slot_table_entries=128
net.ipv4.tcp_mem=16384 16384 16384

Once the filesystem(s) is/are mounted, I have 100% of my storage requirements for RAC taken care of. Most importantly, however, is to not forget Direct I/O when using NFS, so I set the following init.ora parameter filesystemio_options as follows:

filesystemio_options=setall

Life is an unending series of choices. Choosing between simple or difficult storage connectivity and provisioning is one of them. If you overhear someone lamenting about how difficult “RAC” is, ask them how they like their block storage (FCP, iSCSI).

26 Responses to “Manly Men Only Deploy Oracle with Fibre Channel – Part IV. SANs are Simple, RAC is Difficult!”

Feed for this Entry Trackback Address

1 prodlife July 9, 2007 at 9:37 pm

Kevin,

Both Oracle and Netapp recommended using timeo=600 for the NFS mount. Why did you decide to use a lower threshold?

Also, RAC installation and administration guides give very specific guidelines for the kernel parameters, which are very different from what you use. Some of the parameters you tune are not even mentioned in the Oracle books. Is there an “advanced RAC tuning” guide where I can read more on tuning these parameters?

Thanks for the excellent article.

Reply
2 kevinclosson July 9, 2007 at 11:03 pm

prodlife,

You are right about the timeo=600, that is indeed the prescribed value. I should be more careful in stating what I happen to use because I don’t intend to challenge what Oracle or NetApp recommend. In short, I don’t remember when I started using 30 seconds, but to be honest I’m not sure I’d actually want to hear back from a request that took any longer than that anyway! 🙂

I plan to post performance measurements of the generic sysctl setting Oracle and other recommend compared to these that I set as a result of my testing. In the meantime, by all means, set what Oracle recommends!

Reply
3 Jeff July 10, 2007 at 2:52 am

Prodlife,

I’ve gone through a number of Oracle RAC installs on Linux, and found Julian Dyke and Steve Shaw’s RAC book very helpful (Kevin contributed on this book) as well as Werner Puschitz’s web site that goes through soome of the Linux kernel parameters in some detail (hugepages is the one i’m investigating now, as i’m getting servers with more RAM that can take advanatage of this feature). Also check out OTN on the Oracle validated configurations in which the kernel parameters used for the configurations are listed.

I think Charles’ blog hits the nail on the head, in which he states that traditional job roles get a little cloudy when performing RAC installs. To me, RAC is a niche within Oracle, and the DBA has to go beyond a traditional role to successfully setup a RAC environment.

Reply
4 cschultz July 11, 2007 at 11:25 am

Thanks for the writeup, Kevin. I think you did a great job at representing the truth.

I will continue to make the disclaimer that we are new to RAC, so some of our choices/decisions may seem novice in that regards. However, I did ask about going the NAS route, and the perspective of the Linux SysAdmin and Storage group (two completely different groups) seems to be “let’s use SAN because that is what we use everywhere else”. As I am just starting to climb the hill of learning in regards to storage, I acquiesced. But Kevin, you are giving me food for thought.

Reply
5 billy bathgates July 12, 2007 at 4:37 pm

The poll questions could be better. We are using raw oracle storage devices, but they are implemented by an FC SAN. It’s too early to tell whether RAC is worth it. Our implementation is vendor supplied, and seems very idiosyncratic to me (in the sense that, “why do they do this that way” comes to my mind a lot. Sometimes the answer seems to be, “to sell as many middleware licenses as humanly possible”).

I’ll know a lot more in six months, I guess.

By the way, thank you for your routinely insightful comments here.

Reply
6 billy bathgates July 12, 2007 at 5:18 pm

Your storage buffet comments seem a little less than even-handed to me. You kinda skipped the storage allocation steps that are still required when using nfs to access storage. Storage allocated from an FC SAN can just be in the form of filesystems too, if that’s what’s desired.

Admittedly there may be advantages with particular to oracle RAC, I don’t have enough experience in the area to comment.

Reply
7 kevinclosson July 12, 2007 at 5:56 pm

Billy,

I love your nom de plume..if that is in fact a nom de plume! I will need to use your comments on the “storage buffet” as the catalyst for anoth installment on the Manly Man series. You have asked an excellent question and the answer deserves a blog entry.

Thanks for stopping by!

Reply
8 kevinclosson July 12, 2007 at 6:33 pm

“he poll questions could be better. We are using raw oracle storage devices, but they are implemented by an FC SAN.”

Billy, the poll (http://www.misterpoll.com/4072127287.html) has a selection for FC SAN RAW or ASM. Whether you use simple RAW files or have ASM manage the RAW space, it is still RAW access from the database perspective so that is why the selections are listed that way. You should just check that one for your situation.

Reply
9 Vladimir Barac August 25, 2007 at 12:50 pm

I hope I am not late with reply (question actually)…

I have applied some kernel parameters that are listed above:

net.core.rmem_default = 524288
net.core.rmem_max = 16777216
net.core.wmem_default = 524288
net.core.wmem_max = 16777216

net.ipv4.tcp_rmem = 4096 524288 16777216
net.ipv4.tcp_wmem = 4096 524288 16777216

sunrpc.tcp_slot_table_entries=128

Filer is NetApp FAS3020, o/s Suse SLES10, db is 10.2 standard edition, single instance.

Basic test I have performed is “select count(*) from ” to test performance. Before each test instance is bounced.

I am consistently getting around 8300 blocks/s read performance – this is based on V$SESSION_LONGOPS view (columns total_work and elapsed_seconds). This is around 64MB/s with block size of 8K.

Is it possible to increse this value? Or this is considered “good NFS performance”?

Reply
10 kevinclosson August 25, 2007 at 4:27 pm

Vladimir,

Any chance you can send a statspack? Also, what are your NFS mount options on the SLES10 server?

In short, no, 64MB/s is way off the mark.

Reply
11 Vladimir Barac August 26, 2007 at 8:41 am

Yes, I have prepared some statspack report. Contact me by mail and I will send them.

I have one more peculiar performance problem, more interesting one. There is one server that peaks at 10MB/s ~ 12MB/s (again, according to V$SESSION_LONGOPS). Server is identical to the one in my previous post.

Info about server is as following:

– server is HP DL385 G2
– SLES 10
– Oracle 10.2.0.3 enterprise edition
– Netapp FAS3020
– mount options are rw,bg,hard,vers=3,proto=tcp,nointr,rsize=32768,wsize=32768,timeo=600
– filesystemio_options=setall
– multi block read count is 128 (default)
– except Oracle, nothing else is running on the server
– kernel parameters same as in previous post.
– same test is performed (select count(*)…)

I don’t have Windows servers available any more, but if I remember well iSCSI (served from the same filer) was performing better.

Reply
12 Vladimir Barac August 26, 2007 at 11:15 am

I have posted a reply, however it is not yet visible?

Should I post it once again?

Thanks,
Vladimir

Reply
13 Vladimir Barac August 28, 2007 at 12:21 pm

Small update… Server with 10MB/s ~ 12MB/s was actually having network card running at 100Mbps. After replacing network cable, link speed reverted back to 1000Mbps.

Again, we are seing ~60MB/s speed for NFS.

Reply
14 kevinclosson August 28, 2007 at 3:36 pm

Vladimir,

You have a DL385 (2 socket dual-core Opteron). If that query isn’t going parallel (and therefore issuing multiblock reads) then you are getting really good performance for Oracle, but not NFS. See if the reads are db file sequential reads, db file scattered reads or direct path reads.

If Oracle is doing multiblock reads, you **will** see up to about 100-115MB/s on a single GbE hose.

Reply
15 Vladimir Barac August 28, 2007 at 6:10 pm

We can’t use parallel on all instances, as software is standard edition. Where we can have parallel degree >1 we do see peformance increase.

Table used to test speed is dummy table, having no indexes, just millions of rows to create some workload.

Reads are all “scattered”, block count is 128 (according to v$session_wait).

I am just curios what is so “magical” with ~60MB/s. Am I “chasing my tail” here? 🙂

Reply
16 kevinclosson August 28, 2007 at 6:30 pm

Vladimir,

Look at the statspack to see the average number of blocks transfered per read on that tablespace. It just doesn’t sound like Oracle is issuing large reads. If Oracle is issuing large I/O then the problem is likely down-wind of the NIC.

I’m about to blog another approach to testing the wire-throughput before bothering with Oracle. Hint: get a copy of the dd command that Oracle recommends with ocfs 1 and do some dd o-direct using 1MB transfers.

Reply
17 Vladimir Barac August 29, 2007 at 3:25 am

I’ll need word of advice… How to check how many blocks is Oracle actually reading at a time? V$SESSION_WAIT says count is 128.

Thanks,
Vladimir

Reply
18 kevinclosson August 29, 2007 at 5:12 am

Vladimir,

Start a statspack report, run the test, close the report and read the I/O section. It tells you blocks read per I/O.

Reply
19 ezaton April 10, 2009 at 8:31 am

I have been reading your posts and I find them interesting, and generally speaking – correct and accurate, however, I have several comments of my own regarding this specific article.
It begins with the fact that a DBA is a DBA, and not a sysadmin. As soon as you, or any other DBA out there will grasp that, some of the issues dealt with badly by DBAs would be shifted to a sysadmin, and would be solved faster and better.

Check out my blog about using udev for persistent raw names. Quite simple, as soon as you grasp that. The sysadmin’s responsibility is to supply the DBA with a working, prepared server environment. The DBA should know that there are these files – /dev/raw/crs1, /dev/raw/crs2, /dev/raw/vote1 …. etc, etc, etc, which are for his utilization. He should not care as to which transport was responsible for them. They should just be there. Same goes for ASM and multipath. The problem can easily be solved by editing /etc/sysconfig/oracleasm and excluding all “sd” devices. I think that this is also being described in Oracle’s documentation and best practices for EMC.

Which brings us to another thing. I HATE the way EMC works. I hate their PowerPath – this buggy and unreliable piece of software, with their entire interoperability matrix which is impossible to match. The are forcing the Linux OS in ways I would not want to see on any Linux I manage. File locations, RPM structure – just name it….

One last thing – for your RAC deployment, your NFS mount options are fine – for the voting, crs and data files. For anything else – they are disastrous. For large amount of files located on such an NFS tree, attribute cache should be enabled, else you will get terrible performance out of this volume – using actimeo=300 or 600 (my preference)

Cheers!
Ez

Reply
- 20 kevinclosson April 10, 2009 at 4:09 pm
  
  ezaton,
  
  Thanks for stopping by and for the (mostly) kind words. I need to address the content of your comment:
  
  “It begins with the fact that a DBA is a DBA, and not a sysadmin. As soon as you, or any other DBA out there will grasp that, some of the issues dealt with badly by DBAs would be shifted to a sysadmin, and would be solved faster and better”
  
  Well, I haven’t held a position as a DBA for over 20 years, but I have respect for the difficult things they do on a daily basis and it is for that reason that I harp on the “fit and feel” of Oracle on Fibre Channel SAN storage. Suggesting that DBAs need to rush to “grasp “simple” things seems like a put-down to DBAs by the way.
  
  NFS is the only unified, totally supports cross-platform storage presentation model out there and it works well. There is no secret that NFS is the go-to storage platform for Oracle’s hosted On Demand business and that is because of simplicity across the board (my assessment).
  
  “…some of the issues dealt with badly by DBAs would be shifted to a sysadmin, and would be solved faster and better”
  
  It is my position that DBAs shouldn’t have to make frequent trips to the Sysadmin group. The organization’s choice of storage presentation models is a huge factor in the frequency of such visits. With Exadata or NFS the aim is to make those visits much less frequent.
  
  “The DBA should know that there are these files – /dev/raw/crs1, /dev/raw/crs2, /dev/raw/vote1 …. etc, etc, etc, which are for his utilization. “
  
  A DBA owns the contents of the files. The fact that the raw SAN storage presentation model forces DBAs to get intimate with character-special devices does not actually aid them in doing their real job. It is overhead.
  
  “The problem can easily be solved by editing /etc/sysconfig/oracleasm and excluding all “sd” devices”
  
  Yet another example of what I consider unnecessary complexity.
  
  “Which brings us to another thing. I HATE the way EMC works. I hate their PowerPath – this buggy and unreliable piece of software, with their entire interoperability matrix which is impossible to match”
  
  I discriminate less (so as not to name powerpath alone). I think all raw disk multi-pathing in Linux clustered environments is messy, prone to mistakes and unnecessary complexity.
  
  “One last thing – for your RAC deployment, your NFS mount options are fine – for the voting, crs and data files. For anything else – they are disastrous. For large amount of files located on such an NFS tree, attribute cache should be enabled, else you will get terrible performance out of this volume – using actimeo=300 or 600 (my preference)”
  
  Thanks for that, in part. I don’t understand why you clump crs and datafiles together regarding mount options, because the mount where CRS files (css,ocr) reside must be mounted with actimeo=600. For Oracle Home and databases actimeo must be set to 0 (see Metalink 359515.1). Remember, Sysadmins are supposed to make DBA’s lives easier…so be careful not to promote unsupported NFS mount options 🙂
  
  I think my paper on the topic provides reasonable evidence that the Oracle-supported NFS mount options do, in fact, support good performance. I’m not sure what sort of Oracle-related filesystem functionality you have in mind when you suggest the Oracle-supported mount options would be “disastrous.” Feel free to elaborate.
  
  As an aside, all the monkey business about mount options is one of the primary benefits of the Direct NFS feature of Oracle Database 11g. Mount options are a non-issue with that feature.
  
  Reply
21 ezaton April 11, 2009 at 12:02 am

Hi Kevin.
Thanks for your fast response.

I stand corrected as to the NFS mount options. This is also some sort of monkey business, so there is no “leave the sysadmin out of it” solution just yet.
Oracle had some notion in that direction with ASM (you sysadmins only make sure we have our LUNs, and we’ll do fine onwards without you). Not that I have a problem with that, it’s that usually most DBAs don’t have a clue about storage, filesystems and all these containers of bits. Few have. Most don’t.

To stress it out – I have had to “translate” in a conversation between one DBA and one sysadmin which went like this:
DBA: I want the slashes to go the other way
Sysadmin: What do you mean?
DBA: I have this name /vol/oracle/database and I want it to be backwards, since when I snapshot, the slashes kill my “sed” command. Also – if you don’t mind – make it just be “database”. I don’t want this entire /vol/oracle thing there
(here sysadmin starts screaming)

This is a real conversation I have had the pleasure of joining. After getting a glass of water to the sysadmin, I explained the constrains of NetApp to the sed-threatened DBA, and managed to assist with his problem, so, luckily, the /vol and the direction of the slashes were kept. Phew.

Back to the topic. Storage administration is not different than any other administration. Network administrations is required as well, and everyone just lives with it. Assuming storage configured correctly (aha! These sysadmins should do their work, right?), there should not be much trouble – if you underst the restrains and limitations combined with SANs – Just like understanding the restrains of networking is required for a working system.

Multipathing is rather simple when you get to know it. IBMs RDAC (engenio’s, LSI-Logic, today, if you want to be exact) adds an upper and lower filter which converts all the possible paths into a simple /dev/sd* devices, without exposing the under layered paths.

On a different topic – I have seen several RACs with OCFS2 containing the CRS and voting files. This simplifies management enough for the DBAs, right?

Now, just to be clear – I would prefer NFS for any other protocol any day. It is cleaner, easier then the alternative and it is good and reliable. but I have noticed some issues with human behavior around data centers – people will be more likely to disconnect these gray ethernet cables than the orange optic ones. From a psychological point of view, having your transport relay on the orange (1000 Base FX ethernet, or FC storage), your servers would be less likely to be bothered by human error.
Strong rules and professional people out there. Is it not what we all want?

About NFS and actimeo – Imagine a directory containing thousands of files (non-oracle). That was my meaning. When you run ‘ls’ – this colorful happy RedHat’s ‘ls’ command, you actually retrieve all the attributes of the files. When you do it again, under normal circumstances, you do not. When tenths or hundreds of servers do that, this is normal day to day work, however, when actimeo=0, each server has to retrieve the entire meta information on each pass for each file. This means that the amount of small and annoying IOs on this volume is tremendous. Combined with the storage cache working for other things, and you get your disks doing the worst possible IO workload – very small and very scattered tiny reads. This strains the storage dearly, and in several cases I have witnessed, can mean the difference between 2-3 hours of compilation, for example and less than 10 minutes. Same storage, different NFS mount options.

Oracle require their actimeo settings for specific reasons (you need to refresh your meta information every time in a shared cluster), however, for non-oracle (and as you have pointed correctly – non oracle data files), this reduces performance to a turtle pace, with the bonus of massively stressing the storage (you can get only a certain amount of IOPs our of your disk array), and it will be noticed when you have large amount of files on a this NFS volume.

Cheers!
Ez

Reply
22 Ali Zaidi March 23, 2010 at 12:04 am

Kevin:
you mention mount points are a non-issue with DNFS, what about the kernel tweaks in your post, are they still relevant with DNFS?

yhx

Reply
- 23 kevinclosson March 26, 2010 at 6:30 pm
  
  I think I was saying that mount options are not important with DNFS…that statement, however, is restricted to database objects, not RAC objects (e.g., OCR,CSS)
  
  Reply

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage