Archive for the 'server consolidation' Category

Oracle Direct I/O Brought to You By Deranged Monkeys

If you have an Linux system, check the “bugs” section of the man page for the open(2) system call and you’ll see the following quote from Linus Torvalds:

The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances -Linus

I’m not joking, read that man page and you’ll see. Now, while I much prefer a mount option approach to direct I/O, I don’t think the O_DIRECT style of direct I/O was the brain child of a deranged monkey. I wonder if Linus is insinuating that the interface would be better if it was written by a sane monkey—or perhaps even a deranged monkey that is not on some serious mind-controlling substances?

There is nothing strange about O_DIRECT and most of the Unix derivations I am aware of are happy to offer it (Solaris being the notable exception offering directio(3C) instead). I’d love to know more about the context of that Linus quote. I’ve been around O_DIRECT since the very early 1990s. Sequent supported O_DIRECT opens on DYNIX/ptx file system files way back in 1991.

The Linux kernel development community still languishes over the fact that software like Oracle does not like to kernel-dive to access buffered data, preferring to do its own buffering instead.

A Mount-Option Approach
Why? Well, if you have programs that perform properly aligned I/O calls (e.g., cat(1), dd(1), cp(1), etc) but you don’t want them “polluting “ your system page cache, then you either need a mount-option approach to do Direct I/O or the tools need to be re-coded to open O_DIRECT. Back in 2001 I had the opportunity to make that choice for PolyServe and I haven’t regretted it once. Let me explain.

Let’s say, for instance, you generate and compress a few gigabytes of archived redo logs per day—or roughly ~40KB per second. It doesn’t sound like much, I know. But let’s look at page cache costs. When ARCH spools an offline redo log to the archive log destination the OS page cache will be used to buffer the I/O. When your compression tool (e.g. compress(P), gzip(1)) reads the file, page cache will once again be used. As the output of compress needs to be written page cache is used. Finally, when the archived redo is copied off the system (e.g., to tape), page cache will again be used. All this caching for data that is not used again—save for emergencies. But really, caching sequentially read archived files and compress output? Makes little sense.

The only way to not cache this sort of data is O_DIRECT, but I/Os issued against an O_DIRECT opened file must be multiples of the underlying disk block size (generally 512 bytes). The buffer in the calling process used for the I/O must also be a aligned on an address that is a multiple of the OS page size. It turns out that most OS tools perform proper alignment of their I/O buffers. So where is the rub? The I/O sizes! Even if you coded your compress tool to use O_DIRECT (deranged monkey syndrome), the odds that the output file will be a multiple of 512 bytes is nil. Let’s look at an example.

Direct I/O for Better Memory Utilization
In the following session I performed 6 steps to see the effect of direct I/O:

  1. Use df to determine space and exact filesystem of my current working directory (CWD)
  2. Check the Mount options. My CWD is a PolyServe PSFS mounted with the DBOptimized mount option which “renders” direct I/O akin to the Solaris –forcedirectio mount option.
  3. List my redo logs. Note, they are OMF files so the names are a bit strange.
  4. Check free memory on the system
  5. Copy a redo log
  6. Check free memory again to see how much memory was used by the OS page cache

fig1.jpg

OK, hold it, in step 5 I copied a 128MB file and yet the free memory available only changed by 176KB (from step 4 to step 6). My copy of an online log closely resembles what ARCH does—it simply copies the inactive online redo log to the archive log destination. I like the ability to not consume 256MB of physical memory to copy a file that is no longer really part of the database! The cp(1) command performs I/O with requests that are 512byte multiples, so the PolyServe CFS mounted in the DBOptimized mode simply “renders” the I/O through the direct I/O code path. No, cp(1) does not open with O_DIRECT, yet I relieved the pressure on free memory by copying with Direct I/O via the mount option. That’s good.

File Compression with Direct I/O Mounted Filesystem
But what about compressing files in a direct I/O filesystem? Let’s take a look. In the next session I did the following:

  1. Check free memory on the system
  2. Used ls(1) to see my copy of the redo log file.
  3. Used gzip(1) with maximum compression on the copy of the redo log file
  4. Used ls(1) to see the file size of the compressed file.
  5. Check free memory on the system to see what OS page cache was used

fig2.jpg

OK, this is good. I take a 128MB redo log file and compress it down to 29,582,800 bytes—which is, of course, 57,778 512 byte chunks plus one 464 byte chunk. According to the differences in free memory from step 1 and step 5, only 64KB of system memory was “wasted” in the act of compressing that file. Why do I say wasted? Because cache is best used for sharing data such as in the SGA, however, here I was able to read in 128MB and write out 28.2MB and only used 64KB of page cache in the process. Memory costs money and efficiency matters. This is the reason I prefer a mount option approach to direct I/O.

Back to the example. How did I write an amount that included a stray 464 bytes with direct I/O? That is not a multiple of the underlying disk driver requirement which is 512 bytes.

Under The Covers
On Linux, gzip(1) uses 32KB reads and 16KB writes. The output file created by gzip(1) is 29,839,295 bytes which is 1,805 writes at 16KB and one last odd-ball write of 9,680 bytes—something that would be impossible to do with direct I/O were it not for the direct I/O mount option. Let’s look at strace. The last write was 9,680 bytes:

fig3.jpg

Direct I/O Without Compile-Time O_DIRECT
I can’t speak about other direct I/O mount implementations, but I can explain how PolyServe does this. All I/O bound for files in a DBOptimized mounted PSFS filesystem are quickly examined to see if the I/O meets the underlying device driver DMA requirements. In the kernel we use simple arithmetic to determine if the I/O size is a multiple of the underlying disk block size (satisfies DMA requirement) and whether the I/O buffer is aligned on a page boundry. If both conditions are true, the I/O is DMAed directly from the process address space to the disk. If not, we simply grab an OS page cache buffer, perform the I/O and then immediately invalidate that page so no other process can read dirty data (PolyServe is sort of big on cache coherency if you get my drift).

Best of Both Worlds
In the end, Linus might be right about O_DIRECT, but sitting here at PolyServe makes me say, “Who cares.” We supported direct I/O on Linux before Linux supported O_DIRECT (it was just a patch at that time). In fact, we did a 10-node Oracle9i RAC, 10 TB, 10,000 user OLTP Proof of Concept way back in 2002—before Linux O_DIRECT was mainstream. Here is a link to the paper if you are interested in that proof point.


Real Application Clusters: The Shared Database Architecture for Loosely-Coupled Clusters

The typical Real Application Clusters (RAC) deployment is a true enigma. Sometimes I just scratch my head because I don’t get it. I’ve got this to say, if you think Shared Nothing Architecture is the way to go, then deploy it. But this is an Oracle blog, so let’s talk about RAC.

RAC is a shared disk architecture, just like DB2 on IBM mainframes. It is a great architecture, one that I agree with as is manifested by my working for shared data clustering companies all these years. Again, since this is an Oracle blog I think arguments about shared disk versus shared nothing are irrelevant.

Dissociative Identity Disorder
The reason I’m blogging this topic is because in my opinion the typical RAC deployment exhibits the characteristics of a person suffering from Dissociative Identity Disorder. Mind you, I’m discussing the architecture of the deployment, not the people that did the deployment. That is, we spend tremendous amounts of money for shared disk database architecture and then throw it into a completely shared nothing cluster. How much sense does that make? What areas of operations does that paradigm affect? Why does Oracle promote shared disk database deployments on shared-nothing clusters? What is the cause of this Dissociative Identity Disorder? The answer: the lack of a general purpose shared disk filesystem that is suited to Oracle database I/O that works on all Unix derivations and Linux. But wait, what about NFS?

Shared “Everything Else”
I can’t figure out any other way to label the principle I’m discussing so I’ll just call it “Shared Everything Else”. However, the term Shared Everything Else (SEE for short) insinuates that there is less importance in that particular content—an insinuation that could not be further from the truth. What do I mean? Well, consider the Oracle database software itself. How do you suppose an Oracle RAC (shared disk architecture) database can exist without having the product installed somewhere.

The product install directory for the database is called Oracle Home. Oracle has supported the concept of a shared Oracle Home since the initial release of RAC—even with Oracle9i. Yes, Metalink note 240963.1 describes the requirement for Oracle9i to have context dependent symbolic links (CDSL), but that was Oracle9i. Oracle10g requires no context dependent symbolic links. Oracle Universal Installer will install a functional shared Oracle Home without a any such requirements.

What if you don’t share a software install? It is very easy to have botched or mismatched product installs—which doesn’t sit well with a shared disk database. In a recent post on the oracle-l list, sent the following call for help:

We are trying to install a 2-node RAC with ASM (Oracle 10.2.0.2.0 on Solaris 10) and getting the error below when using dbca to create the database.The error occurs when dbca is done creating the DB (100%).Any suggestions?

We have tried starting atlprd2 instance manually and get the error below regarding an issue with spfile which is on ASM.

ORA-01565: error in identifying file ‘+SYS_DG/atlprd/spfileatlprd.ora’
ORA-17503: ksfdopn:2 Failed to open file +SYS_DG/atlprd/spfileatlprd.ora
ORA-03113: end-of-file on communication channel

OK, for those who are not Oracle-minded, this sort of deployment is what I call the Dissociative Identity Disorder since the database will be deployed on a bunch of LUNs provisioned, masked and accessed as RAW disk from the OS side—ASM is a collection of RAW disks. This is clearly not a SEE deployment.The original poster followed up with a status of the investigatory work he had to do to try and get around this problem:

[…] we have checked permissions and they are the same.We also checked and the same disk groups are mounted in both ASM instances

also.We have also tried shutting everything down (including reboot of both servers) and starting everything from scratch (nodeapps, asm, listeners, instances), but the second node won’t start.Keep getting the same error […]

What a joy. Deploying a shared disk database in a shared nothing cluster! There he was on each server checking file permissions (I just counted, there are 20,514 files in one of my Oracle10g Oracle Homes), investigating the RAW disk aspects of ASM, rebooting servers and so on. Good thing this is only a 2 node cluster. What if it was an 8 node cluster? What if he had 10 different clusters?

As usual, the oracle-l support channel comes through. Another list participant posted the following:

Seem to be a known issue (Metalink Note 390591.1). We encountered similar issue in Linux RAC cluster and has been resoled by following this note.

The cause was included in his post (emphasis added by me):

Cause

Installing the 10.2.0.2 patchset in a RAC installation on any Unix platform does not correctly update the libknlopt.a file on all nodes. The local node where the installer is run does update libknlopt.a but remote nodes do not get the updated file. This can lead to dumps or internal errors on the remote nodes if Oracle is subsequently relinked.

That was the good and bad, now the ugly—his post continues with the following excerpt from the Oracle Metalink note:

There are two solutions for this problem:

1) Manual copy of the “libknlopt.a” library to the offending nodes:

-ensure all instances are shut down
-manually copy $ORACLE_HOME/rdbms/lib/libknlopt.a from the local node to all remote nodes

-relink Oracle on all nodes :
make -f ins_rdbms.mk ioracle

2) Install the patchset on every node using the “-local” option:

What’s So Bad About Shared Nothing Clusters?
I’m not going to get into that, but one of the central knock-offs Oracle uses against shared-nothing database architecture is the fact that replication is required. Since the software used to access RAC needs to be kept in lock-step, replication is required there as well, and as we see from this oracle-l email thread, replication is not all that simple with a complex software deployment like the Oracle database product. But speaking of complex, the Oracle database software pales in comparison to the Oracle E-Business Suite. How in the world do people manage to deploy E-Biz on anything other than a huge central server? Shared Applications Tier.

Shared Applications Tier
Yes, just like Oracle Home, the huge, complex Oracle E-Business Suite can be installed in a shared fashion as well. It is called a Shared Applications Tier. One of the other blogs I read has been discussing this topic as well, but this is not just a blogosphere topic—it is mainline. Perhaps the best resource for Shared Applications Tier is Metalink note 243880.1, but Metalink notes 384248.1 and 233428.1 should not be overlooked. The long story short is that Oracle supports SEE, but they don’t promote it for who-knows-what-reason.

Is SEE Just About Product Installs?
Absolutely not. Consider intrinsic RAC functionality that doesn’t function at all without a shared filesystem:

  • External Tables with Parallel Query Option
  • UTIL_FILE
  • BFILE

I’m sure there are others (perhaps compiled PL/SQL), but who cares. The product is expensive and if you are using shared disk architecture you should be able to use all the features of shared disk architecture. However, without a shared filesystem, External Tables and the other features listed are not cluster-ready. That is, you can use External Tables, UTIL_FILE and BFILE—but only from one node. Isn’t RAC about multi-node scalability?

So Why the Rant?
The Oracle Universal Installer will install a fully functional Oracle10g shared Oracle Home to simplify things, the complex E-Business Suite software is architected for shared install and there are intrinsic database features that require shared data outside of the database so why deploy a shared database architecture product on a platform that only shares the database? You are going to have to explain it to me like I’m six years old; because I know I’m not going to understand. Oh, yes, and don’t forget that with a shared-nothing platform, all the day to day stuff like imp/exp, SQL*Loader, compressed archive redo, logging, trace, scripts, spool and so on mean you have to pick a server and go. How symmetric is that? Not as symmetric as the software for which you bought the cluster (RAC), that’s for certain.

Shared Oracle Home is a Single Point of Failure
And so is the SYSTEM tablespace in a RAC database, so what is the point?People who choose to deploy RAC on a platform that doesn’t support shared Oracle Home often say this. Yes a single shared Oracle Home is a single point of failure, but like I said, so is the SYSTEM tablespace in every RAC database out there. Shops that espouse shared software provisioning (e.g., shared Oracle Home) are not dolts, so the off-the-cuff single point of failure red herring is just that. When we say shared Oracle Home, do we mean a single shared Oracle Home? Well, not necessarily. If you have, say, a 4 or 8 node RAC cluster, why assume that SEE or not to SEE is a binary choice? It is perfectly reasonable to have 8 nodes share something like 2 Oracle Homes. That is a significant condensing factor and appeases the folks that concentrate on the possible single point of failure aspect of a shared Oracle Home (whilst often ignoring the SYSTEM tablespace single point of failure). A total availability solution requires Data Guard in my opinion, and Data Guard is really good, solid technology.

Choices
All told, NFS is the only filesystem that can be used across all Unix (and Linux) platforms for SEE. However, not all NFS offerings are suffiently scalable and resilient for SEE. This is why there is a significant technology trend towards clustered storage (e.g., NetApp OnTAP GX, PolyServe(HP) EFS Clustered Gateway, etc).

Finally, does anyone think I’m proposing some sort of mix-match NFS here with a little SAN there sort of ordeal? Well, no, I’m not. Pick a total solution and go with it…either NFS or SAN, the choice is yours, but pick a total platform solution that has shared data to complement the database architecture you’ve chosen. RAC and SEE!


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 2,953 other followers

Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: