Archive for the 'NFS CFS ASM' Category



HP to Acquire PolyServe to Bolster NAS Offerings with Clustered Storage

You faithful readers of this blog know my position on NAS for Oracle. Clustered Storage is getting hot and HP has just stepped up to the plate by acquiring PolyServe. Here is a link to HP’s website with details:

HP To Acquire PolyServe

As you regular readers can imagine, my blogging will certainly sound a lot different going forward.

Yes Direct I/O Means Concurrent Writes. Oracle Doesn’t Need Write-Ordering.

If Sir Isaac Newton was walking about today dropping apples to prove his theory of gravity, he’d feel about like I do making this blog entry. The topic? Concurrent writes on file system files with Direct I/O.

A couple of months back, I made a blog entry about BIGFILE tablespaces in ASM versus modern file systems.The controversy at hand at the time was about the dreadful OS locking overhead that must surely be associated with using large files in a file system. I spent a good deal of time tending to that blog entry pointing out that the world is no longer flat and such age-old concerns over OS locking overhead on modern file systems no longer relevant. Modern file systems support Direct I/O and one of the subtleties that seems to have been lost in the definition of Direct I/O is the elimination of the write-ordering locks that are required for regular file system access. The serialization is normally required so that if two processes should write to the same offset in the same file, one entire write must occur before the other—thus preventing fractured writes. With databases like Oracle, no two processes will write to the same offset in the same file at the same time. So why have the OS impose such locking? It doesn’t with modern file systems that support Direct I/O.

In regards to the blog entry called ASM is “not really an optional extra” With BIGFILE Tablespaces, a reader posted the following comment:

“node locks are only an issue when file metadata changes”
This is the first time I’ve heard this. I’ve had a quick scout around various sources, and I can’t find support for this statement.
All the notes on the subject that I can find show that inode/POSIX locks are also used for controlling the order of writes and the consistency of reads. Which makes sense to me….

Refer to:
http://www.ixora.com.au/notes/inode_locks.htm

Sec 5.4.4 of
http://www.phptr.com/articles/article.asp?p=606585&seqNum=4&rl=1

Sec 2.4.5 of
http://www.solarisinternals.com/si/reading/oracle_fsperf.pdf

Table 15.2 of
http://www.informit.com/articles/article.asp?p=605371&seqNum=6&rl=1

Am I misunderstanding something?

And my reply:

…in short, yes. When I contrast ASM to a file system, I only include direct I/O file systems. The number of file systems and file system options that have eliminated the write-ordering locks is a very long list starting, in my experience, with direct async I/O on Sequent UFS as far back as 1991 and continuing with VxFS with Quick I/O, VxFS with ODM, PolyServe PSFS (with the DBOptimized mount option), Solaris UFS post Sol8-U3 with the forcedirectio mount option and others I’m sure. Databases do their own serialization so the file system doing so is not needed.

The ixora and solarisinternals references are very old (2001/2002). As I said, Solaris 8U3 direct I/O completely eliminates write-ordering locks. Further, Steve Adams also points out that Solaris 8U3 and Quick I/O where the only ones they were aware of, but that doesn’t mean VxFS ODM (2001), Sequent UFS (starting in 1992) and ptx/EFS, and PolyServe PSFS (2002) weren’t all supporting completely unencumbered concurrent writes.

Ari, thanks for reading and thanks for bringing these old links to my attention. Steve is a fellow Oaktable Network Member…I’ll have to let him know about this out of date stuff.

There is way too much old (and incomplete) information out there.

A Quick Test Case to Prove the Point
The following screen shot shows a shell process on one of my Proliant DL585s with Linux RHEL 4 and the PolyServe Database Utility for Oracle. The session is using the PolyServe PSFS filesystem mounted with the DBOptimized mount option which supports Direct I/O. The test consists of a single dd(1) process overwriting the first 8GB of a file that is a little over 16GB. The first invocation of dd(1) writes 2097152 4KB blocks in 283 seconds for an I/O rate of 7,410 writes per second. The next test consisted of executing 2 concurrent dd(1) processes each writing a 4GB portion of the file. Bear in mind that the age old, decrepit write-ordering locks of yester-year serialized writes. Without bypassing those write locks, two concurrent write-intensive processes cannot scale their writes on a single file. The screen shot shows that the concurrent write test achieved 12,633 writes per second. Although 12,633 represents only 85% scale-up, remember, these are physical I/Os—I have a lot of lab gear, but I’d have to look around for a LUN that can do more than 12,633 IOps and I wanted to belt out this post. The point is that on a “normal” file system, the second go around of foo.sh with two dd(1) processes would take the same amount of time to complete as the single dd(1) run. Why? Because both tests have the same amount of write payload and if the second foo.sh suffered serialization the completion times would be the same:

conc_write2.JPG

EMC’s MPFSi for Oracle: Enjoy It While It Lasts, or Not.

Regular readers of my blog know that I am a proponent of Oracle over NFS—albeit in the commodity computing space. I’ll leave those Superdomes and IBM System p servers with their direct SAN plumbing. So I must therefore be a huge fan of EMC’s Celerra MPFSi—the Multi-Path Filesystem, right? No, I’m not. This blog post is about why not MPFSi.

In this paper about EMC MPFSi, pictures speak a thousand words. But first, some of my own—with an Oracle-centric take. MPFSi would be just fine I suppose except it is both an NFS server-side architecture and a proprietary NFS client package. The following screen shot shows a basic diagram of Celerra with MPFSi. First, there are three components at the back end. One is the Celerra and another is an MDS 9509 Connectrix. The Celerra is there to service NAS filesystem metadata operations and the Connectrix with some iSCSI glue is there to transfer data requests in block form. That is, if you create a file and immediately write a block to it, you will have the file creation satisfied by the Celerra and the block write by the Connectrix. The final component is the SAN—since Celerra is a SAN-gateway. There is nothing wrong with SAN gateways by any means. I think SAN gateways are the best way to leverage a SAN for provisioning storage to the legacy monolithic Unix systems as well as the large number of commodity servers sitting on the same datacenter floor. That is, SAN to the legacy Unix system and SAN-gateway-NFS to the commodity servers. That’s tiered storage. Ultimately you have a single SAN holding all the data, but the provisioning and connectivity model of the gateway side is much better suited to large numbers of commodity servers than FCP. Here is the simplified topology of MPFSi:

NOTE, some browsers require you to right click->view.

smallcelerra-1.jpg

 

 

MPFSi requires NFS client-side software. The software presents a filesystem that is compatible with NFS protocols. There is an agent that intercepts NFS protocol messages and forwards them to the Celerra which then does with it what it will as per the MPFSi architecture as the following screen shot shows.

smallcelerra-2.jpg

What’s This Have to do With Oracle?
So what’s the big deal? Well, I suppose if you absolutely need to stay with EMC as your SAN gateway vendor, then this is the choice for you. There are SAN-agnostic choices for SAN gateways as I’ve pointed out on this blog too many times. What about Oracle? Since Oracle10g supports NFS in the traditional model, I’m sure MPFSi works just fine. What about 11g? We’ve all heard “rumors” that 11g has a significant NFS-improvement focus. It is good enough with 10g, but 11g aims to make it an even better I/O model. That is good for Oracle’s On Demand hosting business since they use NFS exclusively. Will the 11g NFS enhancements function with MPFSi? Only an 11g beta program participant could tell you at the moment. I also know that the beta program legalese essentially states that participants can neither confirm nor deny whether they are, or are not, Oracle11g beta program participants. I’ll leave it at that.

Oracle over NFS is Not a Metadata Problem
When Oracle accesses files over NFS, there is no metadata overhead to speak of. Oracle is a simple lseek, read/write engine as far as NFS is concerned and there is no NFS client cache to get in the way either. Oracle opens files on NFS filesystems with the O_DIRECT flag. This alleviates a good deal of the overhead typical NFS exhibits. Oracle has an SGA, it doesn’t need NFS client-side cache. So MPFSi is not going to help where scalable NFS for Oracle is concerned. MPFSi better addresses the traditional problems with scaling home shares and so on.

Using Absolutely Dreadful Whitepapers as Collateral
Watch out if you read this ESG paper on EMC MPFSi because a belt sander might just drop from the ceiling and grind you to a fine powder as punishment for exposing yourself to such spam. This paper is a real jewel. If you dare risk the belt sander, I’ll leave it to you to read the whole thing. I’d like to point out, however, that it shamelessly uses relative performance numbers without the trouble of filling in any baselines for us in the performance section. For instance, the following shot shows a “graph” in the paper where the author makes the claim that MPFSi performs 300% better than normal NFS. This is typical chicanery—without the actual throughput achieved at the baseline, we can’t really ascertain what was tested. I have a $5 bet that the baseline was not, say, triple-bonded GbE delivering some 270+ MB/sec.

smallcelerra-3.PNG

 


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.