Archive for the 'ocfs2' Category

Oracle’s Latest Filesystem Offering. Shades of AdvFS? I Want My OLT!

In this press release, Oracle has started making the existence of the Oracle Linux Test Kit a bit more widely known. I’ve been playing with this kit for a little while now and planned to blog my experiences. It has test components that work against Oracle instances. I’ll blog about my findings as soon as I can.

Long Live AdvFS?
The press release also announces Oracle’s new open source file system called Btrfs. If it was just another file system I wouldn’t mention it in my blog, but the project page mentions plans for features that I think are absolutely critical such as:

  1. Space efficient packing of small files
  2. Writable snapshots
  3. Object level mirroring and striping
  4. Strong integration with device mapper for multiple device support
  5. Efficient incremental backup and FS mirroring

This thing sure smells like a reincarnation of AdvFS, but what do I know?

The project page states that the developers are not even looking into database workloads at the moment but since it is an extent based filesystem I see no reason it wouldn’t work quite well with Oracle databases. I’m most keen on the list of features I’ve listed above though because they go a long way to make life better in production Oracle environments. First, number 1 on the list is good handling of small files. This will be a boon for the handling of trace and logging and a lot of the other ancillary file creation that goes on external to the database. Small files (at least myriads of them) are the bane of most filesystems.

Next on the list are clones-or as the project page calls them writable snapshots. Veritas has offered these for quite some time and they are a very nice feature for creating working copies of databases that are current up to the moment you create the snapshot. Creating a snapshot doesn’t impact performance and a good implementation of clones will not impact the real database even if you are doing a reasonable amount of writes on the clone. Also very cool!

Then there is number 3 on the list-object level mirroring and striping. When all good filesystems “grow up” they offer this sort of feature. Being able to implement software RAID on a file basis is the ultimate cool and I can’t wait to give this one a play! Being able to hand pick which Oracle files-or general purpose files-upon which to apply varying software RAID characteristics is a very nice feature.

Number 4 on the list is Device Mapper support. I’ve ranted on my blog and other forums about Linux handling of device names for quite some time and DM goes a long way to address the pet-peaves. Seeing this filesystem exploit DM at design level is a good thing.

Finally, number 5 above suggests good support for filesystem mirroring. This technology has proven itself useful for Veritas and NetApp customers-as well as others. It’s good to see it in Btrfs.

Clustered Btrfs?
Some time back I made a blog entry about ZFS and discussed the notion of “cluster butter.” That notion refers the the idea that you can take any filesystem and “clusterize” it. Generally adding cluster support after the design phase does not produce much of a cluster filesystem. So, I intend to dive in to see what underpinnings there are in Btrfs for future cluster support.

Oracle Direct I/O Brought to You By Deranged Monkeys

If you have an Linux system, check the “bugs” section of the man page for the open(2) system call and you’ll see the following quote from Linus Torvalds:

The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances -Linus

I’m not joking, read that man page and you’ll see. Now, while I much prefer a mount option approach to direct I/O, I don’t think the O_DIRECT style of direct I/O was the brain child of a deranged monkey. I wonder if Linus is insinuating that the interface would be better if it was written by a sane monkey—or perhaps even a deranged monkey that is not on some serious mind-controlling substances?

There is nothing strange about O_DIRECT and most of the Unix derivations I am aware of are happy to offer it (Solaris being the notable exception offering directio(3C) instead). I’d love to know more about the context of that Linus quote. I’ve been around O_DIRECT since the very early 1990s. Sequent supported O_DIRECT opens on DYNIX/ptx file system files way back in 1991.

The Linux kernel development community still languishes over the fact that software like Oracle does not like to kernel-dive to access buffered data, preferring to do its own buffering instead.

A Mount-Option Approach
Why? Well, if you have programs that perform properly aligned I/O calls (e.g., cat(1), dd(1), cp(1), etc) but you don’t want them “polluting “ your system page cache, then you either need a mount-option approach to do Direct I/O or the tools need to be re-coded to open O_DIRECT. Back in 2001 I had the opportunity to make that choice for PolyServe and I haven’t regretted it once. Let me explain.

Let’s say, for instance, you generate and compress a few gigabytes of archived redo logs per day—or roughly ~40KB per second. It doesn’t sound like much, I know. But let’s look at page cache costs. When ARCH spools an offline redo log to the archive log destination the OS page cache will be used to buffer the I/O. When your compression tool (e.g. compress(P), gzip(1)) reads the file, page cache will once again be used. As the output of compress needs to be written page cache is used. Finally, when the archived redo is copied off the system (e.g., to tape), page cache will again be used. All this caching for data that is not used again—save for emergencies. But really, caching sequentially read archived files and compress output? Makes little sense.

The only way to not cache this sort of data is O_DIRECT, but I/Os issued against an O_DIRECT opened file must be multiples of the underlying disk block size (generally 512 bytes). The buffer in the calling process used for the I/O must also be a aligned on an address that is a multiple of the OS page size. It turns out that most OS tools perform proper alignment of their I/O buffers. So where is the rub? The I/O sizes! Even if you coded your compress tool to use O_DIRECT (deranged monkey syndrome), the odds that the output file will be a multiple of 512 bytes is nil. Let’s look at an example.

Direct I/O for Better Memory Utilization
In the following session I performed 6 steps to see the effect of direct I/O:

  1. Use df to determine space and exact filesystem of my current working directory (CWD)
  2. Check the Mount options. My CWD is a PolyServe PSFS mounted with the DBOptimized mount option which “renders” direct I/O akin to the Solaris –forcedirectio mount option.
  3. List my redo logs. Note, they are OMF files so the names are a bit strange.
  4. Check free memory on the system
  5. Copy a redo log
  6. Check free memory again to see how much memory was used by the OS page cache

fig1.jpg

OK, hold it, in step 5 I copied a 128MB file and yet the free memory available only changed by 176KB (from step 4 to step 6). My copy of an online log closely resembles what ARCH does—it simply copies the inactive online redo log to the archive log destination. I like the ability to not consume 256MB of physical memory to copy a file that is no longer really part of the database! The cp(1) command performs I/O with requests that are 512byte multiples, so the PolyServe CFS mounted in the DBOptimized mode simply “renders” the I/O through the direct I/O code path. No, cp(1) does not open with O_DIRECT, yet I relieved the pressure on free memory by copying with Direct I/O via the mount option. That’s good.

File Compression with Direct I/O Mounted Filesystem
But what about compressing files in a direct I/O filesystem? Let’s take a look. In the next session I did the following:

  1. Check free memory on the system
  2. Used ls(1) to see my copy of the redo log file.
  3. Used gzip(1) with maximum compression on the copy of the redo log file
  4. Used ls(1) to see the file size of the compressed file.
  5. Check free memory on the system to see what OS page cache was used

fig2.jpg

OK, this is good. I take a 128MB redo log file and compress it down to 29,582,800 bytes—which is, of course, 57,778 512 byte chunks plus one 464 byte chunk. According to the differences in free memory from step 1 and step 5, only 64KB of system memory was “wasted” in the act of compressing that file. Why do I say wasted? Because cache is best used for sharing data such as in the SGA, however, here I was able to read in 128MB and write out 28.2MB and only used 64KB of page cache in the process. Memory costs money and efficiency matters. This is the reason I prefer a mount option approach to direct I/O.

Back to the example. How did I write an amount that included a stray 464 bytes with direct I/O? That is not a multiple of the underlying disk driver requirement which is 512 bytes.

Under The Covers
On Linux, gzip(1) uses 32KB reads and 16KB writes. The output file created by gzip(1) is 29,839,295 bytes which is 1,805 writes at 16KB and one last odd-ball write of 9,680 bytes—something that would be impossible to do with direct I/O were it not for the direct I/O mount option. Let’s look at strace. The last write was 9,680 bytes:

fig3.jpg

Direct I/O Without Compile-Time O_DIRECT
I can’t speak about other direct I/O mount implementations, but I can explain how PolyServe does this. All I/O bound for files in a DBOptimized mounted PSFS filesystem are quickly examined to see if the I/O meets the underlying device driver DMA requirements. In the kernel we use simple arithmetic to determine if the I/O size is a multiple of the underlying disk block size (satisfies DMA requirement) and whether the I/O buffer is aligned on a page boundry. If both conditions are true, the I/O is DMAed directly from the process address space to the disk. If not, we simply grab an OS page cache buffer, perform the I/O and then immediately invalidate that page so no other process can read dirty data (PolyServe is sort of big on cache coherency if you get my drift).

Best of Both Worlds
In the end, Linus might be right about O_DIRECT, but sitting here at PolyServe makes me say, “Who cares.” We supported direct I/O on Linux before Linux supported O_DIRECT (it was just a patch at that time). In fact, we did a 10-node Oracle9i RAC, 10 TB, 10,000 user OLTP Proof of Concept way back in 2002—before Linux O_DIRECT was mainstream. Here is a link to the paper if you are interested in that proof point.


Scalable NFS Powered By Open Source Cluster Filesystems

40 Terabytes Per Week With Linux-based Clusters at Dunnhumby
It seems reasonable to think that this company tested the open source clustering stuff, but I don’t know for certain. There are folks out there using Open Source cluster filesystems for “large I/O” processing as is apparent in this recent OCFS2 bug report (emphasis added by me):

During maintenance window, decided to use the OCFS2 filesystem to store a large backup file (about 5-10 gig file). SCP’ed the file from an outside server to node1 of the cluster […]

A little third-party perspective is necessary. Not even back in 1990, with Fujitsu Swallow IV drives, was 10GB considered “large.” The OCFS2 user that filed the bug continued:

After a few minutes, node1 crashed.

Let’s think about that for a moment. The user is bringing unstructured data into the OCFS2 cluster filesystem using scp (1). Just for the heck of it, let’s take the user at his word and do the math. He said, “After a few minutes.” Let’s say a few minutes are 3—180 seconds. That means the scp(1) was likely not trafficked over Gigabit Ethernet because that would be more like enough time to move about 20GB at full bandwidth with a single wire. That pretty much leaves 100BaseT. So, somewhere along 2GB or so, OCFS2 crumbled. Hmmm, lowered expectations. And the fun continued:

Node1 restarted, but crashed again attempting to reenter the cluster.
Leaving Node1 down, attempted reboot of Node2 and Node3.
Both panic crashed during restart attempting to start OCFS2 and join the cluster.
Eventually, found that we had to start Node1 first, then restart the other two nodes.

Good grief, I’m not even going to comment on that bit, but I will point out that the suggested workaround to use the O_DIRECT enabled coreutils seems off mark. The user is trying to scp(1), not cp(1) or mv(1).

If It Isn’t Free, It’s Junk. Ad Revenue Funds Robust Software Development.
In spite of the fact that Ray Lane says traditional software products are soon to be replaced by cobbled together bits and pieces of open source stuff or what Wharton refers to as “ad supported software”, sometimes the good things in life are not free.

Huge Amounts of Unstructured Data
A recent article in Information Week’s Optimize Magazine covered one of PolyServe’s customers, Dunnhumby. These folks manipulate a lot of data using HP Blades as compute nodes accessing data over NFS in a PolyServe File Serving Utility scalable NAS solution. In their own words:

Each week, more than 40 terabytes of data is generated […]

“Hold it”, you say, that’s a comparison of OCFS2 to PolyServe CFS via NFS. What does OCFS2 have to do with NFS? That is a good question. OCFS2 is proclaimed to be a general purpose filesystem (emphasis added by me):

WHAT IS OCFS2?

OCFS2 is the next generation of the Oracle Cluster File System for Linux. It is an extent based, POSIX compliant file system. Unlike the previous release (OCFS), OCFS2 is a general-purpose file system

So why not export OCFS2 filesystems via NFS? That is the sort of thing you do with a general purpose filesystem after all. And, since OCFS2 is a cluster filesystem there shouldn’t be any second thoughts about exporting the same filesystems from multiple nodes—that’s scalable file serving. In fact, that has been tried before. That URL points to a bug report where a user was trying to implement scalable file serving using OCFS2. He reports:

I’m using OCSF2 for backups and to store files used by nfs clients. We have some errors during three file uploading from remote clients. In that case only one node can access those files but the other node receive from dlm a bad lockres error message […]

Right, OK. So what came next? Read on:

So I tried to stop ocfs2 and o2cb services on the second node but I can’t because heartbeat prevents any stop attempt. A stop attempt on the first node instead hungs and I have to reboot the first node because it is impossible to unmount ocfs2 filesystems (even if I use the lazy option).

I’m sure it couldn’t get any worse, right? He continued:

That is a serious problem because to recover the right functionality I had to reboot the first node (o2cb/ocfs2 services hang and after reboot ASM losts spfiles, so problem impacts even the databases running on cluster). There is any kind of action I can do to avoid that?

Surely he must be doing something really convoluted to hit problems so easily! He explains the scenario:

The scenario is:
node X exports filesystem to host Y
node W exports filesystem to host Z

from Y I create a file then I delete it then ls command on Z lists the file but I cannot open it. I receive I lot of messages like this:

Oct 20 08:53:34 proxb31 kernel: (15612,1):ocfs2_populate_inode:234 ERROR:
Invalid dinode: i_ino=9977187, i_blkno=9977187, signature = INODE01, flags = 0x0
Oct 20 08:53:34 proxb31 kernel: (15612,1):ocfs2_read_locked_inode:389 ERROR:
populate inode failed! i_blkno=9977187, i_ino=9977187

Good grief! Cache coherency problems? You mean like this warning about OCFS cache coherency :

Reasons for using odirect cp:

1. Buffered and direct ios are still racy in the kernel. As Oracle is doing directio, doing a normal cp exposes one to the chance of copying a stale page data.

2. Direct ios are less stressful on the page cache. As Oracle datafiles are invariably large, directio is more efficient in the long run.

3. In a clustered environment, the blocks on disk could be updated by any nodes in the cluster. Using odirect io ensures the latest version of the block is always read.

Oh boy. Anyway, back to the bug report. The bug report states that as of January 4, 2007, there is a patch for NFS exported OCFS2 problem being tested at Oracle, however, the following comment was given to help set expectations:

One thing I’m concerned with is having two clients connect to seperate nodes. Since NFSD is not cluster aware, there may be some issues with unlinked inodes being in cache on one node and looked up on another. Is it possibleto confine your nfs exports to a single node for now, until we can get a better handle on that particular issue.

That seems like something that should have been spelled out in the Product Requirements Document, but I’m old-fashioned.

Scalable File Serving with Linux. Who Needs a Cluster-Aware NFSD?
The NAS heads in a PolyServe File Serving Utility configuration (e.g., HP EFS Clustered Gateway), run the enterprise distributions; RHEL4 and SuSE SLES9. So while those folks in the Ray Lane and Wharton’s open source dream world might think that NFSD cannot function in a cluster with data consistency, PolyServe—with that dying traditional software model—seems to have pulled it off. Do you think Dunnhumby pushes 40TB of data per week through a PolyServe File Serving Utility cluster without NFSD scalability or—more importantly—cache coherency? Not a chance.



							

DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 2,953 other followers

Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: