Archive for the 'ocfs' Category

Oracle’s Latest Filesystem Offering. Shades of AdvFS? I Want My OLT!

In this press release, Oracle has started making the existence of the Oracle Linux Test Kit a bit more widely known. I’ve been playing with this kit for a little while now and planned to blog my experiences. It has test components that work against Oracle instances. I’ll blog about my findings as soon as I can.

Long Live AdvFS?
The press release also announces Oracle’s new open source file system called Btrfs. If it was just another file system I wouldn’t mention it in my blog, but the project page mentions plans for features that I think are absolutely critical such as:

  1. Space efficient packing of small files
  2. Writable snapshots
  3. Object level mirroring and striping
  4. Strong integration with device mapper for multiple device support
  5. Efficient incremental backup and FS mirroring

This thing sure smells like a reincarnation of AdvFS, but what do I know?

The project page states that the developers are not even looking into database workloads at the moment but since it is an extent based filesystem I see no reason it wouldn’t work quite well with Oracle databases. I’m most keen on the list of features I’ve listed above though because they go a long way to make life better in production Oracle environments. First, number 1 on the list is good handling of small files. This will be a boon for the handling of trace and logging and a lot of the other ancillary file creation that goes on external to the database. Small files (at least myriads of them) are the bane of most filesystems.

Next on the list are clones-or as the project page calls them writable snapshots. Veritas has offered these for quite some time and they are a very nice feature for creating working copies of databases that are current up to the moment you create the snapshot. Creating a snapshot doesn’t impact performance and a good implementation of clones will not impact the real database even if you are doing a reasonable amount of writes on the clone. Also very cool!

Then there is number 3 on the list-object level mirroring and striping. When all good filesystems “grow up” they offer this sort of feature. Being able to implement software RAID on a file basis is the ultimate cool and I can’t wait to give this one a play! Being able to hand pick which Oracle files-or general purpose files-upon which to apply varying software RAID characteristics is a very nice feature.

Number 4 on the list is Device Mapper support. I’ve ranted on my blog and other forums about Linux handling of device names for quite some time and DM goes a long way to address the pet-peaves. Seeing this filesystem exploit DM at design level is a good thing.

Finally, number 5 above suggests good support for filesystem mirroring. This technology has proven itself useful for Veritas and NetApp customers-as well as others. It’s good to see it in Btrfs.

Clustered Btrfs?
Some time back I made a blog entry about ZFS and discussed the notion of “cluster butter.” That notion refers the the idea that you can take any filesystem and “clusterize” it. Generally adding cluster support after the design phase does not produce much of a cluster filesystem. So, I intend to dive in to see what underpinnings there are in Btrfs for future cluster support.

Scalable NFS Powered By Open Source Cluster Filesystems

40 Terabytes Per Week With Linux-based Clusters at Dunnhumby
It seems reasonable to think that this company tested the open source clustering stuff, but I don’t know for certain. There are folks out there using Open Source cluster filesystems for “large I/O” processing as is apparent in this recent OCFS2 bug report (emphasis added by me):

During maintenance window, decided to use the OCFS2 filesystem to store a large backup file (about 5-10 gig file). SCP’ed the file from an outside server to node1 of the cluster […]

A little third-party perspective is necessary. Not even back in 1990, with Fujitsu Swallow IV drives, was 10GB considered “large.” The OCFS2 user that filed the bug continued:

After a few minutes, node1 crashed.

Let’s think about that for a moment. The user is bringing unstructured data into the OCFS2 cluster filesystem using scp (1). Just for the heck of it, let’s take the user at his word and do the math. He said, “After a few minutes.” Let’s say a few minutes are 3—180 seconds. That means the scp(1) was likely not trafficked over Gigabit Ethernet because that would be more like enough time to move about 20GB at full bandwidth with a single wire. That pretty much leaves 100BaseT. So, somewhere along 2GB or so, OCFS2 crumbled. Hmmm, lowered expectations. And the fun continued:

Node1 restarted, but crashed again attempting to reenter the cluster.
Leaving Node1 down, attempted reboot of Node2 and Node3.
Both panic crashed during restart attempting to start OCFS2 and join the cluster.
Eventually, found that we had to start Node1 first, then restart the other two nodes.

Good grief, I’m not even going to comment on that bit, but I will point out that the suggested workaround to use the O_DIRECT enabled coreutils seems off mark. The user is trying to scp(1), not cp(1) or mv(1).

If It Isn’t Free, It’s Junk. Ad Revenue Funds Robust Software Development.
In spite of the fact that Ray Lane says traditional software products are soon to be replaced by cobbled together bits and pieces of open source stuff or what Wharton refers to as “ad supported software”, sometimes the good things in life are not free.

Huge Amounts of Unstructured Data
A recent article in Information Week’s Optimize Magazine covered one of PolyServe’s customers, Dunnhumby. These folks manipulate a lot of data using HP Blades as compute nodes accessing data over NFS in a PolyServe File Serving Utility scalable NAS solution. In their own words:

Each week, more than 40 terabytes of data is generated […]

“Hold it”, you say, that’s a comparison of OCFS2 to PolyServe CFS via NFS. What does OCFS2 have to do with NFS? That is a good question. OCFS2 is proclaimed to be a general purpose filesystem (emphasis added by me):

WHAT IS OCFS2?

OCFS2 is the next generation of the Oracle Cluster File System for Linux. It is an extent based, POSIX compliant file system. Unlike the previous release (OCFS), OCFS2 is a general-purpose file system

So why not export OCFS2 filesystems via NFS? That is the sort of thing you do with a general purpose filesystem after all. And, since OCFS2 is a cluster filesystem there shouldn’t be any second thoughts about exporting the same filesystems from multiple nodes—that’s scalable file serving. In fact, that has been tried before. That URL points to a bug report where a user was trying to implement scalable file serving using OCFS2. He reports:

I’m using OCSF2 for backups and to store files used by nfs clients. We have some errors during three file uploading from remote clients. In that case only one node can access those files but the other node receive from dlm a bad lockres error message […]

Right, OK. So what came next? Read on:

So I tried to stop ocfs2 and o2cb services on the second node but I can’t because heartbeat prevents any stop attempt. A stop attempt on the first node instead hungs and I have to reboot the first node because it is impossible to unmount ocfs2 filesystems (even if I use the lazy option).

I’m sure it couldn’t get any worse, right? He continued:

That is a serious problem because to recover the right functionality I had to reboot the first node (o2cb/ocfs2 services hang and after reboot ASM losts spfiles, so problem impacts even the databases running on cluster). There is any kind of action I can do to avoid that?

Surely he must be doing something really convoluted to hit problems so easily! He explains the scenario:

The scenario is:
node X exports filesystem to host Y
node W exports filesystem to host Z

from Y I create a file then I delete it then ls command on Z lists the file but I cannot open it. I receive I lot of messages like this:

Oct 20 08:53:34 proxb31 kernel: (15612,1):ocfs2_populate_inode:234 ERROR:
Invalid dinode: i_ino=9977187, i_blkno=9977187, signature = INODE01, flags = 0x0
Oct 20 08:53:34 proxb31 kernel: (15612,1):ocfs2_read_locked_inode:389 ERROR:
populate inode failed! i_blkno=9977187, i_ino=9977187

Good grief! Cache coherency problems? You mean like this warning about OCFS cache coherency :

Reasons for using odirect cp:

1. Buffered and direct ios are still racy in the kernel. As Oracle is doing directio, doing a normal cp exposes one to the chance of copying a stale page data.

2. Direct ios are less stressful on the page cache. As Oracle datafiles are invariably large, directio is more efficient in the long run.

3. In a clustered environment, the blocks on disk could be updated by any nodes in the cluster. Using odirect io ensures the latest version of the block is always read.

Oh boy. Anyway, back to the bug report. The bug report states that as of January 4, 2007, there is a patch for NFS exported OCFS2 problem being tested at Oracle, however, the following comment was given to help set expectations:

One thing I’m concerned with is having two clients connect to seperate nodes. Since NFSD is not cluster aware, there may be some issues with unlinked inodes being in cache on one node and looked up on another. Is it possibleto confine your nfs exports to a single node for now, until we can get a better handle on that particular issue.

That seems like something that should have been spelled out in the Product Requirements Document, but I’m old-fashioned.

Scalable File Serving with Linux. Who Needs a Cluster-Aware NFSD?
The NAS heads in a PolyServe File Serving Utility configuration (e.g., HP EFS Clustered Gateway), run the enterprise distributions; RHEL4 and SuSE SLES9. So while those folks in the Ray Lane and Wharton’s open source dream world might think that NFSD cannot function in a cluster with data consistency, PolyServe—with that dying traditional software model—seems to have pulled it off. Do you think Dunnhumby pushes 40TB of data per week through a PolyServe File Serving Utility cluster without NFSD scalability or—more importantly—cache coherency? Not a chance.



							

DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 2,937 other followers

Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: