Archive for the 'filesystem performance' Category

Oracle’s Latest Filesystem Offering. Shades of AdvFS? I Want My OLT!

In this press release, Oracle has started making the existence of the Oracle Linux Test Kit a bit more widely known. I’ve been playing with this kit for a little while now and planned to blog my experiences. It has test components that work against Oracle instances. I’ll blog about my findings as soon as I can.

Long Live AdvFS?
The press release also announces Oracle’s new open source file system called Btrfs. If it was just another file system I wouldn’t mention it in my blog, but the project page mentions plans for features that I think are absolutely critical such as:

  1. Space efficient packing of small files
  2. Writable snapshots
  3. Object level mirroring and striping
  4. Strong integration with device mapper for multiple device support
  5. Efficient incremental backup and FS mirroring

This thing sure smells like a reincarnation of AdvFS, but what do I know?

The project page states that the developers are not even looking into database workloads at the moment but since it is an extent based filesystem I see no reason it wouldn’t work quite well with Oracle databases. I’m most keen on the list of features I’ve listed above though because they go a long way to make life better in production Oracle environments. First, number 1 on the list is good handling of small files. This will be a boon for the handling of trace and logging and a lot of the other ancillary file creation that goes on external to the database. Small files (at least myriads of them) are the bane of most filesystems.

Next on the list are clones-or as the project page calls them writable snapshots. Veritas has offered these for quite some time and they are a very nice feature for creating working copies of databases that are current up to the moment you create the snapshot. Creating a snapshot doesn’t impact performance and a good implementation of clones will not impact the real database even if you are doing a reasonable amount of writes on the clone. Also very cool!

Then there is number 3 on the list-object level mirroring and striping. When all good filesystems “grow up” they offer this sort of feature. Being able to implement software RAID on a file basis is the ultimate cool and I can’t wait to give this one a play! Being able to hand pick which Oracle files-or general purpose files-upon which to apply varying software RAID characteristics is a very nice feature.

Number 4 on the list is Device Mapper support. I’ve ranted on my blog and other forums about Linux handling of device names for quite some time and DM goes a long way to address the pet-peaves. Seeing this filesystem exploit DM at design level is a good thing.

Finally, number 5 above suggests good support for filesystem mirroring. This technology has proven itself useful for Veritas and NetApp customers-as well as others. It’s good to see it in Btrfs.

Clustered Btrfs?
Some time back I made a blog entry about ZFS and discussed the notion of “cluster butter.” That notion refers the the idea that you can take any filesystem and “clusterize” it. Generally adding cluster support after the design phase does not produce much of a cluster filesystem. So, I intend to dive in to see what underpinnings there are in Btrfs for future cluster support.

Apple OS/X with ZFS Shall Rule the World.

The problem with blogging is there is no way to clearly write something tongue-in-cheek. No facial expressions, no smirking, no hand waving—just plain old black and white. That aside, I can’t resist posting a follow-up to a post on StorageMojo about ZFS performance. I’m not taking a swipe at StorageMojo because it is one of my favorite blogs. However, every now and again third party perspective is a healthy thing. The topic at hand is this post about ZFS performance which is a digest of this post on The original post starts with:

I have seen many benchmarks indicating that for general usage, ZFS should be at least as fast if not faster than UFS (directio not withstanding)[…]

Right off the bat I was scratching my head. I have experience with direct I/O dating back to 1991 and have some direct I/O related posts here. I was trying to figure out what the phrase “directio not withstanding” was supposed to mean. The only performance boost direct I/O can possibly yield is due to the elimination of the write-ordering locks generally imposed by POSIX and the double-buffering/memcopys overhead associated with the DMA into the page cache—and subsequent copy into the user address space buffer. So for a read-intensive benchmark, about the only thing one should expect is less processor overhead, not increased throughput.

Anyway, the post continues:

To give a little background : I have been experiencing really bad throughput on our 3510-based SAN. The hosts are X4100s, 12Gb RAM, 2x dual core 2.6Ghz opterons and Solaris 10 11/06. They are each connected to a 3510FC dual-controller array via a dual-port HBA and 2 Brocade SW200e switches, using MxPIO. All fabric is at 2Gb/s […]

OK, the 4100 is a 2-way Opteron 2000 server which, being a lot like an HP Proliant DL385, should have no problem consuming all the data the 2 x 2Gb FCP paths can deliver—roughly 400MB/s. If the post is about a UFS versus ZFS apples-apples comparison, both results should top out at the maximum theoretical throughput of 2x2Gb FCP. The post continues:

On average, I was seeing some pretty average to poor rates on the non-ZFS volumes depending on how they were configured, on average I was seeing 65MB/s write performance and around 450MB/s read on the SAN using a single drive LUN.

I’d be plenty happy with that 450MB/s given that configuration. The post insinuates this is direct I/O so getting 450MB/s out of 2x2Gb FCP should be the end of the contest. Further, getting 65MB/s written to a single drive LUN is quite snappy! What is left to benchmark? It seems UFS cranks the SAN at full bandwidth.

The post continues:

I then tried ZFS, and immediately started seeing a crazy rate of around 1GB/s read AND write, peaking at close to 2GB/s. Given that this was round 2 to 4 times the capacity of the fabric, it was clear something was going awry.

Hmmm…Opterons with HT 2.0 attaining peaks of 2GB/s write throughput via 2x2Gb FCP? There is nothing awry about this, it is simply not writing the data to the storage.

This is a memory test. ZFS is caching, UFS is not.

The post continues:

Many thanks to Greg Menke and Tim Bradshaw on comp.unix.solaris for their help in unravelling this mystery!

What mystery? I’d recommend testing something that runs longer than 1 second (reading 512MB from memory on any Opteron system should take about .6 seconds. I’d recommend a dataset that is about 2 fold larger than physical memory.

How Does ZFS and UFS Compare to Other Filesystems?
Let’s take a look at an equivalent test running on the HP Cluster Gateway internal filesystem (the former PolyServe PSFS). The following is a screen shot of a dd(1) test using real Oracle datafiles on a Proliant DL585. The first test executes a single thread of dd(1) reading the first 512MB of one of the files using 1M read requests. The throughput is 815MB/s. Next, I ran 2 concurrent dd(1) processes each chomping the first 512MB out of two different Oracle datafiles using 1MB read requests. In the concurrent case, the throughput was 1.46GB/s.

NOTE: You may have to right click->view the image.


But wait, in order to prove the vast superiority of the HP Cluster Gateway filesystem over ZFS using similar hardware, I fired off 6 concurrent dd(1) processes each reading the first 512MB out of six different files. Here I measured 2.9GB/s!


SAN Array Cache and Filers Hate Sequential Writes

Or at least they should.

Jonathan Lewis has taken on a recent Oracle-l thread about Thinking Big. It’s a good blog entry and I recommend giving it a read. The original Oracle-l post read like this:

We need to store apporx 300 GB of data a month. It will be OLTP system. We want to use 
commodity hardware and open source database. we are willing to sacrifice performance 
for cost. E.g. a single row search from 2 billion rows table should be returned in 2 sec.

I replied to that Oracle-l post with:

Try loading a free Linux distro and typing :
man dbopen
man hash
man btree
man mppo
man recno

Yes, I was being sarcastic, but on the other hand I have been involved with application projects where we actually used these time-tested “database” primitives…and primitive they are! Anyway, Jonathan’s blog entry actually took on the topic and covers some interesting aspects. He ends with some of the physical storage concepts that would likely be involved. He writes:


SANs can move large amounts of data around very quickly – but don’t work well with very large numbers of extremely random I/Os. Any cache benefit you might have got from the SAN has already been used by Oracle in caching the branch blocks of the indexes. What the SAN can give you is a write-cache benefit for the log writer, database writer, and any direct path writes and re-reads from sorting and hashing.

Love Your Cache, Hate Large Sequential Writes
OK, this is the part about which I’d like to make a short comment—specifically about Log Writer. It turns out that most SAN arrays don’t handle sequential writes well either. All told, arrays shouldn’t be in the business of caching sequential writes (yes, there needs to be a cut-off there somewhere). I’ve had experiences with some that don’t cache sequential writes and that is generally good. I’ve had experiences with a lot that do and when you have a workload that generates a lot of redo, LGWR I/O can literally swamp an array cache. Sure, the blocks should be cached long enough for the write back to disk, but allowing those blocks to push into the array cache any further than the least of the LRU end makes little sense. Marketing words for arrays that handle these subtleties usually sound like, “Adaptive Array Cache”, or words to that effect.

One trick that can be used to see such potential damage is to run your test workload with concurrent sequential write “noise.” If you create a couple of files the same size as your redo logs and loop a couple of dd(1) processes performing 128K writes—without truncating the files on open—you can drive up this sort of I/O to see what it does to the array performance. If the array handles the caching of sequential writes, without polluting the cache, you shouldn’t get very much damage. An example of such a dd(1) command is:

$ dd if=/dev/zero of=pseudo_redo_file_db1 bs=128k count=8192 conv=notrunc &

$ dd if=/dev/zero of=pseudo_redo_file_db2 bs=128k count=8192 conv=notrunc &

$ wait

Looping this sort of “noise workload” will simulate a lot of LGWR I/O for two databases. Considering the typical revisit rate of the other array cache contents, this sort of dd(1) I/O shouldn’t completely obliterate your cache. If it does, you have an array that is too fond of sequential writes.

What Does This Have To Do With NAS?
This sort of workload can kill a filer. That doesn’t mean I’m any less excited about Oracle over NFS—I just don’t like filers. I recommend my collection of NFS related posts and my Scalable NAS for Oracle paper for background on what sequential writes can do to certain NAS implementations.

I’ll be talking about this topic and more at Utah Oracle User Group on March 21st.


Copying Files on Solaris. Slow or Fast, It’s Your Choice. Part III

This is the third installment on this thread. For context, please see:
Copying Files on Solaris. Slow or Fast, It’s Your Choice. Part I
Copying Files on Solaris. Slow or Fast, It’s Your Choice. Part II

What About cp8M Versus Stock cp(1) with Non-forcedirectio?
That is a good question. The saga continues after my post about copying files on Solaris. Once again, Padraig O’Sullivan was kind enough to test cp8M (available here ) versus stock cp(1) using a normal mounted UFS (non-forcedirectio). He reports:

Ok, I ran the benchmark in the same manner as before WITHOUT forcedirectio i.e. I rebooted the machine before each copy of the file.

# time /usr/bin/cp large_file large_file.1
real 2m17.504s
user 0m0.002s
sys 0m14.363s
# time /usr/bin/cp8m large_file large_file.
real 1m56.217s
user 0m0.003s
sys 0m14.264s

I don’t know. I certainly did not expect an increase in kernel mode cycles for the mmap-enabled cp(1). Please refer to Part I in this series to see that the comparison here is between 14.363s versus 10.853s of kernel-mode cycles.  We’re not talking a little increase. No, what Padraig’s measurements show is an increase of 32% in kernel mode cycles when copying a 1000MB file using stock cp(1) on a regular mount compared to the same work on a file in a forcedirectio mount. But hey, at least the performance (in MB/s) was consistently 16% less than cp8M. Yes, that was sarcasm.

I haven’t yet gotten my head around why the standard mmap-enabled cp(1) suffers such a jump in kernel mode processor overhead when switching from direct I/O to a normal UFS mount. I need to think about that a bit.

As usual, a picture speaks a thousand words, so I’ll provide two:



Remember my rant about the “small test?”

Sharing, and Caring
There was a comment by a reader on Part I of this blog thread that is worthy of discussion. The reader commented:

Perhaps a fairly obvious statement this, but notice the use of MAP_SHARED on the mmap call? – (I suspect you’ve spotted that already). This means that multiple processes can attach to the same memory mapped file simultaneously.

That is a good blog comment and evidence of someone giving it some thought. But, I’d like to comment on the sharable aspect of the 8MB map the reader brought out. I replied:

Your point about the kernel bcopy from UFS read buffers to the heap buffer in the address space of the cp(1) process is a good one, but this is a forcedirectio case.

That means there is no copy from the page cache into the virtual address space of process since it is direct I/O. My reply continued:

Since this is an Oracle blog, I would naturally go with the forcedirectio comparison first. It will be interesting to see with a normal UFS mount.

I’ve got a $2 bet that the MAP_SHARED is only there to facilitate copying an already mmapped file…the odds of a process jumping in and sharing a 8MB map that only lives for the duration of an I/O in and an I/O out seems pretty slim to me…but then that is 8MB twice…hmmm…I guess that 8MB mmap could exist for as much as 2-3 seconds if the I/O is headed for a single, simple drive. Sounds like a race just to share an 8MB map to me.

A Closer Look
Yes, when the stock cp(1) mmaps each 8MB segment of the input file it does so with MAP_SHARED. Like I said, I suspect the only thought behind that flag usage was to ensure there wouldn’t be “twinkling” mmap failures by other processes that could potentially be mmapping parts of that file while cp(1) is walking through it.

The reader’s comment continued:

That’s not to say that they all need to be “cp”’s – anything using mmap() on the same file at about same time will yield a benefit – the 8MB chunk paged in by mmap should only be later reclaimed by the pagescanner (or when the last process detaches?).

I already discussed the odds of another process getting in there and benefiting by that very transient mmap. It is 8MB in size and only valid during the read in, and write out—about 2-3 seconds on a really slow disk subsystem.

What’s this about reclaims? Good topic. When the mmap is dissolved through munmap(), the pages of the file are put on the free list (pagecache). Here is where the non-forcedirectio cp8M and cp(1) have a lot in common. In both cases, the blocks from the input file remain in main memory. Now that is where there is some true opportunity for sharing but only in the non-forcedirectio case. All said, it doesn’t take mmap() to get sharing of file contents being copied when you are using UFS with a normal mount.

So the question remains, what’s up with the mmap()-enabled cp(1)?

Is this thread making you sleepy?

Copying Files on Solaris. Slow or Fast, It’s Your Choice. Part II


In my post about copying files on Solaris, I cover a modification that to GNU cp(1) that yields substantial performance improvements over the stock cp(1) when copying files on forcedirectio mounts. See my comments at the end of that post regarding non-forcedirectio mounts.

I’ve gotten a large number of requests for that code. I don’t have Solaris gear around here so I asked Padraig O’Sullivan to do the favor of testing the modified GNU cp(1) versus stock cp(1). Padraig states:

Here is his recipe for making a cp8M for solaris:

I used coreutils version 5.2.1 which can be obtained from here:

In the coreutils-5.2.1/src directory I modified the copy.c file at line 287 with the following modification:

# diff -b coreutils-5.2.1/src/copy.c cp8m/copy.c


<   buf_size = ST_BLKSIZE (sb);

>   /* buf_size = ST_BLKSIZE (sb);*/


>      buf_size = 8388608 ;

I then used the Makefile which is supplied with coreutils to build the cp binary.

The intent of this whole blog thread is really nothing more than diving into filesystem, and to some degree VM, internals topics so I truly hope that others will do this test. It would be interesting to see whether the mmap-enabled stock cp(1) is better than cp8M with, say, normal UFS mounts. The difference is clear to me on forcedirectio and that is, naturally, the type of mount I would take the most interest in.


Standard File System Tools? We Don’t Need No Standard File System Tools!

Yesterday I posted a blog entry about copying files on Solaris. I received some side channel email on the post such as one with the following tidbit from a very good, long time friend of mine. He wrote:

So optimizing cp() is now your hobby? What’s next….. “ed”… no wait “df”.. boy it sure would be great if I could get a 20% improvement in “ls”… I am sure these commands are limiting the number of orders/hr my business can process :)))

Didn’t that blog entry show a traditional cp(1) implementation utilizing 26% less kernel mode processor cycles? Oh well.

It’s About the Whole System
While those were words spoken in jest, it warrants a blog entry and I’ll tell you why. It is true this is an Oracle related blog and such filesystem tools as cp(1) are not in the Oracle code path. I blog about these things for two reasons: 1) a lot of my readers enjoy learning more about the platform in general and 2) many—perhaps most—Oracle systems have normal file system tools such as cp(1), compress(1) and others running while Oracle is running. For that matter, the Oracle server can call out to the same libraries these tools use for such functionality as BFILE and UTL_FILE. For that reason, I feel these topics are related to Oracle platforms. After all, a garbage-can implementation of the standard filesystem tools—and/or the kernel code paths that service them—is going to take cycles away from Oracle. Now please don’t quote me as saying the mmap()-enabled Solaris cp(1) is a “garbage-can” implementation. I’m just making the point that if such tools are implemented poorly Oracle can be affected even though they are not in the scope of a transaction. It’s about the whole system.

Legacy Code. What Comes Around…Stays Around.
Let’s not think for even a moment that the internals of such tools as ls(1) and df(1) are beyond scrutiny. Both ls(1) and df(1) use the stat(2) system call. We Oracle-minded folks often forget that there is much more unstructured data than structured so it is a good thing there are still some folks like PolyServe (HP) minding the store for the performance of such mundane topics as stat(2). Why? Well, perfect examples are the online photo operations such as Snapfish. Try having thousands of threads accessing tens of millions of files (photos) for fun. See, Snapfish uses the HP Enterprise File Services Clustered Gateway NAS powered by PolyServe. You can bet we pay attention to “mundane” topics like what ls(1) behaves like in a directory with 1, 2 or 100 million small files. The stat(2) system call is extremely important in such situations.

He’s Off His Rocker—This is an Oracle Blog.
What could this possibly have to do with Oracle? Well, if you run Oracle on a platform that only specializes in the code underpinnings of the most common server I/O (e.g., db file sequential read, db file scattered read, direct path read/write, LGWR and DBWR writes), you might not end up very happy if you have to do things that hammer the filesystem with Oracle features like UTL_FILE, BFILE, external tables, imp/exp and so forth, cp(1), tar(1), compress(1) and so on. It’s all about taking a holistic view instead of “camps” that focus on segments of the I/O stack.

As the cliché goes, standard file operations and highly specialized Oracle code paths are often joined at the hip.


I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 747 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories


All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: