File Systems For A Database? Choose One That Couples Direct I/O and Concurrent I/O. What’s This Have To Do With NFS? Harken Back 5.2 Years To Find Out.

It was only 1,747 days ago that I posted one of the final blog entries in a long series of posts regarding multi-headed scalable NAS suitability for Oracle Database (see index of NAS-related posts).  The post,  entitled ASM is “not really an optional extra” With BIGFILE tablespaces, aimed to question the assertion that one must use ASM for bigfile tablespaces. At the time there were writings on the web that suggested a black and white state of affairs regarding what type of storage can handle concurrent write operations. The assertion was that ASM supported concurrent writes and all file systems imposed the POSIX write-ordering semantics and therefore they’d be bunk for bigfile tablespace support. In so many words I stated that any file system that matters for Oracle supports concurrent I/O when Oracle uses direct I/O. A long comment thread ensued and instead of rehashing points I made in the long series of prior posts on the matter, I decided to make a fresh entry a few weeks later entitled Yes Direct I/O Means Concurrent Writes. That’s all still over 5 years ago.

Please don’t worry I’m not blogging about 151,000,000 seconds-old blog posts. I’m revisiting this topic because a reader posted a fresh comment on the 41,944 hour-old post to point out that Ext derivatives implement write-ordering locks even with O_DIRECT opens. I followed up with:

I’m thinking of my friend Dave Chinner when I say this, “Don’t use file systems that suck!”

I’ll just reiterate what I’ve been saying all along. The file systems I have experience with mate direct I/O with concurrent I/O. Of course, I “have experience” with ext3 but have always discounted ext variants for many reasons most importantly the fact that I spent 2001 through 2007 with clustered Linux…entirely. So there was no ext on my plate nor in my cross-hairs.

I then recommended to the reader that he try his tests with NFS to see that the underlying file system (in th NFS server) really doesn’t matter in this regard because NFS supports direct I/O with concurrent writes. I got no response from that recommendation so I set up a quick proof and thought I’d post the information here. If I haven’t lost you yet for resurrecting a 249-week old topic, please read on:

File Systems That Matter
I mentioned Dave Chinner because he is the kernel maintainer for XFS. XFS matters, NFS matters and honestly, most file systems that are smart enough to drop write-ordering when supporting direct I/O matter.

To help readers see my point I set up a test wherein:

  1. I use a simple script to measure single-file write scalability from one to two writers with Ext3
  2. I then export that Ext3 file system via loopback and accessed the files via an NFS mount to ascertain single-file write scalability from one to two writers.
  3. I then performed the same test as in step 1 with XFS.
  4. I then export the XFS file system and mount it via NFS to repeat the same test as in step 2.

Instead of a full-featured benchmark kit (e.g., fio, sysbench, iometer, bonnie, ORION) I used a simple script because a simple script will do. I’ll post links to the scripts at the end of this post.

Test Case 1

The following shows a freshly created Ext file system, creation of a single 4GB file and flushing of the page cache. I then execute the test.sh script first with a single process (dd with oflag=direct and conv=notrunc) and then with two. The result is no scalability.

# mkfs.ext3 /dev/sdd1
mke2fs 1.39 (29-May-2006)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
13123584 inodes, 26216064 blocks
1310803 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
801 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 31 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
# mount /dev/sdd1 /disk
# cd /disk
# tar zxf /tmp/TEST_KIT.tar.gz
# sh -x ./setup.sh
+ dd if=/dev/zero of=bigfile bs=1024K count=4096 oflag=direct
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 9.64347 seconds, 445 MB/s
#
#
# sync;sync;sync;echo 3 > /proc/sys/vm/drop_caches
#
# sh ./test.sh 1
24
# sh ./tally.sh 24
TotIO: 524288 Tm: 24 IOPS: 21845.3
# sh ./test.sh 2
49
# sh ./tally.sh 49
TotIO: 1048576 Tm: 49 IOPS: 21399.5

Ext is a file system I truly do not care about.  So what if I run the workload accessing the downwind files via NFS?

Test Case 2

The following shows that I set up to serve the ext3 file system via NFS, mounted it loopback-local and re-ran the test. The baseline suffered 32% decline in IOPS because a) ext3 isn’t exactly a good embedded file system for a filer and b) I didn’t tune anything. However, the model shows 75% scalability. That’s more than zero scalability.

#  service nfs start
Starting NFS services:                                     [  OK  ]
Starting NFS quotas:                                       [  OK  ]
Starting NFS daemon:                                       [  OK  ]
Starting NFS mountd:                                       [  OK  ]
# mount -t nfs -o rw,bg,hard,nointr,tcp,vers=3,timeo=300,rsize=32768,wsize=32768 localhost:/disk /mnt
# cd /mnt
# rm bigfile
# sh -x ./setup.sh
+ dd if=/dev/zero of=bigfile bs=1024K count=4096 oflag=direct
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 9.83931 seconds, 437 MB/s
# sync;sync;sync;echo 3 > /proc/sys/vm/drop_caches
# pwd
/mnt
# sh ./test.sh 1
37
# sh ./tally.sh 37
TotIO: 524288 Tm: 37 IOPS: 14169.9
# sh ./test.sh 2
49
# sh ./tally.sh 49
TotIO: 1048576 Tm: 49 IOPS: 21399.5

Test Case 3

Next I moved on to test the non-NFS case with XFS. The baseline showed parity with the single-writer Ext case, but the two-writer case showed 40% improvement in IOPS. Going from one to two writers exhibited 70% scalability. Don’t hold that against me though, it was a small setup with 6 disks in RAID5. It’s maxed out. Nonetheless, any scalability is certainly more than no scalability so the test proved my point.

# umount /mnt
# service nfs stop
Shutting down NFS mountd:                                  [  OK  ]
Shutting down NFS daemon:                                  [  OK  ]
Shutting down NFS quotas:                                  [  OK  ]
Shutting down NFS services:                                [  OK  ]
# umount /disk

# mkfs.xfs /dev/sdd1
mkfs.xfs: /dev/sdd1 appears to contain an existing filesystem (ext3).
mkfs.xfs: Use the -f option to force overwrite.
# mkfs.xfs /dev/sdd1 -f
meta-data=/dev/sdd1              isize=256    agcount=16, agsize=1638504 blks
         =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=26216064, imaxpct=25
         =                       sunit=0      swidth=0 blks, unwritten=1
naming   =version 2              bsize=4096  
log      =internal log           bsize=4096   blocks=12800, version=1
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0

# mount /dev/sdd1 /disk
# cd /disk
# tar zxf /tmp/TEST_KIT.tar.gz
# sh -x ./setup.sh
+ dd if=/dev/zero of=bigfile bs=1024K count=4096 oflag=direct
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 4.83153 seconds, 889 MB/s
# sync;sync;sync;echo 3 > /proc/sys/vm/drop_caches
# sh ./test.sh 1
24
# sh ./tally.sh 24
TotIO: 524288 Tm: 24 IOPS: 21845.3
# sh ./test.sh 2
35
# sh ./tally.sh 35
TotIO: 1048576 Tm: 35 IOPS: 29959.3

Test Case 4

I then served up the XFS file system via NFS. The baseline (single writer) showed 16% improvement over the NFS-exported ext3 case. Scalability was 81%. Sandbag the baseline, improve the scalability! 🙂 Joking aside, this proves the point about direct/concurrent on NFS as well.

# cd /
# service nfs start
Starting NFS services:                                     [  OK  ]
Starting NFS quotas:                                       [  OK  ]
Starting NFS daemon:                                       [  OK  ]
Starting NFS mountd:                                       [  OK  ]
# mount -t nfs -o rw,bg,hard,nointr,tcp,vers=3,timeo=300,rsize=32768,wsize=32768 localhost:/disk /mnt
# cd /mnt
# rm bigfile
# sh -x ./setup.sh
+ dd if=/dev/zero of=bigfile bs=1024K count=4096 oflag=direct
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 6.95507 seconds, 618 MB/s
# sync;sync;sync;echo 3 > /proc/sys/vm/drop_caches
# sh ./test.sh 1
32
# sh ./tally.sh 32
TotIO: 524288 Tm: 32 IOPS: 16384.0
# sh ./test.sh 2
40
# sh ./tally.sh 40
TotIO: 1048576 Tm: 40 IOPS: 26214.4
Scripts and example script output:

test.sh
tally.sh
example of test.sh output (handle with tally.sh)

The Moral Of This Blog Entry Is?
…multifarious:

  • Don’t leave comments open on threads for 5.2 years
  • Use file systems suited to the task at hand
  • Kevin is (and has always been) a huge proponent of the NFS storage provisioning model for Oracle
  • ASM is not required for scalable writes

6 Responses to “File Systems For A Database? Choose One That Couples Direct I/O and Concurrent I/O. What’s This Have To Do With NFS? Harken Back 5.2 Years To Find Out.”


  1. 1 Noons August 12, 2011 at 6:56 am

    Not directly relevant here, but in Aix I’ve found using if=/dev/zero and of=/dev/null gives me a good baseline how much each CPU/core/lpar can give me in I/O terms, assuming an infinite bandwidth file system.
    IOW: how fast is the baseline memory/CPU combo for I/O.

    • 2 kevinclosson August 12, 2011 at 4:08 pm

      Actually, Noons, I’d have to disagree (unless I’m misinterpreting your point). by default, dd does not touch data so it is only a tickle of CPU. If you make dd actually touch the data then sure. Here’s an example of my point on Westmere-EP (Xeon 5600) with Linux 2.6 :

      $ dd if=/dev/zero of=/dev/null bs=1M count=8192 ; dd if=/dev/zero of=/dev/null bs=1M count=8192 conv=ucase
      8192+0 records in
      8192+0 records out
      8589934592 bytes (8.6 GB) copied, 0.256025 seconds, 33.6 GB/s
      8192+0 records in
      8192+0 records out
      8589934592 bytes (8.6 GB) copied, 11.5908 seconds, 741 MB/s

      Now, this pretty extreme because databases rarely peek at every byte as it flows off disk. The conv=ucase on all zeros is a read-only operation. If dd had an option to peek at every Nth byte in its read buffer then that would be better.

      • 3 Noons August 13, 2011 at 3:27 pm

        dd likely behaves differently for Aix than it does for Linux. I don’t need to use conv=ucase to get rough memory speed. With bs=8192 and count=10000 I get roughly 800MB/s. Adding conv=ucase makes little difference. Interesting the difference you show with Linux, it must be doing some form of optimisation by not even touching the buffers without the conv option!

        • 4 kevinclosson August 13, 2011 at 4:09 pm

          We can be sure AIX and Linux implement dd differently. For sure on Linux there is no peek at the buffer at all unless you tell it to CONV. There is no reason to peek into the buffer.

          For whatever reason, it seems, AIX deems fit to do more work than it really needs to in this regard.

          So, lets see what 8K looks like on a few other platforms:

          MacOS 10 on Core i7 is light in this regard as well:

          $ dd if=/dev/zero of=/dev/null bs=8192 count=10000
          10000+0 records in
          10000+0 records out
          81920000 bytes transferred in 0.025381 secs (3227630301 bytes/sec)

          but when it has to peek at every byte:

          $ dd if=/dev/zero of=/dev/null bs=8192 count=10000 conv=ucase
          10000+0 records in
          10000+0 records out
          81920000 bytes transferred in 0.090786 secs (902341967 bytes/sec)

          So that’s Core i7 with the MacOS dd code peeking at every byte at the rate of 860MB/s.

          Now, back on that WSM-EP RHEL 5 server but with 8K instead of my original 1MB:

          gpadmin $ dd if=/dev/zero of=/dev/null bs=8192 count=10000
          10000+0 records in
          10000+0 records out
          81920000 bytes (82 MB) copied, 0.016386 seconds, 5.0 GB/s
          gpadmin $ dd if=/dev/zero of=/dev/null bs=8192 count=10000 conv=ucase
          10000+0 records in
          10000+0 records out
          81920000 bytes (82 MB) copied, 0.175084 seconds, 468 MB/s

          So, Noons, if I had my druthers I’d opt for a processor that can peek at every byte of an 8K block whizzing by at 800MB/s+ and ignore the no-touch case! Very interesting. This is clearly an apples-oranges comparison though because if the code is doing the same thing it should be faster on the WSM-EP than on Core i7 (Nehalem) yet it is nearly 50% slower.

          I’m a huge Power fan…but only from a distance since I rarely get to touch one of those beasts. IBM should set me up for more play time 🙂 Before I left IBM in 2000, I had plenty of dedicated time on a 24-processor S85. Oh, memories.

  2. 5 Alexander J. Maidak August 12, 2011 at 8:48 pm

    What drives are you using? Assuming 200 iops per SAS disk your 6 disk RAID 5 can deliever at best 1200 iops. You’re getting 20,000+… Either your IO’ing into caches or are you using SSD?

    • 6 kevinclosson August 12, 2011 at 8:58 pm

      Hi Alexander,

      Yes, the general rule of thumb is that if it is impossible to do an IOPS rate of X with Y devices then there is caching in the middle. These drives are dangling off of an LSI controller with a small write-back cache. The drain-rate is what is holding the IOPS down to ~20K.

      Good question, but point of the block post is further up the stack.


Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 2,988 other followers

Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: