It was only 1,747 days ago that I posted one of the final blog entries in a long series of posts regarding multi-headed scalable NAS suitability for Oracle Database (see index of NAS-related posts). The post, entitled ASM is “not really an optional extra” With BIGFILE tablespaces, aimed to question the assertion that one must use ASM for bigfile tablespaces. At the time there were writings on the web that suggested a black and white state of affairs regarding what type of storage can handle concurrent write operations. The assertion was that ASM supported concurrent writes and all file systems imposed the POSIX write-ordering semantics and therefore they’d be bunk for bigfile tablespace support. In so many words I stated that any file system that matters for Oracle supports concurrent I/O when Oracle uses direct I/O. A long comment thread ensued and instead of rehashing points I made in the long series of prior posts on the matter, I decided to make a fresh entry a few weeks later entitled Yes Direct I/O Means Concurrent Writes. That’s all still over 5 years ago.
Please don’t worry I’m not blogging about 151,000,000 seconds-old blog posts. I’m revisiting this topic because a reader posted a fresh comment on the 41,944 hour-old post to point out that Ext derivatives implement write-ordering locks even with O_DIRECT opens. I followed up with:
I’m thinking of my friend Dave Chinner when I say this, “Don’t use file systems that suck!”
I’ll just reiterate what I’ve been saying all along. The file systems I have experience with mate direct I/O with concurrent I/O. Of course, I “have experience” with ext3 but have always discounted ext variants for many reasons most importantly the fact that I spent 2001 through 2007 with clustered Linux…entirely. So there was no ext on my plate nor in my cross-hairs.
I then recommended to the reader that he try his tests with NFS to see that the underlying file system (in th NFS server) really doesn’t matter in this regard because NFS supports direct I/O with concurrent writes. I got no response from that recommendation so I set up a quick proof and thought I’d post the information here. If I haven’t lost you yet for resurrecting a 249-week old topic, please read on:
File Systems That Matter
I mentioned Dave Chinner because he is the kernel maintainer for XFS. XFS matters, NFS matters and honestly, most file systems that are smart enough to drop write-ordering when supporting direct I/O matter.
To help readers see my point I set up a test wherein:
- I use a simple script to measure single-file write scalability from one to two writers with Ext3
- I then export that Ext3 file system via loopback and accessed the files via an NFS mount to ascertain single-file write scalability from one to two writers.
- I then performed the same test as in step 1 with XFS.
- I then export the XFS file system and mount it via NFS to repeat the same test as in step 2.
Instead of a full-featured benchmark kit (e.g., fio, sysbench, iometer, bonnie, ORION) I used a simple script because a simple script will do. I’ll post links to the scripts at the end of this post.
Test Case 1
The following shows a freshly created Ext file system, creation of a single 4GB file and flushing of the page cache. I then execute the test.sh script first with a single process (dd with oflag=direct and conv=notrunc) and then with two. The result is no scalability.
# mkfs.ext3 /dev/sdd1 mke2fs 1.39 (29-May-2006) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 13123584 inodes, 26216064 blocks 1310803 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 801 block groups 32768 blocks per group, 32768 fragments per group 16384 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872 Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 31 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. # mount /dev/sdd1 /disk # cd /disk # tar zxf /tmp/TEST_KIT.tar.gz # sh -x ./setup.sh + dd if=/dev/zero of=bigfile bs=1024K count=4096 oflag=direct 4096+0 records in 4096+0 records out 4294967296 bytes (4.3 GB) copied, 9.64347 seconds, 445 MB/s # # # sync;sync;sync;echo 3 > /proc/sys/vm/drop_caches # # sh ./test.sh 1 24 # sh ./tally.sh 24 TotIO: 524288 Tm: 24 IOPS: 21845.3 # sh ./test.sh 2 49 # sh ./tally.sh 49 TotIO: 1048576 Tm: 49 IOPS: 21399.5
Ext is a file system I truly do not care about. So what if I run the workload accessing the downwind files via NFS?
Test Case 2
The following shows that I set up to serve the ext3 file system via NFS, mounted it loopback-local and re-ran the test. The baseline suffered 32% decline in IOPS because a) ext3 isn’t exactly a good embedded file system for a filer and b) I didn’t tune anything. However, the model shows 75% scalability. That’s more than zero scalability.
# service nfs start Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS daemon: [ OK ] Starting NFS mountd: [ OK ] # mount -t nfs -o rw,bg,hard,nointr,tcp,vers=3,timeo=300,rsize=32768,wsize=32768 localhost:/disk /mnt # cd /mnt # rm bigfile # sh -x ./setup.sh + dd if=/dev/zero of=bigfile bs=1024K count=4096 oflag=direct 4096+0 records in 4096+0 records out 4294967296 bytes (4.3 GB) copied, 9.83931 seconds, 437 MB/s # sync;sync;sync;echo 3 > /proc/sys/vm/drop_caches # pwd /mnt # sh ./test.sh 1 37 # sh ./tally.sh 37 TotIO: 524288 Tm: 37 IOPS: 14169.9 # sh ./test.sh 2 49 # sh ./tally.sh 49 TotIO: 1048576 Tm: 49 IOPS: 21399.5
Test Case 3
Next I moved on to test the non-NFS case with XFS. The baseline showed parity with the single-writer Ext case, but the two-writer case showed 40% improvement in IOPS. Going from one to two writers exhibited 70% scalability. Don’t hold that against me though, it was a small setup with 6 disks in RAID5. It’s maxed out. Nonetheless, any scalability is certainly more than no scalability so the test proved my point.
# umount /mnt # service nfs stop Shutting down NFS mountd: [ OK ] Shutting down NFS daemon: [ OK ] Shutting down NFS quotas: [ OK ] Shutting down NFS services: [ OK ] # umount /disk # mkfs.xfs /dev/sdd1 mkfs.xfs: /dev/sdd1 appears to contain an existing filesystem (ext3). mkfs.xfs: Use the -f option to force overwrite. # mkfs.xfs /dev/sdd1 -f meta-data=/dev/sdd1 isize=256 agcount=16, agsize=1638504 blks = sectsz=512 attr=0 data = bsize=4096 blocks=26216064, imaxpct=25 = sunit=0 swidth=0 blks, unwritten=1 naming =version 2 bsize=4096 log =internal log bsize=4096 blocks=12800, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0 # mount /dev/sdd1 /disk # cd /disk # tar zxf /tmp/TEST_KIT.tar.gz # sh -x ./setup.sh + dd if=/dev/zero of=bigfile bs=1024K count=4096 oflag=direct 4096+0 records in 4096+0 records out 4294967296 bytes (4.3 GB) copied, 4.83153 seconds, 889 MB/s # sync;sync;sync;echo 3 > /proc/sys/vm/drop_caches # sh ./test.sh 1 24 # sh ./tally.sh 24 TotIO: 524288 Tm: 24 IOPS: 21845.3 # sh ./test.sh 2 35 # sh ./tally.sh 35 TotIO: 1048576 Tm: 35 IOPS: 29959.3
Test Case 4
I then served up the XFS file system via NFS. The baseline (single writer) showed 16% improvement over the NFS-exported ext3 case. Scalability was 81%. Sandbag the baseline, improve the scalability! 🙂 Joking aside, this proves the point about direct/concurrent on NFS as well.
# cd / # service nfs start Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS daemon: [ OK ] Starting NFS mountd: [ OK ] # mount -t nfs -o rw,bg,hard,nointr,tcp,vers=3,timeo=300,rsize=32768,wsize=32768 localhost:/disk /mnt # cd /mnt # rm bigfile # sh -x ./setup.sh + dd if=/dev/zero of=bigfile bs=1024K count=4096 oflag=direct 4096+0 records in 4096+0 records out 4294967296 bytes (4.3 GB) copied, 6.95507 seconds, 618 MB/s # sync;sync;sync;echo 3 > /proc/sys/vm/drop_caches # sh ./test.sh 1 32 # sh ./tally.sh 32 TotIO: 524288 Tm: 32 IOPS: 16384.0 # sh ./test.sh 2 40 # sh ./tally.sh 40 TotIO: 1048576 Tm: 40 IOPS: 26214.4 Scripts and example script output: test.sh tally.sh example of test.sh output (handle with tally.sh)
The Moral Of This Blog Entry Is?
- Don’t leave comments open on threads for 5.2 years
- Use file systems suited to the task at hand
- Kevin is (and has always been) a huge proponent of the NFS storage provisioning model for Oracle
- ASM is not required for scalable writes