The comment thread on my blog entry about the simplicity of NAS for Oracle got me thinking. I can’t count how many times I’ve seen people ask the following question:
Is N MB/s good throughput for Oracle over NFS?
Feel free to plug in any value you’d like for N. I’ve seen people ask if 40MB/s is acceptable. I’ve seen 60, 80, name it-I’ve seen it.
And The Answer Is…
Let me answer this question here and now. The acceptable throughput for Oracle over NFS is full wire capacity. Full stop! With Gigabit Ethernet and large Oracle transfers, that is pretty close to 110MB/s. There are some squeak factors that might bump that number one way or the other but only just a bit. Even with the most hasty of setups, you should expect very close to 100MB/s straight out of the box-per network path. I cover examples of this in depth in this HP whitepaper about Oracle over NFS.
The steps to a clean bill of health are really very simple. First, make sure Oracle is performing large I/Os. Good examples of this are tablespace CCF (create contiguous file) and full table scans with port-maximum multi-block reads. Once you verify Oracle is performance large I/Os, do the math. If you are not close to 100MB/s on a GbE network path, something is wrong. Determining what’s wrong is another blog entry. I want to capitalize on this nagging question about expectations. I reiterate (quoting myself):
Oracle will get line speed over NFS, unless something is ill-configured.
I prefer to test for wire-speed before Oracle is loaded. The problem is that you need to mimic Oracle’s I/O. In this case I mean Direct I/O. Let’s dig into this one a bit.
I need something like a dd(1) tool that does O_DIRECT opens. This should be simple enough. I’ll just go get a copy of the oss.oracle.com coreutils package that has O_DIRECT tools like dd(1) and tar(1). So here goes:
[root@tmr6s15 DD]# ls ../coreutils-4.5.3-41.i386.rpm ../coreutils-4.5.3-41.i386.rpm [root@tmr6s15 DD]# rpm2cpio < ../coreutils-4.5.3-41.i386.rpm | cpio -idm 11517 blocks [root@tmr6s15 DD]# ls bin etc usr [root@tmr6s15 DD]# cd bin [root@tmr6s15 bin]# ls -l dd -rwxr-xr-x 1 root root 34836 Mar 4 2005 dd [root@tmr6s15 bin]# ldd dd linux-gate.so.1 => (0xffffe000) libc.so.6 => /lib/tls/libc.so.6 (0x00805000) /lib/ld-linux.so.2 (0x007ec000)
I have an NFS mount exported from an HP EFS Clustered Gateway (formerly PolyServe):
$ ls -l /oradata2 total 8388608 -rw-r--r-- 1 root root 4294967296 Aug 31 10:15 file1 -rw-r--r-- 1 root root 4294967296 Aug 31 10:18 file2 $ mount | grep oradata2 voradata2:/oradata2 on /oradata2 type nfs (rw,bg,hard,nointr,tcp,nfsvers=3,timeo=600,rsize=32768,wsize=32768,actimeo=0,addr=192.168.60.142)
Let’s see what the oss.oracle.com dd(1) can do reading a 4GB file and over-writing another 4GB file:
$ time ./dd --o_direct=1048576,1048576 if=/oradata2/file1 of=/oradata2/file2 conv=notrunc 4096+0 records in 4096+0 records out real 1m32.274s user 0m3.681s sys 0m8.057s
Test File Over-writing
What’s this bit about over-writing? I recommend using conv=notrunc when testing write speed. If you don’t, the file will be truncated and you’ll be testing write speeds burdened with file growth. Since Oracle writes the contents of files (unless creating or extended a datafile), it makes no sense to test writes to a file that is growing. Besides, the goal is to test the throughput of O_DIRECT I/O via NFS, not the filer’s ability to grow a file. So what did we get? Well, we transferred 8GB (4GB in, 4GB out) and did so in 92 seconds. That’s 89MB/s and honestly, for a single path I would actually accept that since I have done absolutely no specialized tuning whatsoever. This is straight out of the box as they say. The problem is that I know 89MB/s is not my typical performance for one of my standard deployments. What’s wrong?
The dd(1) package supplied with the oss.oracle.com coreutils has a lot more in mind than O_DIRECT over NFS. In fact, it was developed to help OCFS1 deal with early cache-coherency problems. It turned out that mixing direct and non-direct I/O on OCFS was a really bad thing. No matter, that was then and this is now. Let’s take a look at what this dd(1) tool is doing:
$ strace -c ./dd --o_direct=1048576,1048576 if=/oradata2/file1 of=/oradata2/file2 conv=notrunc 4096+0 records in 4096+0 records out Process 32720 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 56.76 4.321097 1054 4100 1 read 22.31 1.698448 415 4096 fstatfs 10.79 0.821484 100 8197 munmap 9.52 0.725123 177 4102 write 0.44 0.033658 4 8204 mmap 0.16 0.011939 3 4096 fcntl 0.02 0.001265 70 18 12 open 0.00 0.000178 22 8 close 0.00 0.000113 23 5 fstat 0.00 0.000091 91 1 execve 0.00 0.000015 2 8 rt_sigaction 0.00 0.000007 2 3 brk 0.00 0.000006 3 2 mprotect 0.00 0.000004 4 1 1 access 0.00 0.000002 2 1 uname 0.00 0.000002 2 1 arch_prctl ------ ----------- ----------- --------- --------- ---------------- 100.00 7.613432 32843 14 total
Eek! I’ve paid for a 1:1 fstatfs(2) and fcntl(2) per read(2) and a mmap(2)/munmap(2) call for every read(2)/write(2) pair! Well, that wouldn’t be a big deal on OCFS since fstatfs(2) is extremely cheap and the structure contents only changes when filesystem attributes change. The mmap(2)/munmap(2) costs a bit, sure, but on a local filesystem it would be very cheap. What I’m saying is that this additional call overhead wouldn’t laden down OCFS throughput with the –o_direct flag-but I’m not blogging about OCFS. With NFS, this additional call overhead is way to expensive. All is not lost.
I have my own coreutils dd(1) that I implements O_DIRECT open(2). You can do this too, it is just GNU after all. With this custom GNU coreutils dd(1) I have, the call profile is nothing more than read(2) and write(2) back to back. Oh, I forgot to mention, the oss.oracle.com dd(1) doesn’t work with /dev/null or /dev/zero since it tries to throw an O_DIRECT open(2) at those devices which makes the tool croak. My dd(1) checks if in or out is /dev/null or /dev/zero and omits the O_DIRECT for that side of the operation. Anyway, here is what this tool got:
$ time dd_direct if=/oradata2/file1 of=/oradata2/file2 bs=1024k conv=notrunc 4096+0 records in 4096+0 records out real 1m20.162s user 0m0.008s sys 0m1.458s
Right, that’s more like it-80 seconds or 102 MB/s. Shaving those additional calls off brought throughput up 15%.
What About Bonding/Teaming NICS
Bonding NICs is a totally different story as I point out somewhat in this paper about Oracle Database 11g Direct NFS. You can get very mixed results if the network interface over which you send NFS traffic is bonded. I’ve seen 100% scalability of NICs in a bonded pair and I’ve seen as low as 70%. If you are testing a bonded pair, set your expectations accordingly.