The comment thread on my blog entry about the simplicity of NAS for Oracle got me thinking. I can’t count how many times I’ve seen people ask the following question:
Is N MB/s good throughput for Oracle over NFS?
Feel free to plug in any value you’d like for N. I’ve seen people ask if 40MB/s is acceptable. I’ve seen 60, 80, name it-I’ve seen it.
And The Answer Is…
Let me answer this question here and now. The acceptable throughput for Oracle over NFS is full wire capacity. Full stop! With Gigabit Ethernet and large Oracle transfers, that is pretty close to 110MB/s. There are some squeak factors that might bump that number one way or the other but only just a bit. Even with the most hasty of setups, you should expect very close to 100MB/s straight out of the box-per network path. I cover examples of this in depth in this HP whitepaper about Oracle over NFS.
The steps to a clean bill of health are really very simple. First, make sure Oracle is performing large I/Os. Good examples of this are tablespace CCF (create contiguous file) and full table scans with port-maximum multi-block reads. Once you verify Oracle is performance large I/Os, do the math. If you are not close to 100MB/s on a GbE network path, something is wrong. Determining what’s wrong is another blog entry. I want to capitalize on this nagging question about expectations. I reiterate (quoting myself):
Oracle will get line speed over NFS, unless something is ill-configured.
Initial Readings
I prefer to test for wire-speed before Oracle is loaded. The problem is that you need to mimic Oracle’s I/O. In this case I mean Direct I/O. Let’s dig into this one a bit.
I need something like a dd(1) tool that does O_DIRECT opens. This should be simple enough. I’ll just go get a copy of the oss.oracle.com coreutils package that has O_DIRECT tools like dd(1) and tar(1). So here goes:
[root@tmr6s15 DD]# ls ../coreutils-4.5.3-41.i386.rpm ../coreutils-4.5.3-41.i386.rpm [root@tmr6s15 DD]# rpm2cpio < ../coreutils-4.5.3-41.i386.rpm | cpio -idm 11517 blocks [root@tmr6s15 DD]# ls bin etc usr [root@tmr6s15 DD]# cd bin [root@tmr6s15 bin]# ls -l dd -rwxr-xr-x 1 root root 34836 Mar 4 2005 dd [root@tmr6s15 bin]# ldd dd linux-gate.so.1 => (0xffffe000) libc.so.6 => /lib/tls/libc.so.6 (0x00805000) /lib/ld-linux.so.2 (0x007ec000)
I have an NFS mount exported from an HP EFS Clustered Gateway (formerly PolyServe):
$ ls -l /oradata2 total 8388608 -rw-r--r-- 1 root root 4294967296 Aug 31 10:15 file1 -rw-r--r-- 1 root root 4294967296 Aug 31 10:18 file2 $ mount | grep oradata2 voradata2:/oradata2 on /oradata2 type nfs (rw,bg,hard,nointr,tcp,nfsvers=3,timeo=600,rsize=32768,wsize=32768,actimeo=0,addr=192.168.60.142)
Let’s see what the oss.oracle.com dd(1) can do reading a 4GB file and over-writing another 4GB file:
$ time ./dd --o_direct=1048576,1048576 if=/oradata2/file1 of=/oradata2/file2 conv=notrunc 4096+0 records in 4096+0 records out real 1m32.274s user 0m3.681s sys 0m8.057s
Test File Over-writing
What’s this bit about over-writing? I recommend using conv=notrunc when testing write speed. If you don’t, the file will be truncated and you’ll be testing write speeds burdened with file growth. Since Oracle writes the contents of files (unless creating or extended a datafile), it makes no sense to test writes to a file that is growing. Besides, the goal is to test the throughput of O_DIRECT I/O via NFS, not the filer’s ability to grow a file. So what did we get? Well, we transferred 8GB (4GB in, 4GB out) and did so in 92 seconds. That’s 89MB/s and honestly, for a single path I would actually accept that since I have done absolutely no specialized tuning whatsoever. This is straight out of the box as they say. The problem is that I know 89MB/s is not my typical performance for one of my standard deployments. What’s wrong?
The dd(1) package supplied with the oss.oracle.com coreutils has a lot more in mind than O_DIRECT over NFS. In fact, it was developed to help OCFS1 deal with early cache-coherency problems. It turned out that mixing direct and non-direct I/O on OCFS was a really bad thing. No matter, that was then and this is now. Let’s take a look at what this dd(1) tool is doing:
$ strace -c ./dd --o_direct=1048576,1048576 if=/oradata2/file1 of=/oradata2/file2 conv=notrunc 4096+0 records in 4096+0 records out Process 32720 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 56.76 4.321097 1054 4100 1 read 22.31 1.698448 415 4096 fstatfs 10.79 0.821484 100 8197 munmap 9.52 0.725123 177 4102 write 0.44 0.033658 4 8204 mmap 0.16 0.011939 3 4096 fcntl 0.02 0.001265 70 18 12 open 0.00 0.000178 22 8 close 0.00 0.000113 23 5 fstat 0.00 0.000091 91 1 execve 0.00 0.000015 2 8 rt_sigaction 0.00 0.000007 2 3 brk 0.00 0.000006 3 2 mprotect 0.00 0.000004 4 1 1 access 0.00 0.000002 2 1 uname 0.00 0.000002 2 1 arch_prctl ------ ----------- ----------- --------- --------- ---------------- 100.00 7.613432 32843 14 total
Eek! I’ve paid for a 1:1 fstatfs(2) and fcntl(2) per read(2) and a mmap(2)/munmap(2) call for every read(2)/write(2) pair! Well, that wouldn’t be a big deal on OCFS since fstatfs(2) is extremely cheap and the structure contents only changes when filesystem attributes change. The mmap(2)/munmap(2) costs a bit, sure, but on a local filesystem it would be very cheap. What I’m saying is that this additional call overhead wouldn’t laden down OCFS throughput with the –o_direct flag-but I’m not blogging about OCFS. With NFS, this additional call overhead is way to expensive. All is not lost.
I have my own coreutils dd(1) that I implements O_DIRECT open(2). You can do this too, it is just GNU after all. With this custom GNU coreutils dd(1) I have, the call profile is nothing more than read(2) and write(2) back to back. Oh, I forgot to mention, the oss.oracle.com dd(1) doesn’t work with /dev/null or /dev/zero since it tries to throw an O_DIRECT open(2) at those devices which makes the tool croak. My dd(1) checks if in or out is /dev/null or /dev/zero and omits the O_DIRECT for that side of the operation. Anyway, here is what this tool got:
$ time dd_direct if=/oradata2/file1 of=/oradata2/file2 bs=1024k conv=notrunc 4096+0 records in 4096+0 records out real 1m20.162s user 0m0.008s sys 0m1.458s
Right, that’s more like it-80 seconds or 102 MB/s. Shaving those additional calls off brought throughput up 15%.
What About Bonding/Teaming NICS
Bonding NICs is a totally different story as I point out somewhat in this paper about Oracle Database 11g Direct NFS. You can get very mixed results if the network interface over which you send NFS traffic is bonded. I’ve seen 100% scalability of NICs in a bonded pair and I’ve seen as low as 70%. If you are testing a bonded pair, set your expectations accordingly.
Just a note, the current version of dd from GNU coreutils does now support O_DIRECT. I don’t know exactly what version it was introduced in, but they added iflag=direct and oflag=direct (input and output flags) that allow you to force dd into using direct I/O. It is included with coreutils 5.97 (RHEL5) but is not included with coreutils 5.2.1 (RHEL4).
After thinking about this for a bit, I realized the 89MB/s is a bit misleading. If the connection is made in full duplex mode, then send/receives are happening at the same time.
Steven,
What is your point? I made the point clear that the 89MB/s
ceiling was due to the that particular dd implementation.
That if you are truly seeing “full wire capacity”, then the time should be almost cut in half. The example churns through 8GB of data, but it’s using both the send/receive portions of the link. It was receiving at 1Gbps while (almost) simultaneously sending that data back down the line at 1Gbps. You only read 4GB of data, and over a 1Gb link, I would expect that to finish in about 40 seconds or so.
What would you get if you read file1 and set of=/dev/null?
BTW, I realize my first post on your site was a criticism, but I really appreciate all of the good stuff you blog about. I’m building a 4 node RAC environment now, RHEL 4 64bit using NFS to a NetApp filer. Your site has been immensely helpful.
Steven wrote:
“That if you are truly seeing “full wire capacity”, then the time should be almost cut in half. ”
…Steven, I’m still confused by what you are trying to get it. I used a dd command implemented with a poor call stack and was throttled to 89MB/s throughput. I switched to a streamlined dd and pushed 102 MB/s. If the output was sent to /dev/null I would have pulled somewhere in the range of 117MB/s as is common for large O_DIRECT reads over GbE NFS.
…Then, you wrote:
“You only read 4GB of data, and over a 1Gb link, I would expect that to finish in about 40 seconds or so.”
…unless my calculator is broken, that is precisely 102MB/s…since I was writing after reading, the time came in at 80s…still 102 MB/s throughput. I must be missing your point all together.
..Nonetheless, readers, be aware that as Christopher Cashell points out in the first comment on this thread, RHEL 5 implements dd O_DIRECT selectively for input/and/or output using the iflag and/or oflag options. If you have an RHEL 4 NFS environment, you might consider getting that rendition of coreutils and use the dd command in there for any of your file shuttling efforts…remember, Oracle used O_DIRECT on files in NFS and you should make sure any external accesses to large NFS files in that environment follow the same model. If you don’t you will consume page cache on the Oracle server which will start a memory usage war that you don’t want to mess around with (battle between SGA and page cache).
Sorry, for the confusion. Does the client only start writing the file after it’s read the file in it’s entirety? Or, does the client start writing blocks as soon as they are read?
Steven wrote:
“Sorry, for the confusion. Does the client only start writing the file after it’s read the file in it’s entirety? Or, does the client start writing blocks as soon as they are read?”
…Steven,
It is dd which has always performed a blocking read followed by a blocking write on every Unix variant I have ever touched dating back to Unix System III.
Ah, so it seems my time should have been spent researching how dd functions 😉
That makes sense, now…