What Is Good Throughput With Oracle Over NFS?

The comment thread on my blog entry about the simplicity of NAS for Oracle got me thinking. I can’t count how many times I’ve seen people ask the following question:

Is N MB/s good throughput for Oracle over NFS?

Feel free to plug in any value you’d like for N. I’ve seen people ask if 40MB/s is acceptable. I’ve seen 60, 80, name it-I’ve seen it.

And The Answer Is…
Let me answer this question here and now. The acceptable throughput for Oracle over NFS is full wire capacity. Full stop! With Gigabit Ethernet and large Oracle transfers, that is pretty close to 110MB/s. There are some squeak factors that might bump that number one way or the other but only just a bit. Even with the most hasty of setups, you should expect very close to 100MB/s straight out of the box-per network path. I cover examples of this in depth in this HP whitepaper about Oracle over NFS.

The steps to a clean bill of health are really very simple. First, make sure Oracle is performing large I/Os. Good examples of this are tablespace CCF (create contiguous file) and full table scans with port-maximum multi-block reads. Once you verify Oracle is performance large I/Os, do the math. If you are not close to 100MB/s on a GbE network path, something is wrong. Determining what’s wrong is another blog entry. I want to capitalize on this nagging question about expectations. I reiterate (quoting myself):

Oracle will get line speed over NFS, unless something is ill-configured.

Initial Readings
I prefer to test for wire-speed before Oracle is loaded. The problem is that you need to mimic Oracle’s I/O. In this case I mean Direct I/O. Let’s dig into this one a bit.

I need something like a dd(1) tool that does O_DIRECT opens. This should be simple enough. I’ll just go get a copy of the oss.oracle.com coreutils package that has O_DIRECT tools like dd(1) and tar(1). So here goes:

[root@tmr6s15 DD]# ls ../coreutils-4.5.3-41.i386.rpm
../coreutils-4.5.3-41.i386.rpm
[root@tmr6s15 DD]# rpm2cpio < ../coreutils-4.5.3-41.i386.rpm | cpio -idm
11517 blocks
[root@tmr6s15 DD]# ls
bin  etc  usr
[root@tmr6s15 DD]# cd bin
[root@tmr6s15 bin]# ls -l dd
-rwxr-xr-x  1 root root 34836 Mar  4  2005 dd
[root@tmr6s15 bin]# ldd dd
        linux-gate.so.1 =>  (0xffffe000)
        libc.so.6 => /lib/tls/libc.so.6 (0x00805000)
        /lib/ld-linux.so.2 (0x007ec000)

I have an NFS mount exported from an HP EFS Clustered Gateway (formerly PolyServe):

 $ ls -l /oradata2
total 8388608
-rw-r--r--  1 root root 4294967296 Aug 31 10:15 file1
-rw-r--r--  1 root root 4294967296 Aug 31 10:18 file2
$ mount | grep oradata2
voradata2:/oradata2 on /oradata2 type nfs
(rw,bg,hard,nointr,tcp,nfsvers=3,timeo=600,rsize=32768,wsize=32768,actimeo=0,addr=192.168.60.142)

Let’s see what the oss.oracle.com dd(1) can do reading a 4GB file and over-writing another 4GB file:

 $ time ./dd --o_direct=1048576,1048576 if=/oradata2/file1 of=/oradata2/file2 conv=notrunc
4096+0 records in
4096+0 records out

real    1m32.274s
user    0m3.681s
sys     0m8.057s

Test File Over-writing
What’s this bit about over-writing? I recommend using conv=notrunc when testing write speed. If you don’t, the file will be truncated and you’ll be testing write speeds burdened with file growth. Since Oracle writes the contents of files (unless creating or extended a datafile), it makes no sense to test writes to a file that is growing. Besides, the goal is to test the throughput of O_DIRECT I/O via NFS, not the filer’s ability to grow a file. So what did we get? Well, we transferred 8GB (4GB in, 4GB out) and did so in 92 seconds. That’s 89MB/s and honestly, for a single path I would actually accept that since I have done absolutely no specialized tuning whatsoever. This is straight out of the box as they say. The problem is that I know 89MB/s is not my typical performance for one of my standard deployments. What’s wrong?

The dd(1) package supplied with the oss.oracle.com coreutils has a lot more in mind than O_DIRECT over NFS. In fact, it was developed to help OCFS1 deal with early cache-coherency problems. It turned out that mixing direct and non-direct I/O on OCFS was a really bad thing. No matter, that was then and this is now. Let’s take a look at what this dd(1) tool is doing:

$ strace -c ./dd --o_direct=1048576,1048576 if=/oradata2/file1 of=/oradata2/file2 conv=notrunc
4096+0 records in
4096+0 records out
Process 32720 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 56.76    4.321097        1054      4100         1 read
 22.31    1.698448         415      4096           fstatfs
 10.79    0.821484         100      8197           munmap
  9.52    0.725123         177      4102           write
  0.44    0.033658           4      8204           mmap
  0.16    0.011939           3      4096           fcntl
  0.02    0.001265          70        18        12 open
  0.00    0.000178          22         8           close
  0.00    0.000113          23         5           fstat
  0.00    0.000091          91         1           execve
  0.00    0.000015           2         8           rt_sigaction
  0.00    0.000007           2         3           brk
  0.00    0.000006           3         2           mprotect
  0.00    0.000004           4         1         1 access
  0.00    0.000002           2         1           uname
  0.00    0.000002           2         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    7.613432                 32843        14 total

Eek! I’ve paid for a 1:1 fstatfs(2) and fcntl(2) per read(2) and a mmap(2)/munmap(2) call for every read(2)/write(2) pair! Well, that wouldn’t be a big deal on OCFS since fstatfs(2) is extremely cheap and the structure contents only changes when filesystem attributes change. The mmap(2)/munmap(2) costs a bit, sure, but on a local filesystem it would be very cheap. What I’m saying is that this additional call overhead wouldn’t laden down OCFS throughput with the –o_direct flag-but I’m not blogging about OCFS. With NFS, this additional call overhead is way to expensive. All is not lost.

I have my own coreutils dd(1) that I implements O_DIRECT open(2). You can do this too, it is just GNU after all. With this custom GNU coreutils dd(1) I have, the call profile is nothing more than read(2) and write(2) back to back. Oh, I forgot to mention, the oss.oracle.com dd(1) doesn’t work with /dev/null or /dev/zero since it tries to throw an O_DIRECT open(2) at those devices which makes the tool croak. My dd(1) checks if in or out is /dev/null or /dev/zero and omits the O_DIRECT for that side of the operation. Anyway, here is what this tool got:

$ time dd_direct if=/oradata2/file1 of=/oradata2/file2 bs=1024k conv=notrunc
4096+0 records in
4096+0 records out

real    1m20.162s
user    0m0.008s
sys     0m1.458s

Right, that’s more like it-80 seconds or 102 MB/s. Shaving those additional calls off brought throughput up 15%.

What About Bonding/Teaming NICS
Bonding NICs is a totally different story as I point out somewhat in this paper about Oracle Database 11g Direct NFS. You can get very mixed results if the network interface over which you send NFS traffic is bonded. I’ve seen 100% scalability of NICs in a bonded pair and I’ve seen as low as 70%. If you are testing a bonded pair, set your expectations accordingly.

8 Responses to “What Is Good Throughput With Oracle Over NFS?”

Feed for this Entry Trackback Address

1 Christopher Cashell September 4, 2007 at 6:32 pm

Just a note, the current version of dd from GNU coreutils does now support O_DIRECT. I don’t know exactly what version it was introduced in, but they added iflag=direct and oflag=direct (input and output flags) that allow you to force dd into using direct I/O. It is included with coreutils 5.97 (RHEL5) but is not included with coreutils 5.2.1 (RHEL4).

2 Steven July 23, 2008 at 3:12 pm

After thinking about this for a bit, I realized the 89MB/s is a bit misleading. If the connection is made in full duplex mode, then send/receives are happening at the same time.

3 kevinclosson July 23, 2008 at 4:39 pm

Steven,

What is your point? I made the point clear that the 89MB/s
ceiling was due to the that particular dd implementation.

4 Steven July 23, 2008 at 5:14 pm

That if you are truly seeing “full wire capacity”, then the time should be almost cut in half. The example churns through 8GB of data, but it’s using both the send/receive portions of the link. It was receiving at 1Gbps while (almost) simultaneously sending that data back down the line at 1Gbps. You only read 4GB of data, and over a 1Gb link, I would expect that to finish in about 40 seconds or so.

What would you get if you read file1 and set of=/dev/null?

BTW, I realize my first post on your site was a criticism, but I really appreciate all of the good stuff you blog about. I’m building a 4 node RAC environment now, RHEL 4 64bit using NFS to a NetApp filer. Your site has been immensely helpful.

5 kevinclosson July 23, 2008 at 5:46 pm

Steven wrote:
“That if you are truly seeing “full wire capacity”, then the time should be almost cut in half. ”

…Steven, I’m still confused by what you are trying to get it. I used a dd command implemented with a poor call stack and was throttled to 89MB/s throughput. I switched to a streamlined dd and pushed 102 MB/s. If the output was sent to /dev/null I would have pulled somewhere in the range of 117MB/s as is common for large O_DIRECT reads over GbE NFS.

…Then, you wrote:

“You only read 4GB of data, and over a 1Gb link, I would expect that to finish in about 40 seconds or so.”

…unless my calculator is broken, that is precisely 102MB/s…since I was writing after reading, the time came in at 80s…still 102 MB/s throughput. I must be missing your point all together.

..Nonetheless, readers, be aware that as Christopher Cashell points out in the first comment on this thread, RHEL 5 implements dd O_DIRECT selectively for input/and/or output using the iflag and/or oflag options. If you have an RHEL 4 NFS environment, you might consider getting that rendition of coreutils and use the dd command in there for any of your file shuttling efforts…remember, Oracle used O_DIRECT on files in NFS and you should make sure any external accesses to large NFS files in that environment follow the same model. If you don’t you will consume page cache on the Oracle server which will start a memory usage war that you don’t want to mess around with (battle between SGA and page cache).

6 Steven July 23, 2008 at 6:25 pm

Sorry, for the confusion. Does the client only start writing the file after it’s read the file in it’s entirety? Or, does the client start writing blocks as soon as they are read?

7 kevinclosson July 23, 2008 at 6:47 pm

Steven wrote:

“Sorry, for the confusion. Does the client only start writing the file after it’s read the file in it’s entirety? Or, does the client start writing blocks as soon as they are read?”

…Steven,

It is dd which has always performed a blocking read followed by a blocking write on every Unix variant I have ever touched dating back to Unix System III.

8 Steven July 23, 2008 at 8:42 pm

Ah, so it seems my time should have been spent researching how dd functions 😉

That makes sense, now…

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage