Oracle over NFS Performance is “Glacial”, But At Least It Isn’t “File Serving.”

I assert that Oracle over NFS is not going away anytime soon—it’s only going to get better. In fact, there are futures that make it even more attractive from a performance and availability standpoint, but even today’s technology is sufficient for Oracle over NFS. Having said that, there is no shortage of misunderstanding about the model. The lack of understanding ranges from clear ignorance about the performance characteristics to simple misunderstanding about how Oracle interacts with the protocol.

Perhaps ignorance is not always the case when folks miss the mark about the performance characteristics. Indeed, when someone tells me the performance is horrible with Oracle over NFS—and the say they actually measured the performance—I can’t call them a bold-faced liar. I’m sure nay-sayers in the poor-performance crowd saw what they saw, but they likely had a botched test. I too have seen the results of a lot of botched or ill-constructed tests, but I can’t dismiss an entire storage and connectivity model based on such results. I’ll discuss possible botched tests in a later post. First, I’d like to clear up the common misunderstanding about NFS and Oracle from a protocol perspective.

The 800lb Gorilla
No secrets here; Network Appliance is the stereotypical 800lb gorilla in the NFS space. So why not get some clarity on the protocol from Network Appliance’s Dave Hitz? In this blog entry about iSCSI and NAS, Dave says:

The two big differences between NAS and Fibre Channel SAN are the wires and the protocols. In terms of wires, NAS runs on Ethernet, and FC-SAN runs on Fibre Channel.

Good so far—in part. Yes, most people feed their Oracle database servers with little orange glass, expensive Host Bus Adaptors and expensive switches. That’s the FCP way. How did we get here? Well, FCP hit 1Gb long before Ethernet and honestly, the NFS overhead most people mistakenly fear in today’s technology was truly a problem in the 2000-2004 time frame. That was then, this is now.

As for NAS, Dave stopped short by suggesting NAS (e.g., NFS, iSCSI) runs over Ethernet. There is also IP over Infiniband. I don’t believe NetApp plays Infiniband so that is likely the reason for the omission.

Dave continues:

The protocols are also different. NAS communicates at the file level, with requests like create-file-MyHomework.doc or read-file-Budget.xls. FC-SAN communicates at the block level, with requests over the wire like read-block-thirty-four or write-block-five-thousand-and-two.

What? NAS is either NFS or iSCSI—honestly. However, only NFS operates with requests like “read-file-Budget.xls”. But that is not the full story and herein comes the confusion when the topic of Oracle over NFS comes up. Dave has inadvertently contributed to the misunderstanding. Yes, an NFS client may indeed cause NFS to return an entire Excel spreadsheet, but that is certainly not how accesses to Oracle database files are conducted. I’ll state it simply, and concisely:

Oracle over NFS is a file positioning and read/write workload.

Oracle over NFS is not traditional “file serving.” Oracle on an NFS client does not fetch entire files. That would simply not function. In fact, Oracle over NFS couldn’t possibly have less in common with traditional “file serving.” It’s all about Direct I/O.

Direct I/O with NFS
Oracle running on an NFS client does not double buffer by using both an SGA and the NFS client page cache. All platforms (that matter) support Direct I/O for files in NFS mounts. To that end, the cache model is SGA->Storage Cache and nothing in between—and therefore none of the associated NFS client cache overhead. And as I’ve pointed out in many blog entries before, I only call something “Direct I/O” if it is real Direct I/O. That is, Direct I/O and concurrent I/O (no write ordering locks).

I/O Libraries
Oracle uses the same I/O libraries (in Oracle9i/Oracle10g) to access files in NFS mounts as it does for:

raw partitions
local file systems
block cluster file systems (e.g. GFS, PSFS, GPFS, OCFS2)
ASM over NFS
ASM on Raw Partitions

Oops, I almost forgot, there is also Oracle Disk Manager. So let me restate. When Oracle is not linked with an Oracle Disk Manager library or ASMLib, the same I/O calls are used for all of the storage options in the list I just provided.

So what’s the point? Well, the point I’m making is that Oracle behaves the same on NFS as it does on all the other storage options. Oracle simply positions within the files and reads or writes what’s there. No magic. But how does it perform?

The Performance is Glacial
There is a recent thread on comp.databases.oracle.server about 10g RAC that wound up twisting through other topics including Oracle over NFS. When discussing the performance of Oracle over NFS, one participant in the thread stated his view bluntly:

And the performance will be glacial: I’ve done it.

Glacial? That is:
gla·cial
adj.
1.
a. Of, relating to, or derived from a glacier.
b. Suggesting the extreme slowness of a glacier: Work proceeded at a glacial pace.

Let me see if I can redefine glacial using modern tested results with real computers, real software, and real storage. This is just a snippet, but it should put the term glacial in a proper light.

In the following screen shot, I list a simple script that contains commands to capture the cumulative physical I/O the instance has done since boot time followed with a simple PL/SQL block that performs full light-weight scans against a table followed by another peek at the cumulative physical I/O. For this test I was not able to come up with a huge amount of storage so I created and loaded a table with order entry history records—about 25GB worth of data. So that the test runs for a reasonable amount of time I scan the table 4 times using the simple PL/SQL block.

NOTE: You may have to right click-> view the image

The following screen shot shows that Oracle scanned 101GB in 466 seconds—223 MB/s scanning throughput. I forgot to mention, this is a DL585 with only 2 paths to storage. Before some slight reconfiguration I had to do I had 3 paths to storage where I was seeing 329MB/s—or about 97% linear scalability when considering the maximum payload on GbE is on the order of 114MB/s for this sort of workload.

NFS Overhead? Cheating is Naughty!
The following screen shot shows vmstat output taken during the full table scanning. It shows that the Kernel mode processor utilization when Oracle uses Direct I/O to scan NFS files falls consistently in range of 22%. That is not entirely NFS overhead by any means either.

Of course Oracle doesn’t know if its I/O is truly physical since there could be OS buffering. The screen shot also shows the memory usage on the server. There was 31 of 32GB free which means I wasn’t scanning a 25GB table that was cached in the OS page cache. This was real I/O going over a real wire.

For more information I recommend:

This paper about Scalable Fault Tolerant NAS and the NFS-related postings on my blog.

17 Responses to “Oracle over NFS Performance is “Glacial”, But At Least It Isn’t “File Serving.””

Feed for this Entry Trackback Address

1 Sto Rage May 3, 2007 at 3:49 am

Can you tell us what mount options you recommend for oracle nfs mounts?
This is what we have for our oracle nfs mounts:
nfs 2 yes rw,bg,vers=3,proto=tcp,hard,intr,rsize=32768,wsize=32768,forcedirectio,llock

Reply
2 cristiancudizio May 3, 2007 at 8:35 am

Excuse my ignorance about NFS, but i’ve not understood what appens when oracle tryes to read a block from a file on NSF: it gets entire file on os cache and then it get its block or it is able to get directly only the block from the NAS?
Thanks for answer and compliments for the post

bye
Cristian

Reply
3 kevinclosson May 3, 2007 at 3:42 pm

“tryes to read a block from a file on NSF: it gets entire file on os cache and then it get its block or it is able to get directly only the block from the NAS?”

If Oracle performs a physical I/O for a single Oracle database block, that is exactly what is transferred over the wire.

Reply
4 Freek May 20, 2007 at 9:06 am

Can you tell a little bit more about the test environment you used, like type of os, number of cpu’s, used values for the filesystemio_options and db_writer_processes parameters?

regards,

Freek

Reply
5 guest April 11, 2008 at 9:12 pm

The throughput for nfs over GbE connection is about ~33Mb/s.
For sata drive is ~70Mb/s.
and so on…

so 114Mb/s is a bit missleading

Reply
6 kevinclosson April 11, 2008 at 10:15 pm

Guest:

I just about censored out your comment, but then I got to thinking that you might actually believe NFS over GbE is limited to 33Mb/s. You couldn’t be further from the truth. How do you suppose NFS eats up 967Mb? I think you just mistook your nomenclature. Nonetheless, I have provided countless evidence that Oracle over NFS can easily drive throughput up to line speed (~114MB/s-118MB/s)

…there is nothing misleading about that.

Read:

Click to access 15650%20NAS%20Oracle%20WP%204A2.pdf

Reply
7 realarms May 24, 2008 at 11:20 pm

I’ve just stumbled upon this blog entry in my quest for any decent informantion on why the mentioned Oracle / NFS performance is at it is.

Unfortunately, so far I was not able to get any decent explanation on any vendor site about this potential isse – all the whitepaper simply miss the real point about oracle performance over NFS – in my humble view.

Let me say, that my profession allows me to have interesting insights into modern data storage and related technologies.

But now to the point:

For me, both the measurements as posted here by Kevin, as well as the mentioned speed of only 33 MB/s over Gigabit Ethernet are both perfectly valid, and true, data points.

Fortunately, Kevin provided the basic key to unlock the secret, what differentiates the two groups (one claiming near-wirespeed nfs performance, and the other group claiming something between 15 and 40% of wirespeed.

It all boils down to concurrency and latency.

Without going into too much detail, let’s investigate what really impacts throughput when running a table scan in oracle:

Oracle will start requesting typically big junks of data from it’s data (nfs) files – usually in the vicinity of 256k to 2M.

Using o_sync and o_direct semantics (supported by mount options such as forcedirectio), that block is requested from the nfs client.

However, most nfs clients and servers only support a much more limited blocksize, typically 32k (most whitepapers call for this blocksize, when mounting a nfs export for oracle).

This means, our oracle 256k request needs to be split up into a number (8) smaller nfs requests which can then be sent off to the nfs server.

The server, after receiving such a small 32k request, will do it’s best and serve that junk of data as fast as possible.

As soon as the nfs client (oracle host) has received all the consituent small nfs replys for the large application (oracle) request, it will return the data to the appliaction. Since the mount options explicitly forbid any client-side caching, the nfs client won’t start doing any read-aheads (prefetches) of it’s own.

So far so good. But what does all that mean, performance wise:

First, i deliberately did not mention one key fact: All these operations require time.
Second, something very basic has never been mentioned in all that threads: The underlying (ethernet) infrastructure, as well as the nfs client (which is basically part of that infrastructure).

Since the nfs server is driven by the client, it’s the client who, more than anything else, can put things good or down.

Consider the following:

Ora OS Srv
-256k->
–32k->

<-32k–

–32k->
–32k->

<-32k–
<-32k–
<-256k–

Anyone noticed the difference?

Actually, these two exchanges are based on actual observations of NFS client behaviour; the first for NFS clients based on the solaris reference client (most commerical unix flavours, like solars, aix, hpux perform sync/direct semantics that way). The second is your common Linux 2.6 (and DNFS) client.

The difference is (and I hope this can be conveyed in this blog entry), that in the first case, the nfs client just sits idle, while it waits for each single consituent request to complete by the server, before asking for the next block…

In the second example, the client actally immediately asked for more data (already requested by the application) – to fill the “time gap”.

Of course, these two examples are single-threaded; But just like “Guest” mentioned, a lot of appliactions out there are using SQL statements like this – plain vanilla single-threaded. Kevin, in his examples, used a appliaction level approach, “to fill the gaps” – running 16 oracle threads in parallel (anyone noticed)?

So, in the real world what does that mean:

Have a real close look at your ethernet infrastructure, and tune your TCP stack and NICs as good as you can. For single thread throughput, each microsecond delay – either in the host, the network or the nfs server – can ruin your performance.

Ideally, use dedicated ethernet switches, and don’t run your oracle NFS traffic across the datacenter over numerous intermediate switches.

Some food for thought: the theoretical delay limits to transfer 32kB of nfs payload data across a gigabit ethernet link, is about 260 microseconds. Now, add to this the delay per switch hop, the delay on the nfs server (and the delay within the nfs client), and you can easily end up with a total round-trip time from nfs request to reception of the last tcp frame from the reply in the order of 600 microseconds.

Sounds reasonable?

You are right, but your NFS throughput will then only be around 53 MB/s – for a single thread.

Now add an oversubscribed, over-configured and underpowered typical core switch (you can probabyl find tolly test results on the brand you got), times 2 within a single data center, and you can easily end up with 1-2 milliseconds delay.

Sounds still good?

Your performance at 2 ms (and with a nfs client behaving like in the first example) will drop to 16 MB/s…

Now, if your network is perfect (ie. a direct cable between your oracle server gigabit nic to the nfs server gigabit nic), but the nfs server has an average latency of say 5 milliseconds per 32k request – throughput down to 6 MB/s.

I hope, this clearly demonstrates the paramount influence of latency to your oracle single thread (table scan) performance.

What can you do about it – because at some point, when you invested into cut-through ethernet switches (0,1 to 10 microseconds delay per hop instead of 150 to 1200 microseconds as with store and forward switches; same technology basically as fibre channel switches), and top of the line nfs servers with huge amounts of ram to have all your working set in cache – any further investment would have diminishing returns.

Well, use a nfs client as in example 2 – or as Kevin demonstrated, never perform single-thread IOs on the oracle side.

Examples for nfs clients which can internally make use of concurrency, while obeying the semantics necessary for o_sync and o_direct calls, are – as mentioned – linux 2.6 and oracle DNFS.

You can also demand, that closed source clients get improved to have the same features as the open source linux nfs client – or have your DB applications all rewritten…

In the end, it all boils down to latency and concurrency. Sync / direct performance for most real-world environments is directly affected by high latency (network delay !), and concurrency can only be introduced by an architectural change (oracle app or nfs client).

Reply
8 realarms May 24, 2008 at 11:24 pm

And again, one more try:

Ora OS Srv
| -256k->
| –32k->
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| <-32k–
|
| –32k->
| –32k->
|
|
|
|
|
|
| <-32k–
| <-32k–
| <-256k–
|
v time

Reply
9 Alonso June 23, 2008 at 2:45 pm

I would like to throw one gotcha out there I encountered. When using directio and datapump (which intentionally bypasses SGA) you can run into big performance problems trying to export an IOT table with overflow segment. Datapump wants to keep the entire row together, so it reads an index leaf than says “oh, the rest of this row is on page 12345”, fetches 12345, then moves on. Next row, page 12345 is not in SGA cache, not in “datapump cache”, not in NFS cache, guess what you get to do another direct read.

For me this meant reading something like 200GB of data trying to export a 12G table.

Its enough of a fringe case that I suspect lots of folks have not run into it.

I dont know what the best workaround is, other than to not use IOT or datapump.

Reply
- 10 Dave February 22, 2010 at 11:19 pm
  
  All of the consultants I’ve spoke to say “no datapumps” … I guess we know why.
  
  Reply
11 Dave February 22, 2010 at 11:13 pm

Why doesn’t Oracle just use DAFS … good stuff.

Reply
- 12 kevinclosson February 22, 2010 at 11:55 pm
  
  because, google: “closson +DNFS”
  
  Reply
  - 13 Matt February 23, 2010 at 2:18 am
    
    Hi Kevin,
    
    I’m a little confused – in running the google search you
    mention, I definitely see alot of material on dNFS, but
    in what I’ve been able to skim through, I don’t find any
    direct address of the question … and in my meager
    understanding, I’m not sure what DAFS even has to do with
    dNFS, other than NetApp supporting both … isn’t DAFS much
    more akin to Linux’s NDB, supporting no-nonsense direct
    block IO rather than “negotiated” as through the NFS protocol ?
    
    Respectfully,
    Matt
    
    Reply
14 James Attard July 27, 2011 at 8:50 am

Hi Kevin nice post, but as other commentators have noted you missed to state what NFS mount options were used in your experiment. To achieve good speeds I noticed that I had to switch to UDP as I described in my article here – http://www.r00tb0x.com/content/tuning-nfs-performance

Reply

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage