DBWR Efficiency, AIO, I/O Libraries with ASM.

I’m sure this information is not new to very many folks, but there might be some intersting stuff in this post…

I’m doing OLTP performance testing using a DL585, 32GB, Scalable NAS, RHEL4 (Real, genuine RHEL4) and 10gR2. I’m doing some oprofile analysis on both the NFS client and server. I’ll blog more about oprofile soon.

This post will sound a bit like a rant. Did you know that the Linux 2.5 Kernel Summit folks spent a bunch of time mulling over features that have been considered absolutely critical for Oracle performance on Unix systems for over 10 years! Take a peek at this partial list and chuckle with me please. After the list, I’m going to talk about DBWR. A snippet from the list:

Raw I/O has a few problems that keep it from achieving the performance it should get. Large operations are broken down into 64KB batches…
A true asynchronous I/O interface is needed.
Shared memory segments should use shared page tables.
A better, lighter-weight semaphore implementation is needed.
It would be nice to allow a process to prevent itself from being scheduled out when it is holding a critical resource.

Yes, the list included such features as eliminating the crazy smashing of Oracle multiblock I/O into little bits, real async I/O, shared page tables and non-preemption. That’s right. Every Unix variant worth its salt, in the mid to late 90s, had all this and more. Guess how much of the list is still not implemented. Guess how important those missing items are. I’ll blog some other time about the lighter-weight semaphore and non-preemption that fell off the truck.

I Want To Talk About Async I/O
Prior to the 2.6 Linux Kernel, there was no mainline async I/O support. Yes, there were sundry bits of hobby code out there, but really, it wasn’t until 2.6 that async I/O worked. In fact, a former co-worker (from the Sequent days) did the first HP RAC Linux TPC-C and reported here that the kludgy async I/O that he was offered bumped performance 5%. Yippie! I assure you, based on years of TPC-C work, that async I/O will give you much more than 5% if it works at all.

So, finally, 2.6 brought us async I/O. The implementation deviates from POSIX, which is a good thing. However, it doesn’t deviate enough. One of the most painful aspects of POSIX async I/O, from an Oracle perspective, is that each call can only initiate writes to a single file descriptor. At least the async I/O that did make it into the 2.6 Kernel is better in that regard. With the io_submit(2) routine, DBWR can send a batch of modified buffers to any number of datafiles in a single call. This is good. In fact, this is one of the main reasons Oracle developed the Oracle Disk Manager (ODM) interface specification. See, with odm_io(), any combination of reads and writes whether sync or async to any number of file descriptors can be issued in a single call. Moreover, while initiating new I/O, prior requests can be checked for completion. It is a really good interface, but was only developed by Veritas, NetApp and PolyServe. NetApps’ version died because it was locked to tightly with DAFS which is dead, really dead (I digress). So, yes, ODM (more info in this, and other papers) is quite efficient at completion processing. Anyone out there look at completion processing on 10gR2 with the 2.6 libaio? I did (a long time ago really).

Here is a screen shot of strace following DBWR while the DL585 is pumping a reasonable I/O rate (approx 8,000 IOPS) from the OLTP workload (click the graphic for better display):

Notice anything weird? There are:

38.8 I/O submit calls per cpu second (batches)

4872 I/O complete processing calls per cpu second (io_getevents())

7785 wall clock time reading calls per cpu second (times + gettimeofday)

Does Anyone Really Know What Time It Is?
Database writer, with the API currently being used, (when no ODM is in play) is doing what we call “snacking for completions”. This happens for one of many reasons. For instance, if the completion check was for any number of completions, there could be only 1 or 2. What’s with that? If DBWR just flushed, say, 256 modified buffers, why is it taking action on just a couple of completions? Waste of time. It’s because the API offers no more flexibility than that. On the other hand, the ODM specification allows for blocking on a completion check until a certain number, or certain request is complete—with an optional timeout. And like I said, that completion check can be done while already in the kernel to initiate new I/O.

And yes, DBWR alone is checking wall clock time with a combination of times(2) and gettimeofday(2) at a rate of 7,785 times per cpu second! Wait stats force this. The VOS layer is asking for a timed OS call. The OS can’t help it if DBWR is checking for I/O completes 4,872 times per cpu second—just to harvest I/Os from some 38.8 batch writes per cpu second…ho hum… you won’t be surprised when I tell you that the Sequent port of Oracle had a user mode gettimeofday(). We looked at it this way, if Oracle wants to call gettimeofday() thousands of times per second, we might as well make it really inexpensive. It is a kernel-mode gettimeofday() on Linux of course.

What can you do about this? Not much really. I think it would be really cool if Oracle actually implemented (Unbreakable 2.0) some of the stuff they were pressing the Linux 2.5 developers to do. Where do you think the Linux 2.5 Summit got that list from?

What? No Mention of ASM?
As usual, you should expect a tie in with ASM. Folks, unless you are using ASMLib, ASM I/Os are fired out of the normal OSDs (Operating System Dependent code) which is libaio on RHEL4. What I’m telling you is that ASM is storage management, not an I/O library. So, if a db file sequential read, or DBWR write is headed for an offset in an ASM raw partition, it will be done with the same system call as it would be if it was on a CFS or NAS.

10 Responses to “DBWR Efficiency, AIO, I/O Libraries with ASM.”

Feed for this Entry Trackback Address

1 poststop November 2, 2006 at 2:49 am

Kevin,

Wondering what your background is that you actually understand all this stuff at this level of detail? You got more than just book smarts. You seem to have a very good understanding of the internals from point A (Oracle) to Z (disk). Sorry if your bio is already on the blog someplace, I will look around.

– Ethan

2 kevinclosson November 2, 2006 at 5:05 am

Hi Ethan,

I welcome you as a reader. The partial answer to your question is in the into and “long intro” in the header of my blog…

https://kevinclosson.wordpress.com/long-winded-intro/

3 poststop November 2, 2006 at 5:34 am

Hey thanks, that was the type of thing I was looking for. While I was reading it I thought of this article from Robert Cringley @ PBS.

http://www.pbs.org/cringely/pulpit/2006/pulpit_20061026_001143.html

Since you know a lot about data centers and disk drives you might be interested if you have not already seen it.

4 Noons November 2, 2006 at 6:01 am

Cool post as usual.
You know: if Looneeks insists on gettimeofday() as a kernel int, there is always a libkevin.so lurking somewhere in the back of my mind that could be used to kick it in the teeth! 😉
Something like a direct memory access to a known location?
Or a /proc inode? Or maybe even – who knows? – a Polyserve distro?
(hint,hint)

5 Luca November 2, 2006 at 9:29 am

Hi Kevin,

I find your blog very interesting. Do you have more details on ASMlib and, for example, pros and cons of ASMlib vs rawdevices on a 2.6 kernel?

.Luca

6 Amir Hameed November 2, 2006 at 9:23 pm

Kevin,
In terms of I/O performance, how does your ODM library measure with the one that Veritas provides?

Amir

7 Kevin November 2, 2006 at 11:33 pm

Amir Hameed wrote:

In terms of I/O performance, how does your ODM library measure with the one that Veritas provides?

Amir,

Thanks first for being a reader, and second for the question. This is tough to answer. The best an ODM implementation can do is reduce processor overhead associated with I/O. In the case of DBWR flushing activity, that information I gave about multiple file descriptor single-call writes and completion processing is all about processor efficiency. There is no magic speed-up. Both Veritas (I mean Gary Bloomville) and PolyServe’s ODM libraries get the data to disk as fast as if it was a raw partition–due to direct I/O. I was a Performance Engineer on the Veritas ODM project and I can say that on Solaris 8 it offered about 5-10% performance increase over libC/POSIX I/O on RAW, but only because there were some interesting processor efficiencies. On RHAS2.1,and SLES8, PolyServe’s ODM library was actually more like 15% better than RAW partitions with LibC and/or that wierd pre-2.6 Async I/O stuff. Those 2.4 Linux releases were really, really bad at I/O and the PolyServe PSD driver in conjunction with ODM was just more efficient.

But, like I said in the post, the 2.6 async I/O (io_submit(2)) is pretty good–albeit the completion processing is just lousy. The net/net however, is that currently our feature-rich ODM library is falling behind on the order of 6-7%–but ONLY on strict OLTP (e.g., majority I/O is db file sequential read, DBWR and LGWR writes) and only at 100% processor saturation. Wouldn’t it be nice if more software companies told the truth? Hint, Hint.

8 Raj November 8, 2006 at 9:54 pm

Interesting post, could you for the sake of us non-systems programmers explain the different between “user mode” and “kernel mode” gettimeofday() call? Why user mode is cheaper than system mode?

Raj

9 kevinclosson November 8, 2006 at 10:22 pm

For those few systems that implemented it, what it eliminates is the cost of entering the kernel through the syscall interface. The clock is mapped into the address space of the process so it is as lightweight as reading any other memory location in the process’ virtual address space (not in processor cache). Don’t get me wrong, today’s processors can switch from user to kernel mode extremely fast, but that doesn’t cancel out the fact that the call is being made with such extreme frequency.

It’s all about efficiency…at least it was when there were systems vendors that had to compete…everything is pretty much commoditized now.

1 ASMLIB Performance vs Udev : Ardent Performance Computing Trackback on October 8, 2008 at 9:28 am

	Optimize replication… on Introducing SLOB – The S…
	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage