DBWR with CIO on JFS2. Resource Starvation?

In a recent post to the oracle-l email list, a participant posted the following:

We are running Oracle 9.2.0.7 & 10.2.0.2 on Aix 5.2. I am seeing the following warnings in the DB Writer trace file:
Warning: lio_listio returned EAGAIN
Performance degradation may be seen.

There have been several very interesting threads on the list lately about Oracle on AIX with JFS2 filesystems. I’d like to blog quickly about this thread.

The error message in the DBWR trace file indicates that lio_listio() calls are being starved for resources, but what resources?

A lio_listio() Backgrounder
The error message in the DBWR trace file indicates that lio_listio() calls are being starved for resources. For those who don’t know, lio_listio() is the traditional POSIX gathered write system call. Most Oracle ports use it. It supports DBWR’s need to flush multiple buffers to multiple files in one call, but completion processing is a little rough. The way the call works is your program (e.g., DBWR) either issues a long list of requests and blocks until they are all done (mode=LIO_WAIT), or the mode argument is set to LIO_NWAIT which means your process can continue to do other work. When all the LIO_NWAIT requests are completed, your process gets delivered a signal also specified at call time. So, if for some reason, DBWR is doing a lot of small burst writes, it will likely be handling a lot of signals. No big deal really, but that is how it works. The way I/O completion handling works with POSIX AIO is one of the main reasons Oracle was motivated to define the Oracle Disk Manager library specification(ODM). The differences between ODM and POSIX AIO is a topic for a different blog entry.

About lio_listio() on Filesystem Files
Details, details. The way AIO is implemented from one filesystem to the next varies greatly—unfortunately. I prefer simplicity and that is why the cluster filesystem coupled with PolyServe Matrix Server supports the same Linux kernel internals for AIO as are exercised when using raw devices. If you mount a PSFS in DBoptimized mode, calls to lio_listio(3) are met with the same kernel handling as if the call was targeting raw partitions. But, this blog entry is not about PolyServe since the oracle-l post was about Oracle9i and Oracle10g on AIX.

AIO on JFS2
So, I sent an email to the poster to make sure that this error was being hit on JFS2 and he confirmed it was. You see, AIO on JFS2 is handled by kernel processes of which there are only 10 per processor by default. The number of system wide AIO in flight is limited by the number of these AIO “handlers” as specified in IBM’s AIO documentation (see maxervers). To get around this lio_listio() error, more AIO kernel processes are needed, but how many? That IBM documenation suggests:

The value should be about the same as the expected number of concurrent AIO requests.

How Many is Enough. How Many is Too Many?
Hmmm, that is going to be a very difficult number to guess for the original poster who had both 9i and 10g running on the same pSeries server. Remember, there is no way to control DBWR. If you have a really dirty SGA, DBWR will do really large write batches. That is a good thing.

In cases like this I tend to generally work out what safe ceilings are. I know Bret Olszewski and if he gives a reasonable value for a tunable, I tend to take his word. In this paper about database performance on JFS, Bret and his co-authors state:

[…] the number of aioservers in the system limits the number of asynchronous I/O operations that can be in progress simultaneously. The maximum number of aioservers that can be created is controlled by the maxservers attribute, which has a default value of 10 per processor. For our experiments, we used a maxservers value of 400 per processor.

The paper also points out that there are 4 CPUs in the System p 680 test system. So while the default is 10 per CPU, it looks like if the collective DBWR in-flight I/O are bursting at high rates—such as as 1600 concurrent I/Os—setting aioservers to 400 per processor is not unreasonable.

The paper also covers a technique used to determine if you have too many aioservers. Basically, if you have a bunch of these kernel processes that do not get CPU time then the I/O load is sufficiently being handled by the processes that are burning CPU. You can then reduce the number if it seems important to you. Makes sense, but I can’t imagine having sleeping kernel processes around costs much.

A Word About Filesystem Performance
OK, if Bret specifies a tunable, I take his advice. On the other hand, I have to call out a problem in that JFS2 paper. In the analysis section, the conclusion is that Concurrent I/O performs 200% better than plain Direct I/O, and comes within 8% of raw logical volumes.That makes sense since CIO implicitly invokes DIO and DIO should be as fast as raw disk with the exception of write serialization (which CIO addresses). The test was 100% processor bound and I/O intensive so any processor overhead in the I/O code path will account for that 8%–there is, after all, more software in the filesystem case than the raw logical volume case but is it really code path that costs the 8 percent? The paper states:

The Concurrent I/O run has a higher %system and lower %user than the raw LV run because of the additional context switches due to traversing down the file system path for I/O, and due to the use of AIO server threads for servicing AIO requests […]

OK, the bit about “additional context switches due to traversing down the file system path” is really weird. There is no such thing unless you are opening files for every I/O—which, of course, Oracle does not do. Oracle keeps files open and I/O is nothing more than a position and read/write (e.g., pread(), pwrite()) or an asynchronous write (e.g., lio_listio()). It is true that converting a pathname to an inode costs (e.g., the namei()/lookup() kernel routines), but that is paid in the open(2) call. Once a file is open there is no cost associated with the pathname because I/O routines do not even use a pathname—they use the file descriptor. In short, there is no “traversing down the file system path” associated with I/O on an open file. If I had $50 to bet, I would wager the 1600 AIO handlers jumping on and off those 4 CPUs would account for the 8%, but what do I know?

3 Responses to “DBWR with CIO on JFS2. Resource Starvation?”

Feed for this Entry Trackback Address

1 Andrey Goryunov February 22, 2007 at 12:57 am

Hello Kevin,

sorry if I did not catch it, but how performance degradation
can be estimated even approximately?

Can I get some estimation tracing DBWR during peak-load hours or monitoring number of aio servers?

Thank you very much for your explanation about lio_listio.

Regards,
Andrey Goryunov

2 kevinclosson February 22, 2007 at 5:20 pm

Andrey,

DBWR is doing a good enough job as long as you don’t witness free buffer waits in Oracle. AIO is also used with PQO, so there I would just run enough aio processes so that (as was discussed in Bret’s paper) there are at least a few that are not utilizing CPU. With some not utilizing CPU, you have clear indication that you have “enough”.

3 Joergen Berg June 21, 2007 at 9:00 am

Hi,

the workaround for the
*
Warning: lio_listio returned EAGAIN

Performance degradation may be seen
*

is to increase the number of MAXSERVERS to 32 per CPU
and ‘Maximum number of REQUESTS’ to 51200 – if U still get the warning
then increase ‘Maximum number of REQUESTS’ to 102400

please oberve that one need to reboot the system to make the changes effective.

best wishes

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage