Archive for the 'Oracle Asynchronous I/O' Category

Automatic Databases Automatically Detect Storage Capabilities, Don’t They?

Doug Burns has started an interesting blog thread about the Oracle Database 11g PARALLEL_IO_CAP_ENABLED parameter in his blog entry about Parallel Query and Oracle Database 11g. Doug is discussing Oracle’s new concept of built-in I/O subsystem calibration-a concept aimed at more auto-tuning database instances. The idea is that Oracle is trying to make PQ more aware of the down-wind I/O subsystem capability so that it doesn’t obliterate it with a flood of I/O. Yes, a kinder, gentler PQO.

I have to admit that I haven’t yet calibrated this calibration infrastructure. That is, I aim to measure the difference between what I know a given I/O subsystem is capable of and what DBMS_RESOURCE_MANAGER.CALIBRATE_IO thinks it is capable of. I’ll blog the findings of course.

In the meantime, I recommend you follow what Doug is up to.

A Really Boring Blog Entry
Nope, this is not just some look at that other cool blog over there post. At first glance I would hope that all the regular readers of my blog would wonder what value there is in throttling I/O all the way up in the database itself given the fact that there are several points at which I/O can/does get throttled downwind. For example, if the I/O is asynchronous, all operating systems have a maximum number of asynchronous I/O headers (the kernel structures used to track asynchronous I/Os) and other limiting factors on the number of outstanding asynchronous I/O requests. Likewise, SCSI kernel code is fit with queues of fixed depth and so forth. So why then is Oracle doing this up in the database? The answer is that Oracle can run on a wide variety of I/O subsystem architectures and not all of these are accessed via traditional I/O system calls. Consider Direct NFS for instance.

With Direct NFS you get disk I/O implemented via the remote procedure call interface (RPC). Basically, Oracle shoots the NFS commands directly at the NAS device as opposed to using the C library read/write routines on files in an NFS mount-which eventually filters down to the same thing anyway, but with more overhead. Indeed, there is throttling in the kernel for the servicing of RPC calls, as is the case with traditional disk I/O system calls, but I think you see the problem. Oracle is doing the heavy lifting that enables you to take advantage of a wide array of storage options-and not all of them are accessed with age-old traditional I/O libraries. And it’s not just DNFS. There is more coming down the pike, but I can’t talk about that stuff for several months given the gag order. If I could, it would be much easier for you to visualize the importance of DBMS_RESOURCE_MANAGER.CALIBRATE_IO. In the meantime, use your imagination. Think out of the box…way out of the box…

Using Oracle Disk Manager to Monitor Database I/O

Some of the topics in this post are also covered in the Julian Dyke/Steve Shaw RAC book that came out last summer. I enjoyed being one of the technical reviewers of the book. It is a good book.

Monitoring DBWR Specifically
I have received several emails from existing PolyServe customers asking me why I didn’t just use the Oracle Disk Manager (ODM) I/O monitoring package that is included in the PolyServe Database Utility for Oracle to show the multi-block DBWR writes I blogged about in this post. After all, there is very little left to the imagination when monitoring Oracle using this unique feature of our implementation of the Oracle Disk Manager library specification.

This URL will get you a copy of the I/O Monitoring feature of our Oracle Disk Manager library. It is quite a good feature.

I didn’t use ODM for the first part of that thread because I wanted to discuss using strace(1) for such purposes. Yes, I could have just used the mxodmstat(1) command that comes with that package and I would have seen that the average I/O size was not exactly db_block_size as one would expect. For instance, the following screen shot is an example of cluster wide monitoring of DBWR processes. The first invocation of the command is used to monitor DBWR only followed by another execution of the command to monitor LGWR. The average size of the async writes for DBWR are not precisely 8KB (the db_block_size for this database) as they would be if this was Oracle9i:

NOTE: You may have to right click->view the screen shot


As an aside, the system was pretty busy as the following screen shot will show. This is a non-RAC database on an HP Proliant DL-585 where database writes are peaking at roughly 140MB/s. You can also see that the service times (Ave ms) for the writes are averaging a bit high (as high as 30ms). Looks like I/O subsystem saturation.



Oh, here’s a quick peek at one nice feature of mxodmstat(1). You can dump out all the active files, clusterwide, for any number of database/instances and nodes using the –lf options:


I hope you take peek at the User Guide for this feature. It has a lot of examples of what the tool can do. You might find it interesting—perhaps something you should push your vendor to implement?


Analyzing Asynchronous I/O Support with Oracle10g

This is not a post about why someone would want to deploy Oracle with a mix of files with varying support for asynchronous I/O. It is just a peek at how Oracle10g handles it. This blog post is a continuation of yesterday’s topic about analyzing DBWR I/O activity with strace(1).

I’ve said many times before that one of the things Oracle does not get sufficient credit for is the fact that the database adapts so well so such a tremendous variety of platforms. Moreover, each platform can be complex. Historically, with Linux for instance, some file systems support asynchronous I/O and others do not. With JFS on AIX, there are mount options to consider as is the case with Veritas on all platforms. These technologies offer deployment options. That is a good thing.

What happens when the initialization parameter filesystemio_options=asynch yet there are a mix of files that do and do not support asynchronous I/O? Does Oracle just crash? Does it offline files? Does it pollute the alert log with messages every time it tries an asynchronous I/O to a file that doesn’t support it? The answer is that it does not of that. It simply deals with it. It doesn’t throw the baby out with the bath water either. Much older versions of Oracle would probably have just marked the whole instance to the least common demoninator (synchronous).

Not Just A Linux Topic
I think the information in this blog post should be considered useful on all platforms. Sure, you can’t use strace(1) on a system that only offers truss(1), but you can do the same general analysis with either. The system calls will be different too. Whereas Oracle has to use the Linux-only libaio routines called io_submit(2)/io_getevents(2), all other ports[1] use POSIX asynchronous I/O (e.g., lio_listio,aio_write,etc) or other proprietary asynchronous I/O library routines.

Oracle Takes Charge
As I was saying, if you have some mix of technology where some files in the database do not support asynchronous I/O, yet you’ve configured the instance to use it, Oracle simply deals with the issue. There are no warnings. It is important to understand this topic in case you run into it though.

Mixing Synchronous with Asynchronous I/O
In the following screen shot I was viewing strace(1) output of a shadow process doing a tablespace creation. The instance was configured to use asynchronous I/O, yet the CREATE TABLESPACE command I issued was to create a file in a filesystem that does not support asynchronous I/O[2]. Performing this testing on a platform where I can mix libaio asynchronous I/O and libc synchronous I/O with the same instance makes it easy to depict what Oracle is doing. At the first arrow in the screen shot, the OMF datafile is created with open(2) using the O_CREAT flag. The file descriptor returned is 13. The second arrow points to the first asynchronous I/O issued against the datafile. The io_submit(2) call failed with EINVAL indicating to Oracle that the operation is invalid for this file descriptor.

NOTE: Firefox users report that you need to right click->view the image to see these screen shots


Now, Oracle could have raised an error and failed the CREATE TABLESPACE statement. It did not. Instead, the shadow process simply proceeded to create the datafile with synchronous I/O. The following screen shot shows the same io_submit(2) call failing at the first arrow, but nothing more than the invocation of some shared libraries (the mmap() calls) occurred between that failure and the first synchronous write using pwrite(2)—on the same file descriptor. The file didn’t need to be reopened or any such thing. Oracle simply fires off a synchronous write.


What Does This Have To Do With DBWR?
Once the tablespace was created, I set out to create tables in it with CTAS statements. To see what DBWR behaved like with this mix of asynchronous I/O support, I once again monitored DBWR with strace(1) sending the trace info to a file called mon.out. The following screen shot shows that the first attempts to flush SGA buffers to the file also failed with EINVAL. All was not lost however, the screen shot also shows that DBWR continued just fine using synchronous writes to this particular file. Note, DBWR does not have to perform this “discovery” on every flushing operation. Once the file is deemed unsuitable for asynchronous I/O, all subsequent I/O will be synchronous. Oracle just continues to work, without alarming the DBA.


How Would a Single DBWR Process Handle This?

So the next question is what does it mean to have a single database writer charged with the task of flushing buffers from the SGA to a mix of files where not all files support asynchronous I/O? It is not good. Now, as I said, Oracle could have just reverted the entire instance to 100% synchronous I/O, but that would not be in the best interest of performance. On the other hand, if Oracle is doing what I’m about to show you, it would be nice if it made one small alert log entry—but it doesn’t. That is why I’m blogging this (actually it is also because I’m a fan of Oracle at the platform level).

In the following screen shot, I use egrep(1) to pull occurrences from the DBWR strace(1) output file where io_submit(2) and pwrite(2) are intermixed. Again, this is a single DBWR flushing buffers from the SGA to files of varying asynchronous I/O support:


In this particular case, the very first io_submit(2) call flushed 4 buffers, 2 each to file descriptors 19 and 20. Before calling io_getevents(2) to process the completion of those asynchronous I/Os, DBWR proceeds to issue a series of synchronous writes to file descriptor 24 (another of the non-asynchronous I/O files in this database). By the way, notice that most of those writes to file descriptor 24 were multi-block DBWR writes. The problem with having one DBWR process intermixing synchronous with asynchronous I/O is that any buffers in the write batch bound for a synchronous I/O file will cause a delay in the instantiation of any buffer flushing to asynchronous I/O files. When DBWR walks an LRU to build a batch, it is not considering the lower-level OS support of asynchronous I/O on the file that a particular buffer will be written to. It just builds a batch based on buffer state and age. In short, synchronous I/O requests will cause a delay in the instantiation of subsequent asynchronous requests.

OK, so this is a two edged sword. Oracle handles this complexity nicely—much credit due. However, it is not entirely inconceivable that some of you out there have databases configured with a mix of asynchronous I/O support for your files. From platform to platform this can vary so much. Please be aware that this is not just a file system topic. It can also be a device driver issue. It is entirely possible to have a file system that generically supports asynchronous I/O created on a device where the device driver does not. This scenario will also result in EINVAL on asynchronous I/O calls. Here too, Oracle is likely doing the right thing—dealing with it.

What To Do?
Just use raw partitions. No, of course not. We should be glad that Oracle deals with such complexity so well. If you configure multiple database writers (not slaves) on a system that has a mix of asynchronous I/O support, you’ll likely never know the difference. But the topic is at least on your mind.

[1] Except Windows of course

[2] The cluster  file system in PolyServe’s Database Utility for Oracle uses a mount option to enable both direct I/O and OS asynchronous I/O. However, when using PolyServe’s Oracle Disk Manager (ODM) Library Oracle can perform asynchronous I/O on all mount types. Mount options for direct I/O is quite common as this is a requirement on UFS and OCFS2 as well.

Analyzing Oracle10g Database Writer I/O Activity on Linux

Using strace(1) to Study Database Writer on Linux
This is a short blog entry for folks that are interested in Oracle10g’s usage of libaio asynchronous I/O routines (e.g., io_submit(2)/io_getevents(2)). For this test, I set up Oracle10g release on Red Hat 4 x86_64. I am using the cluster filesystem bundled with the PolyServe’s Database Utility for Oracle, but for all intents and purposes I could have used ext3.

The workload is a simple loop of INSERT INTO SELECT * FROM statements to rapidly grow some tables thereby stimulating Database Writer (DBWR) to flush modified SGA buffers to disk. Once I got the workload running, I simply executed the strace command as follows where <DBWR_PID> was replaced with the real PID of the DBWR process:

$ strace -o dbw -p <DBWR_PID>

NOTE: Using strace(1) imposes a severe penalty on the process being traced.  I do not recommend using strace(1) on a production instance unless you have other really big problems the strace(1) output would help you get under control.

The second argument to the io_submit(2) call is a long integer that represents the number of I/O requests spelled out in the current call. The return value to an io_submit(2) call is the number of iocb’s processed. One clever thing to do is combine grep(1) and awk(1) to see what degree of concurrent I/O DBWR is requesting on each call. The following screen shot shows an example of using awk(1) to select the io_submit(2) calls DBWR has made to request more than a single I/O. All told, this sould be the majority of DBWR requests.

NOTE: You may have to right click->view the image. Some readers of this blog have reported this. Sorry

strace 1

Another way to do this is to anchor on the return to the io_submit(2) call. The following screen shot shows an example of grep(1) to selecting only the io_submit(2) calls that requested more than 100 I/O transfers in a single call.

strace 2

What File Descriptors?
When io_submit(2) is called for more than one I/O request, the strace(1) output will string out details of each individual iocb. Each individual request in a call to io_submit(2) can be for a write to a different file descriptor. In the following text grep(1)ed out of the strace(1) output file, we see that Oracle requested 136 I/Os in a single call and the first 2 iocbs were requests to write on file descriptor 18.:

io_submit(182926135296, 136, {{0x2a973cea40, 0, 1, 0, 18}, {0x2a973e6e10, 0, 1, 0, 18}

What About ASM?
You can run ASM with libaio. If you do, you can do this sort of monitoring, but you wont really be able to figure out what DBWR is writing to because the file descriptors will just point to raw disk. ASM is raw disk.


I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 2,933 other followers

Oracle ACE Program Status

Click It

website metrics

Fond Memories


All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: