Learning Something New
Learning Oracle is a never ending effort. “OK, tell me something I didn’t know”, you say? You may know this bit I’m about to blog about, but I sure didn’t. I don’t know when the Database Writer I/O profile changed, but it has—somewhere along the way.
I have a simple test of 10gR2 (10.2.0.1) on RHEL4 x86_64 using filesystem files. I was using strace(1) on the single DBWR process I have configured for this particular instance. The database uses an 8KB block size and there are no variable block sizes anywhere (pools are not even configured). The workload I’m running while monitoring DBWR is quite simple. I have a tablespace called TEST created using Oracle Managed Files (OMF)—thus the peculiar filenames you’ll see in the screen shots below. I have 2 tables in the TEST tablespace and am simply looping INSERT INTO SELECT * FROM statements between the 2 tables as the stimulus to get DBWR busy.
In the following screen shot you’ll see that I took a look at the file descriptors DBWR is using by listing a few of them out in /proc/<DBWR PID>/fd. The interesting file descriptors for this topic are:
- FD 18 – System Tablespace
- FD 19 – UNDO Tablespace
- FD 20 – SYSAUX Tablespace
- FD 23 – The TEST Tablespace
NOTE: Please right click the image to open in a viewer. Some of you readers have reported a problem, but we’ve found that it is as simple as clicking it. I need to investigate if that is something that wordpress is doing.
I have purposefully set up this test to not use libaio, thus filesystemio_options was not set in the parameter file. In the next screen shot I use grep to pull all occurrences of the pwrite(1) system calls that DBWR is making that are not 8KB in size. Historically there should be none since DBWR’s job is to clean scattered SGA buffers by writing single blocks to random file offsets. That has always been DBWR’s lot in life.
So, as strace(1) is showing, these days DBWR is exhibiting a variation of its traditional I/O profile. In this synchronous I/O case, on this port of Oracle, DBWR is performing synchronous multi-block writes to sequential blocks on disk! That may seem like trivial pursuit, but it really isn’t. First, where are the buffers? The pwrite system call does not flush scattered buffers as do such routines as writev(),lio_listio() or odm_io()—it is not a gathered write. So if DBWR’s job is to build write batches by walking LRUs and setting up write-lists by LRU age, how is it magically finding SGA buffers that are adjacent in memory and bound for sequential offsets in the same file? Where is the Twilight Zone soundtrack when you need it? The only way DBWR can issue these system calls requires the buffers to be contiguous in memory. That would surely require a memory copy from the SGA buffers to a contiguous memory location of equal size in either the PGA or some non-buffer pool heap in the SGA.
Of course DBWR also behaved in a more “predictable” manner during this test as well as the following screen shot shows:
Is This A Big Problem?
No, I don’t think so—unless you’re like me and have had DBWR’s I/O profile cast in stone dating back to Version 6 of Oracle. All this means is that when you are counting DBWR write calls, you can’t presume they are always single block. Now that is something new.
Is This A Direct Path Write?
I don’t yet know if this is accounted for as a direct path write in Oracle statistics. I think it should be, because it look, smells and feels just like a direct path write, but I need to look into it.
I often point out that Microsoft has always had it quite easy with SQL Server. I’m not doing that pitiful Microsoft bashing that I despise so much, but think about it. SQL Server started with a fully functional version of Sybase (I was a Sybase fan) and brought it to market on 1 Operating System and only 1 platform (x86). Since then their “porting” effort as “exploded” into x86, x86_64 and IA64. That is the simple life.
This DBWR “issue” may only be relevant to this Oracle revision and this specific port. I can say with great certainty that if Sequent were still alive, the Sequent port would have used one of the gathered write calls we had at our disposal in this particular case. With Oracle, so much of the runtime is determined by porting decisions. So the $64,000 dollar question is, why isn’t DBWR just using writev(1) which would nicely clear all those SGA buffers from their memory-scattered locations?