Learning Something New
Learning Oracle is a never ending effort. “OK, tell me something I didn’t know”, you say? You may know this bit I’m about to blog about, but I sure didn’t. I don’t know when the Database Writer I/O profile changed, but it has—somewhere along the way.
I have a simple test of 10gR2 (10.2.0.1) on RHEL4 x86_64 using filesystem files. I was using strace(1) on the single DBWR process I have configured for this particular instance. The database uses an 8KB block size and there are no variable block sizes anywhere (pools are not even configured). The workload I’m running while monitoring DBWR is quite simple. I have a tablespace called TEST created using Oracle Managed Files (OMF)—thus the peculiar filenames you’ll see in the screen shots below. I have 2 tables in the TEST tablespace and am simply looping INSERT INTO SELECT * FROM statements between the 2 tables as the stimulus to get DBWR busy.
In the following screen shot you’ll see that I took a look at the file descriptors DBWR is using by listing a few of them out in /proc//fd. The interesting file descriptors for this topic are:
- FD 18 – System Tablespace
- FD 19 – UNDO Tablespace
- FD 20 – SYSAUX Tablespace
- FD 23 – The TEST Tablespace
NOTE: Please right click the image to open in a viewer. Some of you readers have reported a problem, but we’ve found that it is as simple as clicking it. I need to investigate if that is something that wordpress is doing.
I have purposefully set up this test to not use libaio, thus filesystemio_options was not set in the parameter file. In the next screen shot I use grep to pull all occurrences of the pwrite(1) system calls that DBWR is making that are not 8KB in size. Historically there should be none since DBWR’s job is to clean scattered SGA buffers by writing single blocks to random file offsets. That has always been DBWR’s lot in life.
So, as strace(1) is showing, these days DBWR is exhibiting a variation of its traditional I/O profile. In this synchronous I/O case, on this port of Oracle, DBWR is performing synchronous multi-block writes to sequential blocks on disk! That may seem like trivial pursuit, but it really isn’t. First, where are the buffers? The pwrite system call does not flush scattered buffers as do such routines as writev(),lio_listio() or odm_io()—it is not a gathered write. So if DBWR’s job is to build write batches by walking LRUs and setting up write-lists by LRU age, how is it magically finding SGA buffers that are adjacent in memory and bound for sequential offsets in the same file? Where is the Twilight Zone soundtrack when you need it? For DBWR to issue these pwrite() system calls requires the buffers to be contiguous in memory.
Of course DBWR also behaved in a more “predictable” manner during this test as well as the following screen shot shows:
Is This A Big Problem?
No, I don’t think so—unless you’re like me and have had DBWR’s I/O profile cast in stone dating back to Version 6 of Oracle. All this means is that when you are counting DBWR write calls, you can’t presume they are always single block. Now that is something new.
Porting
I often point out that Microsoft has always had it quite easy with SQL Server. I’m not doing that pitiful Microsoft bashing that I despise so much, but think about it. SQL Server started with a fully functional version of Sybase (I was a Sybase fan) and brought it to market on 1 Operating System and only 1 platform (x86). Since then their “porting” effort as “exploded” into x86, x86_64 and IA64. That is the simple life.
This DBWR “issue” may only be relevant to this Oracle revision and this specific port. I can say with great certainty that if Sequent were still alive, the Sequent port would have used one of the gathered write calls we had at our disposal in this particular case. With Oracle, so much of the runtime is determined by porting decisions. So the $64,000 dollar question is, why isn’t DBWR just using writev(1) which would nicely clear all those SGA buffers from their memory-scattered locations?
Hmmm, very interesting. It does seem like direct path write, but without the APPEND hint or parallel DML I don’t see how that could occur.
I couldn’t open the attached images though, Kevin. Some link problem I think.
Hi David,
I’m not happy with the fact that I have to attach screen shots, but on the other hand I have a terrible time cutting and pasting from an xterm into wordpress’ editor…it doesn’t look anything like what I was seeing in the xterm…just a mess… have you tried right clicking? I’ve had other folks here look at this blog post with firefox Lin and Win as well as IE6 and they have no problems…but then I also had another person report a problem…aargh!
The reason the screen shot is SO critical is it shows that I was stracing DBWR so messing with parallel DML wont shange his activity. If this was a PQ Parallel DML, stracing DBWR would be very boring. It is really wierd. This is DBWR coalescing and doing a sychronous write, but not a vectored write (ala writev) …just wierd
Yeah, I thought I couldn’t view the images at first because I just tried clicking them but a right-click/view image does the trick.
Interesting.
I concur with Doug Burns on this – right click
and ‘view image’ does the trick for viewing the
screen shots.
thanks guys … I will be posting a LOT to this blog with vt100/xterm stuff … I’ll be working out how to cut and paste, but in the meantime it is simpler for me to go this route…the important thing is that I find something quickly that works for everyone that reads my blog
Yup works .. cheers fellas. Put me down as “baffled by technology”.
Not about DBWR but about new things.
About few months ago i discovered that on 10.2 index range scans and index full scans can use multiblock reads of eight blocks (then possible) at a time.
Rows Row Source Operation
——- —————————————————
200 TABLE ACCESS BY INDEX ROWID TEST (cr=206 pr=69 pw=0 time=654773 us)
200 INDEX FULL SCAN I_TEST (cr=4 pr=3 pw=0 time=25281 us)(object id 3996531)
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
—————————————- Waited ———- ————
SQL*Net message to client 3 0.00 0.00
SQL*Net message from client 3 0.00 0.00
db file sequential read 5 0.01 0.04
db file scattered read 8 0.01 0.07
Well, direct path writes are going directly from shadow processes’ PGA to disk. Obviously, DBWR role is to flush dirty buffers from SGA buffer cache so it just can’t be direct path writes.
By the way, when I I traced DBWR with 10046 with async IO enabled – it didn’t post any event to the trace file and I was scratching my head how is that DBWR was idle during checkpoints… until I attached to it with strace. It was on 10.2 (I think 10.2.0.1) on x86 32bit platform.
Yes, Alex, you are correct in your definition of direct path writes. I have not yet looked at the accounting of these DBWR I/Os, so there is still some guessing going on…
Your assertion that DBWR’s role is to flush from the SGA to disk is no longer entirely correct. That is why I posted this whole thing. DBWR is now flushing with the traditional I/O profile of single block SGA buffers to disk and this coalesced (copied) set of buffers when it deems fit. I’ll be posting more info as soon as possible.
Because of your blog about this, I looked at it and noticed an alteration in the statistics regarding reads&writes in v$sysstat.
If you search for ‘physical write total%’, you’ll notice ‘physical write total multi block requests’. I have no evidence this truly reflects dbwr making multiblock writes, but the name suggests it.
I noticed while browsing the 10.2 documentation the following comment:
http://download-west.oracle.com/docs/cd/B19306_01/server.102/b14220/process.htm#i7259
“In all cases, DBWn performs batched (multiblock) writes to improve efficiency. The number of blocks written in a multiblock write varies by operating system.”
Thanks, David… good find. The wording of that document, however, cannot reflect reality. It says, “In all cases”, which we all know to be preposterous. It is actually not the norm to have 2 cold adjacent buffered blocks to be written on the aged end of an LRU every time DBWR is posted to flush. In fact, I hope the server doesn’t spend much effort in determining the co-locality of buffers. I’d love to be able to benchmark with and without this feature. In fact, if I have 50 dollars to spare, I’d bet TPC-Cs are run with this feature compiled out because we all know that TPC-C is a random thrash of customer and stock.
Thanks for stopping by, David.