Over-Configuring DBWR Processes – Part II. | Kevin Closson's Blog: Platforms, Databases and Storage

In my recent blog entry about over-configuring DBWR processes, I mentioned the processor cache effect of having too many DBWR processes doing work that a single DBWR process could handle. I’d like to provide more detail on the matter.

Backgrounder
Few people really understand what DBWR does for a living. Yes, everyone knows DBWR is responsible for flushing modified buffers from the SGA, but that is really high level. DBWR:

1. Builds “write batches” by examining the state of buffer headers on the LRU lists. It scans the LRU end considering the age of the buffer. At instance boot time, each DBWR process is assigned a set of LRUs to tend to.
2. Manipulates cache buffers chains elements for such reasons as marking the buffer busy when it is in flight for an I/O and marking it as written after the flush operation.
3. Performs I/O using the OSDs that are available to it. OSDs are routines in the Oracle kernel that are Operating System (port) specific. These are the routines through which Oracle gets support for asynchronous and/or direct I/O via calls to various libraries (e.g., libc, libaio, libodm, etc).
4. Posts waiters, such as foreground processes waiting in a free buffer wait event.

The problem with over-configuring DBWR processes is two-fold:

1. Processor Cache Trashing. DBWR processes have huge working sets since they deal with so many LRU and cache buffers chains structures. Having more of these processes than necessary migrating from CPU to CPU places a huge dent in the L2 caches with both page table and data footprint.
2. Scheduling anomalies. If you have more than 1 DBWR process, you run the increased risk of having DBWR processes that can’t get on CPU. Consider a checkpoint. All LRUs will have modified buffers that need flushed during a checkpoint. Any given LRU can only be cleaned by the boot-time assigned DBWR process. That means that a checkpoint requires action from all DBWR processes. It’s easier for 1 process to get scheduled on a CPU on, say, an 8 CPU system than 8 processes.

Point 2 above is the part that I suspect most folks aren’t aware of. I think folks envision the various multiple DBWR processes will only get busy if their peers aren’t keeping up. That is not the case. LRUs get evenly loaded by foreground processes because when a process needs a buffer (for file I/O, sort buffer, block cloning, etc) they first try the LRU latch for the last LRU they used (improves cache warmth) and if they can’t get that latch with a no-wait get, they jump to a different latch and do another no-wait get. This continues until they either obtain an LRU on a no-wait get or they loop around to the first attempted latch and do a wait-get. I recall in older releases of Oracle the next LRU to try was the very next LRU. That could have changed to something like a random selection but I don’t know ( I haven’t seen the code in years but that’s about to change). No matter, the LRUs are evenly loaded. Since LRUs are evenly loaded, that means that DBWR processes are equally tasked. I’m not very smart, but this I know:

On SMP systems it is not wise to break up a task that a single CPU can do and distribute the work evenly amongst multiple CPUs. You introduce processor scheduling overhead and you hammer the processor caches

How true, and even more so when the processes share resources as DBWR processes do-they “share” cache buffers chains and other goodies. Anyway, it might be smart to do such task decomposition on realtime or other specialized systems, but not Oracle systems. Let’s take a closer look.

Moderate OLTP Load, db_writer_processes=8
I just did an OLTP run on my DL585 8-core system with db_writer_processes set to 8. After the instance is booted, but before the Pro*C OLTP clients are triggered, I did a ps(1) command grepping for all the DBWR processes. As soon as the 900 second run was complete, the driver script does the same ps(1) command. All output goes to nohup.out so I can grep for DBWR processes and see how much CPU time they consumed during the run:

$ grep ora_dbw nohup.out
oracle   14507     1  2 13:46 ?        00:00:01 ora_dbw0_bench1
oracle   14509     1  2 13:46 ?        00:00:01 ora_dbw1_bench1
oracle   14511     1  2 13:46 ?        00:00:02 ora_dbw2_bench1
oracle   14513     1  1 13:46 ?        00:00:01 ora_dbw3_bench1
oracle   14515     1  1 13:46 ?        00:00:01 ora_dbw4_bench1
oracle   14517     1  2 13:46 ?        00:00:01 ora_dbw5_bench1
oracle   14519     1  1 13:46 ?        00:00:01 ora_dbw6_bench1
oracle   14521     1  2 13:46 ?        00:00:01 ora_dbw7_bench1
oracle   14507     1  2 13:46 ?        00:00:23 ora_dbw0_bench1
oracle   14509     1  2 13:46 ?        00:00:23 ora_dbw1_bench1
oracle   14511     1  2 13:46 ?        00:00:23 ora_dbw2_bench1
oracle   14513     1  2 13:46 ?        00:00:23 ora_dbw3_bench1
oracle   14515     1  2 13:46 ?        00:00:23 ora_dbw4_bench1
oracle   14517     1  2 13:46 ?        00:00:23 ora_dbw5_bench1
oracle   14519     1  2 13:46 ?        00:00:23 ora_dbw6_bench1
oracle   14521     1  2 13:46 ?        00:00:23 ora_dbw7_bench1

The ps(1) output shows us that each of the 8 DBWR processes accumulated 23 seconds of CPU. This clearly reflects the fact that all LRUs are loaded evenly and that each DBWR has an equal part in keeping them clear. So what’s the point? Well, 8 DBWR processes accumulating 184 CPU seconds (23*8) during a 7200 (8 * 900) CPU-second OLTP run means this DBWR workload could have been serviced by a single CPU. In fact 184 CPU seconds is only 20% of a single CPU core during a 900 second run, yet by configuring 8 DBWR processes we spread that workload over multiple CPUs. I know, you’re thinking this must be a pretty lightly loaded system, right? No, the processors hover around 50% utilization with peaks of 70%. The datafile I/O rate was some 6,980 IOPS of which 4,548/second were writes:

Load Profile                            Per Second       Per Transaction
~~~~~~~~~~~~                       ---------------       ---------------
                  Redo size:          3,222,269.99              4,570.66
              Logical reads:             66,770.61                 94.71
              Block changes:             18,097.81                 25.67
             Physical reads:              2,342.71                  3.32
            Physical writes:              4,548.36                  6.45
                 User calls:             11,795.67                 16.73
                     Parses:              3,597.63                  5.10
                Hard parses:                  0.27                  0.00
                      Sorts:                  1.00                  0.00
                     Logons:                  0.00                  0.00
                   Executes:             10,121.28                 14.36
               Transactions:                704.99

Moderate OLTP Load, db_writer_processes=1
Let’s take a look at the same workload after a shutdown, database restore and reboot of the instance:

$ grep ora_db noh*
oracle   15652     1  0 14:23 ?        00:00:00 ora_dbw0_bench1
oracle   15652     1 17 14:23 ?        00:02:55 ora_dbw0_bench1

What’s that? A single DBWR was able to service the same workload with only 175 CPU seconds? That’s the same work with 5% fewer processor cycles. But what did it do to throughput? Well, this is the medium model of the benchmark where the highest processor utilization peaks were roughly 70%. Reducing CPU utilization on a system that has ample CPU cycles to spare doesn’t get much. In fact, throughput only improved 2.4% from 46.3 Transactions Per Minute (TPM) to 47.4 TPM. Some would say that is within the margin of statistical error, but I bet those same people don’t have a setup were a database restore followed by an instance reboot will produce benchmark results within .5% like this setup does. Let’s just say it’s calibrated. Have you really read this far to hear about a paltry 2.4% difference in OLTP throughput though? No, that’s not all. The impact of too many DBWR processes is scheduling overhead and poor processor cache efficiency, so we need to look at a run with higher CPU utilization.

Heavy OLTP Load, db_writer_processes=8
The following ps(1) data shows us that with the increased OLTP load the 8 DBWR processes each consumed 32 CPU seconds for an aggregate of 256 CPU seconds. That’s still much less than the amount of cycles a single CPU had to offer-28% of a single CPU in fact. S

$ grep ora_db noh*
oracle   20763     1  0 15:59 ?        00:00:00 ora_dbw0_bench1
oracle   20765     1  0 15:59 ?        00:00:00 ora_dbw1_bench1
oracle   20767     1  0 15:59 ?        00:00:00 ora_dbw2_bench1
oracle   20769     1  0 15:59 ?        00:00:00 ora_dbw3_bench1
oracle   20771     1  0 15:59 ?        00:00:00 ora_dbw4_bench1
oracle   20773     1  0 15:59 ?        00:00:00 ora_dbw5_bench1
oracle   20775     1  0 15:59 ?        00:00:00 ora_dbw6_bench1
oracle   20777     1  0 15:59 ?        00:00:00 ora_dbw7_bench1
oracle   20763     1  3 15:59 ?        00:00:32 ora_dbw0_bench1
oracle   20765     1  3 15:59 ?        00:00:32 ora_dbw1_bench1
oracle   20767     1  3 15:59 ?        00:00:32 ora_dbw2_bench1
oracle   20769     1  3 15:59 ?        00:00:32 ora_dbw3_bench1
oracle   20771     1  3 15:59 ?        00:00:32 ora_dbw4_bench1
oracle   20773     1  3 15:59 ?        00:00:32 ora_dbw5_bench1
oracle   20775     1  3 15:59 ?        00:00:32 ora_dbw6_bench1
oracle   20777     1  3 15:59 ?        00:00:32 ora_dbw7_bench1

So here again the DBWR workload would “fit” within a single CPU. With this configuration, the OLTP throughput was 57.9 TPM. How does that compare to the same workload running against an instance with a single DBWR process?

Heavy OLTP Load, db_writer_processes=1
The following ps(1) output shows that a single DBWR process was able to service the same OLTP load with 291 CPU seconds-still much less (32%) than the bandwidth of a single CPU (there are 900 CPU seconds per core during the test).

$ grep ora_db noh*
oracle   21426     1  0 16:26 ?        00:00:00 ora_dbw0_bench1
oracle   21426     1 29 16:26 ?        00:04:51 ora_dbw0_bench1

But what about performance? When reducing DBWR count to 1, the OLTP throughput increased 8% from 57.9 to 62.6 TPM. What was the I/O load like though? Here is a snippet from statspack:

Load Profile                            Per Second       Per Transaction
~~~~~~~~~~~~                       ---------------       ---------------
                  Redo size:          5,918,344.96              4,355.71
              Logical reads:            125,684.76                 92.50
              Block changes:             33,622.21                 24.74
             Physical reads:              2,066.64                  1.52
            Physical writes:              8,483.06                  6.24
                 User calls:             21,996.51                 16.19
                     Parses:              6,737.41                  4.96
                Hard parses:                  0.27                  0.00
                      Sorts:                  1.03                  0.00
                     Logons:                  0.00                  0.00
                   Executes:             18,852.22                 13.87
               Transactions:              1,358.76

We see that with a single DBWR, the OLTP workload pushed through some 10,549 IOPS of which 8,483 per second were writes. That means a single DBWR on fairly outdated AMD Opteron processors (DL585) can service 8,483 writes per second with only 32% of a single CPU. So why configure more than that?

Where Did The Time Go?
Were there any significant waits on the 8 DBWR case? Yes, there were. Let’s see if anyone can guess which one is which:

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
CPU time                                                     3,148          35.8
log file sync                                1,207,863       2,905      2   33.1
log file switch (checkpoint incomplete)          4,627         828    179    9.4
db file scattered read                         172,724         670      4    7.6
log file parallel write                        532,373         464      1    5.3
          -------------------------------------------------------------

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
CPU time                                                     2,858          31.0
log file sync                                1,076,670       2,509      2   27.2
log file switch (checkpoint incomplete)          4,808       1,939    403   21.0
db file scattered read                         173,566         664      4    7.2
log file parallel write                        491,346         419      1    4.5
          -------------------------------------------------------------

Summary
Did he really make this long of a blog entry to discuss some 8% OLTP throughput increase? Well, yes and no. The affect of processor-cache thrashing will be more severe on larger SMP systems. This test system only has 4 sockets (8 cores). I’ve seen crazy DBWR configurations on 32 and 64 CPU systems account for more than 8%. Even if 8% was a constant, who wants to lose as much as 3 or 5 CPUs worth of bandwidth in today’s IT environment?

13 Responses to “Over-Configuring DBWR Processes – Part II.”

Feed for this Entry Trackback Address

1 Jason Bucata August 17, 2007 at 6:15 am

From what’s written here, it looks like you only tested 1 and 8. Is it possible that the global maximum is actually somewhere between? Maybe 2 processes is slightly faster than 1, for example? It would be more interesting to see the numbers for all those different values.

I’m also curious whether trying to pin the different processes to the same core, or to different cores on a socket, or to different sockets, has any effect on the numbers–it would go along nicely with Part I.

Of course I’m not the one actually running the tests, so it’s really easy for me to crank out test case opportunities… 😉

2 Chris August 17, 2007 at 8:50 am

Wow Kevin, Nice test case.

Can you explain the log file sync waits in the top 5 timed events at the end. And which runs are the top 5 waits from?

3 Richard Burakowski August 17, 2007 at 6:53 pm

If I was to hazard a guess, then the top 5 appear 1 dbw followed by 8 dbw.

Why? The largest discrepancy is in “log file switch (checkpoint incomplete)” total wait time for a similar number of waits. It could take longer to wait on 8 processes to run as opposed to 1 process on a CPU bound system.

Curious if you have anything else to add about the affect of the number of dbw on the other differences, given the same workload.

Thanks for your posts. Always a good read.

4 Chris August 17, 2007 at 9:42 pm

>>Can you explain the log file sync waits in the top 5 timed events at the end. And which runs are the top 5 waits from?

I actually meant to type log file switch (checkpoint incomplete)

5 vlad August 17, 2007 at 11:40 pm

Something else to factor in: the buffer cache size. With huge bc size (dozens of GB), one dbwr may have less benefit from cpu cache due to higher memory consumption for all the buffer cache chains structure, … Several dbwrs would have smaller cpu cache needs. Of course they can be bounced from one cpu to the other but a single dbwr can also use different cpus (ok cpu affinity but cpu affinity should also work for multiple dbwrs). I guess one dbwr is less likely to be ousted from a cpu if other cpus are available so he can benefit from the cpu cache longer.
Another aspect is NUMA where I think buffer cache is split in various physical memory locations and one dbwr for each of these is (supposed to be) beneficial.

6 kevinclosson August 18, 2007 at 12:06 am

Vlad,

Thanks for stopping by. I beg to differ with your point about chains though. All DBWR processes will hammer chains equally. LRU on the other hand are divvied up to the various DBWR processes. In layman’s terms, chains are complete chaos as far as cache goes because all DBWR processes **AND** all foreground processes manipulate these structures quite equally. On the other hand, each foreground tends to settle on a “favorite” LRU unless tings go haywire.

Your point about NUMA is well taken. Trust me, I know what multiple segmented DBWR does for NUMA optimization. In fact, the NUMA ports *should* be partitioning the LRUs to foregrounds too. That is, any given foreground should only work a set of LRUs that it and the other peers on that NUMA resource domain are using. This eliminates LRU latch contention between NUMA resource domains (RAD, QUAD, whatever the relevant term is for the port). Been there, done that…

7 qihua August 18, 2009 at 7:19 am

We have a database with heavy physical write. The redo log size is 2G, based on the statspack for 15 minutes, most of the physical write are because of checkpoint, so I increased the redo size from 2G to 4G with the hope to reduce physical write. But it is helpless after use 4G redo. Is “physical writes non checkpoint ” misleading? or there are something else triggered the checkpoint other then redo log switch?
Instance Activity Stats DB/Inst: FND/FND Snaps: 7568-7569
Statistic Total per Second per Trans
——————————— —————— ————– ————
physical writes 3,246,476 3,603.2 142.0
physical writes non checkpoint 960,700 1,066.3 42.0

Thanks,
Qihua

8 maclean November 3, 2010 at 1:19 pm

Hey, Once I have set db_writer_processes to more than 1 ,and got poor IO performance !

9 heliodias April 1, 2014 at 6:01 am

Reblogged this on Oracle Logbook.

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage