In my recent blog entry about over-configuring DBWR processes, I mentioned the processor cache effect of having too many DBWR processes doing work that a single DBWR process could handle. I’d like to provide more detail on the matter.
Backgrounder
Few people really understand what DBWR does for a living. Yes, everyone knows DBWR is responsible for flushing modified buffers from the SGA, but that is really high level. DBWR:
- 1. Builds “write batches” by examining the state of buffer headers on the LRU lists. It scans the LRU end considering the age of the buffer. At instance boot time, each DBWR process is assigned a set of LRUs to tend to.
- 2. Manipulates cache buffers chains elements for such reasons as marking the buffer busy when it is in flight for an I/O and marking it as written after the flush operation.
- 3. Performs I/O using the OSDs that are available to it. OSDs are routines in the Oracle kernel that are Operating System (port) specific. These are the routines through which Oracle gets support for asynchronous and/or direct I/O via calls to various libraries (e.g., libc, libaio, libodm, etc).
- 4. Posts waiters, such as foreground processes waiting in a free buffer wait event.
The problem with over-configuring DBWR processes is two-fold:
- 1. Processor Cache Trashing. DBWR processes have huge working sets since they deal with so many LRU and cache buffers chains structures. Having more of these processes than necessary migrating from CPU to CPU places a huge dent in the L2 caches with both page table and data footprint.
- 2. Scheduling anomalies. If you have more than 1 DBWR process, you run the increased risk of having DBWR processes that can’t get on CPU. Consider a checkpoint. All LRUs will have modified buffers that need flushed during a checkpoint. Any given LRU can only be cleaned by the boot-time assigned DBWR process. That means that a checkpoint requires action from all DBWR processes. It’s easier for 1 process to get scheduled on a CPU on, say, an 8 CPU system than 8 processes.
Point 2 above is the part that I suspect most folks aren’t aware of. I think folks envision the various multiple DBWR processes will only get busy if their peers aren’t keeping up. That is not the case. LRUs get evenly loaded by foreground processes because when a process needs a buffer (for file I/O, sort buffer, block cloning, etc) they first try the LRU latch for the last LRU they used (improves cache warmth) and if they can’t get that latch with a no-wait get, they jump to a different latch and do another no-wait get. This continues until they either obtain an LRU on a no-wait get or they loop around to the first attempted latch and do a wait-get. I recall in older releases of Oracle the next LRU to try was the very next LRU. That could have changed to something like a random selection but I don’t know ( I haven’t seen the code in years but that’s about to change). No matter, the LRUs are evenly loaded. Since LRUs are evenly loaded, that means that DBWR processes are equally tasked. I’m not very smart, but this I know:
On SMP systems it is not wise to break up a task that a single CPU can do and distribute the work evenly amongst multiple CPUs. You introduce processor scheduling overhead and you hammer the processor caches
How true, and even more so when the processes share resources as DBWR processes do-they “share” cache buffers chains and other goodies. Anyway, it might be smart to do such task decomposition on realtime or other specialized systems, but not Oracle systems. Let’s take a closer look.
Moderate OLTP Load, db_writer_processes=8
I just did an OLTP run on my DL585 8-core system with db_writer_processes set to 8. After the instance is booted, but before the Pro*C OLTP clients are triggered, I did a ps(1) command grepping for all the DBWR processes. As soon as the 900 second run was complete, the driver script does the same ps(1) command. All output goes to nohup.out so I can grep for DBWR processes and see how much CPU time they consumed during the run:
$ grep ora_dbw nohup.out
oracle 14507 1 2 13:46 ? 00:00:01 ora_dbw0_bench1
oracle 14509 1 2 13:46 ? 00:00:01 ora_dbw1_bench1
oracle 14511 1 2 13:46 ? 00:00:02 ora_dbw2_bench1
oracle 14513 1 1 13:46 ? 00:00:01 ora_dbw3_bench1
oracle 14515 1 1 13:46 ? 00:00:01 ora_dbw4_bench1
oracle 14517 1 2 13:46 ? 00:00:01 ora_dbw5_bench1
oracle 14519 1 1 13:46 ? 00:00:01 ora_dbw6_bench1
oracle 14521 1 2 13:46 ? 00:00:01 ora_dbw7_bench1
oracle 14507 1 2 13:46 ? 00:00:23 ora_dbw0_bench1
oracle 14509 1 2 13:46 ? 00:00:23 ora_dbw1_bench1
oracle 14511 1 2 13:46 ? 00:00:23 ora_dbw2_bench1
oracle 14513 1 2 13:46 ? 00:00:23 ora_dbw3_bench1
oracle 14515 1 2 13:46 ? 00:00:23 ora_dbw4_bench1
oracle 14517 1 2 13:46 ? 00:00:23 ora_dbw5_bench1
oracle 14519 1 2 13:46 ? 00:00:23 ora_dbw6_bench1
oracle 14521 1 2 13:46 ? 00:00:23 ora_dbw7_bench1
The ps(1) output shows us that each of the 8 DBWR processes accumulated 23 seconds of CPU. This clearly reflects the fact that all LRUs are loaded evenly and that each DBWR has an equal part in keeping them clear. So what’s the point? Well, 8 DBWR processes accumulating 184 CPU seconds (23*8) during a 7200 (8 * 900) CPU-second OLTP run means this DBWR workload could have been serviced by a single CPU. In fact 184 CPU seconds is only 20% of a single CPU core during a 900 second run, yet by configuring 8 DBWR processes we spread that workload over multiple CPUs. I know, you’re thinking this must be a pretty lightly loaded system, right? No, the processors hover around 50% utilization with peaks of 70%. The datafile I/O rate was some 6,980 IOPS of which 4,548/second were writes:
Load Profile Per Second Per Transaction
~~~~~~~~~~~~ --------------- ---------------
Redo size: 3,222,269.99 4,570.66
Logical reads: 66,770.61 94.71
Block changes: 18,097.81 25.67
Physical reads: 2,342.71 3.32
Physical writes: 4,548.36 6.45
User calls: 11,795.67 16.73
Parses: 3,597.63 5.10
Hard parses: 0.27 0.00
Sorts: 1.00 0.00
Logons: 0.00 0.00
Executes: 10,121.28 14.36
Transactions: 704.99
Moderate OLTP Load, db_writer_processes=1
Let’s take a look at the same workload after a shutdown, database restore and reboot of the instance:
$ grep ora_db noh*
oracle 15652 1 0 14:23 ? 00:00:00 ora_dbw0_bench1
oracle 15652 1 17 14:23 ? 00:02:55 ora_dbw0_bench1
What’s that? A single DBWR was able to service the same workload with only 175 CPU seconds? That’s the same work with 5% fewer processor cycles. But what did it do to throughput? Well, this is the medium model of the benchmark where the highest processor utilization peaks were roughly 70%. Reducing CPU utilization on a system that has ample CPU cycles to spare doesn’t get much. In fact, throughput only improved 2.4% from 46.3 Transactions Per Minute (TPM) to 47.4 TPM. Some would say that is within the margin of statistical error, but I bet those same people don’t have a setup were a database restore followed by an instance reboot will produce benchmark results within .5% like this setup does. Let’s just say it’s calibrated. Have you really read this far to hear about a paltry 2.4% difference in OLTP throughput though? No, that’s not all. The impact of too many DBWR processes is scheduling overhead and poor processor cache efficiency, so we need to look at a run with higher CPU utilization.
Heavy OLTP Load, db_writer_processes=8
The following ps(1) data shows us that with the increased OLTP load the 8 DBWR processes each consumed 32 CPU seconds for an aggregate of 256 CPU seconds. That’s still much less than the amount of cycles a single CPU had to offer-28% of a single CPU in fact. S
$ grep ora_db noh*
oracle 20763 1 0 15:59 ? 00:00:00 ora_dbw0_bench1
oracle 20765 1 0 15:59 ? 00:00:00 ora_dbw1_bench1
oracle 20767 1 0 15:59 ? 00:00:00 ora_dbw2_bench1
oracle 20769 1 0 15:59 ? 00:00:00 ora_dbw3_bench1
oracle 20771 1 0 15:59 ? 00:00:00 ora_dbw4_bench1
oracle 20773 1 0 15:59 ? 00:00:00 ora_dbw5_bench1
oracle 20775 1 0 15:59 ? 00:00:00 ora_dbw6_bench1
oracle 20777 1 0 15:59 ? 00:00:00 ora_dbw7_bench1
oracle 20763 1 3 15:59 ? 00:00:32 ora_dbw0_bench1
oracle 20765 1 3 15:59 ? 00:00:32 ora_dbw1_bench1
oracle 20767 1 3 15:59 ? 00:00:32 ora_dbw2_bench1
oracle 20769 1 3 15:59 ? 00:00:32 ora_dbw3_bench1
oracle 20771 1 3 15:59 ? 00:00:32 ora_dbw4_bench1
oracle 20773 1 3 15:59 ? 00:00:32 ora_dbw5_bench1
oracle 20775 1 3 15:59 ? 00:00:32 ora_dbw6_bench1
oracle 20777 1 3 15:59 ? 00:00:32 ora_dbw7_bench1
So here again the DBWR workload would “fit” within a single CPU. With this configuration, the OLTP throughput was 57.9 TPM. How does that compare to the same workload running against an instance with a single DBWR process?
Heavy OLTP Load, db_writer_processes=1
The following ps(1) output shows that a single DBWR process was able to service the same OLTP load with 291 CPU seconds-still much less (32%) than the bandwidth of a single CPU (there are 900 CPU seconds per core during the test).
$ grep ora_db noh*
oracle 21426 1 0 16:26 ? 00:00:00 ora_dbw0_bench1
oracle 21426 1 29 16:26 ? 00:04:51 ora_dbw0_bench1
But what about performance? When reducing DBWR count to 1, the OLTP throughput increased 8% from 57.9 to 62.6 TPM. What was the I/O load like though? Here is a snippet from statspack:
Load Profile Per Second Per Transaction
~~~~~~~~~~~~ --------------- ---------------
Redo size: 5,918,344.96 4,355.71
Logical reads: 125,684.76 92.50
Block changes: 33,622.21 24.74
Physical reads: 2,066.64 1.52
Physical writes: 8,483.06 6.24
User calls: 21,996.51 16.19
Parses: 6,737.41 4.96
Hard parses: 0.27 0.00
Sorts: 1.03 0.00
Logons: 0.00 0.00
Executes: 18,852.22 13.87
Transactions: 1,358.76
We see that with a single DBWR, the OLTP workload pushed through some 10,549 IOPS of which 8,483 per second were writes. That means a single DBWR on fairly outdated AMD Opteron processors (DL585) can service 8,483 writes per second with only 32% of a single CPU. So why configure more than that?
Where Did The Time Go?
Were there any significant waits on the 8 DBWR case? Yes, there were. Let’s see if anyone can guess which one is which:
Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time
----------------------------------------- ------------ ----------- ------ ------
CPU time 3,148 35.8
log file sync 1,207,863 2,905 2 33.1
log file switch (checkpoint incomplete) 4,627 828 179 9.4
db file scattered read 172,724 670 4 7.6
log file parallel write 532,373 464 1 5.3
-------------------------------------------------------------
Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time
----------------------------------------- ------------ ----------- ------ ------
CPU time 2,858 31.0
log file sync 1,076,670 2,509 2 27.2
log file switch (checkpoint incomplete) 4,808 1,939 403 21.0
db file scattered read 173,566 664 4 7.2
log file parallel write 491,346 419 1 4.5
-------------------------------------------------------------
Summary
Did he really make this long of a blog entry to discuss some 8% OLTP throughput increase? Well, yes and no. The affect of processor-cache thrashing will be more severe on larger SMP systems. This test system only has 4 sockets (8 cores). I’ve seen crazy DBWR configurations on 32 and 64 CPU systems account for more than 8%. Even if 8% was a constant, who wants to lose as much as 3 or 5 CPUs worth of bandwidth in today’s IT environment?
Like this:
Like Loading...
Recent Comments