Manly Men Only Use Solid State Disk For Redo Logging. LGWR I/O is Simple, But Not LGWR Processing

Let’s look at some statspack data from an OLTP Order Entry benchmark run that achieved 133 transaction per second (TPS):

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
log file sync                                   68,965       1,117     16   64.6
db file sequential read                        159,712         241      2   14.0
CPU time                                                       163           9.4
log file switch (checkpoint incomplete)            164         146    890    8.4
db file parallel write                          19,651          31      2    1.8

And now let’s look at the stats from run number 2 of the same benchmark, same system, same disk setup, etc:

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
db file sequential read                        838,810       1,635      2   69.1
CPU time                                                       253          10.7
db file parallel write                         153,334         145      1    6.1
log file sync                                   93,100         143      2    6.0
log file parallel write                         93,417          73      1    3.1

Benchmark run number 2 achieved 200 TPS. That’s right, statspack number 1 and number 2 above are from the same system. The database files are stored in an an HP EFS Clustered Gateway NAS device and accesed via NFS and the redo logs are, get this, stored on a Texas Memory Systems Solid State Disk. Statspack 2 above was a very busy system (a lot of cached SGA work) and doing over 3,000 IOPS:

Load Profile                            Per Second       Per Transaction
~~~~~~~~~~~~                       ---------------       ---------------
                  Redo size:          1,138,491.64              7,166.65
              Logical reads:             14,640.16                 92.16
              Block changes:              5,538.83                 34.87
             Physical reads:              1,428.73                  8.99
            Physical writes:              1,742.27                 10.97
                 User calls:              2,352.99                 14.81

So if a) system is the same, and b) LGWR is writing redo to the fastest disk possible, what is the story? What changed from statspack 1 to statspack 2 to make log file sync wait events jump 8-fold-with the associated 34% drop in throughput? More users? No. Nothing that simple. I’ll explain.

LGWR I/O? Who Cares-Simple Stuff
I see myriads of information on blogs, forums and all those expert repositories about the I/O aspects of redo logging, but very little about LGWR processing. Yes, a portion of LGWR’s job is to flush the redo buffer to the online redo log, but there’s more to LGWR’s job than writing redo. In fact, I never could understand the fascination with LGWR’s I/O since it is about as simple as it gets: sequential writes to a single file of sizes ranging from 512 bytes to a port-defined maximum (usually 128KB but often 1MB). Yes, it gets a little more complex with multiplexed redo, but that is just concurrent operations of the exact same profile. The part that seems to be hard to grasp for folks is where asynchronous I/O comes in to play.

LGWR Async I/O
If LGWR is clearing more than the port-defined maximum from the buffer, asynchronous I/O is important. Let’s consider an example. Let’s say LGWR has been posted to clear 1.75MB of redo from the buffer on a platform that establishes the LGWR I/O maximum to be 1MB. In this case, LGWR will issue a 1MB write asynchronously immediately followed by a 750KB asynchronous write. These writes are sequential. After the second I/O is in flight, LGWR starts to poll for the completions of the I/Os. So the question often becomes what if LGWR is only flushing, say, 600KB instead of 1.75MB, isn’t asynchronous I/O crucial in that case as well? The answer is no-unless you are using multiplexed redo. If LGWR is flushing an amount less than the port-defined maximum, asynchronous I/O offers nothing-for that specific flushing operation. See, LGWR is a lot different than DBWR. When DBWR has I/O in flight, it also has other stuff it needs to do like building the next write batch and so forth. LGWR on the other hand has nothing to do when its I/Os are in flight. It polls for the completions and once the flushing operation is complete, it then takes action to post the foregrounds that are waiting in a log file sync wait event. So asyncnronous I/O has no play in a LGWR buffer flush operation of an amount smaller than the port-defined maximum I/O size.

Oh, I forgot to mention, LGWR will keep up to 4 I/Os in flight (I need to check if that is more in late releases) when clearing more than the maximum allowed in a single physical write. All in all, LGWR I/O is very simple. So why do session waits, such as log file sync, generate so much folklore? I don’t know, but I know the folklore isn’t helpful.

Throw A Solid State Disk At It
I love that one. Let’s see, writing to solid state disk is faster than round-brown spinning thingies, so if any aspect of LGWR’s duties seem out of sorts, let’s get some really, really fast disk. Don’t get me wrong, I’m a big fan of solid state disk, but throwing it at the wrong problem is not the right thing to do.

The most complex topic regarding LGWR is the log file sync wait event. Why? Because it has very little to do with I/O, that’s why. Sure I/O is a part of it, but there is more to it.

Backgrounder
LGWR spends its time in a work loop consisting of very little, really. Well, in contrast to DBWR that is. Remember, DBWR has to manipulate both cache buffers chains and lrus and flush modified SGA buffers and post processes. While DBWR works on a large amount of metadata and buffers, LGWR does not. A glance at ps(1) output makes that clear. Notice in the following ps(1) output, LGWR manages to perform its tasks with a resident set size one 30^th the size of DBWR-and of course that ratio would lean much more to the DBWR side on a large SGA system since DBWR has to map all the buffers in the SGA over the life of the instance (the exception to this is the indirect data buffers feature). LGWR on the other hand has very little working set because it only tends to the redo buffer and will only exercise a few latches (redo allocation, copy, etc).

$ ps -eF | egrep 'UID|lgwr|dbw' | grep -v grep
UID        PID  PPID  C    SZ  RSS PSR STIME TTY          TIME CMD
oracle   19062     1  1 377082 851808 0 16:11 ?       00:00:08 ora_dbw0_bench1
oracle   19064     1  2 379943 29816 0 16:11 ?        00:00:09 ora_lgwr_bench1

Log File Sync
When a foreground process commits changes to the database, it allocates space in the redo buffer and copies the redo information into the buffer-protected either by the redo allocation latch itself or a redo copy latch. The process then enqueues itself for service from LGWR and posts the LGWR (most ports use IPC semaphores for posting). The process then goes to sleep waiting for the return post from LGWR indicating its redo is flushed to disk. So, we now have a process waiting for log file sync event (LFS). Let’s call this process waiter1. How long waiter1 waits depends on a lot of factors.

The first factor that will affect waiter1’s LFS wait is what LGWR was doing when it posted him. Remember, Oracle is a multi-user database on a multi-processor system. There is 1 LGWR and a lot of consumers of his service. On a busy system, odds are quite good that any time LGWR get posted it is in the process of servicing a batch of prior waiters. That is, when waiter1 posted LGWR the odds are good it was already busy flushing a bunch of redo for some other group of processes who are themselves waiting in LFS. This is where it gets dicey.

Before LGWR can flush waiter1’s redo, he has to finish flushing the prior batch and post all of those waiters. Let’s say there was 256KB of redo to flush for 2 waiters on a platform with the port-defined maximum LGWR I/O size set at 128KB. On this platform, LGWR will have to issue 2 asynchronous 128KB writes and poll for those to complete. Once the I/O is complete, LGWR will post each of the two waiters using the post-wait mechanism for that port of Oracle-which is an IPC Semaphore semop() call for most platforms. So, LGWR has made system calls to write, and system calles to poll for I/O completion and finally 2 calls to semop(). LGWR is just a normal user-mode process so all this dipping into the kernel introduces processing delays. Yes, these system calls I listed are non-blocking, but when a processor transitions from kernel to user mode there is opportunity to switch to higher-priority runable processes. In fact, the very act of LGWR posting these 2 processes can cause it to lose CPU. Huh? Sure. What if instead of 2 there were 16 in the batch and there were only 4 CPUs on the system. We call that the thundering herd and it wreaks havoc on LGWR. These process scheduling storms can be quite problematic. Think about it. You have one process (LGWR) that just made 4,8 or 16 or more processes runable by semop()ing them from their log file sync wait which makes them runable in kernel mode which trumps a runable user-mode process. If that happens on a system with, say 8 cores, those processes will need to run somewhere. It is a scheduling storm. For this reason you should always set LGWR’s scheduling priority to the highest possible (with renice for instance).

Without elevated scheduling priority for LGWR, it can often be the case that the processes LGWR wakes up will take the CPU from LGWR. Some platforms (Legacy Unix) support the notion of making processes non-preemptable as well. On those platforms you can make LGWR non-preemptable. Without preemption protection, LGWR can lose CPU before it has exhausted its fair time slice.
Once LGWR loses his CPU it may be quite some time until he gets it back. For instance, if LGWR is preempted in the middle of trying to perform a redo buffer flush, there may be several time slices of execution for other processes before LGWR gets back on CPU. Why am I putting you to sleep about this? Because those time slices that LGWR was off CPU will be charged to foreground processes waiting in a log file sync wait event. Yes, if a foreground is in log file sync wait and LGWR is piled under other processes and can’t get CPU for, say, a few time slices (10ms each), that foreground process’ log file sync will have 10, 20 or more milliseconds of extra wait tacked onto it before LGWR even gets a chance to do I/O. LGWR’s I/O is merely a component of a foreground process’ LFS wait.

So what happened to the benchmark run behind statspack 2 above? CPU starvation specifically for LGWR! How did I do that? Easy. I have a simple program that uses the Linux sched_setaffinity() API call to assign hard processor affinity to processes. These tests were done on a 4-core x86 system and in both cases all the background processes were started up with hard CPU affinity to CPU 0. Once the database was started, I reassign LGWR’s hard affinity to CPU 3. The Pro*C benchmark processes and their associated shadow processes are all running split equally with hard affinity between CPUs 1 and 2. This is the model for both benchmark runs. Yes, LGWR is all alone on CPU 3.

In the case where the benchmark only got 133TPS, I have a “noise” process running on the same CPU as LGWR. This noise process also is pinned to the CPU with hard affinity. What does this “noise” process do that is so heinous as to affect Oracle’s OLTP throughput by 34%? After all, it is only a single process. Yes, single, but also running on the same CPU as LGWR. The noise program is:

double sqrt();
main () { int i; double z; for (;;) { for (i = 0; i < 1000000; i++) { z = sqrt (i); } poll (0, 0, 1); } }

Simple enough. That loop of sqrt() calls from libm takes about 3ms on the CPUs in this benchmark system. So the way this works is that about every 3ms, this process will go to sleep for 1ms using the poll() system call. If this process is running when LGWR needs to run, LGWR will have to wait for 3ms. If that process isn’t executing, LGWR has no impediment to peforming his tasks and LFS are a measely 2ms and OLTP throughput jumps 34%.

This little parasite noise process gets in LGWR’s way a lot too. Basically, any time LGWR gets posted to perform a flush, this noise process is in the way (chalk up 3ms), any time LGWR blocks on an I/O (e.g., a flush that is smaller than the port maximum and therefore a single I/O) this noise process is in the way once the I/O is complete (chalk up 3ms), any time LGWR unluckily loses his CPU to a time slice switch this process will run and take 3ms from LGWR. It all adds up. And in this test case, processes posted by LGWR can never take his CPU (since I’ve set up hard affinity) so this is a nicer test case actually.

Don’t Write At All
Just to be evil, I’ll take it a step further. Did you know that if you set the initialization parameter _disable_logging=TRUE LGWR does everything it normally does except it omits performing the I/O for a redo flush. Don’t do this if you like your database because anything other than a shutdown normal is going to leave your database unrecoverable. However, this is a great test because if performance doesn’t increase when you set this parameter in this fashion, then you have a LGWR I/O processing problem and not a LGWR I/O problem. So I did a run of the same benchmark like this:

SQL> show parameter _disable

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
_disable_logging                     boolean     TRUE

Everything else was the same but now instead of really fast Solid State Disk I/Os, LGWR doesn’t even bother writing to disk at all. What did that do to the benchmark results? Nothing. I still get 133TPS.

Summary
LGWR I/O is very simple. LGWR processing on the other hand is a bit more complex.

64 Responses to “Manly Men Only Use Solid State Disk For Redo Logging. LGWR I/O is Simple, But Not LGWR Processing”

Feed for this Entry Trackback Address

1 Amir Hameed July 22, 2007 at 3:40 am

Kevin,
Excellent and very educational blog. I have a question that I would like to ask. On systems implemented with buffered IO and asynchronous IO enabled (like Oracle on Solaris), the asynchronous IO is implemented via LW threads with each thread doing its own IO synchronosly. When LGWR issues an IO that is below the port-maximum, is that IO carried out by one of the LWP? If not then how is it done?

Reply
2 Senthil Subramanian July 22, 2007 at 12:23 pm

Kevin,

Thanks for an excellent and informative blog on the mechanism of LGWR process.

Reply
3 kevinclosson July 22, 2007 at 4:31 pm

Senthil,

Thanks for the kind words and thanks for stopping by. Note, there is a lot I left out about process priorities and preemption having to do with post/wait, I’ll see if I can’t turn this into a mini-series.

BTW, I speak at a reasonable number of OUG gatherings. I proposed a abstract for this sort of material a few times and those folks never think it is presentation-worthy. Maybe it’s not.

Reply
4 Dominic Delmolino July 23, 2007 at 1:39 pm

Excellent post, Kevin. It’s definitely presentation-worthy. Make it provocative and people will eat it up — “Should LGWR get a dedicated CPU?” 🙂 It’s funny — for years people have been saying to make sure the redo logs are placed on dedicated disks — so the dedicated CPU statement should resonate. Good luck with it!

Reply
5 kevinclosson July 23, 2007 at 3:02 pm

Dominic,

When I worked on large systems (e.g., 30 CPU), I was willing to give LGWR a dedicated CPU since that would be some 3%. I wouldn’t consider that on an 8-core system. I’d probably pile DBWR and LGWR on the same CPU in that case. They both yield the CPU frequently enough…shouldn’t be a problem. I’d nice LGWR 20 and DBWR less. Some of the boutique ports (on good OSes) supported preemption protecting wile holding latches. That was a good thing too. Those old ports had a lot of specialization because the OSes were so hard-wired to the hardware.

Reply
6 Noons July 24, 2007 at 12:55 am

Kevin: please turn this into a prezzie! And do indeed make it as widely available as possible.

About time folks realize log tuning is not a secret art: it can be done if people sit down and think, instead of the usual knee-jerk reaction of “throw hardware at it”.

In a recent sql server course I attended, we had a “tuning round table” discussion on how to achieve this in SS. First cab off the rack? “Stick the logfile in the fastest sequential access device you can find”. No mention whatsoever of solid-state disks! All Microslop recommends is: fastest write speed disk you can find and increase priority of logfile handler task. There are of course other details, I’m summarising.

Logfile access is mostly sequential I/O, therefore this is what we must tune for. And the mechanism for controlling and tuning sequential write outpout is a well known classic: asynch IO, fast writes by isolating from other random IO interference, jack-up the priority of the process issuing the IO requests. But it MUST be all three of them, not just bits and pieces!

I’m reminded of a BAAG moment of mine back in the 90s, when I recommended a client stick the redo logs in a dedicated LUN in a SAN: I got a dedicated chunk of the very fast battery-backed cache but my redo log writing waits stayed where they were! It took me a while to click into what was wrong. When we “re-niced” lgwr, thinks took off!

Folks should forget SSDs for logs: they are good to reduce seek and rotational delays, that is NOT the problem in 99% of the cases with redo logs. They might be much better off putting the indexes in the SSDs!

Reply
7 cristiancudizio July 24, 2007 at 3:49 pm

Compliments Kevin, another interestin post. Please forgive me if i annoy you with another question: i’ve not well understood the first step you described in “Log File Sync” paragraph, with

“When a foreground process commits changes to the database, it allocates space in the redo buffer and copies the redo information into the buffer-protected either by the redo allocation latch itself or a redo copy latch. ”

I was thinking that foreground process allocates space in redo log buffer with every statement that generates redo. Why you say that it allocatoe space in the redo buffer when it commits? where does it copies it then?
Thanks a lot for all your informations.
Cristian

Reply
8 joel garry July 27, 2007 at 12:45 am

thankyouthankyouthankyou.

I’ve asked about this in several places over the years with a general response of crickets chirping.

Reply
9 Nataraj-CorePerf Engineer August 3, 2007 at 3:58 pm

Hi Kelvin,

It’s really interesting posting and all your comments really worthy ; As far as my experienced concerned, sometimes it very difficult to predict the behaviour of our BG processes specifically in environments having hybrid workload; What you’ve posted in SP-1 is exactly sync with workload profile as it we extensively perform DML’s on LOB data on various objects concurrently. My suggestion moving to dedicated disks/RAID 0+1/ nothing approved due to GOVT regulations but still we’ve tried to offset LGWR load by disabling logging for non-critical objects. It will be great to know what’s the impact of nice the LGWR in 18 core system.. ??
Finally lot of people have lot of misconceptions about LGWR.

By the way, have you tried to playaround with _log_io_size parameter during your benchmarks.

Reply
10 Nataraj-CorePerf Engineer August 3, 2007 at 3:59 pm

Finally lot of people have lot of misconceptions about LGWR. your pointers will definitely erase them. Let’ continue this thread for more experiences here.

Reply
11 Amir Hameed August 9, 2007 at 3:54 pm

Kevin,
If a process is bound to a specific CPU, what will happen if that CPU goes bad and the OS off-lines it? as it happens from time to time on large SMP systems?

Reply
12 kevinclosson August 9, 2007 at 5:02 pm

Hello Amir,

Since I know a small bit about your specific environment I understand the question. The concept of a system surviving after a processor failure is generally the domain of high-end boutique SMP systems. Such system have handling for a processor failure in the scheduler code. A processor bound to an offlined CPU isn’t banished, it’s just returned to the larger scheduling pool. This handling in the scheduling software merely complements the heavy lifting down in the hardware to handle a processor going offline.

The linux scheduler has provisions for CPUs coming and going with its CPU Hotplug support but I don’t know what happens when a CPU fails. Indeed, since the P6, Xeons have supported the Machine Check Architecture which exposes information to software about whether the processor is working correctly or not, but to what extent that is being used by Linux is something I haven’t looked into.

I’ve never seen a CPU fail on a Linux system that didn’t result in a full blown NMI PANIC. There are folks in the industry trying to add value in this space with x86 processors inside such as the Bull R620 servers which internally run parallel lock-step execution on different CPUs and compares the result of each instruction checking for differences. If there is a difference, it supposedly figures out what to do but I have no experience with this. They offer this higher-level hardware fault tolerance with either Woodcrest or Clovertown Xeons inside but seemingly only targeted at Windows:

http://www.wcm.bull.com/internet/pr/rend.jsp?DocId=220981&lang=en

Reply
13 Mike Killough September 5, 2007 at 1:38 pm

Kevin,

This was a very interesting article. Thanks for educating us on this. I always had assumed that ‘log file sync’ waits were always slow disk, and in my experience they always were. I would have been very suprised if I had moved them to fast disks and saw no benefit. I think this is the type of topic that should be presented more often. Do you think it would be a best practice to always renice the LGWR, or wait until the log file waits are an issue?

Mike

Reply
14 kevinclosson September 5, 2007 at 2:26 pm

Mike,

I don’t like to see unbroken things fixed, so I’d stay off the renice uness you have a generally high processor utilization. Now, be it far from me to practice what I preach because I always renice LGWR and DBWR. How’s that for a slippery answer? 🙂

Reply
15 Aman Sharma September 13, 2007 at 3:49 am

Hi sir,
First of all congrats on joining Oracle server technology group.I am sure you will be among the most valuable assests of that group.
I have read this post for 3 times but I am still trying to understand it and collecting the pieces to make them right fit, thanks alot to tiny brain :-(.But its the best post I have read so far for LGWR’s working.I hope I shall understand it soon.
Can I ask some one stupid doubt sir?From Data Buffer Cache,DBWR does the IO in terms of Oracle Block multiples and from Redo Buffer, LGWR in terms of O/S IO multiple, is it right?
Thanks and best regards,
Aman….

Reply
16 kevinclosson September 13, 2007 at 5:40 am

Aman,

There is no such thing as a stupid question…just ask the people who mentored me over the last 20 years…trust me! The answer is that, yes, DBWR’s I/Os are a multiple of the block size, but not so concise on LGWR. When you say “OS I/O multiple”, I’m not sure what you really mean there so I have to say no. If you mean a multiple of the SCSI driver block size (512b) then yes. I do recall a problem with VME drives on a Unix system I used one that limited the max write size at a multiple of 512b alright, but not a power of 2…that device driver supported a max single DMA of only 120KB! That was weird, but you shouldn’t be surprised that at the Oracle port level we had to determine if the redo logs were on one of these devices then limit LGWR’s writes to 120KB. I wasn’t too sad to see VME go away 🙂 That was a long time ago.

That was a longer answer than you wanted for sure.

Reply
17 Tom September 14, 2007 at 5:57 pm

Very intersting article. One question…

Is there any way to “renice” DBWR and LGWR on Windows where they are not separate processes but instead threads?

Reply
- 18 Mike Timonin June 11, 2009 at 1:05 pm
  
  Tom,
  there’s a registry parameter ORACLE_PRIORITY which you can use. More info in metalink note 11608.1
  It’s nothing as flexible as dynamic renicing in unixes but you can try.
  
  Reply
19 kevinclosson September 14, 2007 at 7:19 pm

Tom,

That is such a great question. The problem is that although I am no expert on Windows Oracle, I do know that the concepts I wrote about vis a vis the possible scheduling anamolies component of LFS is completely off the radar when discussing Windows Oracle. As you point out, LGWR is a thread and that changes everything. Since I have not studies LFS on Windows, I have to shut up now.

Reply
20 Zoran September 17, 2007 at 11:05 am

Hi Kevin,

This is a very good post and you should make the presentation around this topic.
I remember testing solid state disks a couple of years ago for the same reason fighting log file sync as we had the systems with billion inserts per day into the same table.
At that time my conclusion was the same about the SSD.
I like to joke that Oracle “does not like fast I/O”. Oracle is designed to have I/O 🙂
You just move the bottleneck from the disk to the CPU where your possibility for tuning may be more limited.
When you have SSD your database start experiencing all kind of latching issues and CPU contentions as you clearly described.
I had big problems to tune redo latches with SSD for example to beat non SSD solution somehow.
At the end it was not worth it at all.
Also the other points are the integration of SSD’s to the overall existing HW platform introducing more possible failure points.

Regards,
-Zoran

Reply
21 Joel October 3, 2007 at 12:46 pm

How do you decide where to renice lgwr, I noticed 20 mentioned once above? Also, consider that the server may have 2 or 6 (or any other number) of production databases running. renice them all — and all the same number?

Reply
22 Purav November 26, 2007 at 5:50 am

Hi Kevin, Many Thanks for this insight. We have many sites where our products are deployed. DBs running at these sites are high DML DBs(high number of insert/updates per sec.). Most of the DBs show high wait times for event ‘log file sync’. However the CPU utilization is a low as 3% and at times a high of 10%. In that case, I believe, changing the nice value will have no impact because lgwr is anyways not waiting for CPU – because CPU is already available in abundance.
However, when I check iostat, what I see is that I/O bandwidth is also available (because at other times I have seen it going as high as 15MB per sec. whereas during the period when I see ‘log file sync’ waits, its just at 5MB per sec.).

So then what is the reason its showing high wait times for ‘log file sync’.

Thanks.

Reply
23 Adrian April 1, 2008 at 8:16 am

Great article, this is certainly something I have seen in performace testing particually under higher utilisation when queuing is in effect on the database server CPU.

We do also see average Log File Sync waits that are much larger than Log File Parallel write when there are a large number of sessions commiting and CPU utilisation is 20% – 30%

It is possible for a session commiting to issue its commit a fraction after the current write in is progress, therefore presumably it will have to wait for the current write to complete and the next which contains is own commit marker. It is therefore not unreasable to see a Log File Sync wait which is almost double the Log File Parallel write time which will affect the average.

Thanks

Reply
24 oracle tom July 21, 2008 at 9:34 am

Thanks a lot. Your blog is really enlightening!

Reply
25 Krishna Manoharan July 24, 2008 at 4:10 pm

Hi Kevin,

Would the hidden parameter _log_parallelism help in these cases? Also where would I find information on the port specific maximum size of an IO for the log writer?

In real life, I have seen more bottlenecks from the arch and redo process stomping on each other due to poor physical/redo groups layout, rather than redo IO to disk.

Thanks
Krishna Manoharan

Reply
26 Mike Timonin June 11, 2009 at 1:14 pm

Kevin,

thanks for this excellent post. However, it’s probably worth metioning that changing lgwr priority not going to help things in case of general CPU starvation. Please consider the following example (oracle 9.2 windows):

1. OLTP test (swingbench) on default priorities. Log file sync is toping the wait list, and the next one is latch free which points at the lack of cpu power (and they are indeed present)

Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
——————————————– ———— ———– ——–
log file sync 441,554 29,045 78.71
latch free 96,298 4,534 12.29
CPU time 2,192 5.94
db file sequential read 83,953 1,033 2.80
log file parallel write 11,942 27 .07

2. We change LGWR priority to “highest” and get the following

Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
——————————————– ———— ———– ——–
latch free 211,786 38,858 84.29
log file sync 442,901 2,943 6.38
CPU time 2,166 4.70
db file sequential read 84,486 2,064 4.48
log file parallel write 416,613 18 .04

Yes, log file sync is now 15 times less than it was, hooray! But at the same time latching went up as well and all the benefit is lost – the tests demonstrate the same throughput.

Reply
- 27 kevinclosson June 11, 2009 at 3:40 pm
  
  “However, it’s probably worth metioning that changing lgwr priority not going to help things in case of general CPU starvation.”
  
  …Mike, I wouldn’t expect removing wait events to improve anything in a CPU-starvation case.
  
  PS. 9iR2 on Windows? Wow, I haven’t even thought of 9i for about 6 years!
  
  PPS. What particular latch was so hammered in your case? Swingbench on 9i? Was it library cache? Perhaps Redo allocation or copy?
  
  Reply
  - 28 Mike Timonin June 11, 2009 at 4:49 pm
    
    Kevin,
    
    I’d say it depends on the way how you do it. Asynchronous commit in 10g could greatly improve(by removing a wait!) such a situation if your application can afford it.
    
    Re latch – it’s cache buffer chains
    the number of requests for latch stayed the same, however times are 10 times higher
    1. cache buffers chains 184,151,078 3430
    2. cache buffers chains 185,411,883 35978
    Probably I’ve made the concurrency level too high (100 sessions on a 4-core box), but the message is clear – something for something.
    
    PS lucky you 🙂
    
    Reply
    - 29 kevinclosson June 11, 2009 at 5:37 pm
      
      You have a situation where CPU saturation is causing the holders of chains latches to lose CPU while walking the chain. In a perfect world, latch holders never lose CPU during their quantum for anything other than the CPU taking a hardware interrupt (don’t want to starve those). Also, I bet if you dug into this you’d find there are probably only about 5 or so really hot chains…not a smooth problem across the thousands of chain latches…. it would be interesting to see the same test with 10g or 11g with specific focus on the behavior of chains latches….
      
      Reply
30 Martin Berger July 28, 2009 at 4:39 am

Hi Kevin,
can you leave one or two words about multiplexed redo, please?
If there are multiplexed logfiles, does the logwriter needs to wait for all multiplexed files to have a successfull write or only for the fastest (first) write?
As allways, thank you for the great article!
Martin

Reply
- 31 kevinclosson July 28, 2009 at 5:32 am
  
  Hi Martin,
  
  Multiplexed redo is only as fast as the slowest target device. I’ve wondered for quite some time what point there is in multiplexed redo when you can have high redundancy ASM. I should chat as soon as possible with those guys I hold in highest regard: the MAA Group. After all, there could be some funky Data Guard benefit from multiplexed redo and if that should be the case, I acknowledge my complete ignorance on that facet of the topic.
  
  Reply
32 qihua November 16, 2009 at 2:28 am

Hi Kevin,
Thanks for sharing the internal of how lgwr is working, I have one question about the following statement.

==>So, LGWR has made system calls to write, and system calles to poll for I/O completion and finally 2 calls to semop().

Do you mean poll is called by lgwr frequently? But the truss result doesn’t show such call. I truss the lgwr, and then insert large amount of data and commit, but there is no call named as poll appear in the result of truss.

MSGALERT-sqwdb083$> ps -ef|grep lgw
oracle 22891 20289 0 19:20:26 pts/4 0:00 grep lgw
oracle 24029 1 0 Oct 14 ? 14:39 ora_lgwr_MSGALERT
MSGALERT-sqwdb083$> truss -p 24029
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) (sleeping…)
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) Err#11 EAGAIN
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) (sleeping…)
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) = 0
pwrite(260, “03F9\f EB0 )E7 <03".., 512, 0x188B6000) = 512
semctl(77, 195, SETVAL, 1) = 0
times(0xFFFFFFFF7FFFDF10) = 656626590
times(0xFFFFFFFF7FFFDE60) = 656626590
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) Err#11 EAGAIN
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) (sleeping…)
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) Err#11 EAGAIN
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) (sleeping…)
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) Err#11 EAGAIN
times(0xFFFFFFFF7FFFDF10) = 656627240
times(0xFFFFFFFF7FFFDE60) = 656627240
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) (sleeping…)
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) Err#11 EAGAIN
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) (sleeping…)
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) Err#11 EAGAIN
times(0xFFFFFFFF7FFFDF10) = 656627842
times(0xFFFFFFFF7FFFDE60) = 656627842
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) (sleeping…)
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) = 0
pwrite(260, "03F9\f EB1 )E7 <11".., 1048576, 0x188B6200) = 1048576
pwrite(260, "03F9\f MB1 )E7 <11".., 2048, 0x189B6200) = 2048
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) Err#11 EAGAIN
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) (sleeping…)
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) Err#11 EAGAIN
pwrite(260, "03F9\f MB5 )E7 <15".., 292352, 0x189B6A00) = 292352
times(0xFFFFFFFF7FFFDF10) = 656628456
times(0xFFFFFFFF7FFFDE60) = 656628456
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) (sleeping…)
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) Err#11 EAGAIN
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) (sleeping…)
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) Err#11 EAGAIN
open("/proc/24015/psinfo", O_RDONLY) = 19
read(19, "0201 ]CF".., 416) = 416
close(19) = 0
times(0xFFFFFFFF7FFFDF10) = 656629058
times(0xFFFFFFFF7FFFDE60) = 656629058
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) (sleeping…)
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) Err#11 EAGAIN
semtimedop(76, 0xFFFFFFFF7FFFD974, 1, 0xFFFFFFFF7FFFD960) = 0
pwrite(260, "03F9\f OF0 )E7 <1F".., 419840, 0x189FE000) = 419840
semtimedop(76, 0xFFFFFFFF7FFFDD24, 1, 0xFFFFFFFF7FFFDD10) = 0
pwrite(260, "03F9\f S $ )E7 <1F".., 1048576, 0x18A64800) = 1048576

Reply
- 33 kevinclosson November 17, 2009 at 12:19 am
  
  Hello qihua,
  
  When I used the term “poll” I meant to convey “completion processing” or “reap.” That is different than saying poll(2).
  
  In looking at your pasted truss output I can surmise that this info was not gleaned form a Linux system (because of truss). However, I also see pwrite() calls so no asynchronous I/O. In your specific case you wouldn’t see any “poll” or “completion processing” because LGWR isn’t using async I/O. I can only guess you are on a Solaris system.
  
  On Linux, the “completion processing”, or polling/reaping, is done with the io_getevents() call, for what it is worth.
  
  Can you enable asynchronous I/O and truss some more?
  
  Reply
  - 34 qihua April 9, 2010 at 9:12 am
    
    Actually we are using async IO, the pwrite is on contronfile.
    
    I also find a weird thing: the number of “redo write” in oracle statistics is only about 1/4 ~ 1/6 of the number of write IO on OS level(we are using QIO, we can get the number of write IO from vxstat), So why one oracle redo write IO is split into so many OS level IO?
    
    Test$> grep pwrite xxx
    /7: pwrite(256, “1502\n01 ” SAF”.., 8192, 81920) = 8192
    /9: pwrite(257, “1502\n01 ” SAF”.., 8192, 81920) = 8192
    /11: pwrite(258, “1502\n01 ” SAF”.., 8192, 81920) = 8192
    /5: pwrite(256, “1502\tFB01 ” SAF”.., 8192, 0x013F6000) = 8192
    /7: pwrite(257, “1502\tFB01 ” SAF”.., 8192, 0x013F6000) = 8192
    /9: pwrite(258, “1502\tFB01 ” SAF”.., 8192, 0x013F6000) = 8192
    /1: pwrite(256, “15020101 ” SB0″.., 8192, 8192) = 8192
    /11: pwrite(257, “15020101 ” SB0″.., 8192, 8192) = 8192
    /5: pwrite(258, “15020101 ” SB0″.., 8192, 8192) = 8192
    /1: pwrite(262, “13C89301 *A114B2”.., 512, 512) = 512
    /1: pwrite(260, “13C89201 *A114B2”.., 512, 512) = 512
    /7: pwrite(256, “1502\t01 ” SB0″.., 8192, 73728) = 8192
    /9: pwrite(257, “1502\t01 ” SB0″.., 8192, 73728) = 8192
    /11: pwrite(258, “1502\t01 ” SB0″.., 8192, 73728) = 8192
    /7: pwrite(257, “150203 501 ” SB0″.., 8192, 6725632) = 8192
    /5: pwrite(256, “150203 501 ” SB0″.., 8192, 6725632) = 8192
    /9: pwrite(258, “150203 501 ” SB0″.., 8192, 6725632) = 8192
    /11: pwrite(256, “15020701 ” SB0″.., 8192, 57344) = 8192
    /5: pwrite(257, “15020701 ” SB0″.., 8192, 57344) = 8192
    /7: pwrite(258, “15020701 ” SB0″.., 8192, 57344) = 8192
    /9: pwrite(256, “1502\tFC01 ” SB0″.., 8192, 0x013F8000) = 8192
    /5: pwrite(258, “1502\tFC01 ” SB0″.., 8192, 0x013F8000) = 8192
    /11: pwrite(257, “1502\tFC01 ” SB0″.., 8192, 0x013F8000) = 8192
    /1: pwrite(256, “15020101 ” SB1″.., 8192, 8192) = 8192
    /7: pwrite(257, “15020101 ” SB1″.., 8192, 8192) = 8192
    /9: pwrite(258, “15020101 ” SB1″.., 8192, 8192) = 8192
    Test$> grep kaio xxx|more
    /1: kaio(AIOWRITE, 260, 0x381513E00, 4608, 0x04F42E0007469790) = 0
    /1: kaio(AIOWAIT, 0xFFFFFFFF7FFFD890) = 4417034128
    /1: kaio(AIOWRITE, 260, 0x381515000, 7680, 0x04F440000747BDC0) = 0
    /1: kaio(AIOWAIT, 0xFFFFFFFF7FFFD890) = 4417109440
    /1: kaio(AIOWRITE, 260, 0x381516E00, 13824, 0x04F45E0007456D70) = 0
    /1: kaio(AIOWAIT, 0xFFFFFFFF7FFFD890) = 4416957808
    /1: kaio(AIOWRITE, 260, 0x38151A400, 7168, 0x04F4940007469790) = 0
    /1: kaio(AIOWRITE, 260, 0x38151C000, 1024, 0x04F4B0000747BDC0) = 0
    /1: kaio(AIOWAIT, 0xFFFFFFFFFFFFFFFF) = 4417034128
    /1: kaio(AIOWAIT, 0xFFFFFFFF7FFFD890) = 4417109440
    
    Reply
35 Dmitri March 10, 2011 at 11:33 am

Hi Kevin

I love this post. I come back and re-read it every month or so in the hope that one day it will all sink in completely.

They say there’s no such thing as a dumb question, but I’d like to test that assertion if I may.

You say: “For this reason you should always set LGWR’s scheduling priority to the highest possible (with renice for instance).”

Why doesn’t Oracle do this by default? There are parameters in some flavours of Oracle which change the priority of processes (e.g. HPUX_SCHED_NOAGE) after all.

Also, is it supported to renice an Oracle process manually like this? It seems like a no-brainer but you never know.

Thanks

Dmitri

Reply
36 John August 19, 2011 at 9:30 am

Kevin,

I ‘am trying to digest this.. 🙂

You wrote:
>>If LGWR is clearing more than the port-defined maximum from the buffer

How do i find the “maximum port-defined ” ?

Regards,
John

Reply
- 37 kevinclosson August 22, 2011 at 5:31 pm
  
  We use to put that sort of port-specific stuff in the release notes for the port I worked on (DYNIX/ptx). If it isn’t listed there you could ask support because it can vary. It’s 128KB on Linux to to the best of my knowledge it is 128KB on most platforms. I don’t have access to the code anymore so I can’t say for certain about other platforms.
  
  Reply
38 manish February 13, 2013 at 4:53 pm

I just read it and it was so wonderful. I worked in Oracle as well but unfortunately could not get that deeper insight there also .. so thanks a ton 🙂

Reply
39 SQL Sasquatch January 16, 2014 at 7:03 am

Hello Kevin! In the blog post:
“LGWR will keep up to 4 I/Os in flight (I need to check if that is more in late releases)”
Still 4 inflight I/O max for LGWR in 12c? Been searching, couldn’t even find another mention of the max in any version yet.

Reply
- 40 SQL Sasquatch January 16, 2014 at 7:18 am
  
  So now I’ve seen references in various pages to 4 max in flight for LGWR as far back as 2003… but nothing versioned for 11.2.0.3 or later…
  
  Reply
  - 41 kevinclosson January 16, 2014 at 5:05 pm
    
    The number of I/O in flight from background processes like LGWR is controlled by the port. I’m stale on the code (having left Oracle Exadata development org in 2011) so I can’t say what it is in modern releases. Sorry.
    
    Reply
42 Mr Lallemand (@lacuadra) April 26, 2016 at 8:14 am

I love your topic, but I have to say this: The LF Sync old and tired issue is an Oracle design problem !!

It is an old design that needs to be reworked.

A single synchronous log writer is outdated, it is just ridiculous to expect 21st century DB system to post changes based on a 1978s concept world.

Log writer has to be able to multistream and flush data accordingly to accomodate a large amount of transactional throughput. Imagine a garden hose spraying information and receiving a constant feedback. (RAC is not the answer , as networking has not replaced the electronic cohesion of memory banks)

How do you do this…? that is up to the big boys to figure this out. If I knew the answer I would be the next Larry Ellison !

Reply

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage