Oracle OTLP | Kevin Closson's Blog: Platforms, Databases and Storage

Let’s look at some statspack data from an OLTP Order Entry benchmark run that achieved 133 transaction per second (TPS):

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
log file sync                                   68,965       1,117     16   64.6
db file sequential read                        159,712         241      2   14.0
CPU time                                                       163           9.4
log file switch (checkpoint incomplete)            164         146    890    8.4
db file parallel write                          19,651          31      2    1.8

And now let’s look at the stats from run number 2 of the same benchmark, same system, same disk setup, etc:

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
db file sequential read                        838,810       1,635      2   69.1
CPU time                                                       253          10.7
db file parallel write                         153,334         145      1    6.1
log file sync                                   93,100         143      2    6.0
log file parallel write                         93,417          73      1    3.1

Benchmark run number 2 achieved 200 TPS. That’s right, statspack number 1 and number 2 above are from the same system. The database files are stored in an an HP EFS Clustered Gateway NAS device and accesed via NFS and the redo logs are, get this, stored on a Texas Memory Systems Solid State Disk. Statspack 2 above was a very busy system (a lot of cached SGA work) and doing over 3,000 IOPS:

Load Profile                            Per Second       Per Transaction
~~~~~~~~~~~~                       ---------------       ---------------
                  Redo size:          1,138,491.64              7,166.65
              Logical reads:             14,640.16                 92.16
              Block changes:              5,538.83                 34.87
             Physical reads:              1,428.73                  8.99
            Physical writes:              1,742.27                 10.97
                 User calls:              2,352.99                 14.81

So if a) system is the same, and b) LGWR is writing redo to the fastest disk possible, what is the story? What changed from statspack 1 to statspack 2 to make log file sync wait events jump 8-fold-with the associated 34% drop in throughput? More users? No. Nothing that simple. I’ll explain.

LGWR I/O? Who Cares-Simple Stuff
I see myriads of information on blogs, forums and all those expert repositories about the I/O aspects of redo logging, but very little about LGWR processing. Yes, a portion of LGWR’s job is to flush the redo buffer to the online redo log, but there’s more to LGWR’s job than writing redo. In fact, I never could understand the fascination with LGWR’s I/O since it is about as simple as it gets: sequential writes to a single file of sizes ranging from 512 bytes to a port-defined maximum (usually 128KB but often 1MB). Yes, it gets a little more complex with multiplexed redo, but that is just concurrent operations of the exact same profile. The part that seems to be hard to grasp for folks is where asynchronous I/O comes in to play.

LGWR Async I/O
If LGWR is clearing more than the port-defined maximum from the buffer, asynchronous I/O is important. Let’s consider an example. Let’s say LGWR has been posted to clear 1.75MB of redo from the buffer on a platform that establishes the LGWR I/O maximum to be 1MB. In this case, LGWR will issue a 1MB write asynchronously immediately followed by a 750KB asynchronous write. These writes are sequential. After the second I/O is in flight, LGWR starts to poll for the completions of the I/Os. So the question often becomes what if LGWR is only flushing, say, 600KB instead of 1.75MB, isn’t asynchronous I/O crucial in that case as well? The answer is no-unless you are using multiplexed redo. If LGWR is flushing an amount less than the port-defined maximum, asynchronous I/O offers nothing-for that specific flushing operation. See, LGWR is a lot different than DBWR. When DBWR has I/O in flight, it also has other stuff it needs to do like building the next write batch and so forth. LGWR on the other hand has nothing to do when its I/Os are in flight. It polls for the completions and once the flushing operation is complete, it then takes action to post the foregrounds that are waiting in a log file sync wait event. So asyncnronous I/O has no play in a LGWR buffer flush operation of an amount smaller than the port-defined maximum I/O size.

Oh, I forgot to mention, LGWR will keep up to 4 I/Os in flight (I need to check if that is more in late releases) when clearing more than the maximum allowed in a single physical write. All in all, LGWR I/O is very simple. So why do session waits, such as log file sync, generate so much folklore? I don’t know, but I know the folklore isn’t helpful.

Throw A Solid State Disk At It
I love that one. Let’s see, writing to solid state disk is faster than round-brown spinning thingies, so if any aspect of LGWR’s duties seem out of sorts, let’s get some really, really fast disk. Don’t get me wrong, I’m a big fan of solid state disk, but throwing it at the wrong problem is not the right thing to do.

The most complex topic regarding LGWR is the log file sync wait event. Why? Because it has very little to do with I/O, that’s why. Sure I/O is a part of it, but there is more to it.

Backgrounder
LGWR spends its time in a work loop consisting of very little, really. Well, in contrast to DBWR that is. Remember, DBWR has to manipulate both cache buffers chains and lrus and flush modified SGA buffers and post processes. While DBWR works on a large amount of metadata and buffers, LGWR does not. A glance at ps(1) output makes that clear. Notice in the following ps(1) output, LGWR manages to perform its tasks with a resident set size one 30^th the size of DBWR-and of course that ratio would lean much more to the DBWR side on a large SGA system since DBWR has to map all the buffers in the SGA over the life of the instance (the exception to this is the indirect data buffers feature). LGWR on the other hand has very little working set because it only tends to the redo buffer and will only exercise a few latches (redo allocation, copy, etc).

$ ps -eF | egrep 'UID|lgwr|dbw' | grep -v grep
UID        PID  PPID  C    SZ  RSS PSR STIME TTY          TIME CMD
oracle   19062     1  1 377082 851808 0 16:11 ?       00:00:08 ora_dbw0_bench1
oracle   19064     1  2 379943 29816 0 16:11 ?        00:00:09 ora_lgwr_bench1

Log File Sync
When a foreground process commits changes to the database, it allocates space in the redo buffer and copies the redo information into the buffer-protected either by the redo allocation latch itself or a redo copy latch. The process then enqueues itself for service from LGWR and posts the LGWR (most ports use IPC semaphores for posting). The process then goes to sleep waiting for the return post from LGWR indicating its redo is flushed to disk. So, we now have a process waiting for log file sync event (LFS). Let’s call this process waiter1. How long waiter1 waits depends on a lot of factors.

The first factor that will affect waiter1’s LFS wait is what LGWR was doing when it posted him. Remember, Oracle is a multi-user database on a multi-processor system. There is 1 LGWR and a lot of consumers of his service. On a busy system, odds are quite good that any time LGWR get posted it is in the process of servicing a batch of prior waiters. That is, when waiter1 posted LGWR the odds are good it was already busy flushing a bunch of redo for some other group of processes who are themselves waiting in LFS. This is where it gets dicey.

Before LGWR can flush waiter1’s redo, he has to finish flushing the prior batch and post all of those waiters. Let’s say there was 256KB of redo to flush for 2 waiters on a platform with the port-defined maximum LGWR I/O size set at 128KB. On this platform, LGWR will have to issue 2 asynchronous 128KB writes and poll for those to complete. Once the I/O is complete, LGWR will post each of the two waiters using the post-wait mechanism for that port of Oracle-which is an IPC Semaphore semop() call for most platforms. So, LGWR has made system calls to write, and system calles to poll for I/O completion and finally 2 calls to semop(). LGWR is just a normal user-mode process so all this dipping into the kernel introduces processing delays. Yes, these system calls I listed are non-blocking, but when a processor transitions from kernel to user mode there is opportunity to switch to higher-priority runable processes. In fact, the very act of LGWR posting these 2 processes can cause it to lose CPU. Huh? Sure. What if instead of 2 there were 16 in the batch and there were only 4 CPUs on the system. We call that the thundering herd and it wreaks havoc on LGWR. These process scheduling storms can be quite problematic. Think about it. You have one process (LGWR) that just made 4,8 or 16 or more processes runable by semop()ing them from their log file sync wait which makes them runable in kernel mode which trumps a runable user-mode process. If that happens on a system with, say 8 cores, those processes will need to run somewhere. It is a scheduling storm. For this reason you should always set LGWR’s scheduling priority to the highest possible (with renice for instance).

Without elevated scheduling priority for LGWR, it can often be the case that the processes LGWR wakes up will take the CPU from LGWR. Some platforms (Legacy Unix) support the notion of making processes non-preemptable as well. On those platforms you can make LGWR non-preemptable. Without preemption protection, LGWR can lose CPU before it has exhausted its fair time slice.
Once LGWR loses his CPU it may be quite some time until he gets it back. For instance, if LGWR is preempted in the middle of trying to perform a redo buffer flush, there may be several time slices of execution for other processes before LGWR gets back on CPU. Why am I putting you to sleep about this? Because those time slices that LGWR was off CPU will be charged to foreground processes waiting in a log file sync wait event. Yes, if a foreground is in log file sync wait and LGWR is piled under other processes and can’t get CPU for, say, a few time slices (10ms each), that foreground process’ log file sync will have 10, 20 or more milliseconds of extra wait tacked onto it before LGWR even gets a chance to do I/O. LGWR’s I/O is merely a component of a foreground process’ LFS wait.

So what happened to the benchmark run behind statspack 2 above? CPU starvation specifically for LGWR! How did I do that? Easy. I have a simple program that uses the Linux sched_setaffinity() API call to assign hard processor affinity to processes. These tests were done on a 4-core x86 system and in both cases all the background processes were started up with hard CPU affinity to CPU 0. Once the database was started, I reassign LGWR’s hard affinity to CPU 3. The Pro*C benchmark processes and their associated shadow processes are all running split equally with hard affinity between CPUs 1 and 2. This is the model for both benchmark runs. Yes, LGWR is all alone on CPU 3.

In the case where the benchmark only got 133TPS, I have a “noise” process running on the same CPU as LGWR. This noise process also is pinned to the CPU with hard affinity. What does this “noise” process do that is so heinous as to affect Oracle’s OLTP throughput by 34%? After all, it is only a single process. Yes, single, but also running on the same CPU as LGWR. The noise program is:

double sqrt();
main () { int i; double z; for (;;) { for (i = 0; i < 1000000; i++) { z = sqrt (i); } poll (0, 0, 1); } }

Simple enough. That loop of sqrt() calls from libm takes about 3ms on the CPUs in this benchmark system. So the way this works is that about every 3ms, this process will go to sleep for 1ms using the poll() system call. If this process is running when LGWR needs to run, LGWR will have to wait for 3ms. If that process isn’t executing, LGWR has no impediment to peforming his tasks and LFS are a measely 2ms and OLTP throughput jumps 34%.

This little parasite noise process gets in LGWR’s way a lot too. Basically, any time LGWR gets posted to perform a flush, this noise process is in the way (chalk up 3ms), any time LGWR blocks on an I/O (e.g., a flush that is smaller than the port maximum and therefore a single I/O) this noise process is in the way once the I/O is complete (chalk up 3ms), any time LGWR unluckily loses his CPU to a time slice switch this process will run and take 3ms from LGWR. It all adds up. And in this test case, processes posted by LGWR can never take his CPU (since I’ve set up hard affinity) so this is a nicer test case actually.

Don’t Write At All
Just to be evil, I’ll take it a step further. Did you know that if you set the initialization parameter _disable_logging=TRUE LGWR does everything it normally does except it omits performing the I/O for a redo flush. Don’t do this if you like your database because anything other than a shutdown normal is going to leave your database unrecoverable. However, this is a great test because if performance doesn’t increase when you set this parameter in this fashion, then you have a LGWR I/O processing problem and not a LGWR I/O problem. So I did a run of the same benchmark like this:

SQL> show parameter _disable

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
_disable_logging                     boolean     TRUE

Everything else was the same but now instead of really fast Solid State Disk I/Os, LGWR doesn’t even bother writing to disk at all. What did that do to the benchmark results? Nothing. I still get 133TPS.

Summary
LGWR I/O is very simple. LGWR processing on the other hand is a bit more complex.

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage

Archive for the 'Oracle OTLP' Category

What Oracle Topic Is On Your Coffee Table?

Manly Men Only Use Solid State Disk For Redo Logging. LGWR I/O is Simple, But Not LGWR Processing

DISCLAIMER

Pages

Blogroll

Follow Blog via Email

Recent Posts

Recent Comments

Fond Memories

Copyright