Using Linux Perf(1) To Analyze Database Processes – Part I | Kevin Closson's Blog: Platforms, Databases and Storage

Troubleshooting Runaway Processes
Everyone reads Tanel Poder’s material—for good reason.

I took particular interest in his recent post about investigating where an apparently runaway, cpu-bound, Oracle foreground process is spending its time. That article can be found here.

I’ve been meaning to do some blogging about analyzing Oracle execution with perf(1). I think Tanel’s post is a good segue for me to do so. You might ask, however, why I would bother attempting to add value in a space Tanel has already blogged. Well, to that I would simply say that modern systems professionals need as many tools as they can get their hands on.

Tanel, please ping me if you object, for any reason, to my direct links to your scripts.

Monitoring Window
Tanel’s approach to monitoring his cpu-bound foreground process is based on the pstack(1) utility. Once he identified the spinning process he fired off a 100 pstack commands in a loop. For each iteration, the output of pstack was piped through a text processing script (os_explain). From my reading of that script I estimate it probably takes about 10 milliseconds to process the data flowing into it. I’m estimating here so please bear with me. If pstack execution requires zero wall clock time I think these 100 snoops at the process stack likely occur in 1 second of wall clock time. According to Tanel, the SQL takes about 23 minutes to complete. If the foreground process is looping just a small amount of code then I see no problem monitoring a small portion of its overall execution time. Since you can see what Tanel is doing you know it would be simple to grab the process every so often throughout its life and monitor the approximate 1 second with the 100 iterations of pstack. The technique and tools Tanel shares here are extensible. That is good.

Perturbation
The Linux pstack utility stops the execution of a process in order to read its stack and produce output to standard out. Performance engineering techniques should always be weighed against the perturbation levied by the monitoring method. I recall the olden days of CA Unicenter brutally punishing a system just to report performance metrics. Ouch.

Monitoring a cpu-bound process, even periodically, with pstack will perturb performance. There is such a thing as an acceptable amount of perturbation though. Once must decide for themselves what that acceptable level is.

I would like to offer an example of pstack perturbation. For this exercise I will use the silly core-stalling program I call “fat.” This program suffers a lot of processor stalls because it has very poor cache locality. The program can be found here along with its cousin, “skinny.” Aptly named, skinny fits in cache and does not stall. The following shows that the program executes in 51.251 seconds on my Nehalem Xeon server when executed in isolation. However, when I add 100 pokes with pstack the run time suffers 5% degradation.

$
$ cat p.sh
#!/bin/bash

time ./fat &
p=`ps -f | awk ' $NF ~ /.fat/ { print $2 }'`

for i in {1..100}
do
	pstack $p
done  > /dev/null 2>&1
wait
$
$ sh ./p.sh

real	0m53.836s
user	0m50.604s
sys	0m0.024s
$
$ time ./fat

real	0m51.251s
user	0m51.221s
sys	0m0.013s
$

I know 5% is not a lot but this is a cpu-bound process so that is a lot of cycles. Moreover, this is a process that is not sharing any data nor exercising any application concurrency mechanisms (e.g., spinlocks, semaphores). I understand there is little danger in perturbing the performance of some runaway processes but if such a process is also sharing data there can be ancillary affect when troubleshooting a large-scale problem. Before I move on I’d like to show how the execution time of “fat” changes when I gather 100 pstacks as soon as the process is invoked and then wait 10 seconds before gathering another 100 stack readings. As the following shows, the perturbation meter climbs up to 13%:

$ cat p2.sh
#!/bin/bash

time ./fat &
p=`ps -f | awk ' $NF ~ /.fat/ { print $2 }'`

for t in 1 2
do
	for i in {1..100}
	do
		pstack $p
	done  > /dev/null 2>&1
sleep 10
done

wait
$ sh ./p2.sh

real	0m57.930s
user	0m51.480s
sys	0m0.050s
$

Performance Profiling With perf(1)
As I said, I’ve been trying to work out the time to blog about how important perf(1) should be to you in your Oracle monitoring—or any performance analysis for that matter. It requires a modern Linux distribution (e.g., RHEL 6, OEL 6) though. I can’t write a tutorial on perf(1) in this post. However, since I’m on the topic of perturbation via monitoring tools, I’ll offer an example of perf-record(1) using my little “fat” program:

$ time perf record ./fat
[ perf record: Woken up 8 times to write data ]
[ perf record: Captured and wrote 1.812 MB perf.data (~79187 samples) ]

real	0m52.299s
user	0m52.234s
sys	0m0.046s
$

$ perf report --stdio | grep fat
# cmdline : /usr/bin/perf record ./fat
    99.81%      fat  fat                [.] main
     0.14%      fat  [kernel.kallsyms]  [k] __do_softirq
     0.02%      fat  libc-2.12.so       [.] __memset_sse2
     0.01%      fat  [kernel.kallsyms]  [k] clear_page_c
     0.01%      fat  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     0.00%      fat  [kernel.kallsyms]  [k] get_page_from_freelist
     0.00%      fat  [kernel.kallsyms]  [k] free_pages_prepare

So, we see that monitoring with perf(1) does levy a small tax—2%. However, I just monitored the entire execution of the program.

Monitor Oracle Foreground Process With perf-record(1)
Now I’ll run the program Tanel used in his example. I’ll record 5 minutes of execution by identifying the process ID of the spinning shadow process and then using the –p option to perf-record(1). The way perf(1) works is to monitor the entire execution of the program you attach to—unless you give it a process to keep it busy (or use the -c option). Now don’t be confused. In the following example I’m telling perf-record(1) to monitor the shadow process. The usage of sleep 300 is just my way of telling it to finish in 5 minutes.

$ cat t.sh

sqlplus / as sysdba < /dev/null 2>&1 &
[1] 462
$ ps -f
UID        PID  PPID  C STIME TTY          TIME CMD
oracle     462  2259  0 10:11 pts/0    00:00:00 sh ./t.sh
oracle     463   462  0 10:11 pts/0    00:00:00 sqlplus   as sysdba
oracle     465  2259  0 10:11 pts/0    00:00:00 ps -f
oracle    2259  2258  0 Feb14 pts/0    00:00:00 -bash
$ ps -ef | grep 463 | grep -v grep
oracle     463   462  0 10:11 pts/0    00:00:00 sqlplus   as sysdba
oracle     464   463 89 10:11 ?        00:00:14 oracleSLOB (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
$ sudo perf record -p 464 sleep 300
[ perf record: Woken up 35 times to write data ]
[ perf record: Captured and wrote 8.755 MB perf.data (~382519 samples) ]
$

$ perf report
$ perf report --stdio
# ========
# captured on: Mon Feb 18 10:17:02 2013
# ========
#
# Events: 286K cpu-clock
#
# Overhead  Command      Shared Object                       Symbol
# ........  .......  .................  ...........................
#
    29.84%   oracle  oracle             [.] kglic0
    11.08%   oracle  oracle             [.] kgxExclusive
    11.02%   oracle  oracle             [.] kglGetHandleReference
     6.66%   oracle  oracle             [.] kglGetMutex
     6.47%   oracle  oracle             [.] kgxRelease
     4.91%   oracle  oracle             [.] kglGetSessionUOL
     4.48%   oracle  oracle             [.] kglic_cbk
     3.79%   oracle  oracle             [.] kgligl
     3.44%   oracle  oracle             [.] kglMutexHeld
     3.12%   oracle  oracle             [.] kgligp
     1.96%   oracle  oracle             [.] kqlpgCallback
     1.92%   oracle  oracle             [.] kglReleaseMutex
     1.88%   oracle  oracle             [.] kglGetBucketMutex
     1.80%   oracle  oracle             [.] kglReleaseBucketMutex
     1.47%   oracle  oracle             [.] kglIsMutexHeld
     1.18%   oracle  oracle             [.] kglMutexNotHeld
     1.04%   oracle  oracle             [.] kglReleaseHandleReference
     0.51%   oracle  oracle             [.] kghalf
     0.33%   oracle  oracle             [.] qerfxFetch
     0.32%   oracle  oracle             [.] lnxmin
     0.31%   oracle  oracle             [.] qerfxGCol
     0.26%   oracle  oracle             [.] qeruaRowProcedure
     0.24%   oracle  oracle             [.] kqlfnn

$ perf report --stdio | grep oracle | sed 's/\%//g' | awk '{ t=t+$1 } END { print t }'
99.98
$

If you study Tanel’s post and compare it to mine you’ll see differences in the cost accounting Tanel has associated with certain Oracle kernel routines. That’s no huge problem. I’d like to explain what’s happening.

When I used perf-record(1) to monitor the shadow process for 300 seconds it collected 382,519 samples. When using a pstack approach, on the other hand, just be aware that you are getting a glimpse of whatever happens to be on the stack when the program is stopped. You might be missing a lot of important events. Allow me to offer an illustration of this effect.

Do You see What I See?
Envision a sentry guarding a wall and taking a photo every 1 second. Intruders jumping over the wall with great agility are less likely to be in view when he snaps his photo. On the other hand, an intruder taking a lot longer to cross the wall (carrying a large bag of “loot” for instance) suffers greater odds of showing up in one of his photos. The sentry might get 10 photos of slow intruders while missing hundreds of intruders who happen to be more nimble–thus getting over the wall more quickly. For all we know, the slow intruder is carrying bags of coal while the more nimble intruders are packing a pockets full of diamonds. Which intruder matters more? It depends on how cold the sentry is I guess 🙂 It’s actually a wee-bit like that with software performance analysis. Catch me some time and I’ll explain why a CPI (cycles per instruction) of 1 is usually bad thing. Ugh, I digress.

Tanel’s approach/kit works on a variety of operating systems and versions of Linux. It also does not require any elevated privilege. I only aim to discuss the differences.

Summary
I’ve lightly covered the topic of performance monitoring perturbation and have given a glimpse into what I envision will end up as a multiple-part series on perf(1).

8 Responses to “Using Linux Perf(1) To Analyze Database Processes – Part I”

Feed for this Entry Trackback Address

1 goryszewskig February 18, 2013 at 12:36 pm

Perfect timing Kevin, Am I right saying that perf would be useful in mutex issues tracing when things are going really fast ?
Regards
GregG

- 2 kevinclosson February 18, 2013 at 4:43 pm
  
  @goryszewskig : Yes… I’ll be blogging some of that too
  
3 Tanel Poder (@TanelPoder) February 19, 2013 at 4:31 pm

That’s why I said “poor-man’s stack profiler” in my post. Even though the core topic of my post was about drilling down into CPU usage, the important side-topic there was that you do not always need to install additional tools to troubleshoot something right when it happens. Often a text datasource combined with a couple of OS commands or AWK do the trick and you can adapt to changing demands – like just adding “sleep 1” between the stack samples in the loop.

But perf itself is cool, even though oprofile can do the same things, usability matters – a single command which is able to show the top vs a sequence of commands to initiate collection and then parse & report is going to win. But I still don’t get perf’s hierarchical profiler output logic, I think stack profiles are much better readable if done like Windows ProcMon, Mac OS “Sample Process” or my oradebug-based OStackProf are doing it.

- 4 kevinclosson February 19, 2013 at 5:42 pm
  
  @TanelPoder : Yep on all accounts. I made sure to point out that your kit is portable and all that good stuff. And good stuff it is.
  
5 Anantha March 24, 2013 at 10:23 am

@Kevin, thanks for these posts. Now we’re back to great technical articles that we’ve come to expect from your blog.

- 6 kevinclosson March 28, 2013 at 12:19 pm
  
  Anantha,
  
  Whew! I saw your name in the comment list here and thought I was in for more chastisement! 🙂 I much prefer this.

	Optimize replication… on Introducing SLOB – The S…
	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage