Fellow Oak Table Network member Charles Hooper has undertaken a critical reading of a recently published book on the topic of Oracle performance. Some folks have misconstrued his coverage as just being hyper-critical, but as Charles points out his motive is just to bring the content alive. It has been an interesting series of blog entries. I’ve commented on a couple of these blog posts, but as I began to comment on his latest installment I realized I should just do my own blog entry on the matter and refer back. The topic at hand is about how “system time” relates to Oracle performance.
The quote from the book that Charles is blogging about reads:
System time: This is when a core is spending time processing operating system kernel code. Virtual memory management, process scheduling, power management, or essentially any activity not directly related to a user task is classified as system time. From an Oracle-centric perspective, system time is pure overhead.
To say “[…] any activity not directly related to a user task is classified as system time” is too simplistic to be correct. System time is the time processors spend executing code in kernel mode. Period. But therein lies my point. The fact is the kernel doesn’t do much of anything that is not directly related to a user task. It isn’t as if the kernel is running interference for Oracle. It is only doing what Oracle (or any user mode code for that matter) is driving it to do.
For instance, the quote lists virtual memory, process scheduling and so on. That list is really too short to make the point come alive. It is missing the key kernel internals that have to do with Oracle such as process birth, process death, IPC (e.g., Sys V semaphores), timing (e.g., gettimeofday()), file and network I/O, heap allocations and stack growth and page table internals (yes, Virtual Memory).
In my opinion, anyone interested in the relationship between Oracle and an operating system kernel must read Section 8.1 of my friend James Morle’s book Scaling Oracle8i in spite of the fact that it sounds really out of date (by title) it goes a long way to make the topic at hand a lot easier to understand.
If this topic is of interest to you feel free to open the following link and navigate down to section 8.1 (page 417). Scaling Oracle8i ( in PDF form).
How Normal Are You?
The quote on Charles’ blog entry continues:
From an Oracle-centric perspective, system time is pure overhead. It’s like paying taxes. It must be done, and there are good reasons (usually) for doing it, […]
True, processor cycles spent in kernel mode are a lot like tax. However, as James pointed out in his book, the VOS layer, and the associated OSD underpinnings, have historically allowed for platform-specific optimizations. That is, the exact same functionality on one platform may impose a larger tax than on others. That is the nature of porting. The section of section of James’ book starting at page 421 shows some of the types of things that ports have done historically to lower “system time” tax.
Finally, Charles posts the following quote from the book he is reviewing:
Normally, Oracle database CPU subsystems spend about 5% to 40% of their active time in what is called system mode.
No, I don’t know what “CPU subsystems” is supposed to mean. That is clearly a nickname for something. But that is not what I’m blogging about.
If you are running Oracle Database (any version since about 8i) on a server dedicated to Oracle and running on the hardware natively (not a Virtual Machine), I simply cannot agree with that upper-bound figure of 40%. That is an outrageous amount of kernel-mode overhead. I should think the best way to get to that cost level would be to use file system files without direct I/O. Can anyone with a system losing 40% to kernel mode please post a comment with any specifics about what is driving that much overhead and whether you are happy with the performance of your server?
I saw such a system (way more than 50% system time) once, some 6 or 7 years ago. IIRC, it was an 8.1.7 on AIX. My “working hypothesis” at the time was that maybe AIX was mis-accounting the time the AIO (asynchronous I/O) servers spent waiting on I/O as CPU time in kernel mode. This was based on a large discrepancy between CPU usage as reported by the instance, and what AIX was reporting for the Oracle processes.
In hindsight that hypothesis seems somewhat misguided, but I didn’t have the tools, time, and permission (production DB at a customer site) at the time to investigate deeper. I haven’t seen anything like this ever since.
I don’t have the reports to hand, but the closest I’ve come to that is on a server with Veritas filesystems and no quick IO or Storage foundation for oracle licences, configured with 12 (Yes 12!) db writer porcesses so no direct, async or even concurrent IO and performing a lot of very small trnasactions. Even here we only managed to get about to about 30% kernal mode and still couldn;t persuade the origional systems architect that we really did need to buy the filesystem options or move the whole thing to ASM to fix it. Interestingly when we then implimented dataguard the IO on the (identical) standby server was even worse.
>Can anyone with a system losing 40% to kernel mode please post a comment …
Not 40%, but very close to:
>…with any specifics about what is driving that much overhead …
http://pastebin.com/VGMnQug9
>… and whether you are happy with the performance of your server?
The server isn’t mine. There are issues reported which include the word “performance”, but I don’t have any details (yet, probably).
We just solved this problem of over 40% system time.
It was running on Linux with NFS (NetApp) storage and no direct IO. Yep performance was terrible … Every time a large datafile IO operation occurred we saw a big performance hit.
We had someone add 2 large data files to the database in the middle of the day and the entire database froze while the OS wrote those files to disk.
Today we run on a fiber SAN with ASM and have less than 5% system time.
Hi David,
That sounds like throwing the baby out with the bathwater 😦 filesystemio_options=SETALL is fully supported on NFS so that would not have been an issue with this init.ora parameter set…
Surely you’ve read the Manly Man series 🙂
Joking aside, I fully understand cooking 40% in kernel mode with non-O_DIRECT NFS…not suprise at all…. so, as the title of the blog post goes, I would not consider this normal 🙂
We did some testing with the filesystemio_options — that was my first choice. But due to the nature and of the underlying hardware, and admittedly inefficiencies in the application (full table scans etc). The over all response times were better having the OS cache the files rather than push the load to the disk system. System time went down as expected, but overall response times went up. The NetApp was fairly old and could not keep up with the current load and use of the system. Overall activity on the application had seen huge growth over a 3 year period.
I guess it may come down to how you define “normal”. This performance profile on the system was normal to the people who designed, built and maintained the environment over a several year period. Sometimes normal evolves at a slow pace without anyone realizing it. For us, the good news is everyone is much happier with the new normal.
David
David,
Thanks for that. I see your point. It’s all too common for folks to mitigate poor storage by double-buffering in RAM. It’s a sad thing, but true and I understand how things get to that point. I consider it a normal action to take in an abnormal situation 🙂 But any system running with double-buffering shouldn’t be considered “normal.” Not in this day and age…
Hi,
Easiest way to achieve about 40 – 60 % of CPU system time is to overestimate a memory for Oracle or any other process. Swapping activity could easily increase system time to that level. Sad to say but I have seen a few production Oracle servers running on 1 GB RAM machines with SGA set to 900 MB – it was windows boxes so system time was reported in different way but still was between 40 up to 70 %.
Performance ? I’m not sure if this is a correct word for that system behavior – login time AFAIK was about 5 min.
regards,
Marcin
Hello Marcin,
Right. So, not “normal” as the title of the blog post goes …
The theme here is that upwards of 40% is an abnormality.
Here, I have seen some RAC instances taking about 45% system CPU (user CPU at 40%). They run on large SuperDome Itanium partitions with HP-UX. 45% was exceptional though, but generally on these boxes, system CPU represents half of the global CPU consumption.
Performance is fine most of the time.
My feeling is the system CPU consumption is caused by the clusterware stack we use (Oracle CRS + Veritas VCS/CFS/CVM/DMP). Cache fusion management, multi-pathed I/Os, clustered file systems
do not come for free …
Didier,
Thanks for that. CRS cycles are nil during runtime so it won’t be pushing the CPUs into kernel mode. Cache fusion (generically speaking) will drive CPUs into kernel mode for skgxp() and spend more time in kernel mode if the interconnect is UDP o GbE, that is for sure. The more CR over head, the more dives into skgxp() thus more time in the UDP stack and down-wind drivers. I have experience with Veritas mutlipathing so I can understand some cycles lost in kernel mode for the balancing (locking) especially if there are a lot of CPUs and a lot of HBAs and a lot of I/O of course. CFS with direct I/O (required with RAC) should not take many cycles. A CFS burns cycles when file metadata changes, not when the contents of files are read/written. Is there any of that Veritas Cached Quick I/O in play perhaps? Are there, perhaps, VxFS snapshots/clones in play? That could easily burn a lot of kernel-mode cycles for the COW overhead on a block change.
45% is huge! Are you folks happy with the application throughput with that much overhead?
Kevin, many thanks for the detailed answer.
http://docs.google.com/View?id=ddxr9kb_19gkhvbsgj
On this partition, we have 4 Oracle instances running; one of them got unusual activity for 3 hours which led to a sustained 45 % system CPU consumption.
At that time, response time for some services was degraded (but still within an acceptable range). AFAIK, users did not complain, and the box was still able to cope with the throughput. I must said it worked better than I expected.
Regarding interconnect, I don’t think we use raw UDP. All the traffic is channeled through the lltd daemon which is frequently one of the top system CPU consumers (the other ones are vxiod, vxfsd, vxglmd, crsd.bin, ocssd.bin in addition of the normal Oracle processes). During the peak, we can correlate the system CPU consumption to network activity more than I/O activity (but of course we had both of them).
Regarding Veritas Cached Quick I/O, I cannot tell if we use this option or not – will check later. Also, I don’t think we had an active VxFS snapshot at peak time.
Anyway, I still consider this as an exceptional situation (so not “normal”), and this is not the CPU consumption we are usually comfortable with. That said, even if you exclude peak time on the above graph, I still find system CPU consumption quite high compared to user CPU consumption on this hardware.
Whatever system CPU limit you settle on to define “normality”, my opinion is you can add at least 10% for hardware with complex stacked clusterware or I/O subsystems.
Regards,
Didier.
I’ve seen 40% and higher on systems with large SGAs (more than 16G) and no hugepages configured.
-Bill
Hi BillT,
Most likely 16G with a lot of dedicated connections, right? That would be due to page table thrashing and perhaps desperation VM side-affects I should think. So this scenario might also be difficult to classify as “normal” because we all know how critical hugepages are for large SGA and large connection count.
https://kevinclosson.wordpress.com/kevin-closson-index/2009/07/28/quantifying-hugepages-memory-savings-with-oracle-database-11g/
Kevin,
Maybe I was thinking 40% or higher in system-mode was “normal” for a less than ideal configuration. In fact, up until I saw your site March of 09, I thought I had to accept such high levels of system-mode utilization. Your posts about hugepages proved to be a crucial setting I needed on my large database systems. I cannot thank you enough for that.
-Bill
Hi BillT,
I’m glad the information helped. And, let’s all please remember that I can’t put my tone of voice in the blog… My continual pressing on the word “normal” refers to the idea that a 40% kernel mode system could ever in fact be considered “normal.” Now, that aside, I do know that if a system is delivering on its SLA and had 40% or more kernel mode overhead the issue is moot. The ultimate criteria is always whether the system is doing what it needs to do.
I’m hoping someone might chime in with experience based on a heavy UTL_FILE type deployment. There much be servers out there that lean on externally stored files buffered in page cache that exhibit very high kernel mode overhead. Of course the remedy for that would be SecureFiles…
I’ve seen this with databases with a lot of dedicated connections and not necessarily large SGAs.
The fix was to enable shared connections (multi-threaded server).
Thanks for that NetComrade… so that would be a processing switching (context switch) storm. So cycles in that realm are generally lost in kernel mode to locking associated with scheduling and the cache pounding it takes to run a process that hasn’t run for a while which generally means the cache has no or limited PTEs and likely the process stack is cache cold…so each switch is followed by kernel mode stalls to fill the cache… a CPU-busy, throughput-challenged scenario…
This would also not make one feel “normal” 🙂
Can anyone with a system losing 40% to kernel mode…
…having to turn off direct NFS because of a flakey AIX implementation 😦
But hopefully, that will eventually become an abnormal situation 🙂
Hi Connor,
Thanks for stopping by. Sad to hear about the AIX dNFS situation. I wonder how bad these kernel NFS implementations are these days. Glenn Fawcett recently proved some 10GbE non-dNFS throughput using the Affinity Card Test Database schema (Winter Corporation …google it)/ Glenn got 1GB/s from 10GbE at **very** polite processor overhead…
I understand this posting was from June 2010, and it is now Feb 2011. I thought I would add that there are many other reasons for high kernel cpu, at least on Solaris 10.
First, make sure you are patched at the OS Layer. We have run into some solaris bugs that cause high system mutex calls, causing mutex spins.
Second, make sure you check your interrupt processing for your boards. There are ways to split your interrupt processing across resources within your cpu’s, basically binding interrupt processing to only cpus 1-4. If you have a 32 thread system, this provide ample resources for interrupt processing.
Also, check to see how high your context switching really is and which executable is causing it.
I have been starting to use Solaris DTrace to find and fix these issues and have found it to be a very worthwhile adventure.
Brian,
Right, so you are pointing out pathological scenarios that burn cycles in kernel mode. The post was about a publication that suggested 40% is some magical acceptable loss to kernel cycles. My point was, and is, such a high amount of kernel mode overhead is ridiculous. Your comment brings up some idea of broken things to look for if one is unhappy with the amount of CPU they are losing to kernel mode overhead. So the question is, what is your opinion of a reasonable amount of overhead. Do you side with the publication (topic of this post)?
>”I simply cannot agree with that upper-bound figure of 40%. That is an outrageous amount of kernel-mode overhead.”
I agree that 40% kernel is huge and NOT acceptable.
On a balanced system, kerrnel cpu utilization should always be less than 15% (IMHO), periodically spiking a bit higher. Anything over 20% is usually meaning periods of saturation. Now, the causes are varied and the solution to high kernel certainly goes into crossing the boarder of database tuning to system tuning and back again, eliminating all bottlenecks until Cpu is saturated. Purchase more cpu, re-tune, etc.
I do agree that kernel overhead is like a “tax” that must be paid.
So, my point is that if you are experiencing 40% kernel cpu, don’t take it lightly. Causes external to the database operations do exist, including bugs at the OS Level, OS level tuning and configuration – including interrupt binding, processor set/cpu binding, mutex issues, cpu migrations, swapping, ISM/DISM configuration, etc.
In the end, 40% kernel is not acceptable (unless you are out of money). If you complete a tuning cycle and the cpu is saturated, probably time to buy more cpu, swap architectures (like moving to ASM), and review approaches.
Hope that clarifies.
I have seen pretty high CPU on 32 bit systems configured with PAE especially on RAC using use_indirect_data_buffers=true, and a large db_block_buffers parameter. Not sure it was 40% though. The most I have seen is 25-30%.
Ashish,
Yes, that sounds like recipe for disaster in 2011.
Hey Kevin,
This was back in 2005-2007 on RHEL3 32 bit and Oracle 9i RAC. Of course, thankfully things are very different now in 2011 🙂
Hi Ashish,
Yeah, I figured you were speaking of a past-life system. For what it’s worth, the development platform (Sequent DYNIX/ptx) for the indirect data buffers functionality suffered very, very low overhead for exposing the indirect buffers into the processes address space. That OS was very hardwired for it (and patented in that regard as well http://www.patents.com/us-6055617.html ). There were a lot of systems in that era that supported “large” physical memories (e.g., 40bits) but only 32 bit user address space. So, there was quite a race on to see which of Oracle’s partners could figure out the cheapest way to support large buffer capacity for the SGA in those scenarios. Before virtwin(2SEQ) came into existence there were some Sequent folks outside of database engineering fumbling around with the idea of map/remap of the entire SGA. I reminded them that spinlocks (latch structures) reside in the SGA (fixed and variable sections) and that no system would be able to survive a system call under a spinlock especially when the memory needing to be unmapped/remapped held the lock structure! So, one evening I was chatting with Brent (see the patent) and shared words to the effect of, “…if only the OS supported the ability to map/remap a small set of pages within an SysV IPC Shared Memory segment.” Off he went and implemented support for my wish-list just like all those Sequent kernel engineers did every time we in Database Engineering needed support from the OS for something that improved the platform for Oracle database. In my opinion they were the most talented kernel brain-trust of all time for SMPs, clustered Unix systems and NUMA.
The “Oracle ecosystem” was a lot different beck then. Customers were #1 and partners were #2 on the list. The “list”, as it were, is a lot different these days. I can see plainly who is not #1 and #2 doesn’t even exist. Alone and angry is a bad place to be. I digress.
Years later the Linux port of Oracle showed up with a very poorly implemented similar functionality and called it use_indirect_data_buffers with extremely costly overhead. The OS was not hard-wired for it. Little details matter.
Thanks a lot Kevin for providing such wonderful background about use_indirect_data_buffer’s origination. Sounds like fun times :).
Yeah around 2004/5, we got x86 machines with lot of RAM but OS’s for x86 were still 32 bit, and PAE/use_indirect_data_buffers was the only way out. Glad AMD came up with 64 bit extensions and intel adopted it after.
Ashish,
Yes, we owe AMD a debt of gratitude. When Opterons came on the scene in 2003, Intel was churning out scant innovation in the x86 space hoping instead to drive people to Itanium for high-end throughput and scalability. As we know, AMD Opterons were tremendously superior to the Xeon’s of that era. It wasn’t until about the Woodcrest Xeon (Xeon 5100) that Intel started to play catch-up. Of course Nehalem changed all that.
Industry competition is very good for consumers. Any IT solution that limits customer choice should avoided like the plague. That’s one reason I joined the Greenplum team at EMC. Greenplum is a software-only solution but is also offered in an appliance for customers who prefer that route. Enterprise software that limits platform choice just doesn’t seem like the wave of the future to me. Didn’t the industry try the vendor lock-in approach all the way through the 1980s?
From an Oracle-centric perspective, system time is pure overhead.
Nothing wrong with that quote.
Oracle process=user process
kernel process=system process.
Even if a user call invokes a system call, from oracle perspective it is still a “necessary” overhead.
Its the necessary evil.
This occurred on a Solaris VM running Oracle 12c. Kernel went to 99%.