Archive Page 26

Oracle11g Automatic Memory Management – Part III. A NUMA Issue.

Now I’m glad I did that series about Oracle on Linux, The NUMA Angle. In my post about the the difference between NUMA and SUMA and “Cyclops”, I shared a lot of information about the dynamics of Oracle running with all the SGA allocated from one memory bank on a NUMA system. Déjà vu.

Well, we’re at it again. As I point out in Part I and Part II of this series, Oracle implements Automatic Memory Management in Oracle Database 11g with memory mapped files in /dev/shm. That got me curious.

Since I exclusively install my Oracle bits on NFS mounts, I thought I’d sling my 11g ORACLE_HOME over to a DL385 I have available in my lab setup. Oh boy am I going to miss that lab when I take on my new job September 4th. Sob, sob. See, when you install Oracle on NFS mounts, the installation is portable. I install 32bit Linux ports via 32bit server into an NFS mount and I can take it anywhere. In fact, since the database is on an NFS mount (HP EFS Clustered Gateway NAS) I can take ORACLE_HOME and the database mounts to any system with a RHEL4 OS running-and that includes RHEL4 x86_64 servers even though the ORACLE_HOME is 32bit. That works fine, except 32bit Oracle cannot use libaio on 64bit RHEL4 (unless you invokde everything under the linux32 command environment that is). I don’t care about that since I use either Oracle Disk Manager or, better yet, Oracle11g Direct NFS. Note, running 32bit Oracle on a 64bit Linux OS is not supported for production, but for my case it helps me check certain things out. That brings us back to /dev/shm on AMD Opteron (NUMA) systems. It turns out the only Opteron system I could test 11g AMM on happens to have x86_64 RHEL4 installed-but, again, no matter.

Quick Test

[root@tmr6s5 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3585 MB
node 1 size: 4095 MB
node 1 free: 3955 MB
[root@tmr6s5 ~]# dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024
1024+0 records in
1024+0 records out
[root@tmr6s5 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3585 MB
node 1 size: 4095 MB
node 1 free: 2927 MB

Uh, that’s not good. I dumped some zeros into a file on /dev/shm and all the memory was allocated from socket 1. Lest anyone forget from my NUMA series (you did read that didn’t you?), writing memory not connected to your processor is, uh, slower:

[root@tmr6s5 ~]# taskset -pc 0-1 $$
pid 9453's current affinity list: 0,1
pid 9453's new affinity list: 0,1
[root@tmr6s5 ~]# time dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024 conv=notrunc
1024+0 records in
1024+0 records out

real    0m1.116s
user    0m0.005s
sys     0m1.111s
[root@tmr6s5 ~]# taskset -pc 1-2 $$
pid 9453's current affinity list: 0,1
pid 9453's new affinity list: 1
[root@tmr6s5 ~]# time dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024 conv=notrunc
1024+0 records in
1024+0 records out

real    0m0.931s
user    0m0.006s
sys     0m0.923s

Yes, 20% slower.

What About Oracle?
So, like I said, I mounted that ORACLE_HOME on this Opteron server. What does an AMM instance look like? Here goes:

SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3587 MB
node 1 size: 4095 MB
node 1 free: 3956 MB
SQL> startup pfile=./amm.ora
ORACLE instance started.

Total System Global Area 2276634624 bytes
Fixed Size                  1300068 bytes
Variable Size             570427804 bytes
Database Buffers         1694498816 bytes
Redo Buffers               10407936 bytes
Database mounted.
Database opened.
SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 1331 MB
node 1 size: 4095 MB
node 1 free: 3951 MB

Ick. This means that Oracle11g AMM on Opteron servers is a Cyclops. Odd how this allocation came from memory attached to socket 0 when the file creation with dd(1) landed in socket 1’s memory. Hmm…

What to do? SUMA? Well, it seems as though I should be able to interleave tmpfs memory and use that for /dev/shm-at least according to the tmpfs documentation. And should is the operative word. I have been tweaking for a half hour to get the mpol=interleave mount option (with and without the -o remount technique) to no avail. Bummer!

Impact
If AMD can’t get the Barcelona and/or Budapest Quad-core off the ground (and into high-quality servers from HP/IBM/DELL/Verari), none of this will matter. Actually, come to think of it, unless Barcelona is really, really fast, you won’t be sticking it into your existing Socket F motherboards because that doubles your Oracle license fee (unless you are on standard edition which is priced on socket count). That leaves AMD Quad-core adopters waiting for HyperTransport 3.0 as a remedy. I blogged all this AMD Barcelona stuff already.

Given the NUMA characteristics of /dev/shm, I think I’ll test AMM versus MMM on NUMA, and them test again on SUMA-if I can find the time.

If anyone can get /dev/shm mounted with the mpol option, please let me know because, at times, I can be quite a dolt and I’d love this to be one of them.

Using Oracle Clusterware for Non-RAC Purposes

In a recent post on the oracle-l list, a participant asked:

Hi, has anyone used 10.2 Clusterware with OCFS2 on RHEL5 to get single instance failover from one host to another?

My buddy Matt Zito (we’ve had beers before so we’re buddies) of GridApp followed up with:

I have a customer that does that – it apparently works very well […text deleted…]

However, the downside of CRS as single-instance is that both sides of the cluster need to be licensed for Oracle (as I understand the CRS license).


Licensing
Licensing is always the topic for interesting conversation. To get to the bottom of this, I sent an email to the first Oracle person I ever heard pitch the idea of CRS for non-RAC purposes-Marshall Presser. Hmm, I think I can call him my old buddy too since we also had beers. Or then again if I’m not mistaken Marshall is an old Pyramid_Technology guy and since I am an old Sequent_Computer_Systems guy, we are sort of long-lost cousins. Anyway, back to the topic. Marshall was nice enough to send me a very current reference for Oracle’s licensing terms for using CRS for non-RAC purposes with a quote from Oracle® Database Licensing Information11g Release 1 (11.1) Part Number B28287-01:

Oracle Clusterware can be installed and used to protect any Oracle or third-party software provided any of the following conditions are met:

1. The software being protected is from Oracle.

2. The software being protected uses an Oracle Database.

3. The software being protected is running on Oracle Unbreakable Linux.

4. The software being protected is running in a cluster where at least one machine involved in the cluster is licensed using the appropriate metric for either Oracle Database Enterprise Edition or Oracle Database Standard Edition. A cluster is defined to include all the machines that share the same Oracle Cluster Registry (OCR) and Voting Disk

Unclear Clarity
So, as is usually the case with licensing, we have unclear clarity. And, yes, I know this is 11g information and the original query was about 10g, but it stands to reason that with some digging there would be a 10g equivalent. I wonder why criteria 1 above is stated. Since only 1 criteria is needed, I suppose we can interpret as follows:

  • You can use CRS on Unbreakable Linux for anything you want (rule 3)
  • You can protect non-RAC Oracle databases on any platform (rule 1)
  • You can protect any software the connects to an Oracle database on any platform (rule 2)
  • You can protect anything on any cluster as long as one node in the cluster is running an instance of EE or SE (rule 4)

These are pretty liberal rules. I think Oracle is keen on widespread adoption of Oracle Clusterware for general purpose HA, but then I could be misreading the tea leaves.

What Does This Really Mean?
What we’re talking about here is using CRS to monitor (“check” in CRS parlance, “probe” in generic industry terms) an instance of Oracle and take action if the action program fails. In general failover HA terms, probes (checks in CRS terms) fail as follows:

  1. The server is up but the database is down
  2. The server is down

Failover
In case 1 above, the HA engine will restart the database and in case 2 it will fail the database over to another server. The HA engine (in this case CRS) is smart enough to fail the service over to a system that is actually alive and has functional disk access and network interfaces. That is one the roles of any HA clusterware (e.g., CRS, Steel Eye, VCS, Service Guard, HACMP, Red Hat Cluster Suite, PolyServe, etc).

Time Outs
The other way the HA engine will take action is if your probe (check script/program) seizes (times out). In that case, most HA engines will execute “restart” action which is generally a stop action followed by a start action and another probe (check). This is not an endless loop though. Most HA engines have a tunable max for retries (restart attempts in CRS) and then it will failover to the defined backup server. Be aware, however, that a seized service (such as a non-RAC database instance) could be so locked up it didn’t stop when the HA engine tried its restart action. In that case, you have Oracle processes with files open. If you failover to a server that accesses the database on a shared filesystem such as NFS or OCFS, you have some things to be concerned about. You won’t be able to start the instance until the $ORACLE_HOME/dbs/lk${ORACLE_SID} file is removed, but simply removing it still leaves that other catatonic instance up on the ill server. These solutions can become complex.

The topic of what probe (check) actions are appropriate is the subject matter of very long, drawn-out discussions rife with theory and prejudice. I’ve been there and I’ve done that. I bet most folks that use CRS to start/stop and check non-RAC databases will likely use the script interface. Note, as with all HA engines out there, you can write a C probe (or CRS action program) because all the engine is looking for is a return code (success/failure).

I think the most clever probe action I’ve heard to date came from fellow OakTable Network member Tim Gorman. Tim once suggested that a great probe action would be to make a purposeful failed attempt to connect such as:

$ sqlplus foo/bar <<EOF 2>&1 | grep 1017
> REM There is no user called foo...expect ORA 1017
> exit;
> EOF
ORA-01017: invalid username/password; logon denied
$ echo $?
0

If you get anything other than ORA-01017, something is ill. In this case, a success for grep(1) is a success for the probe/check. That is, if grep(1) gets it’s text, the server returned ORA-01017 thus the instance was well enough to perform the functionality of user authentication. Your check script would get this in grep(1)’s return code ($?).

Trying to connect as a bogus user actually tests quite a bit of server functionality (SQL parsing, user authentication and so forth). I think this may actually create a temporary session as well. It certainly tests the server’s ability to fork(2) sqlplus and exec(2) $ORACLE_HOME/bin/oracle so you are testing the OS VM, process slots, etc.  All in all, it is a very clever probe (check action). If you wanted to use CRS to check both the health of SQL*Net and a non-RAC database instance, then you could do this same bogus connect attempt through the listener. If the listener is down, you’ll get the appropriate error text. Then again, if you wanted to make a heavy probe/check, you could connect as an application user and update a dummy row in a table or something like that. The sky is the limit with this sort of HA kit.

Additional Material
Oracle has more information in the form of whitepapers:

Oracle11g Automatic Memory Management – Part II. Automatically Stupid?

Oracle Database 10g Automatic Memory Management (AMM) might have been this smart, but I don’t know. I’m playing a bit with Oracle Database 11g AMM and find that so far it is a pretty smart cookie-at least in the sense I’m blogging about in this entry.

Automatic So Long As You Do Everything Correctly 
One thing I always hate is a feature that causes the server to not function at all or at some degraded state if the configuration is not perfect for the feature. In my mind, if something is called automatic it should be automatic. Like I say, don’t make me kneel on peach pits to benefit from an automatic feature. So how about a quick test.

I’ve got 11g x86 on a Proliant DL380 fit with 4GB RAM. As I mention in this blog entry, Oracle uses memory mapped files in /dev/shm for the SGA when you use 11g AMM. On a 4GB system, the default /dev/shm size is about half of physical memory. I want to set up a larger, custom size and see if 11g AMM will try to cram 10lbs of rocks into a 5lb bag.

# umount /dev/shm
# mount -t tmpfs shmfs -o size=3584m /dev/shm
# df /dev/shm
Filesystem           1K-blocks      Used Available Use% Mounted on
shmfs                  3670016         0   3670016   0% /dev/shm

There, I now have 3.5GB of space for /dev/shm. I don’t want to use that much because leaving Linux with .5GB for the kernel will likely cause some chaos. I want to see if 11g AMM is smart enough to allocate just enough to fill the virtual address space of the Oracle processes. So, I set AMM larger than I know will fit in the address space of a 32-bit linux processes:

SQL> !grep MEMORY amm.ora
MEMORY_TARGET=3500M
MEMORY_MAX_TARGET=3500M

So what happened? Well, I haven’t relocated the SGA so I shouldn’t expect more than about 2GB. I wouldn’t expect more than about 1.7GB for buffers. Did AMM try to over-allocate? Did it get nervous and under-allocate? Did it tell me to help it be more automatic through some configuration task I need to perform? Let’s see:

SQL> startup pfile=./amm.ora
ORACLE instance started.

Total System Global Area 2276634624 bytes
Fixed Size                  1300068 bytes
Variable Size             570427804 bytes
Database Buffers         1694498816 bytes
Redo Buffers               10407936 bytes
Database mounted.
Database opened.

Nice, AMM was smart enough to pile in about 1.6GB of buffers and the appropriate amount of variable region to go with it. A look at DBWR’s address space shows that the first 16MB /dev/shm granule file (my term) was mapped in at virtual address 512MB. The last 16MB segment fit in at 2688MB. The top of that last granule is 2704MB. If I subtract off the sum of 2276634624 (show sga) + 512MB (the attach address) I’m left with a little over 20MB that are most likley Oracle rounding up for page alignment and other purposes.

# pmap `pgrep -f dbw` | grep 'dev.shm' | head -1
20000000      4K r-xs-  /dev/shm/ora_bench1_1441803_0
# pmap `pgrep -f dbw` | grep 'dev.shm' | tail -1
a8000000  16384K rwxs-  /dev/shm/ora_bench1_1507341_7

Summary
I don’t really expect folks to be running Oracle in production on 32bit Linux servers in modern times, but I was pleasantly surprised to see 11g AMM is smart enough to poke around until it filled the address space. I asked for more than it could give me (MEMORY_TARGET=3500M) and instead of failing and suggesting I request less automatic memory, it did the right thing. I like that.

Oracle11g Automatic Memory Management – Part I. Linux Hugepages Support.

I spent the majority of my time in the Oracle Database 11g Beta program testing storage-related aspects of the new release. To be honest, I didn’t even take a short peek at the new Automatic Memory Management feature. As I pointed out the other day, Tanel Poder has started blogging about the feature.

If you read Tanel’s post you’ll see that he points out AMM-style shared memory does not use hugepages. This is because AMM memory segments are memory mapped files in /dev/shm. At this time, the major Linux distributions do not implement backing memory mapped files with hugepages as they do with System V-style IPC shared memory. The latter supports the SHM_HUGETLB flag passed to the shmget(P) call. It appears as though there was an effort to get hugepages support for memory mapped pages by adding MAP_HUGETLB flag support for the mmap(P) call as suggested in this kernel developer email thread from 2004. I haven’t been able to find just how far that proposed patch went however. Nonetheless, I’m sure Wim’s group is more than aware of that proposed mmap(P) support and if it is really important for Oracle Database 11g Automatic Memory Management, it seems likely there would be a 2.6 Kernel patch for it someday. But that begs the question: just how important are hugepages? Is it blasphemy to even ask the question?

Memory Mapped Files and Oracle Ports
The concept of large page tables is a bit of a porting nightmare. It will be interesting to see how the other ports deal with OS-level support for the dynamic nature of Automatic Memory Management. Will the other ports also use memory mapped files instead of IPC Shared Memory? If so, they too will have spotty large page table support for memory mapped files. For instance, Solaris 9 supported large page tables for mmap(2) pages, but only if it was an anonymous mmap (e.g., a map without a file) or a map of /dev/zero-neither of which would work for AMM. I understand that Solaris 10 supports large page tables for mmap(2) regions that are MAP_SHARED mmap(2)s of files-which is most likely how AMM will look on Solaris, but I’m only guessing. Other OSes, like Tru64-and I’m quite sure most others-don’t support large page tables for mmap(2)ed files. This will be interesting to watch.

Performance, Large Page Table, Etc
I remember back in the mid-90s when Sequent implemented shared large page tables for IPC Shared memory on our Unix variant-DYNIX/ptx. It was a very significant performance enhancement. For instance, 1024 shadow processes attached to a 1GB SGA required 1GB of physical memory-for the page tables alone! That was significant on systems that had very small L2 caches and only supported 4GB physical memory. Fast forwarding to today. I know people with Oracle 10g workloads that absolutely seize up their Linux (2.6. Kernel) system unless they use hugepages. Now I should point out that these sites I know of have a significant mix of structured and unstructured data. That is, they call out to LOBs in the filesystem (give me SecureFiles please). So the pathology they generally suffered without hugepages was memory thrashing between Oracle and the OS page cache (filesystem buffer cache). The salve for those wounds was hugepages since that essentially carves out and locks down the memory at boot time. Hugepages memory can never be nibbled up for page cache. To that end, benefiting from hugepages in this way is actually a by-product. The true point behind hugepages not the fact that it is reserved at boot time, but the fact that CPUs don’t have to thrash to maintain the physical to virtual translations (tlb). In general, hugepages are a lot more polite on processor caches and they reduce RAM overhead for page tables. Compared to the mid 1990s, however, RAM is about the least of our worries these days. Manageability is the most important and AMM aims to help on that front.

Confusion
Of all things Oracle and Linux, I think one of the topics that gets mangled the most is hugepages. The terms and nobs to twist run the gamut. There’s hugepages, hugetlb, hugetlbfs, hugetlbpool and so on. Then there are the differences from one Linux distribution and Linux kernel to the other. For instance, you can’t use hugepages on SuSE unless you turn off vm.disable_cap_mlock (need a few double negatives?). Then there is the question of boot-time versus /proc or sysctl(8) to reserve the pages. Finally, there is the fact that if you don’t have enough hugepages when you boot Oracle, Oracle will not complain-you just don’t get hugepages. I think Metalink 361323.1 does a decent job explaining hugepages with old and recent Linux in mind, but I never see it explained as succinctly as follows:

  1. Use OEL 4 or RHEL 4 with Oracle Database 10g or 11g
  2. Set oracle hard memlock N in /etc/security/limits.conf where N is a value large enough to cover your SGA needs
  3. Set vm.nr_hugepages in /etc/sysctl.conf to a value large enough to cover your SGA.

Further Confusion
Audited TPC results don’t help. For instance, on page 125 of this Full disclosure report from a recent Oracle10g TPC-C, there are listings of sysctl.conf and lilo showing the setting of the hugetlbpool parameter. That would be just fine if this was a RHEL3 benchmark since vm.hugetlbpool doesn’t exist in RHEL4.

Performance
I admit I haven’t done a great deal of testing with AMM, but generally a quick I/O-intensive OLTP test on a system with 4 processor cores utilized at 100% speak volumes to me. So I did just such a test.

Using an order-entry workload accessing the schema detailed in this Oracle Whitepaper about Direct NFS, I tested two configurations:

Automatic Memory Management (AMM). Just like it says, I configured the simplest set of initialization parameters I could:

UNDO_TABLESPACE=rb1
UNDO_MANAGEMENT = AUTO
compatible = 10.1.0.0
control_files                  = ( /u01/app/oracle/product/11/db_1/rw/DATA/cntlbench_1 )
db_block_size                   = 4096
MEMORY_TARGET=1500M
db_files                        = 100
db_writer_processes = 1
db_name                         = bench
processes                       = 200
sessions                        = 400
cursor_space_for_time           = TRUE  # pin the sql in cache
filesystemio_options=setall

Manual Memory Management(MMM). I did my best to tailor the important SGA regions to match what AMM produced. In my mind, for an OLTP workload the most important SGA regions are the block buffers and the shared pool.

UNDO_TABLESPACE=rb1
UNDO_MANAGEMENT = AUTO
compatible = 10.1.0.0
control_files                  = ( /u01/app/oracle/product/11/db_1/rw/DATA/cntlbench_1 )
db_block_size                   = 4096
#MEMORY_TARGET=1500M
db_cache_size = 624M
shared_pool_size=224M
db_files                        = 100
db_writer_processes = 1
db_name                         = bench
processes                       = 200
sessions                        = 400
cursor_space_for_time           = TRUE  # pin the sql in cache
filesystemio_options=setall

The following v$sgainfo output justifies just how closely configured the AMM and MMM cases were.

AMM:

SQL> select * from v$sgainfo ;

NAME                                  BYTES RES
-------------------------------- ---------- ---
Fixed SGA Size                      1298916 No
Redo Buffers                       11943936 No
Buffer Cache Size                 654311424 Yes
Shared Pool Size                  234881024 Yes
Large Pool Size                    16777216 Yes
Java Pool Size                     16777216 Yes
Streams Pool Size                         0 Yes
Shared IO Pool Size                33554432 Yes
Granule Size                       16777216 No
Maximum SGA Size                 1573527552 No
Startup overhead in Shared Pool    83886080 No

NAME                                  BYTES RES
-------------------------------- ---------- ---
Free SGA Memory Available                 0

MMM:

SQL> select * from v$sgainfo ;
NAME                                  BYTES RES
-------------------------------- ---------- ---
Fixed SGA Size                      1302592 No
Redo Buffers                        4964352 No
Buffer Cache Size                 654311424 Yes
Shared Pool Size                  234881024 Yes
Large Pool Size                           0 Yes
Java Pool Size                     25165824 Yes
Streams Pool Size                         0 Yes
Shared IO Pool Size                29360128 Yes
Granule Size                        4194304 No
Maximum SGA Size                  949989376 No
Startup overhead in Shared Pool    75497472 No

NAME                                  BYTES RES
-------------------------------- ---------- ---
Free SGA Memory Available                 0

The server was a HP DL380 with 4 processor cores and the storage was an HP EFS Clustered Gateway NAS. Before each test I did the following:

  1. Restore Database
  2. Reboot Server
  3. Mount NFS filesystems
  4. Boot Oracle

Before the MMM case I set vm.nr_hugepages=600 and after the database was booted, hugepages utilization looked like this:

$ grep Huge /proc/meminfo
HugePages_Total:   600
HugePages_Free:    145
Hugepagesize:     2048 kB

So, given all these conditions, I believe I am making an apples-apples comparison of AMM to MMM where AMM does not get hugepages support but MMM does. I think this is a pretty stressful workload since I am maxing out the processors and performing a significant amount of I/O-given the size of the server.

Test Results
OK, so this is a very contained case and Oracle Database 11g is still only available on x86 Linux. I hope I can have the time to do a similar test with more substantial gear. For the time being, what I know is that losing hugepages support for the sake of gaining AMM should not make you lose sleep. The results measured in throughput (transactions per second) and server statistics are in:

Configuration OLTP Transactions/sec Logical IO/sec Block Changes/sec Physical Read/sec Physical Write/sec
AMM 905 36,742 10,195 4,287 2,817
MMM 872 36,411 10,101 4,864 2,928

Looks like 4% in the favor of AMM to me and that is likely attributed to the 13% more physical I/O per transaction the MMM case had to perform. That part of the results has me baffled for the moment since they both have the same buffering as the v$sgainfo output above shows. Well, yes, there is a significant difference in the amount of Large Pool in the MMM case, but this workload really shouldn’t have any demand on Large Pool. I’m going to investigate that further. Perhaps an interesting test would be to reduce the amount buffering the AMM case gets to force more physical I/O. That could bring it more in line. We’ll see.

Summary
I’m not saying hugepages is no help across the board. What I am saying is that I would weigh heavily the benefits AMM offers because losing hugepages might not make any difference for you at all. If it is, in fact, a huge problem across the board then it looks like there has been work done in this area for the 2.6 Kernel and it seems reasonable that such a feature (hugepages support for mmap(P)) could be implemented. We’ll see.

This Is a SPAM Blog Entry

An old Sueqnet bdudy of mnie snet the fowolling to me tdoay

I cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The phaonmneal pweor of the hmuan mnid, aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it dseno’t mtaetr in waht oerdr the ltteres in a wrod are, the olny iproamtnt tihng is taht the frsit and lsat ltteer be in the rghit pclae. The rset can be a taotl mses and you can sitll raed it whotuit a pboerlm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.

…hmmm…he must be a regular reader of my blog I guess 🙂

Oracle Database 11g Automatic Memory Management

Our OakTable Wunderkind, Tanel Poder, has started blogging about Oracle Database 11g Automatic Memory Management. Do pay his site a visit! In fact, his site is worth a regular visit.

Multi-Core Processors? Lots, and Lots of Cores!

One of my fellow OakTable Network members sent me email about this these guys today. I had never heard of them. But then I’m no sleuth since I didn’t hear about PANTA Systems until they audited a very nice Infiniband-enabled TPC-H last year. I’m talking about TILERA. Today TILERA announced a 64-core x86 compatible processor. Since this processor is marketed for embedded applications such as network switches and so on I won’t bore you with a long blog entry. One point I found very interesting about this processor, however, is that it has 2 10GbE interfaces on the chip itself as opposed to jumping over PCI. Now that is one heck of a way to hose down the cores! 

Come to think of it, maybe this is worth blogging. I did see recently that Intel has an 80-core (not x86 compatible though) chip in demo state. Sun, of course, has the Niagra and now there is this 64-core chip where each core can run its own copy of Linux. Even though the TILERA isn’t intended for servers, it represents technology that doesn’t seem that far off. Interesting stuff.

 

Over-Configuring DBWR Processes – Part IV

In Part II of this small series of blog entries on DBWR, a reader made note of the fact that the test I was using suffered log file switch (checkpoint incomplete) wait events and wanted an explanation as to why those were happening. I’d like to answer that and cover a couple of mistakes I made as well.

When I configured the OLTP database for that set of tests I sized the online redo logs small enough to ensure I’d get some log switches into the overall mix as well. My rationale for that move was to take measurements at the varying DBWR counts with a workload that put DBWR under as much duress as possible. To that end, some may consider the test too artificial and honestly I don’t blame them. Afterall, the remedy for log file switch (checkpoint incomplete) is to either make the redo logs larger or address DBWR’s ability to flush buffers. Of course the approach for the latter would be either a faster I/O subsystem or improve DBWR’s ability to do his job-which is what this thread is about. So that begs the question, what are the numbers if redo logs are large enough to rule out checkpoint bottlenecks. Good question. After changing my redo logs to very large sizes I ran another set of tests that lead me to believe that on systems with a small number of processor cores and a normal DBWR workload (e.g., reasonable checkpoints) there is no appreciable difference in throughput when going with as many as 1 DBWR per processor core. The results surprised me…read on…

Admissions of Error
The first of two admissions of error that is. After reading my own blog entries on this thread a couple of times I believe I need to clarify some things. I’ve stated matter-of-factly that over-configuring DBWR processes will impact performance. Well, that is true as my test example showed but is it a completely true statement? No. I met a guy once that couldn’t admit when he was wrong. I didn’t like him much.

Here is the situation: over-configuring DBWR processes is not going to impact performance if your workload isn’t a challenge for DBWR functionality. I have myriads of proof that normal DBWR workloads can be handled by multiple DBWR processes without any problem. It’s more likely the case that if your system has unnecessary multiple writers and you reconfigure to a single writer you wont see any difference. That’s because multiple writers in and of themselves is not a bad thing. And most production systems do not torture the I/O subsystem like I do here. My intent was simply to dig into the topic a bit deeper. That leads me to admission number 2.

Earlier in this series I mentioned the cache effect of multiple writers. Well, I was digging around some audited TPC-C full disclosure reports and noticed the practice of setting db_block_checksum=FALSE. This jarred my memory to the fact that since Oracle9i, DBWR has in fact been peeking at block contents by default. Let me explain. There was a day when DBWR’s accesses to memory differed significantly from that of foreground processes. Those days were pre-Oracle9i and in those releases DBWR didn’t peek at the contents of buffers. Instead, DBWR’s work consisted of figuring out what to write, writing it and performing the necessary housekeeping so that the written buffer would have the correct state. To that end, DBWR processes manipulated much more SGA metadata than did foreground processes. While that is still true to a large degree, Oracle9i changed that situation because DBWR now peeks at buffer contents.

With Oracle9i and later releases, DBWR processes peek at the contents of the buffers to calculate the block checksum and then store the value in the block itself. When re-read, the checksum is calculated by the reading process to ensure the block hadn’t been corrupted after it was flushed from the SGA. Now don’t get me wrong, these buffer reads are nowhere as data intensive as what foreground processes do for a living. Consider, for instance, a block of data manipulated by a foreground process servicing a payroll application. The foreground process exercises myriads of code that DBWR never would (SQL, transaction layer, etc), touches shared resources DBWR never would (shared pool, redo, etc) and most importantly, manipulates the row pieces in the block. When DBWR flushes the buffer with block check summing, it will peek at the contents of the block which will require the memory be installed into DBWR’s processor cache. That wasn’t the case “in the old days.” DBWR used to only manipulate metadata.

With db_block_checksum=FALSE, DBWR doesn’t peek at the contents of the block. When DBWR simply writes a buffer (without peeking into it), the contents of that buffer do not get installed into the processor cache. The page is in DBWR’s address space, but that is not the same as a memory access. Don’t get me wrong, I’m not saying db_block_checksum=TRUE is miserably expensive, because I haven’t tested the difference on modern systems so it I don’t know. To be honest, I wouldn’t run a database without it because it sure seems like good insurance. I would think systems with really large processor caches would see the most performance hit, but that would just be my best guess. After all, most commodity systems have small processor caches that are simply overwhelmed long before any of these concepts come into play. As for huge systems, well, I’ve been unfortunate to have been out of that space for a few years. It’s good to have friends, so I’ll quote the comment my old buddy Glenn Fawcett made on Part I of this series:

I agree… One DBWR process per cpu is not sage advice.

On large NUMA based servers, such as the Sun E25K, Oracle will by *default* create one dbwr process per system board. This allows the DBWR process to search LRUs out of local memory which is very efficient… Kevin could explain the benefits of this far better than I could.

The best advice I could give on “db_writer_processes” is don’t set it AT ALL. Let Oracle figure out the best setting.

Glenn knows really big systems due to his role in Sun’s SAE and other roles.

I think all this was my long-winded way to say that it is good to dive into DBWR topics, but it is best to remember that things like db_block_checksum defaults change from time to time. Further, there should be no panic over having multiple writers and finally, mistakes happen.

Oracle Enterprise Linux: Making Oracle on Linux Even Simpler? Introducing the oracle-validated-1.0.0-4.el4.x86_64.rpm Package

I currently don’t have any systems I can point at Oracle’s Unbreakable Linux Network. If I did, I’d get a copy of the new oracle-validated-1.0.0-4.el4.x86_64.rpm package. This package aims to collect all the necessary system RPMs and install them as well as configure all the sundry system parameters required for Oracle. According to this Oracle Linux Validated Configuration description, the package does the following:


 Additional packages from Oracle:

  # This package automatically pulls in package dependencies and sets
  # system parameters recommended above.  It is not required but is
  # recommended in order to save time in getting the system setup.  It is
  # available from ULN by subscribing to the "Oracle Software" channel and
  # then using up2date to install it.  The dependent packages will be
  # installed and parameters will be adjusted as part of this up2date process.
  oracle-validated-1.0.0-4.el4.x86_64.rpm [ Optional package, but recommended ]

If anyone happens to have a copy laying around and wouldn’t mind shooting me a copy, I’d like to rpm2cpio(8) it and take a peek. I’ve been hoping for quite some time that Oracle Enterprise Linux will include some things that make Oracle on Linux simpler. This could be a start.

This One’s Not About Oracle

sunset0001.jpg

Over-Configuring DBWR Processes – Part III

Sometimes the comment thread on a blog entry are important enough to warrant yet another blog entry. This is one of those times. In my blog entry called Over-Configuring DBWR Processes Part II I offered some test results showing that it really is best to keep a single writer until such time as a single writer becomes CPU-bound. But first a disclaimer.

DISCLAIMER: There are ports of Oracle where specialized multiple DBWR code is implemented for specific hardware architecture (e.g., NUMA). I ought to know, I worked on one of the very first (if not the first) to do so with the ports to Sequent DYNIX/ptx.

There, disclaimer aside, I am talking about the Linux ports of 10g and 11g which have no such hardware specialization.

The first question in that comment thread was:

From what’s written here, it looks like you only tested 1 and 8. Is it possible that the global maximum is actually somewhere between? Maybe 2 processes is slightly faster than 1, for example?

Great question with a simple answer: use 1 writer as long as that single writer has ample bandwidth. That is a truth, but I figured I’d do some runs to milk out a fact or two. I configured 2 DBWR processes and re-ran the workload I describe in Part II of this series. I followed that up with 4 DBWR processes. Hey, it’s all auto-pilot stuff with my setup so doing this was no big deal. Let’s compare the results. Just to summarize, the previous configurations performed as follows:

Configuration

OLTP Throughput (TPM)

Aggregate DBWR CPU Usage

1 DBWR

62.6

291

8 DBWR

57.9

256

And now with the 2 and 4 DBWR cases:

Configuration

OLTP Throughput (TPM)

Aggregate DBWR CPU Usage

1 DBWR

62.6

291

2 DBWR

58.7

273

4 DBWR

58.4

269

8 DBWR

57.9

256

The way this goes is that more DBWR processes impact throughput. On the other hand, more throughput will require more DBWR work so the optimal case of 1 DBWR process will take more CPU, but remember that it is also getting more throughput. I only provide these CPU numbers to show what percentage of a single CPU any given configuration utilizes. I don’t think it is necessary to run the 6 DBWR case. But what about a wildcat configuration such as multiple DBWR processes that have hard processor affinity?

Processor Affinity and Oracle Background Processes
Well, I’d never bind an Oracle background process to a singe processor (core) for obvious reasons, but with the Linux taskset(1) command it is possible to bind a running process to a socket. For example, the following command binds pid 13276 to the seconds socket of a mulit-socket system:

# taskset –pc 2-3 13276

Building on that idea, you can grab DBWR processes (or any process for that matter) once they are running and pin them to a CPU core or a set of CPU cores. The following stupid bash shell function can be used to do this sort of thing. Note, since there are tens of thousands of skilled computer programmers out there these days, all shell scripts are considered stupid of course. The function takes an argument that it then uses in its grep regular expression for globbing ps(1) output. An example would be:

aff ora_dbw
aff()
{
local GLOB="$1"
local CORE=0
local TWIN=0

ps -ef | grep $GLOB | grep -v grep | awk '{ print $2 }'  | while read PID
do
        (( TWIN = $CORE + 1 ))

        set -x
        sudo taskset -pc ${CORE}-${TWIN} $PID
        set -    

        (( CORE = CORE + 2 ))
        [[ $CORE = 8 ]] && CORE=0 && TWIN=0
done
}

I used this script to affinity 2 DBWR processes to a single socket and the numbers didn’t change. Neither did throwing 2 DBWR processes one each to two different sockets. The fact simply remains that DBWR doesn’t benefit from having peers unless it is CPU-bottlenecked.

Yet Another Excellent RAC Install Guide

Tim Hall sent me email to point me to a recent step-by-step install tip he produced for Oracle11g with NAS storage (NFS).  In the email he asked me if I had any experience with the new Oracle11g Direct NFS (DNFS) feature. The answer is, yes, I have a lot of DNFS experience as I hinted to with my blog post entitled Manly Men Only Deploy Oracle With Fibre Channel Part VI. Introducing Oracle11g Direct NFS. If I haven’t plugged the “Manly Man” series lately I am doing so again now. I think anyone interested in storage with an Oracle slant would take interest.  The full series of Manly Man posts can be found easily through my index of CFS/NFS/ASM Topics as well as this entry about a recent Oracle 300GB TPC-H result. That TPC-H result is very interesting-especially if you are trying to get out of SAN/DAS/NAS rehab. Yes, that was supposed to be humorous.

Back to the point. Here is the link to Tim’s (very good) step-by-step Oracle Database 11g RAC on Linux setup for NFS environments. I especially liked the mention of Direct NFS since I think it is very important technology as my jointly-authored Oracle Whitepaper on the topic should attest.

Oracle Clusterware and Fencing…Again?

Coming back from vacation and failing to catch up on oracle-l list topics is a bad mistake. I was wondering why Kirk McGowan decided to make a post about fencing in the context of Oracle Clusterware. After finally catching up on my oracle-l backlog, I see that the stimulus for Kirk’s blog entry was likely this post to the oracle-l list where the list member was asking whether Oracle Clusterware implements STONITH as its fencing model. It seems the question was asked after the list member watched this Oracle webcast about RAC where slide 11 specifically states:

IO Fencing via Stonith algorithm (remote power reset)

The list member was conflicted over the statement in Oracle’s webcast. It seems he had likely seen my blog entry entitled RAC Expert or Clusters Expert where I discuss the clusters concept of fencing. In that blog entry, and in the paper I reference therein, I point out that Oracle Clusterware doesn’t implement STONITH because it doesn’t. Oh boy, there he goes again contradicting Oracle. Well, no, I’m not. The quote from Oracle’s webcast says they implement their fencing using a “STONITH algorithm” and they do. The bit about remote power reset is splitting hairs a bit since the way the fenced node excuses itself from the cluster is by executing an immediate shutdown (e.g., Linux reboot(8) command). Kirk correctly points out that the correct term is actually suicide. Oracle uses algorithms common to STONITH implementations to determine what nodes need to get fenced. When a node is alerted that it is being “fenced” it uses suicide to carry out the order.

What Time Is It?
From about 2003 through 2005 I had dozens of people ask me for in-depth clusters concepts information with both a generic view and an Oracle-centric view-I was working for a clustering company and had a long history of clustered Oracle behind me after all. It seems people were getting confused as to why there were options to use vendor-integrated host clusterware on all platforms except Linux and Windows. People wanted to better understand both generic clustering concepts as well as Oracle Clusterware. It seems some merely wanted to “know what time it is” while others wanted to “know how to tell time” and some even wanted to know “how the clock works.” About the time Oracle implemented the Third Party Clusterware Validation Program I decided I need to write a paper on the matter, so I did and posted it on the OakTable Network site. In the paper, and my blog post, I point out that Oracle Clusterware is not STONITH-clinically speaking-and indeed it isn’t. STONITH requires healthy servers to take action against ill servers via:

  • Remote Power Reset. This technology is not expensive, nor spooky. In fact, here is a network power switch for $199 that allows SNMP commands to power cycle outlets. Academic (and some commercial) approaches use these sorts of devices when implementing clusters. A healthy server will simply issue an SNMP command to power off the ill server. Incidentally, not all servers that run Oracle even have a power cord (think blades) and some don’t even use AC (see Rackable’s DB Power servers) so Oracle couldn’t use this approach without horrible platform-specific porting issues.
  • Remote System Management. There are a plethora of remote system management technologies (e.g., power cycle a server remotely) such as DRAC, IPMI, iLO, ALOM, RSC. Oracle is not crazy enough to tailor their fencing requirements around each of these. What a porting nightmare that would be. Oracle has stated more than once that there are no standards in this space and thus no useable APIs. The closest thing would have been either IPMI or OPMA, but the industry hasn’t seemed to want a cross-platform standard in this space.

The lack of standards where cluster fencing is concerned leaves us with a wide array of vendor clusterware such as Service Guard, HACMP, VCS, PolyServe, Red Hat Cluster Suite and on and on. I had a lot of Oracle customers asking me to inform them of the fundamental differences between these various clusterware and Oracle’s Clusterware so I did.

Gasp, Oracle Doesn’t Implement STONITH!
Henny penny: the sky is falling. So Oracle doesn’t really implement STONITH. So what. That doesn’t mean nobody wants to understand the general topic of clustering-and fencing in particular-a little bit better. It would not be right to tell them that their quest for information is moot just because other cluster approaches are not embodied in Oracle Clusterware. However, the importance of Oracle’s choice of fencing method is probably summed up the best in that oracle-l email thread which dried up and died within 24 hours after another member posted the following:

Has anyone see a RAC data corruption due to Clusterware unable to shoot itself?


I can assure you all that if anyone reading the oracle-l list had such a testimonial we would have heard it. The oracle-l list membership is huge and there are also a lot of consultants on the list who have contacts with a lot of production sites. The thread dried up, dropped to the ground and died. I think what I just wrote mirrors Kirk McGowan’s position on the matter.

What Would It Take
No clustering approach is perfect. Whether STONITH, fabric fencing or suicide, clusters can melt down. That is, after all, why Oracle offers an even higher level of protection in their Maximum Availability Architecture through such technology as DataGuard.

What would it take for an Oracle Clusterware fencing breach and why would I blog such taboo? It takes a lot of unlikely (yet possible) circumstances and because some people want to know. With Oracle Clusterware, a fencing breach would require:

  • Failed Suicide. If for any reason Oracle’s Clusterware process is not able to successfully execute a software reboot of the ill server.
  • Hangcheck Failure. The hangcheck kernel module executes off a kernel timer. If the system is so ill that these kernel events are not getting triggered then that would mean hardclock interrupts are not working and I should think the system would likely PANIC. All told a PANIC is just as good as hangcheck timer succeeding really. Nonetheless, it is possible that such a situation could arise.

A Waste of My Time
So over the last few years I spent a little time explaining clusters concepts to people with Oracle in mind. In my writings I discussed such topics as fencing, kernel mode/user mode clusterware, skgxp(), skgxn() and a host of other RAC-related material. Was it a waste of my time? No. Do I agree with Kirk McGowan’s post? Yes. Most importantly, however, I hang my hat on the oracle-l thread that dead-ended when the last poster on the thread asked:

Has anyone see a RAC data corruption due to Clusterware unable to shoot itself?

…and then there was silence.

Over-Configuring DBWR Processes – Part II.

In my recent blog entry about over-configuring DBWR processes, I mentioned the processor cache effect of having too many DBWR processes doing work that a single DBWR process could handle. I’d like to provide more detail on the matter.

Backgrounder
Few people really understand what DBWR does for a living. Yes, everyone knows DBWR is responsible for flushing modified buffers from the SGA, but that is really high level. DBWR:

  • 1. Builds “write batches” by examining the state of buffer headers on the LRU lists. It scans the LRU end considering the age of the buffer. At instance boot time, each DBWR process is assigned a set of LRUs to tend to.
  • 2. Manipulates cache buffers chains elements for such reasons as marking the buffer busy when it is in flight for an I/O and marking it as written after the flush operation.
  • 3. Performs I/O using the OSDs that are available to it. OSDs are routines in the Oracle kernel that are Operating System (port) specific. These are the routines through which Oracle gets support for asynchronous and/or direct I/O via calls to various libraries (e.g., libc, libaio, libodm, etc).
  • 4. Posts waiters, such as foreground processes waiting in a free buffer wait event.

The problem with over-configuring DBWR processes is two-fold:

  • 1. Processor Cache Trashing. DBWR processes have huge working sets since they deal with so many LRU and cache buffers chains structures. Having more of these processes than necessary migrating from CPU to CPU places a huge dent in the L2 caches with both page table and data footprint.
  • 2. Scheduling anomalies. If you have more than 1 DBWR process, you run the increased risk of having DBWR processes that can’t get on CPU. Consider a checkpoint. All LRUs will have modified buffers that need flushed during a checkpoint. Any given LRU can only be cleaned by the boot-time assigned DBWR process. That means that a checkpoint requires action from all DBWR processes. It’s easier for 1 process to get scheduled on a CPU on, say, an 8 CPU system than 8 processes.

Point 2 above is the part that I suspect most folks aren’t aware of. I think folks envision the various multiple DBWR processes will only get busy if their peers aren’t keeping up. That is not the case. LRUs get evenly loaded by foreground processes because when a process needs a buffer (for file I/O, sort buffer, block cloning, etc) they first try the LRU latch for the last LRU they used (improves cache warmth) and if they can’t get that latch with a no-wait get, they jump to a different latch and do another no-wait get. This continues until they either obtain an LRU on a no-wait get or they loop around to the first attempted latch and do a wait-get. I recall in older releases of Oracle the next LRU to try was the very next LRU. That could have changed to something like a random selection but I don’t know ( I haven’t seen the code in years but that’s about to change). No matter, the LRUs are evenly loaded. Since LRUs are evenly loaded, that means that DBWR processes are equally tasked. I’m not very smart, but this I know:

On SMP systems it is not wise to break up a task that a single CPU can do and distribute the work evenly amongst multiple CPUs. You introduce processor scheduling overhead and you hammer the processor caches

How true, and even more so when the processes share resources as DBWR processes do-they “share” cache buffers chains and other goodies. Anyway, it might be smart to do such task decomposition on realtime or other specialized systems, but not Oracle systems. Let’s take a closer look.

Moderate OLTP Load, db_writer_processes=8
I just did an OLTP run on my DL585 8-core system with db_writer_processes set to 8. After the instance is booted, but before the Pro*C OLTP clients are triggered, I did a ps(1) command grepping for all the DBWR processes. As soon as the 900 second run was complete, the driver script does the same ps(1) command. All output goes to nohup.out so I can grep for DBWR processes and see how much CPU time they consumed during the run:

$ grep ora_dbw nohup.out
oracle   14507     1  2 13:46 ?        00:00:01 ora_dbw0_bench1
oracle   14509     1  2 13:46 ?        00:00:01 ora_dbw1_bench1
oracle   14511     1  2 13:46 ?        00:00:02 ora_dbw2_bench1
oracle   14513     1  1 13:46 ?        00:00:01 ora_dbw3_bench1
oracle   14515     1  1 13:46 ?        00:00:01 ora_dbw4_bench1
oracle   14517     1  2 13:46 ?        00:00:01 ora_dbw5_bench1
oracle   14519     1  1 13:46 ?        00:00:01 ora_dbw6_bench1
oracle   14521     1  2 13:46 ?        00:00:01 ora_dbw7_bench1
oracle   14507     1  2 13:46 ?        00:00:23 ora_dbw0_bench1
oracle   14509     1  2 13:46 ?        00:00:23 ora_dbw1_bench1
oracle   14511     1  2 13:46 ?        00:00:23 ora_dbw2_bench1
oracle   14513     1  2 13:46 ?        00:00:23 ora_dbw3_bench1
oracle   14515     1  2 13:46 ?        00:00:23 ora_dbw4_bench1
oracle   14517     1  2 13:46 ?        00:00:23 ora_dbw5_bench1
oracle   14519     1  2 13:46 ?        00:00:23 ora_dbw6_bench1
oracle   14521     1  2 13:46 ?        00:00:23 ora_dbw7_bench1

The ps(1) output shows us that each of the 8 DBWR processes accumulated 23 seconds of CPU. This clearly reflects the fact that all LRUs are loaded evenly and that each DBWR has an equal part in keeping them clear. So what’s the point? Well, 8 DBWR processes accumulating 184 CPU seconds (23*8) during a 7200 (8 * 900) CPU-second OLTP run means this DBWR workload could have been serviced by a single CPU. In fact 184 CPU seconds is only 20% of a single CPU core during a 900 second run, yet by configuring 8 DBWR processes we spread that workload over multiple CPUs. I know, you’re thinking this must be a pretty lightly loaded system, right? No, the processors hover around 50% utilization with peaks of 70%. The datafile I/O rate was some 6,980 IOPS of which 4,548/second were writes:

Load Profile                            Per Second       Per Transaction
~~~~~~~~~~~~                       ---------------       ---------------
                  Redo size:          3,222,269.99              4,570.66
              Logical reads:             66,770.61                 94.71
              Block changes:             18,097.81                 25.67
             Physical reads:              2,342.71                  3.32
            Physical writes:              4,548.36                  6.45
                 User calls:             11,795.67                 16.73
                     Parses:              3,597.63                  5.10
                Hard parses:                  0.27                  0.00
                      Sorts:                  1.00                  0.00
                     Logons:                  0.00                  0.00
                   Executes:             10,121.28                 14.36
               Transactions:                704.99

Moderate OLTP Load, db_writer_processes=1
Let’s take a look at the same workload after a shutdown, database restore and reboot of the instance:

$ grep ora_db noh*
oracle   15652     1  0 14:23 ?        00:00:00 ora_dbw0_bench1
oracle   15652     1 17 14:23 ?        00:02:55 ora_dbw0_bench1

What’s that? A single DBWR was able to service the same workload with only 175 CPU seconds? That’s the same work with 5% fewer processor cycles. But what did it do to throughput? Well, this is the medium model of the benchmark where the highest processor utilization peaks were roughly 70%. Reducing CPU utilization on a system that has ample CPU cycles to spare doesn’t get much. In fact, throughput only improved 2.4% from 46.3 Transactions Per Minute (TPM) to 47.4 TPM. Some would say that is within the margin of statistical error, but I bet those same people don’t have a setup were a database restore followed by an instance reboot will produce benchmark results within .5% like this setup does. Let’s just say it’s calibrated. Have you really read this far to hear about a paltry 2.4% difference in OLTP throughput though? No, that’s not all. The impact of too many DBWR processes is scheduling overhead and poor processor cache efficiency, so we need to look at a run with higher CPU utilization.

Heavy OLTP Load, db_writer_processes=8
The following ps(1) data shows us that with the increased OLTP load the 8 DBWR processes each consumed 32 CPU seconds for an aggregate of 256 CPU seconds. That’s still much less than the amount of cycles a single CPU had to offer-28% of a single CPU in fact. S

$ grep ora_db noh*
oracle   20763     1  0 15:59 ?        00:00:00 ora_dbw0_bench1
oracle   20765     1  0 15:59 ?        00:00:00 ora_dbw1_bench1
oracle   20767     1  0 15:59 ?        00:00:00 ora_dbw2_bench1
oracle   20769     1  0 15:59 ?        00:00:00 ora_dbw3_bench1
oracle   20771     1  0 15:59 ?        00:00:00 ora_dbw4_bench1
oracle   20773     1  0 15:59 ?        00:00:00 ora_dbw5_bench1
oracle   20775     1  0 15:59 ?        00:00:00 ora_dbw6_bench1
oracle   20777     1  0 15:59 ?        00:00:00 ora_dbw7_bench1
oracle   20763     1  3 15:59 ?        00:00:32 ora_dbw0_bench1
oracle   20765     1  3 15:59 ?        00:00:32 ora_dbw1_bench1
oracle   20767     1  3 15:59 ?        00:00:32 ora_dbw2_bench1
oracle   20769     1  3 15:59 ?        00:00:32 ora_dbw3_bench1
oracle   20771     1  3 15:59 ?        00:00:32 ora_dbw4_bench1
oracle   20773     1  3 15:59 ?        00:00:32 ora_dbw5_bench1
oracle   20775     1  3 15:59 ?        00:00:32 ora_dbw6_bench1
oracle   20777     1  3 15:59 ?        00:00:32 ora_dbw7_bench1

So here again the DBWR workload would “fit” within a single CPU. With this configuration, the OLTP throughput was 57.9 TPM. How does that compare to the same workload running against an instance with a single DBWR process?

Heavy OLTP Load, db_writer_processes=1
The following ps(1) output shows that a single DBWR process was able to service the same OLTP load with 291 CPU seconds-still much less (32%) than the bandwidth of a single CPU (there are 900 CPU seconds per core during the test).

$ grep ora_db noh*
oracle   21426     1  0 16:26 ?        00:00:00 ora_dbw0_bench1
oracle   21426     1 29 16:26 ?        00:04:51 ora_dbw0_bench1

But what about performance? When reducing DBWR count to 1, the OLTP throughput increased 8% from 57.9 to 62.6 TPM. What was the I/O load like though? Here is a snippet from statspack:

Load Profile                            Per Second       Per Transaction
~~~~~~~~~~~~                       ---------------       ---------------
                  Redo size:          5,918,344.96              4,355.71
              Logical reads:            125,684.76                 92.50
              Block changes:             33,622.21                 24.74
             Physical reads:              2,066.64                  1.52
            Physical writes:              8,483.06                  6.24
                 User calls:             21,996.51                 16.19
                     Parses:              6,737.41                  4.96
                Hard parses:                  0.27                  0.00
                      Sorts:                  1.03                  0.00
                     Logons:                  0.00                  0.00
                   Executes:             18,852.22                 13.87
               Transactions:              1,358.76

We see that with a single DBWR, the OLTP workload pushed through some 10,549 IOPS of which 8,483 per second were writes. That means a single DBWR on fairly outdated AMD Opteron processors (DL585) can service 8,483 writes per second with only 32% of a single CPU. So why configure more than that?

Where Did The Time Go?
Were there any significant waits on the 8 DBWR case? Yes, there were. Let’s see if anyone can guess which one is which:

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
CPU time                                                     3,148          35.8
log file sync                                1,207,863       2,905      2   33.1
log file switch (checkpoint incomplete)          4,627         828    179    9.4
db file scattered read                         172,724         670      4    7.6
log file parallel write                        532,373         464      1    5.3
          -------------------------------------------------------------

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
CPU time                                                     2,858          31.0
log file sync                                1,076,670       2,509      2   27.2
log file switch (checkpoint incomplete)          4,808       1,939    403   21.0
db file scattered read                         173,566         664      4    7.2
log file parallel write                        491,346         419      1    4.5
          -------------------------------------------------------------

Summary
Did he really make this long of a blog entry to discuss some 8% OLTP throughput increase? Well, yes and no. The affect of processor-cache thrashing will be more severe on larger SMP systems. This test system only has 4 sockets (8 cores). I’ve seen crazy DBWR configurations on 32 and 64 CPU systems account for more than 8%. Even if 8% was a constant, who wants to lose as much as 3 or 5 CPUs worth of bandwidth in today’s IT environment?



							

Feels Like A Sound Bite

Interesting to see this eWEEK.com interview finally show up. It sure sounds different when it is presented there; as if this was the complete conversation. In fact, this was a small part of a long set of interviews back in February 2007. I wish now that I would have been clearer about what ‘production’ means. There are production applications at the core of the enterprise and there are production applications at the edge. Any application that can be developed and deployed using any open source database would most certainly not require Oracle’s carrier-grade edition (Enterprise Edition) were it to be developed and deployed with Oracle. That is, any application that can be deployed with an open source database could be deployed with a limited-feature edition of Oracle such as Standard Edition (or even perhaps Standard Edition One). On the contrary, it is unreasonable to think that one could migrate enterprise-class applications to any of the open source stuff out there (or even most of the closed-source stuff for that matter). I think I’ve made that point pretty clear in my thread about EnterpriseDB for warehousing and many other posts here in the blog.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.