Archive for the 'oracle' Category



RASUI and RAC Polls. Acronyms are Cool.

Not much to blog about. No, actually, I have an Olympic-sized swimming pool full of things to blog about, but I can’t. No time.

Since I’m working on a future Oracle product I’ve got RASUI on my mind. I thought this video seemed appropriate.

Polls Are Useless Without Critical Mass
Have all you readers participated in my (yet another) RAC poll?  By the numbers I’d say not. Please take a moment and visit the poll. Thanks.

The Death of Row-Oriented RDBMS Technology.

Well, I’ve been in Oracle Server Technologies for a whopping week and in that short time I’m reassured of two constants:

  • Blogging takes time (that I don’t have)
  • Focusing on a single product, and more specifically a feature within a single product suite, makes a guy very pin-point minded

But, alas, I still need to stay well-rounded. So, thanks to a friend (and VP at a company I once worked for) for alerting me to the fact that Michael Stonebraker and others are blogging at The Database Column.

Yes, I’ll be reading that one. I don’t know if I’d recommend it to my readers though. That may sound odd, but most of my readers are practitioners of real, existing technology filling a production purpose. I expect the stuff on that blog to be laced with theory. Now, having said that, I have to remind myself that I believe well over 90% of my blog content is not exactly what one would call “ready to use” information.

All that aside, I just thought I’d poke my head up from the hole I’m in and make a quick blog entry. OK, there, I did…now it is back into the dungeon…but wait.

First Installment
Just because I said I was going to read it doesn’t imply I’m going to take it as some gulp from the fountain of omniscience.

For instance, in Michael Stonebraker’s post about row versus column orientation, he states:

[…] Vertica can be set-up and data loaded, typically in one day. The major vendors require weeks. Hence, the “out of box” experience is much friendlier. Also, Vertica beats all row stores on the planet – typically by a factor of 50. This statement is true for software only row stores as well as row stores with specialized hardware (e.g. Netezza, Teradata, Datallegro). The only engines that come closer are other column stores, which Vertica typically beats by around a factor of 10.

In my opinion that looks cut and pasted from a Vertica data sheet-but I don’t know. One thing I do know is that it seems unwise to say things like “all” and “typically by a factor of” in the same sentence. And this bit about “set-up and loaded” taking “weeks” for the “major vendors” is just plain goofy.

That’s my opinion. No, hold it, that’s my experience. I’ve loaded Oracle databases sized in the tens of terabytes that certainly didn’t take weeks!

Oracle ACE(s) Who Nominate Themselves…Oh, That Again?

This is a link to an Oracle ACE who didn’t nominate himself.

Databases are the Contents of Storage. Future Oracle DBAs Can Administer More. Why Would They Want To?

I’ve taken the following quote from this Oracle whitepaper about low cost storage:

A Database Storage Grid does not depend on flawless execution from its component storage arrays. Instead, it is designed to tolerate the failure of individual storage arrays.

In spite of the fact that the Resilient Low-Cost Storage Initiative program was decommissioned along with the Oracle Storage Compatability Program, the concepts discussed in that paper should be treated as a barometer of the future of storage for Oracle databases-with two exceptions: 1) Fibre Channel is not the future and 2) there’s more to “the database” than just the database. What do I mean by point 2? Well, with features like SecureFiles, we aren’t just talking rows and columns any more and I doubt (but I don’t know) that SecureFiles is the end of that trend.

Future Oracle DBAs
Oracle DBAs of the future become even more critical to the enterprise since the current “stove-pipe” style IT organization will invariably change. In today’s IT shop, the application team talks to the DBA team who talks to the Sys Admin team who tlks to the Storage Admin team. All this to get an application to store data on disk through a Oracle database. I think that will be the model that remains for lightly-featured products like MySQL and SQL Server, but Oracle aims for more. Yes, I’m only whetting your appetite but I will flesh out this topic over time. Here’s food for thought: Oracle DBAs should stop thinking their role in the model stops at the contents of the storage.

So while Chen Shapira may be worried that DBAs will get obviated, I’d predict instead that Oracle technology will become more full-featured at the storage level. Unlike the stock market where past performance is no indicator of future performance, Oracle has consistently brought to market features that were once considered too “low-level” to be in the domain of a Database vendor.

The IT industry is going through consolidation. I think we’ll see Enterprise-level IT roles go through some consolidation over time as well. DBAs who can wear more than “one hat” will be more valuable to the enterprise. Instead of thinking about “encroachment” from the low-end database products, think about your increased value proposition with Oracle features that enable this consolidation of IT roles-that is, if I’m reading the tea leaves correctly.

How to Win Friends and Influence People
Believe me, my positions on Fibre Channel have prompted some fairly vile emails in my inbox-especially the posts in my Manly Man SAN series. Folks, I don’t “have it out”, as they say, for the role of Storage Administrators. I just believe that the Oracle DBAs of today are on the cusp of being in control of more of the stack. Like I said, it seems today’s DBA responsibilities stop at the contents of the storage-a role that fits the Fibre Channel paradigm quite well, but a role that makes little sense to me. I think Oracle DBAs are capable of more and will have more success when they have more control. Having said that, I encourage any of you DBAs who would love to be in more control of the storage to look at my my post about the recent SAN-free Oracle Data Warehouse. Read that post and give considerable thought to the model it discusses. And give even more consideration to the cost savings it yields.

The Voices in My Head
Now my alter ego (who is a DBA, whereas I’m not) is asking, “Why would I want more control at the storage level?” I’ll try to answer him in blog posts, but perhaps some of you DBAs can share experiences where performance or availability problems were further exacerbated by finger pointing between you and the Storage Administration group.

Note to Storage Administrators
Please, please, do not fill my email box with vitriolic messages about the harmony today’s typical stove-pipe IT organization creates. I’m not here to start battles.

Let me share a thought that might help this whole thread make more sense. Let’s recall the days when an Oracle DBA and a System Administrator together (yet alone) were able to provide Oracle Database connectivity and processing for thousands of users without ever talking to a “Storage Group.” Do you folks remember when that was? I do. It was the days of Direct Attach Storage (DAS). The problem with that model was that it only took until about the late 1990s to run out of connectivity-enter the Fibre Channel SAN. And since SANs are spokes attached to hubs of storage systems (SAN arrays), we wound up with a level of indirection between the Oracle server and its blocks on disk. Perhaps there are still some power DBAs that remember how life was with large numbers of DAS drives (hundreds). Perhaps they’ll recall the level of control they had back then. On the other hand, perhaps I’m going insane, but riddle me this (and feel free to quote me elsewhere):

Why is it that the industry needed SANs to get more than a few hundred disks attached to a high-end Oracle system in the late 1990s and yet today’s Oracle databases often reside on LUNs comprised of a handful of drives in a SAN?

The very thought of that twist of fate makes me feel like a fish flopping around on a hot sidewalk. Do you remember my post about capacity versus spindles? Oh, right, SAN cache makes that all better. Uh huh.

Am I saying the future is DAS? No. Can I tell you now exactly what model I’m alluding to? Not yet, but I enjoy putting out a little food for thought.

Some Blog Changes

Blog Changes
Now that I’ve left HP/PolyServe, I’ve gotten a few emails from readers looking for files that I linked to that resided on HP/PolyServe systems. I’ve also gotten quite a few emails about pages found by search engines, but don’t load (Error 404) once you come to my site. As for the latter, there are a good number of posts that I’ve taken offline. Sorry, but that is just how things go. As for the files content, I’ll try to make them available on the OakTable website. The first to get this treatment is the Silly Little Benchmark that I spoke of in some of my NUMA-related posts such as Oracle on Opteron with Linux–The NUMA Angle. Introducing the Silly Little Benchmark.

I’ll be fixing those refrences. In the meaintime, here is a link to the Silly Little Benchmark.

A New Adventure

There is a difference between jumping ship and making an orderly transfer from one ship to another. I tend to think of jumping ship as something one does when it is sinking. I’m not jumping ship.

For the last 6 years at PolyServe, and 10 years before that at Sequent Computer Systems, I’ve had the honor to work with some of the brightest software folks around. See, the brain power behind PolyServe is a lot of the heavy-hitting Unix kernel engineers from Sequent Computer Systems. They left after IBM bought Sequent. I joined them shortly thereafter, leaving Veritas to do so. I think it is fitting to quote my long time friend and fellow OakTable Network member, James Morle. In Oracle Insights: Tales of the Oak Table, James says:

…around the same time Sequent appeared with a good story, great attitude, and some of the best software technicians I have worked with from a hardware company.

I know what he means. But I’m leaving. I did my part at PolyServe, HP acquired us. Getting taken up in a corporate takeover means a new adventure. When faced with a new adventure it sometimes makes sense to open up to other, sometimes better, opportunity. This is one of those cases.

Next week I start a new chapter. I’ll be taking a role in Oracle’s Server Technologies group. Fortunately for me, I’ve had pretty close ties with a lot of these folks dating back to the mid-1990s and my work in Sequent’s Advanced Oracle Engineering group.

So, I’m transferring to a different ship. A ship that I am very familiar with. A ship with crew that I respect and with whom it’ll be an honor to work.

What Is Good Throughput With Oracle Over NFS?

The comment thread on my blog entry about the simplicity of NAS for Oracle got me thinking. I can’t count how many times I’ve seen people ask the following question:

Is N MB/s good throughput for Oracle over NFS?

Feel free to plug in any value you’d like for N. I’ve seen people ask if 40MB/s is acceptable. I’ve seen 60, 80, name it-I’ve seen it.

And The Answer Is…
Let me answer this question here and now. The acceptable throughput for Oracle over NFS is full wire capacity. Full stop! With Gigabit Ethernet and large Oracle transfers, that is pretty close to 110MB/s. There are some squeak factors that might bump that number one way or the other but only just a bit. Even with the most hasty of setups, you should expect very close to 100MB/s straight out of the box-per network path. I cover examples of this in depth in this HP whitepaper about Oracle over NFS.

The steps to a clean bill of health are really very simple. First, make sure Oracle is performing large I/Os. Good examples of this are tablespace CCF (create contiguous file) and full table scans with port-maximum multi-block reads. Once you verify Oracle is performance large I/Os, do the math. If you are not close to 100MB/s on a GbE network path, something is wrong. Determining what’s wrong is another blog entry. I want to capitalize on this nagging question about expectations. I reiterate (quoting myself):

Oracle will get line speed over NFS, unless something is ill-configured.

Initial Readings
I prefer to test for wire-speed before Oracle is loaded. The problem is that you need to mimic Oracle’s I/O. In this case I mean Direct I/O. Let’s dig into this one a bit.

I need something like a dd(1) tool that does O_DIRECT opens. This should be simple enough. I’ll just go get a copy of the oss.oracle.com coreutils package that has O_DIRECT tools like dd(1) and tar(1). So here goes:

[root@tmr6s15 DD]# ls ../coreutils-4.5.3-41.i386.rpm
../coreutils-4.5.3-41.i386.rpm
[root@tmr6s15 DD]# rpm2cpio < ../coreutils-4.5.3-41.i386.rpm | cpio -idm
11517 blocks
[root@tmr6s15 DD]# ls
bin  etc  usr
[root@tmr6s15 DD]# cd bin
[root@tmr6s15 bin]# ls -l dd
-rwxr-xr-x  1 root root 34836 Mar  4  2005 dd
[root@tmr6s15 bin]# ldd dd
        linux-gate.so.1 =>  (0xffffe000)
        libc.so.6 => /lib/tls/libc.so.6 (0x00805000)
        /lib/ld-linux.so.2 (0x007ec000)

I have an NFS mount exported from an HP EFS Clustered Gateway (formerly PolyServe):

 $ ls -l /oradata2
total 8388608
-rw-r--r--  1 root root 4294967296 Aug 31 10:15 file1
-rw-r--r--  1 root root 4294967296 Aug 31 10:18 file2
$ mount | grep oradata2
voradata2:/oradata2 on /oradata2 type nfs
(rw,bg,hard,nointr,tcp,nfsvers=3,timeo=600,rsize=32768,wsize=32768,actimeo=0,addr=192.168.60.142)

Let’s see what the oss.oracle.com dd(1) can do reading a 4GB file and over-writing another 4GB file:

 $ time ./dd --o_direct=1048576,1048576 if=/oradata2/file1 of=/oradata2/file2 conv=notrunc
4096+0 records in
4096+0 records out

real    1m32.274s
user    0m3.681s
sys     0m8.057s

Test File Over-writing
What’s this bit about over-writing? I recommend using conv=notrunc when testing write speed. If you don’t, the file will be truncated and you’ll be testing write speeds burdened with file growth. Since Oracle writes the contents of files (unless creating or extended a datafile), it makes no sense to test writes to a file that is growing. Besides, the goal is to test the throughput of O_DIRECT I/O via NFS, not the filer’s ability to grow a file. So what did we get? Well, we transferred 8GB (4GB in, 4GB out) and did so in 92 seconds. That’s 89MB/s and honestly, for a single path I would actually accept that since I have done absolutely no specialized tuning whatsoever. This is straight out of the box as they say. The problem is that I know 89MB/s is not my typical performance for one of my standard deployments. What’s wrong?

The dd(1) package supplied with the oss.oracle.com coreutils has a lot more in mind than O_DIRECT over NFS. In fact, it was developed to help OCFS1 deal with early cache-coherency problems. It turned out that mixing direct and non-direct I/O on OCFS was a really bad thing. No matter, that was then and this is now. Let’s take a look at what this dd(1) tool is doing:

$ strace -c ./dd --o_direct=1048576,1048576 if=/oradata2/file1 of=/oradata2/file2 conv=notrunc
4096+0 records in
4096+0 records out
Process 32720 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 56.76    4.321097        1054      4100         1 read
 22.31    1.698448         415      4096           fstatfs
 10.79    0.821484         100      8197           munmap
  9.52    0.725123         177      4102           write
  0.44    0.033658           4      8204           mmap
  0.16    0.011939           3      4096           fcntl
  0.02    0.001265          70        18        12 open
  0.00    0.000178          22         8           close
  0.00    0.000113          23         5           fstat
  0.00    0.000091          91         1           execve
  0.00    0.000015           2         8           rt_sigaction
  0.00    0.000007           2         3           brk
  0.00    0.000006           3         2           mprotect
  0.00    0.000004           4         1         1 access
  0.00    0.000002           2         1           uname
  0.00    0.000002           2         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    7.613432                 32843        14 total

Eek! I’ve paid for a 1:1 fstatfs(2) and fcntl(2) per read(2) and a mmap(2)/munmap(2) call for every read(2)/write(2) pair! Well, that wouldn’t be a big deal on OCFS since fstatfs(2) is extremely cheap and the structure contents only changes when filesystem attributes change. The mmap(2)/munmap(2) costs a bit, sure, but on a local filesystem it would be very cheap. What I’m saying is that this additional call overhead wouldn’t laden down OCFS throughput with the –o_direct flag-but I’m not blogging about OCFS. With NFS, this additional call overhead is way to expensive. All is not lost.

I have my own coreutils dd(1) that I implements O_DIRECT open(2). You can do this too, it is just GNU after all. With this custom GNU coreutils dd(1) I have, the call profile is nothing more than read(2) and write(2) back to back. Oh, I forgot to mention, the oss.oracle.com dd(1) doesn’t work with /dev/null or /dev/zero since it tries to throw an O_DIRECT open(2) at those devices which makes the tool croak. My dd(1) checks if in or out is /dev/null or /dev/zero and omits the O_DIRECT for that side of the operation. Anyway, here is what this tool got:

$ time dd_direct if=/oradata2/file1 of=/oradata2/file2 bs=1024k conv=notrunc
4096+0 records in
4096+0 records out

real    1m20.162s
user    0m0.008s
sys     0m1.458s

Right, that’s more like it-80 seconds or 102 MB/s. Shaving those additional calls off brought throughput up 15%.

What About Bonding/Teaming NICS
Bonding NICs is a totally different story as I point out somewhat in this paper about Oracle Database 11g Direct NFS. You can get very mixed results if the network interface over which you send NFS traffic is bonded. I’ve seen 100% scalability of NICs in a bonded pair and I’ve seen as low as 70%. If you are testing a bonded pair, set your expectations accordingly.

Oracle11g Automatic Memory Management – Part I. Linux Hugepages Support.

I spent the majority of my time in the Oracle Database 11g Beta program testing storage-related aspects of the new release. To be honest, I didn’t even take a short peek at the new Automatic Memory Management feature. As I pointed out the other day, Tanel Poder has started blogging about the feature.

If you read Tanel’s post you’ll see that he points out AMM-style shared memory does not use hugepages. This is because AMM memory segments are memory mapped files in /dev/shm. At this time, the major Linux distributions do not implement backing memory mapped files with hugepages as they do with System V-style IPC shared memory. The latter supports the SHM_HUGETLB flag passed to the shmget(P) call. It appears as though there was an effort to get hugepages support for memory mapped pages by adding MAP_HUGETLB flag support for the mmap(P) call as suggested in this kernel developer email thread from 2004. I haven’t been able to find just how far that proposed patch went however. Nonetheless, I’m sure Wim’s group is more than aware of that proposed mmap(P) support and if it is really important for Oracle Database 11g Automatic Memory Management, it seems likely there would be a 2.6 Kernel patch for it someday. But that begs the question: just how important are hugepages? Is it blasphemy to even ask the question?

Memory Mapped Files and Oracle Ports
The concept of large page tables is a bit of a porting nightmare. It will be interesting to see how the other ports deal with OS-level support for the dynamic nature of Automatic Memory Management. Will the other ports also use memory mapped files instead of IPC Shared Memory? If so, they too will have spotty large page table support for memory mapped files. For instance, Solaris 9 supported large page tables for mmap(2) pages, but only if it was an anonymous mmap (e.g., a map without a file) or a map of /dev/zero-neither of which would work for AMM. I understand that Solaris 10 supports large page tables for mmap(2) regions that are MAP_SHARED mmap(2)s of files-which is most likely how AMM will look on Solaris, but I’m only guessing. Other OSes, like Tru64-and I’m quite sure most others-don’t support large page tables for mmap(2)ed files. This will be interesting to watch.

Performance, Large Page Table, Etc
I remember back in the mid-90s when Sequent implemented shared large page tables for IPC Shared memory on our Unix variant-DYNIX/ptx. It was a very significant performance enhancement. For instance, 1024 shadow processes attached to a 1GB SGA required 1GB of physical memory-for the page tables alone! That was significant on systems that had very small L2 caches and only supported 4GB physical memory. Fast forwarding to today. I know people with Oracle 10g workloads that absolutely seize up their Linux (2.6. Kernel) system unless they use hugepages. Now I should point out that these sites I know of have a significant mix of structured and unstructured data. That is, they call out to LOBs in the filesystem (give me SecureFiles please). So the pathology they generally suffered without hugepages was memory thrashing between Oracle and the OS page cache (filesystem buffer cache). The salve for those wounds was hugepages since that essentially carves out and locks down the memory at boot time. Hugepages memory can never be nibbled up for page cache. To that end, benefiting from hugepages in this way is actually a by-product. The true point behind hugepages not the fact that it is reserved at boot time, but the fact that CPUs don’t have to thrash to maintain the physical to virtual translations (tlb). In general, hugepages are a lot more polite on processor caches and they reduce RAM overhead for page tables. Compared to the mid 1990s, however, RAM is about the least of our worries these days. Manageability is the most important and AMM aims to help on that front.

Confusion
Of all things Oracle and Linux, I think one of the topics that gets mangled the most is hugepages. The terms and nobs to twist run the gamut. There’s hugepages, hugetlb, hugetlbfs, hugetlbpool and so on. Then there are the differences from one Linux distribution and Linux kernel to the other. For instance, you can’t use hugepages on SuSE unless you turn off vm.disable_cap_mlock (need a few double negatives?). Then there is the question of boot-time versus /proc or sysctl(8) to reserve the pages. Finally, there is the fact that if you don’t have enough hugepages when you boot Oracle, Oracle will not complain-you just don’t get hugepages. I think Metalink 361323.1 does a decent job explaining hugepages with old and recent Linux in mind, but I never see it explained as succinctly as follows:

  1. Use OEL 4 or RHEL 4 with Oracle Database 10g or 11g
  2. Set oracle hard memlock N in /etc/security/limits.conf where N is a value large enough to cover your SGA needs
  3. Set vm.nr_hugepages in /etc/sysctl.conf to a value large enough to cover your SGA.

Further Confusion
Audited TPC results don’t help. For instance, on page 125 of this Full disclosure report from a recent Oracle10g TPC-C, there are listings of sysctl.conf and lilo showing the setting of the hugetlbpool parameter. That would be just fine if this was a RHEL3 benchmark since vm.hugetlbpool doesn’t exist in RHEL4.

Performance
I admit I haven’t done a great deal of testing with AMM, but generally a quick I/O-intensive OLTP test on a system with 4 processor cores utilized at 100% speak volumes to me. So I did just such a test.

Using an order-entry workload accessing the schema detailed in this Oracle Whitepaper about Direct NFS, I tested two configurations:

Automatic Memory Management (AMM). Just like it says, I configured the simplest set of initialization parameters I could:

UNDO_TABLESPACE=rb1
UNDO_MANAGEMENT = AUTO
compatible = 10.1.0.0
control_files                  = ( /u01/app/oracle/product/11/db_1/rw/DATA/cntlbench_1 )
db_block_size                   = 4096
MEMORY_TARGET=1500M
db_files                        = 100
db_writer_processes = 1
db_name                         = bench
processes                       = 200
sessions                        = 400
cursor_space_for_time           = TRUE  # pin the sql in cache
filesystemio_options=setall

Manual Memory Management(MMM). I did my best to tailor the important SGA regions to match what AMM produced. In my mind, for an OLTP workload the most important SGA regions are the block buffers and the shared pool.

UNDO_TABLESPACE=rb1
UNDO_MANAGEMENT = AUTO
compatible = 10.1.0.0
control_files                  = ( /u01/app/oracle/product/11/db_1/rw/DATA/cntlbench_1 )
db_block_size                   = 4096
#MEMORY_TARGET=1500M
db_cache_size = 624M
shared_pool_size=224M
db_files                        = 100
db_writer_processes = 1
db_name                         = bench
processes                       = 200
sessions                        = 400
cursor_space_for_time           = TRUE  # pin the sql in cache
filesystemio_options=setall

The following v$sgainfo output justifies just how closely configured the AMM and MMM cases were.

AMM:

SQL> select * from v$sgainfo ;

NAME                                  BYTES RES
-------------------------------- ---------- ---
Fixed SGA Size                      1298916 No
Redo Buffers                       11943936 No
Buffer Cache Size                 654311424 Yes
Shared Pool Size                  234881024 Yes
Large Pool Size                    16777216 Yes
Java Pool Size                     16777216 Yes
Streams Pool Size                         0 Yes
Shared IO Pool Size                33554432 Yes
Granule Size                       16777216 No
Maximum SGA Size                 1573527552 No
Startup overhead in Shared Pool    83886080 No

NAME                                  BYTES RES
-------------------------------- ---------- ---
Free SGA Memory Available                 0

MMM:

SQL> select * from v$sgainfo ;
NAME                                  BYTES RES
-------------------------------- ---------- ---
Fixed SGA Size                      1302592 No
Redo Buffers                        4964352 No
Buffer Cache Size                 654311424 Yes
Shared Pool Size                  234881024 Yes
Large Pool Size                           0 Yes
Java Pool Size                     25165824 Yes
Streams Pool Size                         0 Yes
Shared IO Pool Size                29360128 Yes
Granule Size                        4194304 No
Maximum SGA Size                  949989376 No
Startup overhead in Shared Pool    75497472 No

NAME                                  BYTES RES
-------------------------------- ---------- ---
Free SGA Memory Available                 0

The server was a HP DL380 with 4 processor cores and the storage was an HP EFS Clustered Gateway NAS. Before each test I did the following:

  1. Restore Database
  2. Reboot Server
  3. Mount NFS filesystems
  4. Boot Oracle

Before the MMM case I set vm.nr_hugepages=600 and after the database was booted, hugepages utilization looked like this:

$ grep Huge /proc/meminfo
HugePages_Total:   600
HugePages_Free:    145
Hugepagesize:     2048 kB

So, given all these conditions, I believe I am making an apples-apples comparison of AMM to MMM where AMM does not get hugepages support but MMM does. I think this is a pretty stressful workload since I am maxing out the processors and performing a significant amount of I/O-given the size of the server.

Test Results
OK, so this is a very contained case and Oracle Database 11g is still only available on x86 Linux. I hope I can have the time to do a similar test with more substantial gear. For the time being, what I know is that losing hugepages support for the sake of gaining AMM should not make you lose sleep. The results measured in throughput (transactions per second) and server statistics are in:

Configuration OLTP Transactions/sec Logical IO/sec Block Changes/sec Physical Read/sec Physical Write/sec
AMM 905 36,742 10,195 4,287 2,817
MMM 872 36,411 10,101 4,864 2,928

Looks like 4% in the favor of AMM to me and that is likely attributed to the 13% more physical I/O per transaction the MMM case had to perform. That part of the results has me baffled for the moment since they both have the same buffering as the v$sgainfo output above shows. Well, yes, there is a significant difference in the amount of Large Pool in the MMM case, but this workload really shouldn’t have any demand on Large Pool. I’m going to investigate that further. Perhaps an interesting test would be to reduce the amount buffering the AMM case gets to force more physical I/O. That could bring it more in line. We’ll see.

Summary
I’m not saying hugepages is no help across the board. What I am saying is that I would weigh heavily the benefits AMM offers because losing hugepages might not make any difference for you at all. If it is, in fact, a huge problem across the board then it looks like there has been work done in this area for the 2.6 Kernel and it seems reasonable that such a feature (hugepages support for mmap(P)) could be implemented. We’ll see.

This Is a SPAM Blog Entry

An old Sueqnet bdudy of mnie snet the fowolling to me tdoay

I cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The phaonmneal pweor of the hmuan mnid, aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it dseno’t mtaetr in waht oerdr the ltteres in a wrod are, the olny iproamtnt tihng is taht the frsit and lsat ltteer be in the rghit pclae. The rset can be a taotl mses and you can sitll raed it whotuit a pboerlm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.

…hmmm…he must be a regular reader of my blog I guess 🙂

Oracle Database 11g Automatic Memory Management

Our OakTable Wunderkind, Tanel Poder, has started blogging about Oracle Database 11g Automatic Memory Management. Do pay his site a visit! In fact, his site is worth a regular visit.

Multi-Core Processors? Lots, and Lots of Cores!

One of my fellow OakTable Network members sent me email about this these guys today. I had never heard of them. But then I’m no sleuth since I didn’t hear about PANTA Systems until they audited a very nice Infiniband-enabled TPC-H last year. I’m talking about TILERA. Today TILERA announced a 64-core x86 compatible processor. Since this processor is marketed for embedded applications such as network switches and so on I won’t bore you with a long blog entry. One point I found very interesting about this processor, however, is that it has 2 10GbE interfaces on the chip itself as opposed to jumping over PCI. Now that is one heck of a way to hose down the cores! 

Come to think of it, maybe this is worth blogging. I did see recently that Intel has an 80-core (not x86 compatible though) chip in demo state. Sun, of course, has the Niagra and now there is this 64-core chip where each core can run its own copy of Linux. Even though the TILERA isn’t intended for servers, it represents technology that doesn’t seem that far off. Interesting stuff.

 

Over-Configuring DBWR Processes – Part IV

In Part II of this small series of blog entries on DBWR, a reader made note of the fact that the test I was using suffered log file switch (checkpoint incomplete) wait events and wanted an explanation as to why those were happening. I’d like to answer that and cover a couple of mistakes I made as well.

When I configured the OLTP database for that set of tests I sized the online redo logs small enough to ensure I’d get some log switches into the overall mix as well. My rationale for that move was to take measurements at the varying DBWR counts with a workload that put DBWR under as much duress as possible. To that end, some may consider the test too artificial and honestly I don’t blame them. Afterall, the remedy for log file switch (checkpoint incomplete) is to either make the redo logs larger or address DBWR’s ability to flush buffers. Of course the approach for the latter would be either a faster I/O subsystem or improve DBWR’s ability to do his job-which is what this thread is about. So that begs the question, what are the numbers if redo logs are large enough to rule out checkpoint bottlenecks. Good question. After changing my redo logs to very large sizes I ran another set of tests that lead me to believe that on systems with a small number of processor cores and a normal DBWR workload (e.g., reasonable checkpoints) there is no appreciable difference in throughput when going with as many as 1 DBWR per processor core. The results surprised me…read on…

Admissions of Error
The first of two admissions of error that is. After reading my own blog entries on this thread a couple of times I believe I need to clarify some things. I’ve stated matter-of-factly that over-configuring DBWR processes will impact performance. Well, that is true as my test example showed but is it a completely true statement? No. I met a guy once that couldn’t admit when he was wrong. I didn’t like him much.

Here is the situation: over-configuring DBWR processes is not going to impact performance if your workload isn’t a challenge for DBWR functionality. I have myriads of proof that normal DBWR workloads can be handled by multiple DBWR processes without any problem. It’s more likely the case that if your system has unnecessary multiple writers and you reconfigure to a single writer you wont see any difference. That’s because multiple writers in and of themselves is not a bad thing. And most production systems do not torture the I/O subsystem like I do here. My intent was simply to dig into the topic a bit deeper. That leads me to admission number 2.

Earlier in this series I mentioned the cache effect of multiple writers. Well, I was digging around some audited TPC-C full disclosure reports and noticed the practice of setting db_block_checksum=FALSE. This jarred my memory to the fact that since Oracle9i, DBWR has in fact been peeking at block contents by default. Let me explain. There was a day when DBWR’s accesses to memory differed significantly from that of foreground processes. Those days were pre-Oracle9i and in those releases DBWR didn’t peek at the contents of buffers. Instead, DBWR’s work consisted of figuring out what to write, writing it and performing the necessary housekeeping so that the written buffer would have the correct state. To that end, DBWR processes manipulated much more SGA metadata than did foreground processes. While that is still true to a large degree, Oracle9i changed that situation because DBWR now peeks at buffer contents.

With Oracle9i and later releases, DBWR processes peek at the contents of the buffers to calculate the block checksum and then store the value in the block itself. When re-read, the checksum is calculated by the reading process to ensure the block hadn’t been corrupted after it was flushed from the SGA. Now don’t get me wrong, these buffer reads are nowhere as data intensive as what foreground processes do for a living. Consider, for instance, a block of data manipulated by a foreground process servicing a payroll application. The foreground process exercises myriads of code that DBWR never would (SQL, transaction layer, etc), touches shared resources DBWR never would (shared pool, redo, etc) and most importantly, manipulates the row pieces in the block. When DBWR flushes the buffer with block check summing, it will peek at the contents of the block which will require the memory be installed into DBWR’s processor cache. That wasn’t the case “in the old days.” DBWR used to only manipulate metadata.

With db_block_checksum=FALSE, DBWR doesn’t peek at the contents of the block. When DBWR simply writes a buffer (without peeking into it), the contents of that buffer do not get installed into the processor cache. The page is in DBWR’s address space, but that is not the same as a memory access. Don’t get me wrong, I’m not saying db_block_checksum=TRUE is miserably expensive, because I haven’t tested the difference on modern systems so it I don’t know. To be honest, I wouldn’t run a database without it because it sure seems like good insurance. I would think systems with really large processor caches would see the most performance hit, but that would just be my best guess. After all, most commodity systems have small processor caches that are simply overwhelmed long before any of these concepts come into play. As for huge systems, well, I’ve been unfortunate to have been out of that space for a few years. It’s good to have friends, so I’ll quote the comment my old buddy Glenn Fawcett made on Part I of this series:

I agree… One DBWR process per cpu is not sage advice.

On large NUMA based servers, such as the Sun E25K, Oracle will by *default* create one dbwr process per system board. This allows the DBWR process to search LRUs out of local memory which is very efficient… Kevin could explain the benefits of this far better than I could.

The best advice I could give on “db_writer_processes” is don’t set it AT ALL. Let Oracle figure out the best setting.

Glenn knows really big systems due to his role in Sun’s SAE and other roles.

I think all this was my long-winded way to say that it is good to dive into DBWR topics, but it is best to remember that things like db_block_checksum defaults change from time to time. Further, there should be no panic over having multiple writers and finally, mistakes happen.

This One’s Not About Oracle

sunset0001.jpg

Over-Configuring DBWR Processes – Part III

Sometimes the comment thread on a blog entry are important enough to warrant yet another blog entry. This is one of those times. In my blog entry called Over-Configuring DBWR Processes Part II I offered some test results showing that it really is best to keep a single writer until such time as a single writer becomes CPU-bound. But first a disclaimer.

DISCLAIMER: There are ports of Oracle where specialized multiple DBWR code is implemented for specific hardware architecture (e.g., NUMA). I ought to know, I worked on one of the very first (if not the first) to do so with the ports to Sequent DYNIX/ptx.

There, disclaimer aside, I am talking about the Linux ports of 10g and 11g which have no such hardware specialization.

The first question in that comment thread was:

From what’s written here, it looks like you only tested 1 and 8. Is it possible that the global maximum is actually somewhere between? Maybe 2 processes is slightly faster than 1, for example?

Great question with a simple answer: use 1 writer as long as that single writer has ample bandwidth. That is a truth, but I figured I’d do some runs to milk out a fact or two. I configured 2 DBWR processes and re-ran the workload I describe in Part II of this series. I followed that up with 4 DBWR processes. Hey, it’s all auto-pilot stuff with my setup so doing this was no big deal. Let’s compare the results. Just to summarize, the previous configurations performed as follows:

Configuration

OLTP Throughput (TPM)

Aggregate DBWR CPU Usage

1 DBWR

62.6

291

8 DBWR

57.9

256

And now with the 2 and 4 DBWR cases:

Configuration

OLTP Throughput (TPM)

Aggregate DBWR CPU Usage

1 DBWR

62.6

291

2 DBWR

58.7

273

4 DBWR

58.4

269

8 DBWR

57.9

256

The way this goes is that more DBWR processes impact throughput. On the other hand, more throughput will require more DBWR work so the optimal case of 1 DBWR process will take more CPU, but remember that it is also getting more throughput. I only provide these CPU numbers to show what percentage of a single CPU any given configuration utilizes. I don’t think it is necessary to run the 6 DBWR case. But what about a wildcat configuration such as multiple DBWR processes that have hard processor affinity?

Processor Affinity and Oracle Background Processes
Well, I’d never bind an Oracle background process to a singe processor (core) for obvious reasons, but with the Linux taskset(1) command it is possible to bind a running process to a socket. For example, the following command binds pid 13276 to the seconds socket of a mulit-socket system:

# taskset –pc 2-3 13276

Building on that idea, you can grab DBWR processes (or any process for that matter) once they are running and pin them to a CPU core or a set of CPU cores. The following stupid bash shell function can be used to do this sort of thing. Note, since there are tens of thousands of skilled computer programmers out there these days, all shell scripts are considered stupid of course. The function takes an argument that it then uses in its grep regular expression for globbing ps(1) output. An example would be:

aff ora_dbw
aff()
{
local GLOB="$1"
local CORE=0
local TWIN=0

ps -ef | grep $GLOB | grep -v grep | awk '{ print $2 }'  | while read PID
do
        (( TWIN = $CORE + 1 ))

        set -x
        sudo taskset -pc ${CORE}-${TWIN} $PID
        set -    

        (( CORE = CORE + 2 ))
        [[ $CORE = 8 ]] && CORE=0 && TWIN=0
done
}

I used this script to affinity 2 DBWR processes to a single socket and the numbers didn’t change. Neither did throwing 2 DBWR processes one each to two different sockets. The fact simply remains that DBWR doesn’t benefit from having peers unless it is CPU-bottlenecked.

Feels Like A Sound Bite

Interesting to see this eWEEK.com interview finally show up. It sure sounds different when it is presented there; as if this was the complete conversation. In fact, this was a small part of a long set of interviews back in February 2007. I wish now that I would have been clearer about what ‘production’ means. There are production applications at the core of the enterprise and there are production applications at the edge. Any application that can be developed and deployed using any open source database would most certainly not require Oracle’s carrier-grade edition (Enterprise Edition) were it to be developed and deployed with Oracle. That is, any application that can be deployed with an open source database could be deployed with a limited-feature edition of Oracle such as Standard Edition (or even perhaps Standard Edition One). On the contrary, it is unreasonable to think that one could migrate enterprise-class applications to any of the open source stuff out there (or even most of the closed-source stuff for that matter). I think I’ve made that point pretty clear in my thread about EnterpriseDB for warehousing and many other posts here in the blog.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.