Archive Page 31

Oracle11g New Features Look Like Good Stuff. Not Just a World Domination Play.

I feel fortunate to be one of 350 people invited to Oracle HQ this week for the Xtreme PTS: The Total Oracle Partner Technology Connection. This is a week long 11g briefing, technical deep dive and hands-on event hosted by Oracle’s Platform Technology Solutions group.

I’ve been involved with pre-release Oracle technology in one fashion or another with every release since 5.1.17 so I’ve been to this type of event several times. Over the years these events have been sort of “This is our software, with it we shall dominate the world.” That was not the case at this event. The presentations have been very good, focused on the technology and honestly a pleasure to attend. Andy Mendelsohn gave the first key note presentation covering 11g new features. I have to admit I was curious what sort of war drum the 11g new features presentation would sound like. Would I hear, “We are Linux” or “Microsoft Sucks?”

As soon as Andy started to speak, I set my stop watch. I wanted to time how long it would take for the word Linux to pop up. I was pleasantly surprised that the first 21 minutes of the presentation consisted of a good coverage of the new features with complete lack of any “World Dominance” sort of mantra. At 22 minutes, the word Linux appeared right next to Windows in the context of provisioning! See, I don’t think we really want to hear all this “Rule the world with Linux” sort of stuff. After all, the majority of Oracle customers are running their business with Oracle on Unix systems primarily, Windows just below that and then Linux. I’m not bashing Linux, but come on, it is just an OS. We like Oracle—The Database, the platform. No need for Linux religion really.

Richard Sarwal followed with an update on Enterprise Manager. I’ve known Richard since the mid-90s. What I learned from his presentation is that EM is getting more and more focus at Oracle. I’m glad because as it stabilizes it appears to me more and more crucial in the grand scheme of operations over all. I have to admit that I have not kept myself up to date with some of the provisioning and patch management capabilities of EM. Small awakenings like that are why I like to attend these types of events.

Near the end of the day yesterday I spent about an hour and a half briefing the Sr. Director in charge of VOS development in Server Technologies on some testing I’m doing in our labs. One of his engineers that I’ve known, and held in the highest regard, since the mid-90s was there as well. After the meeting she told me she reads my blog. She didn’t chew me out about the content. I was happy.

Oracle versus SQL Server
I’ve pointed out many times on this blog that I am not a Microsoft-basher. I do have one thing to say, however. The more I see about 11g combined with the February SE licensing (as I discussed in this blog entry) change, Oracle is going to put the hurt on SQL Server they way they did Informix, Sybase and Ingres. And for those of you who can’t remember those wars, they were bloody.

World Dominance
OK, so I harp on that one a bit too much. It isn’t that I don’t think Oracle should pursue world dominance; it’s just that I don’t want to constantly hear about it. Too much preaching to the converted gets tiring. Look, if Oracle doesn’t remain dominant in the market they won’t be there to support our deployments. None of us would benefit from Oracle suffering in the market.  But world dominance?

It’s OK To Blame the “Storage.” Oracle Problems When “Storage” is “Just Fine”

“It’s not the storage.” Now that is a famous quote. I hear about it all the time. Some bizarre Oracle problems crop up and the DBA team do all the analysis they possibly can and then start talking to the System Administration and eventually, the Storage Administration folks. Let me tell you a tale…

I had a fully functional 4 node Oracle10gR2 RAC cluster running on RHEL4 x86_64. Everything was just fine, until Wednesday this week. The 30+ hours that ensued were quite interesting—because of how many different teams of people were involved. I’d like to share the strangest set of Oracle failures that turned out to be “storage related.” Why the quotes?

Storage Starts When Oracle Makes an I/O Call
That is my opinion and it always has been. As far as the Oracle server is concerned, everything “downwind” of a physical I/O (e.g., db file sequential read, direct path write, log write parallel write, db file scattered read, CF read, etc) is storage related—generically speaking. Some of that is in the I/O library (user mode), some in the Kernel I/O stack, and more in the I/O adapter firmware, storage array firmware and all the hardware along the way of course. So why did my database spend about an entire day down when there were “no storage errors?” Ever hear that from the Storage Group in your IT shop? Sure you have. The problems my database had were not because blocks of data where failing to hit their rightful spot on some little round, brown spinning thing somewhere. No, this “storage problem” was up-wind from that point.

Diagnosis
Yes, the best way to get to the bottom of a problem is to ask 3 questions:

  1. Has this ever worked
  2. When did it stop working
  3. What changed

If nothing changed in my situation, that database would have most likely happily continued to run. But we were changing a lot of things—I admit. We were testing the I/O stack comparing RHEL4 kernel versions 2.6.9-34 and 2.6.9-42. The following problems are not typical of “I/O related issues”:

Wed Mar 28 19:08:00 2007
Errors in file /u01/app/oracle/admin/PROD/udump/prod1_ora_7140.trc:
ORA-00600: internal error code, arguments: [2116], [900], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [3708], [910], [], [], [], [], [], []
Wed Mar 28 19:08:05 2007
Trace dumping is performing id=[cdmp_20070328190805]
Wed Mar 28 19:08:38 2007
Errors in file /u01/app/oracle/admin/PROD/bdump/prod1_ckpt_7018.trc:
ORA-00600: internal error code, arguments: [2103], [0], [4], [1], [900], [], [], []

Well, OK, CF Enqueue timeout could be I/O related, and ORA-00600 [3708] is not that uncommon, but 2116 is not very common. The point is that I had no I/O errors anywhere. The problem persisted:

Thu Mar 29 10:36:20 2007
GES: Potential blocker (pid=6870) on resource DM-0x0-0x1;
enqueue info in file /u01/app/oracle/admin/PROD/bdump/prod1_lmd0_6735.trc and DIAG trace file
Thu Mar 29 10:42:06 2007
Errors in file /u01/app/oracle/admin/PROD/udump/prod1_ora_6870.trc:
ORA-00600: internal error code, arguments: [2116], [900], [], [], [], [], [], []

$ more /u01/app/oracle/admin/PROD/udump/prod1_ora_6870.trc
*** 2007-03-29 10:13:11.558
Dumping diagnostic information for CKPT:
OS pid = 6776
loadavg : 1.34 0.97 0.54
memory info: free memory = 0.00M
swap info: free = 0.00M alloc = 0.00M total = 0.00M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 D oracle 6776 1 0 75 0 – 129285 wait_f 09:51 ? 00:00:00 ora_ckpt_PROD1
(no debugging symbols found)
Using host libthread_db library “/lib64/tls/libthread_db.so.1”.
Attaching to program: /proc/6776/exe, process 6776
skgpgpstack: fgets() timed out after 60 seconds
*** 2007-03-29 10:14:45.135
*** 2007-03-29 10:14:55.139
Waited for detached process: CKPT for 310 seconds:

This cluster was quite ill as should be apparent. But is it apparent that “Storage” is the problem?

What is “Storage” Again?
The moral of this story is to remember that I/O starts when Oracle makes a physical I/O call. This problem turned out to be an accidental association of .34 kernel modules getting used by a .42 kernel. Most things worked just fine—but I sure could not boot my Oracle database.

Once the right kernel modules were sorted out, the instances booted just fine and it started chugging along:

SQL> @io.sql

READS READMB WRITES WRITEMB
———- ———- ———- ———-
6648 41.6289063 97259 403.9375

SQL> list
1 select sum(PHYRDS) reads,sum(PHYBLKRD * 4 )/1024 readMB,
2 sum(PHYWRTS) writes,sum(PHYBLKWRT * 4 )/1024 writeMB
3 from dba_data_files,gv$filestat
4* where dba_data_files.file_id = gv$filestat.file#

The Moral
This blog entry won’t be life changing for anyone—except perhaps someone who is googling for ORA-0600 errors that start popping up “out of nowhere” on a database that has otherwise been quite healthy. Or perhaps even, someone that suspects they have a “storage problem”, but the blocks of data seem to be getting to disk just fine as far as the Storage Group can tell.

I hope my viewpoint about where “Storage” begins resonates with readers though.

Oracle on Virtual Machines. Going Fishing? Intel “Nehalem” Xeon Quad-Core with CSI Floats!

CRN.com has coverage of the Xeon “Penryn” processor and some info about the micro-architecture change that will following in 2008 with the “Nehalem” processor. I think the following is an astounding comment:

Meanwhile, Intel is also preparing its next-generation Nehalem platform, which represents the company’s most significant shift in system architecture since the Pentium Pro debuted in 1996, Gelsinger said.

If you remember the P6 Orion chipset with the Pentium Pro, you’ll recall that it was Intel’s first MCM with 4 Pentium processors. It offered 48 bit memory support (kernel address space), 3 cycle shared L2 cache, and was quite the leap over the Pentium. The article states that the off-chip memory controller will be gone (good) and the interconnect (CSI) will be more like AMD HyperTransport. I think that means a bit of a NUMA feel, but I’m not sure yet. The architecture of Nehalem will support up to 8 cores as well.

What Does This Have To Do With Oracle
These are quad core processors that are going to pack a very significant punch—much more so than the AMD Barcelona processor expected later this year. That means single socket, quad core servers with more power than most 4 socket systems today. So if you have, say, a Proliant DL585 (great box) with idle cycles, you will likely have a lot of idle cycles when you refresh with these servers. That means virtualization—get use to it. The article hints towards 32nm processors in the 2010 timeframe. My oh my.

Where and What is a Nehalem, Really?
It is a North American Indian tribe. There is also a river about 40 miles from where I live and it is, in fact, precisely what Intel named this processor after. Intel has named other processors after rivers in the Pacific Northwest region of the states in the past (e.g., Willamette). I’ve been fishing the Nehalem for many, many years. I’m told blogs are better with photos, so here goes.

I’m sure the concept of fishing will wound the tender sentiment of at least a few readers. I’m sorry. You can’t make everyone happy, but I’ll throw a bone. The main species we fish for in the Nehalem is Steelhead which is an anadromous salmonid related to trout. Basically, it is a trout that lives in salt water but spawns in fresh water. Unlike true salmon, it can repeat that cycle. For that reason, game management in my home state enforce a great deal of “catch and release” and artificial bait regulations. That is in fact what I was doing when I caught the “Nehalem Bright”, as they are called, in the following photo. Caught, photographed and placed gently back into the water.

Nehalem_Brite

Real Application Clusters on x86_64 RHEL? Don’t Forget Those 32bit Libraries!

The required list of RPMs for installing Oracle10g Release 2 on RHEL4 x86_64 is listed on the web in many places. What I don’t find, however, is ample coverage of the errors one sees if things are not completely in order. I think this blog entry might help out someone that makes it well into the Oracle clusterware install and hits this problem. I think there is enough of this error stack to anchor future googlers.

As one of my co-workers, Sergei, discovered yesterday, if you don’t have glibc-devel-2.3.4-2.25.i386.rpm (yes the 32bit glibc-devel) installed, you will see the following error stack when you run the root.sh in $ORA_CRS_HOME:

CSS is active on these nodes.
qar14s21
qar14s22
qar14s23
qar14s24

CSS is active on all nodes.
Waiting for the Oracle CRSD and EVMD to start
Oracle CRS stack installed and running under init(1M)
Running vipca(silent) for configuring nodeapps
PRKH-1010 : Unable to communicate with CRS services.
[PRKH-1000 : Unable to load the SRVM HAS shared library
[PRKN-1008 : Unable to load the shared library “srvmhas10″ or a dependent library, from LD_LIBRARY_PATH=”/u01/app/oracle/product/10.2.0/crs_1/jdk/jre/lib/i386/client:\
/u01/app/oracle/product/10.2.0/crs_1/jdk/jre/lib/i386:\
/u01/app/oracle/product/10.2.0/crs_1/jdk/jre/../lib/i386:\
/u01/app/oracle/product/10.2.0/crs_1/lib32:\
/u01/app/oracle/product/10.2.0/crs_1/srvm/lib32:\
/u01/app/oracle/product/10.2.0/crs_1/lib:\
/u01/app/oracle/product/10.2.0/crs_1/srvm/lib:\
/u01/app/oracle/product/10.2.0/crs_1/lib”
[java.lang.UnsatisfiedLinkError: /u01/app/oracle/product/10.2.0/crs_1/lib32/libsrvmhas10.so: libclntsh.so.10.1: cannot open shared object file: No such file or directory]]]

Not the most intuitive error stack given the origin of the problem especially since it happens so late in the game. I hope this helps somebody one day. That is a part of what blogging is supposed to be about I think.

Thanks to Sergei as well.

Steve Jobs Burns Money for the Betterment of Mankind

I haven’t done the Stupid Quote of the Day in a while, so I submit:

I was worth about $1 million when I was 23 and more than $10 million when I was 24, and more than $100 million when I was 25, and it wasn’t that important because I never did it for the money.

Steve Jobs

Gee, thanks Steve. We know, you did it for the betterment of mankind. Is there a pile of ashes in his back yard where he burned that money? I prefer truthful people:

Money doesn’t make you happy. I now have $50 million, but I was just as happy when I had $48 million.

Arnold Schwarzenegger

 

Do Not Exhaust Free Space Under ORA_CRS_HOME. Who Was The Culprit?

In my blog post entitled DBCA is Fond of ORA-30012 In RAC Environments, I mentioned that I am doing a lot of 4-node 10gR2 Linux RAC testing at the moment. I’ve got 1 cluster each of:

  • RHEL4 x86_64
  • RHEL4 x86
  • SLES9 x86_64
  • SLES9 x86

At one point I tried to start one of my RAC databases and got the following complaint from srvctl:

$ srvctl start database -d PROD
PRKH-1010 : Unable to communicate with CRS services.
[Communications Error(Native: prsr_initCLSS:[3])]
PRKO-2005 : Application error: Failure in getting Cluster Database Configuration for: PROD

That is a very generic error stack. My working directory was somewhere under $ORACLE_HOME (under /u01) I recall. I tried to cat(1) some text into a file and found there was no space available:

$ df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/cciss/c0d0p1 32993108 4671004 26646128 15% /
none 517148 0 517148 0% /dev/shm
/dev/psv/psv1 4489172 4489172 0 100% /u01
/dev/psv/psv2 9011188 1969292 7041896 22% /u02
/dev/psv/psv3 9011188 16264 8994924 1% /u03
/dev/psv/psv4 9011188 50540 8960648 1% /u04

The srvctl/CRS Connection
Right, running out of space in ORACLE_HOME is not good, but what does that have to do with srvctl? Well, this is one configuration where I have ORA_CRS_HOME in the same mount as ORACLE_HOME. When starting a database, srvctl uses relies on services from CRS. Having ORA_CRS_HOME on the same mount as ORACLE_HOME works just fine…unless…

I found the culprit taking all the space in the mount point where ORACLE_HOME and ORA_CRS_HOME reside, but first I want to show what happened next. I cleaned up some “stuff” and ran df(1):

$ df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/cciss/c0d0p1 32993108 4668384 26648748 15% /
none 517148 0 517148 0% /dev/shm
/dev/psv/psv1 4489172 3523208 965964 79% /u01
/dev/psv/psv2 9011188 1969296 7041892 22% /u02
/dev/psv/psv3 9011188 16264 8994924 1% /u03
/dev/psv/psv4 9011188 50540 8960648 1% /u04

I reclaimed some space, so…

$ srvctl start database -d PROD
PRKP-1001 : Error starting instance PROD2 on node qar11s10
CRS-1005: Failed to get required resources
CRS-0223: Resource ‘ora.PROD.PROD2.inst’ has placement error.
PRKP-1001 : Error starting instance PROD4 on node qar11s12
CRS-0215: Could not start resource ‘ora.PROD.PROD4.inst’.

OK, what now? I started investigating ORA_CRS_HOME for the node called qar11s10 and found the following:

# cd /u01/app/oracle/product/10.2.0/crs_1/log/qar11s10/crsd
# ls -l
total 95072
-rw——- 1 root root 463667200 Mar 19 10:52 core.6155
-rw-r–r– 1 root root 3519955 Mar 20 08:08 crsd.log

I thought that was really strange. Who was sitting in the logging directory for CRS when they dumped core? Let’s take a peek (emphasis added by me):

# gdb -c core.6155
GNU gdb Red Hat Linux (6.3.0.0-1.132.EL4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions.
Type “show copying” to see the conditions.
There is absolutely no warranty for GDB. Type “show warranty” for details.
This GDB was configured as “i386-redhat-linux-gnu”.
Core was generated by `/u01/app/oracle/product/10.2.0/crs_1/bin/crsd.bin reboot’.
Program terminated with signal 6, Aborted.
#0 0x003647a2 in ?? ()

So for some reason crsd.bin was executing with its CWD in the logging directory and dumped core. OK, whatever. But why a 442MB core dump after sending itself a SIGABRT (by calling abort(P))? Unfortunately those are good questions I can’t answer. I haven’t started on the path investigating why crsd.bin trashed its address space—I’ll get to that later. I also wasn’t done with checking the other nodes. Since this is a shared ORA_CRS_HOME, I simply changed directories up two parent levels (../../) and set off a find. The find command returned the core file I already knew about and one dropped from crsd.bin on node 1 of the cluster. I thought 442MB was special—how about 534MB:

# pwd
/u01/app/oracle/product/10.2.0/crs_1/log
# find . -name “core.*” -print
./qar11s10/crsd/core.6155
./qar11s9/crsd/core.6206
# find . -name “core.*” -exec ls -l {} \;
-rw——- 1 root root 463667200 Mar 19 10:52 ./qar11s10/crsd/core.6155
-rw——- 1 root root 560902144 Mar 19 17:06 ./qar11s9/crsd/core.6206

Why Am I Blogging About This?
If ORA_CRS_HOME runs out of space, availability can be affected.

Where Were You The First Time You Ever …?
After having seen this, I started to investigate just how big crsd.bin is when it isn’t hitting a bug—and what its CWD is when executing normally. The following was taken from a different cluster, but no matter:

# cd /
# ps -ef | grep crsd.bin
root 9880 1 0 10:53 ? 00:02:36 /u01/app/oracle/product/10.2.0/crs_1/bin/crsd.bin reboot
root 23005 24382 0 17:01 pts/0 00:00:00 grep crsd.bin
# cd /proc/9880
# ls -l cwd
lrwxrwxrwx 1 root root 0 2007-03-22 16:38 cwd -> /u01/app/oracle/product/10.2.0/crs_1/log/tmr18s1/crsd

Eek! I was surprised to find that crsd.bin executes with its CWD in a logging directory, but no matter—just be aware that this logging directory under ORA_CRS_HOME can get some good sized bombs (core files) dropped there if things go bad.

Back to the Core Files
With 442MB and 534MB seen so far, it made me wonder just how big crsd.bin normally is. In fact, I thought it best to look at how large csrd.bin is before any databases are even booted:

# ps -ef | grep pmon
root 14147 10188 0 17:21 pts/1 00:00:00 grep pmon
# ps -ef | grep crsd.bin
root 5953 1 0 12:09 ? 00:01:10 /u01/app/oracle/product/10.2.0/crs_1/bin/crsd.bin reboot
root 14329 10188 0 17:21 pts/1 00:00:00 grep crsd.bin
# cat /proc/5953/statm
127760 4207 1859 875 0 121306 0

So, I know this seems a bit like trivial pursuit, but according to /proc crsd.bin has a base memory utilization of roughly 499MB (first word in statm in KB). If crsd.bin dumps core smaller than that it is because core dumps don’t have the entire process context. Program text, for instance, is not dumped.

Start, Disable. Same Thing.
Then there was this other thing I saw today…

Did I really ask to disable the database in the following command?

$ srvctl start database -d PROD
PRKP-1019 : Database PROD already disabled.

If you try to start a database after disabling it, this is the message you’ll get. It is a bit misleading.

DBCA is Fond of ORA-30012 In RAC Environments

I’ve been doing a lot of 4 node 10gR2 RAC stuff lately on 4 different clusters. I ran into a problem that I thought I’d share since I can’t find very much on the web or in Metalink. It is a real headache, but “easy” to fix. I put easy in quotes because it is only easy to fix if you know what to do. First, the problem.

ORA-30012
I was running a Real Application Clusters stress test that I developed called thrash. It is not sophisticated, but it does put a lot of pain on the instances, servers and storage. It consists of staggered instance rebooting followed by the creation of a small tablespace—one per instance. Once the tablespaces are created, a set of sqlplus sessions alter the tablespaces adding a significant number of random sized data files. The tablespaces are dropped and the instances are rebooted in a staggered fashion. As I said, each instance is sustaining this workload. I know it is nothing like a production workload. It’s just one test I run in hopes of exposing file manipulation bugs associated with tablespace creation, datafile addition and tablespace dropping. That is, I’m looking for filesystem bugs exposed by Oracle tablespace manipulation.

Bumps in the Road
On occasion I was getting instances that would stop participating in the thrash. I read the alert logs and found that the error was ORA-30012—which made no sense to me since it was happening out of the blue. In fact, it was happening after running thrash for as long as 12 hours. To show you the oddity of these alert log entries, I’ll provide grep(1) output from all the alert logs (ORACLE_BASE is on a CFS):

$ grep ‘does not exist or of wrong type’ *log
alert_PROD2.log:ORA-30012: undo tablespace ‘UNDOTBS2’ does not exist or of wrong type
alert_PROD2.log:ORA-30012: undo tablespace ” does not exist or of wrong type
alert_PROD3.log:ORA-30012: undo tablespace ‘UNDOTBS3’ does not exist or of wrong type
alert_PROD3.log:ORA-30012: undo tablespace ” does not exist or of wrong type
alert_PROD4.log:ORA-30012: undo tablespace ‘UNDOTBS4’ does not exist or of wrong type

How strange! What is an undo tablespace with a NULL name? Part of the SPFILE is shown later in this post establishing the fact that I assign undo to instances explicity so what gives?

The following is a snippet to show that the error was occuring during ALTER DATABASE OPEN:

Errors in file /u01/app/oracle/admin/PROD/udump/prod2_ora_1569.trc:
ORA-30012: undo tablespace ‘UNDOTBS2’ does not exist or of wrong type
Fri Mar 16 04:41:38 2007
Error 30012 happened during db open, shutting down database
USER: terminating instance due to error 30012
Instance terminated by USER, pid = 1569
ORA-1092 signalled during: ALTER DATABASE OPEN…

The trace file wasn’t much help:

$ more /u01/app/oracle/admin/PROD/udump/prod2_ora_1569.trc
/u01/app/oracle/admin/PROD/udump/prod2_ora_1569.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 – 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
ORACLE_HOME = /u01/app/oracle/product/10.2.0/rac_1
System name:    Linux
Node name:      qar14s22
Release:        2.6.9-42mxs351RHELupdate4
Version:        #1 SMP Tue Mar 6 16:37:53 PST 2007
Machine:        x86_64
Instance name: PROD2
Redo thread mounted by this instance: 2
Oracle process number: 16
Unix process pid: 1569, image: oracle@qar14s22 (TNS V1-V3)
*** SERVICE NAME:() 2007-03-16 04:41:35.038
kspgetpeeq:  kspasci not KSPASCNOP (0x110001 != 0x0)
*** SESSION ID:(137.25) 2007-03-16 04:41:38.788
ORA-30012: undo tablespace ‘UNDOTBS2’ does not exist or of wrong type

After exhausting my patience spelunking for information on the web and Metalink, I asked my fellow OakTable Network members. Jože Senegačnik of dbprof.com replied with:

If this is a RAC database then you need to specify the instance name in the init.ora or spfile together with the undo_tablespace parameter.

Where did he come up with that? I asked him if he’d actually hit this before. His answer was:

Yes, a couple of weeks ago I have experienced it on RAC on Windows. One node had problems with the undo_tablespace parameter after an unplanned database restart. The undo_tablespace parameter was changed in runtime months before but obviously it lacked the SID information and this caused problems after database restart.

A DBCA NO-NO
Folks, this database was created with DBCA. I don’t know how I haven’t seen this issue before, but the problem is that DBCA does not configure the SPFILE with explicit assignments for the INSTANCE_NAME parameter. For example, the following are a couple of strings(1)|grep(1) command pipelines that would return the 4 assigments of the INSTANCE_NAME parameter had DBCA set it up that way:

$ strings – spfilePROD.ora | grep –I instance_name
$ strings – spfilePROD.ora | grep ‘^PROD’
PROD2.__db_cache_size=71303168
PROD4.__db_cache_size=67108864
PROD3.__db_cache_size=75497472
PROD1.__db_cache_size=71303168
PROD3.__java_pool_size=4194304
PROD2.__java_pool_size=4194304
PROD4.__java_pool_size=4194304
PROD1.__java_pool_size=4194304
PROD3.__large_pool_size=4194304
PROD2.__large_pool_size=4194304
PROD4.__large_pool_size=4194304
PROD1.__large_pool_size=4194304
PROD2.__shared_pool_size=79691776
PROD4.__shared_pool_size=83886080
PROD3.__shared_pool_size=75497472
PROD1.__shared_pool_size=79691776
PROD3.__streams_pool_size=0
PROD2.__streams_pool_size=0
PROD4.__streams_pool_size=0

PROD1.__streams_pool_size=0
PROD3.instance_number=3
PROD4.instance_number=4
PROD2.instance_number=2
PROD1.instance_number=1
PROD3.thread=3
PROD2.thread=2
PROD4.thread=4
PROD1.thread=1
PROD1.undo_tablespace=’UNDOTBS1′
PROD3.undo_tablespace=’UNDOTBS3′
PROD4.undo_tablespace=’UNDOTBS4′
PROD2.undo_tablespace=’UNDOTBS2′

Remedy
It was exactly what Joze said. I set explicit assignments for the ISNTANCE_NAME parameters as follows and the problem has gone away.

PROD1.INSTANCE_NAME=PROD1
PROD2.INSTANCE_NAME=PROD2
PROD3.INSTANCE_NAME=PROD3
PROD4.INSTANCE_NAME=PROD4

Hope this helps some googler someday.

SAN Array Cache and Filers Hate Sequential Writes

Or at least they should.

Jonathan Lewis has taken on a recent Oracle-l thread about Thinking Big. It’s a good blog entry and I recommend giving it a read. The original Oracle-l post read like this:

We need to store apporx 300 GB of data a month. It will be OLTP system. We want to use 
commodity hardware and open source database. we are willing to sacrifice performance 
for cost. E.g. a single row search from 2 billion rows table should be returned in 2 sec.

I replied to that Oracle-l post with:

Try loading a free Linux distro and typing :
man dbopen
man hash
man btree
man mppo
man recno

Yes, I was being sarcastic, but on the other hand I have been involved with application projects where we actually used these time-tested “database” primitives…and primitive they are! Anyway, Jonathan’s blog entry actually took on the topic and covers some interesting aspects. He ends with some of the physical storage concepts that would likely be involved. He writes:

 

SANs can move large amounts of data around very quickly – but don’t work well with very large numbers of extremely random I/Os. Any cache benefit you might have got from the SAN has already been used by Oracle in caching the branch blocks of the indexes. What the SAN can give you is a write-cache benefit for the log writer, database writer, and any direct path writes and re-reads from sorting and hashing.

Love Your Cache, Hate Large Sequential Writes
OK, this is the part about which I’d like to make a short comment—specifically about Log Writer. It turns out that most SAN arrays don’t handle sequential writes well either. All told, arrays shouldn’t be in the business of caching sequential writes (yes, there needs to be a cut-off there somewhere). I’ve had experiences with some that don’t cache sequential writes and that is generally good. I’ve had experiences with a lot that do and when you have a workload that generates a lot of redo, LGWR I/O can literally swamp an array cache. Sure, the blocks should be cached long enough for the write back to disk, but allowing those blocks to push into the array cache any further than the least of the LRU end makes little sense. Marketing words for arrays that handle these subtleties usually sound like, “Adaptive Array Cache”, or words to that effect.

One trick that can be used to see such potential damage is to run your test workload with concurrent sequential write “noise.” If you create a couple of files the same size as your redo logs and loop a couple of dd(1) processes performing 128K writes—without truncating the files on open—you can drive up this sort of I/O to see what it does to the array performance. If the array handles the caching of sequential writes, without polluting the cache, you shouldn’t get very much damage. An example of such a dd(1) command is:

$ dd if=/dev/zero of=pseudo_redo_file_db1 bs=128k count=8192 conv=notrunc &

$ dd if=/dev/zero of=pseudo_redo_file_db2 bs=128k count=8192 conv=notrunc &

$ wait

Looping this sort of “noise workload” will simulate a lot of LGWR I/O for two databases. Considering the typical revisit rate of the other array cache contents, this sort of dd(1) I/O shouldn’t completely obliterate your cache. If it does, you have an array that is too fond of sequential writes.

What Does This Have To Do With NAS?
This sort of workload can kill a filer. That doesn’t mean I’m any less excited about Oracle over NFS—I just don’t like filers. I recommend my collection of NFS related posts and my Scalable NAS for Oracle paper for background on what sequential writes can do to certain NAS implementations.

I’ll be talking about this topic and more at Utah Oracle User Group on March 21st.

 

Copying Files on Solaris. Slow or Fast, It’s Your Choice. Part III

This is the third installment on this thread. For context, please see:
Copying Files on Solaris. Slow or Fast, It’s Your Choice. Part I
Copying Files on Solaris. Slow or Fast, It’s Your Choice. Part II

What About cp8M Versus Stock cp(1) with Non-forcedirectio?
That is a good question. The saga continues after my post about copying files on Solaris. Once again, Padraig O’Sullivan was kind enough to test cp8M (available here ) versus stock cp(1) using a normal mounted UFS (non-forcedirectio). He reports:

Ok, I ran the benchmark in the same manner as before WITHOUT forcedirectio i.e. I rebooted the machine before each copy of the file.

# time /usr/bin/cp large_file large_file.1
real 2m17.504s
user 0m0.002s
sys 0m14.363s
#
# time /usr/bin/cp8m large_file large_file.
real 1m56.217s
user 0m0.003s
sys 0m14.264s
#

Why?
I don’t know. I certainly did not expect an increase in kernel mode cycles for the mmap-enabled cp(1). Please refer to Part I in this series to see that the comparison here is between 14.363s versus 10.853s of kernel-mode cycles.  We’re not talking a little increase. No, what Padraig’s measurements show is an increase of 32% in kernel mode cycles when copying a 1000MB file using stock cp(1) on a regular mount compared to the same work on a file in a forcedirectio mount. But hey, at least the performance (in MB/s) was consistently 16% less than cp8M. Yes, that was sarcasm.

I haven’t yet gotten my head around why the standard mmap-enabled cp(1) suffers such a jump in kernel mode processor overhead when switching from direct I/O to a normal UFS mount. I need to think about that a bit.

As usual, a picture speaks a thousand words, so I’ll provide two:

cp8m1.jpg

cp8m2.jpg

Remember my rant about the “small test?”

Sharing, and Caring
There was a comment by a reader on Part I of this blog thread that is worthy of discussion. The reader commented:

Perhaps a fairly obvious statement this, but notice the use of MAP_SHARED on the mmap call? – (I suspect you’ve spotted that already). This means that multiple processes can attach to the same memory mapped file simultaneously.

That is a good blog comment and evidence of someone giving it some thought. But, I’d like to comment on the sharable aspect of the 8MB map the reader brought out. I replied:

Your point about the kernel bcopy from UFS read buffers to the heap buffer in the address space of the cp(1) process is a good one, but this is a forcedirectio case.

That means there is no copy from the page cache into the virtual address space of process since it is direct I/O. My reply continued:

Since this is an Oracle blog, I would naturally go with the forcedirectio comparison first. It will be interesting to see with a normal UFS mount.

I’ve got a $2 bet that the MAP_SHARED is only there to facilitate copying an already mmapped file…the odds of a process jumping in and sharing a 8MB map that only lives for the duration of an I/O in and an I/O out seems pretty slim to me…but then that is 8MB twice…hmmm…I guess that 8MB mmap could exist for as much as 2-3 seconds if the I/O is headed for a single, simple drive. Sounds like a race just to share an 8MB map to me.

A Closer Look
Yes, when the stock cp(1) mmaps each 8MB segment of the input file it does so with MAP_SHARED. Like I said, I suspect the only thought behind that flag usage was to ensure there wouldn’t be “twinkling” mmap failures by other processes that could potentially be mmapping parts of that file while cp(1) is walking through it.

The reader’s comment continued:

That’s not to say that they all need to be “cp”’s – anything using mmap() on the same file at about same time will yield a benefit – the 8MB chunk paged in by mmap should only be later reclaimed by the pagescanner (or when the last process detaches?).

I already discussed the odds of another process getting in there and benefiting by that very transient mmap. It is 8MB in size and only valid during the read in, and write out—about 2-3 seconds on a really slow disk subsystem.

Reclaims
What’s this about reclaims? Good topic. When the mmap is dissolved through munmap(), the pages of the file are put on the free list (pagecache). Here is where the non-forcedirectio cp8M and cp(1) have a lot in common. In both cases, the blocks from the input file remain in main memory. Now that is where there is some true opportunity for sharing but only in the non-forcedirectio case. All said, it doesn’t take mmap() to get sharing of file contents being copied when you are using UFS with a normal mount.

So the question remains, what’s up with the mmap()-enabled cp(1)?

Is this thread making you sleepy?

Copying Files on Solaris. Slow or Fast, It’s Your Choice. Part II

 

In my post about copying files on Solaris, I cover a modification that to GNU cp(1) that yields substantial performance improvements over the stock cp(1) when copying files on forcedirectio mounts. See my comments at the end of that post regarding non-forcedirectio mounts.

I’ve gotten a large number of requests for that code. I don’t have Solaris gear around here so I asked Padraig O’Sullivan to do the favor of testing the modified GNU cp(1) versus stock cp(1). Padraig states:

Here is his recipe for making a cp8M for solaris:

I used coreutils version 5.2.1 which can be obtained from here:

http://ftp.gnu.org/pub/gnu/coreutils/

In the coreutils-5.2.1/src directory I modified the copy.c file at line 287 with the following modification:

# diff -b coreutils-5.2.1/src/copy.c cp8m/copy.c

287c315

<   buf_size = ST_BLKSIZE (sb);

>   /* buf_size = ST_BLKSIZE (sb);*/

288a317,319

>      buf_size = 8388608 ;

I then used the Makefile which is supplied with coreutils to build the cp binary.

Motives
The intent of this whole blog thread is really nothing more than diving into filesystem, and to some degree VM, internals topics so I truly hope that others will do this test. It would be interesting to see whether the mmap-enabled stock cp(1) is better than cp8M with, say, normal UFS mounts. The difference is clear to me on forcedirectio and that is, naturally, the type of mount I would take the most interest in.

 

Standard File System Tools? We Don’t Need No Standard File System Tools!

Yesterday I posted a blog entry about copying files on Solaris. I received some side channel email on the post such as one with the following tidbit from a very good, long time friend of mine. He wrote:

So optimizing cp() is now your hobby? What’s next….. “ed”… no wait “df”.. boy it sure would be great if I could get a 20% improvement in “ls”… I am sure these commands are limiting the number of orders/hr my business can process :)))

Didn’t that blog entry show a traditional cp(1) implementation utilizing 26% less kernel mode processor cycles? Oh well.

It’s About the Whole System
While those were words spoken in jest, it warrants a blog entry and I’ll tell you why. It is true this is an Oracle related blog and such filesystem tools as cp(1) are not in the Oracle code path. I blog about these things for two reasons: 1) a lot of my readers enjoy learning more about the platform in general and 2) many—perhaps most—Oracle systems have normal file system tools such as cp(1), compress(1) and others running while Oracle is running. For that matter, the Oracle server can call out to the same libraries these tools use for such functionality as BFILE and UTL_FILE. For that reason, I feel these topics are related to Oracle platforms. After all, a garbage-can implementation of the standard filesystem tools—and/or the kernel code paths that service them—is going to take cycles away from Oracle. Now please don’t quote me as saying the mmap()-enabled Solaris cp(1) is a “garbage-can” implementation. I’m just making the point that if such tools are implemented poorly Oracle can be affected even though they are not in the scope of a transaction. It’s about the whole system.

Legacy Code. What Comes Around…Stays Around.
Let’s not think for even a moment that the internals of such tools as ls(1) and df(1) are beyond scrutiny. Both ls(1) and df(1) use the stat(2) system call. We Oracle-minded folks often forget that there is much more unstructured data than structured so it is a good thing there are still some folks like PolyServe (HP) minding the store for the performance of such mundane topics as stat(2). Why? Well, perfect examples are the online photo operations such as Snapfish. Try having thousands of threads accessing tens of millions of files (photos) for fun. See, Snapfish uses the HP Enterprise File Services Clustered Gateway NAS powered by PolyServe. You can bet we pay attention to “mundane” topics like what ls(1) behaves like in a directory with 1, 2 or 100 million small files. The stat(2) system call is extremely important in such situations.

He’s Off His Rocker—This is an Oracle Blog.
What could this possibly have to do with Oracle? Well, if you run Oracle on a platform that only specializes in the code underpinnings of the most common server I/O (e.g., db file sequential read, db file scattered read, direct path read/write, LGWR and DBWR writes), you might not end up very happy if you have to do things that hammer the filesystem with Oracle features like UTL_FILE, BFILE, external tables, imp/exp and so forth, cp(1), tar(1), compress(1) and so on. It’s all about taking a holistic view instead of “camps” that focus on segments of the I/O stack.

As the cliché goes, standard file operations and highly specialized Oracle code paths are often joined at the hip.

Oracle’s Realworld Performance Team and a New Blog

Greg Rahn of Oracle’s Realworld Performance Group (under Andrew Holdsworth) has fired up a blog. His first entry is about the art of problem diagnosis and I wish I said it this way first, but I didn’t…I quote:

In times like this I find it very productive to create a list of my observations at the different levels: host/system, database, application, etc. Once all of the observations are documented, relationships between them often become more apparent. Remember, it’s better to spend more time and correctly diagnose an observation than to prematurely label it incorrectly.

Sage words indeed. I hold people like Andrew Holdsworth in the highest regard and since Greg is in his group I’m looking for good things from this new blog.

Greg’s blog is :  Structured Data

Copying Files on Solaris. Slow or Fast, It’s Your Choice. Part I

In my recent blog post entitled Standard File Utilities with Direct I/O, I covered the concept of using direct I/O filesystems for storing files to eliminate the overhead of caching them. Consider such files as archived redo. It makes no sense polluting memory with spooled archived redo logs. Likewise, if you compress archived redo it makes little sense to have those disk blocks hanging out in main memory. However, the thread discusses the fact that most operating system file tools do their work by issuing ridiculously small I/O requests. If you haven’t yet, I encourage you to read that blog entry.

The blog entry seeded a lively thread of comments—some touched on theory, others were evidence of entirely missing the point. One excellent comment came in that refreshed some long-lost memories of Solaris. The reader wrote:

For what its worth, cp on Solaris 10 uses mmap64 in 8MB chunks to read in data and 8MB write()s to write it out.

I did in fact know that but it had been a while since I have played around on Solaris.

The Smell Test
I wasn’t fortunate enough to have genius passed to me through genetics. Oh how it seems life would be simpler if that were the case. But it isn’t. So, my motto is, “99% Perspiration, 1% Inspiration.” To that end, I learned early on in my career to develop the skills needed to spot something that cannot be correct—before bothering myself with whether or not it is in fact correct. There is a subtle difference. As soon as the Solaris mmap() enabled cp(1) thing cropped up, I spent next to no time at all pondering how that must certainly be better than normal reads (e.g., pread()) since it failed my smell test.

Ponder Before We Measure
What did I smell? There is just no way that walking through the input file by mapping, unmapping and remapping 8MB at a time could be faster than simply reusing a heap buffer. No way at all. After all, mmap() has to make VM adjustments that are not exactly cheap so taxing every trip to disk with a vigorous jolt of VM overhead makes little sense.

There must have been some point in time when a cp(1) implemented with mmap() was faster, but I suspect that was long ago. For instance, perhaps back in the day before pread(1)/pwrite(). Before these calls, positioning and reading a file required 2 kernel dives (one to seek and the other to read). Uh, but hold it. We are talking about cp(1) here—not random reads—where each successful read on the input file automatically adjusts the file pointer. That is, the input work loop would never have been encumbered with a pair of seek and read. Hmmm. Anyway, we can guess all day long why the Solaris folks chose to have cp(1) use mmap(2) as its input work horse, but in the end we’ll likely never know.

A Closer Look
In the following truss output, the Solaris cp(1) is copying a 5.8MB file to an output file called “xxx.” After getting a file descriptor for the input file, the output file is created. Next, mmap() is used on the input file (reading all 5.8MB since it is smaller than the 8MB operation limit). Next, the write call is used to write all 6161922 bytes from the mmap()ed region out to the output file (fd 4).

open64("SYSLOG-4", O_RDONLY) = 3
creat64("xxx", 0777) = 4
stat64("xxx", 0x00028640) = 0
fstat64(3, 0x000286D8) = 0
mmap64(0x00000000, 6161922, PROT_READ, MAP_SHARED, 3, 0) = 0xFEC00000
write(4, " F e b 5 1 0 : 3 6".., 6161922) = 6161922
munmap(0xFEC00000, 6161922) = 0

Of course if the file happened to be larger than 8MB, cp(1) would unmap and then remap the next chunk and on it would proceed in a loop until the input EOF is reached. That is a lot more “moving parts” than simply calling read(2) over and over clobbering the contents of a buffer allocated at the onset of the copy operation—without continual agitation of the VM subsystem with mmap().

I couldn’t imagine how cp(1) using mmap() would be any faster than read(2)/write(2). But then, it actually only replaces the input read with an mmap() while using write(2) on the output side. I couldn’t imagine how replacing just the input portion with mmap() would be faster than a cp() that uses a static heap buffer with read/write pairs. Moreover, I couldn’t picture how the mmap() approach would be easier on resources.

Measure Before Blogging
Not exactly me since I don’t have much Solaris gear around here. I asked Padraig O’Sullivan to compare the stock cp(1) of Solaris 10 to a GNU cp(1) with the modification I discuss in this blog entry. The goal at hand was to test whether the stock cp(1) constantly mapping and unmapping the input file is somehow faster or more gentle on processor cycles than starting out with a heap buffer and reusing it. The latter is exactly what GNU cp(1) does of course. Padraig asked:

One question about the benchmark you want me to run (I want to make sure I get all the data you want) – is this strategy ok?

1. Mount a UFS filesystem with the forcedirectio option
2. Create a 1 GB file on this filesystem
3. Copy the file using the standard cp(1) utility and record timing statistics
4. Copy the file using your modified cp8M utility and record timing statistics

Let me know if this is ok or if you want more information for the benchmark.

 

There was something else. I wanted a fresh system reboot just prior to each copy operation to make sure there were no variables. Padraig had the following to report:

[…] manage to run the benchmark in the manner you requested this morning […] Below is the results. I rebooted the machine before performing the copy each time.

# ls -l large_file
-rw-r–r– 1 root root 1048576000 Mar 5 10:25 large_file

# time /usr/bin/cp large_file large_file.1

real 2m17.894s
user 0m0.001s
sys 0m10.853s

# time /usr/bin/cp8m large_file large_file.2

real 1m57.932s
user 0m0.002s
sys 0m8.057s

Look, I’m just an old Sequent hack and yet the results didn’t surprise me. The throughput increased roughly 16% from 7.3MB/s to 8.5MB/s. What about resources? The tweaked GNU cp8M utilized roughly 26% less kernel mode processor cycles to do the same task. That’s not trivial since we didn’t actually eliminate any I/O. What’s that? Yes, cp8M reduces the wall clock time it takes to copy a 1000MB file by 16%–without eliminating any physical I/O!

Controversy
Yes, blogging is supposed to be a bit controversial. It cultivates critical thought and the occasional gadfly, fly-by commentary I so dearly appreciate. Here we have a simple test that shows a “normal” cp(1) is slightly faster and significantly lighter on resources than the magical mmap()-enabled cp(1) that ships with Solaris. Does that mean there isn’t a case where the mmap() style is better? I don’t know. It could be that some really large Solaris box with dozens of concurrent cp(1) operations would show the value of the mmap() approach. If so, let us know.

What Next?
I hope someone will volunteer to test cp8M on a high-end Solaris system. Maybe the results will help us understand why cp(1) uses mmap() on Solaris for its input file. Maybe not. Guess which way I’d wager.

Multiple Buffer Pools with Oracle

The multiple buffer pools feature of Oracle has always seemed much underappreciated. Back in the late 90s we found the feature invaluable when running audited TPC-C benchmarks on Sequent gear. With that workload it is critical to be able to size blocks according to the row size and modify rate of certain relations. I think there may be a perception that this feature is a benchmark special when in fact it can be very helpful.

Half Full or Half Empty
Some folks don’t think about their SGA block buffer pool much beyond the fundamental aspect of having blocks there to satisfy logical reads. Since it is managed with a least recently used (LRU) algorithm, I think folks generally leave it at that. The problem is that if you take a pessimistic look at LRU it is not always sufficient. That is, evicting the least recently used blocks isn’t sufficient if the blocks shouldn’t really be in the cache in the first place. There are some blocks that should simply not remain in the cache and LRU is not aggressive enough to “protect the cache”, as I put it. As an aside, this sort of dovetails in with the thread I have going about saving the overhead of caching certain filesystem data (e.g., compressed archived redo log files) by implementing direct I/O.

The Art of Multiple Buffer Pools Tuning
I think it is an art. The general approach is to figure out what is consuming cache and whether or not you want that particular object to be consuming the cache. The following is a quote from a whitepaper I wrote entitled Buffer Cache Performance Analysis and Multiple Buffer Pools in Oracle9i:

Any table or index sustaining a large number of reads should also have a commensurate presence in the cache. If not, it is safe to surmise that the blocks are being read at random and therefore are not being shared.

Although that paper is old, I think the principles apply quite well even today. Note, I have not done this testing with Oracle10g yet so the queries in the paper may not be correct.

A Current Topic
In a recent thread on the oracle-l list, fellow OakTable Network member Mark Bobak wrote:

I went through an exercise a while ago where I tried to determine which segments were candidates for a KEEP or RECYCLE pool.

The query I came up with is:

select vss.owner,
vss.object_name, vss.statistic_name, vss.value,
ds.bytes segsize, ds.buffer_pool

from v$segment_statistics vss, dba_segments ds
where vss.statistic_name =’physical reads’
and vss.value > 5000000 —You may need to play with this threshold value for your environment
and ds.segment_type = vss.object_type
and ds.segment_name = vss.object_name
and ds.owner=vss.owner
and ds.buffer_pool = ‘DEFAULT’
order by value
/

Now, this helped me identify the segments that had the largest levels of physical IO, and the size of the segment. The idea is, “show me which segments are in the DEFAULT pool and are doing the largest numbers of physical reads, and show me how big those segments are.”

Mark continued with:

Next, ask yourself these questions:

– How much memory do I have to dedicate to KEEP and/or RECYCLE pools?

– Of the segments returned from the query above, which ones are small, and easy to entirely cache? These are candidates for KEEP pool.

– Which are entirely too large to consider caching? These are candidates for RECYCLE pool.

I hope my old Oracle9i paper and Mark’s oracle-l post are helpful.

 

Concurrrent Instance Booting == Trivial Pursuit

Blog Frequency
I have found out the hard way that it is very difficult to keep the blogging stream going when one of your best friends of over 20 years passes away after a very short, fierce battle with cancer and your company gets swept up in a corporate takeover 2 days later. So, I’m understandably behind in my series about Oracle on Linux with NUMA hardware and a lot of other things. I actually have the material on the NUMA thread, but just haven’t been able to construct an entry. So, I thought I’d drop one in on a lighter note.

A Really Weird ps(1) Listing
What are the odds of this? In the following screen shot I have a ps(1) listing from one of my lab systems where I had two non-RAC instances running—one called PROD and the other called PRODD. What are the odds of having two PMON processes with sequential Linux process ids? Remember, the foreground process performing the instance start forks itself and execs off each background process which in turn execs again to change argv[0] (yes there are other reasons the server does a double exec()).   Anyway, it looked cool to me.

Check it out:

NOTE: Some browsers require you to right click->view for a clear image.

pps_1.jpg

 slash_proc


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.