In my blog post entitled DBCA is Fond of ORA-30012 In RAC Environments, I mentioned that I am doing a lot of 4-node 10gR2 Linux RAC testing at the moment. I’ve got 1 cluster each of:
- RHEL4 x86_64
- RHEL4 x86
- SLES9 x86_64
- SLES9 x86
At one point I tried to start one of my RAC databases and got the following complaint from srvctl:
$ srvctl start database -d PROD
PRKH-1010 : Unable to communicate with CRS services.
[Communications Error(Native: prsr_initCLSS:[3])]
PRKO-2005 : Application error: Failure in getting Cluster Database Configuration for: PROD
That is a very generic error stack. My working directory was somewhere under $ORACLE_HOME (under /u01) I recall. I tried to cat(1) some text into a file and found there was no space available:
$ df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/cciss/c0d0p1 32993108 4671004 26646128 15% /
none 517148 0 517148 0% /dev/shm
/dev/psv/psv1 4489172 4489172 0 100% /u01
/dev/psv/psv2 9011188 1969292 7041896 22% /u02
/dev/psv/psv3 9011188 16264 8994924 1% /u03
/dev/psv/psv4 9011188 50540 8960648 1% /u04
The srvctl/CRS Connection
Right, running out of space in ORACLE_HOME is not good, but what does that have to do with srvctl? Well, this is one configuration where I have ORA_CRS_HOME in the same mount as ORACLE_HOME. When starting a database, srvctl uses relies on services from CRS. Having ORA_CRS_HOME on the same mount as ORACLE_HOME works just fine…unless…
I found the culprit taking all the space in the mount point where ORACLE_HOME and ORA_CRS_HOME reside, but first I want to show what happened next. I cleaned up some “stuff” and ran df(1):
$ df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/cciss/c0d0p1 32993108 4668384 26648748 15% /
none 517148 0 517148 0% /dev/shm
/dev/psv/psv1 4489172 3523208 965964 79% /u01
/dev/psv/psv2 9011188 1969296 7041892 22% /u02
/dev/psv/psv3 9011188 16264 8994924 1% /u03
/dev/psv/psv4 9011188 50540 8960648 1% /u04
I reclaimed some space, so…
$ srvctl start database -d PROD
PRKP-1001 : Error starting instance PROD2 on node qar11s10
CRS-1005: Failed to get required resources
CRS-0223: Resource ‘ora.PROD.PROD2.inst’ has placement error.
PRKP-1001 : Error starting instance PROD4 on node qar11s12
CRS-0215: Could not start resource ‘ora.PROD.PROD4.inst’.
OK, what now? I started investigating ORA_CRS_HOME for the node called qar11s10 and found the following:
# cd /u01/app/oracle/product/10.2.0/crs_1/log/qar11s10/crsd
# ls -l
total 95072
-rw——- 1 root root 463667200 Mar 19 10:52 core.6155
-rw-r–r– 1 root root 3519955 Mar 20 08:08 crsd.log
I thought that was really strange. Who was sitting in the logging directory for CRS when they dumped core? Let’s take a peek (emphasis added by me):
# gdb -c core.6155
GNU gdb Red Hat Linux (6.3.0.0-1.132.EL4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions.
Type “show copying” to see the conditions.
There is absolutely no warranty for GDB. Type “show warranty” for details.
This GDB was configured as “i386-redhat-linux-gnu”.
Core was generated by `/u01/app/oracle/product/10.2.0/crs_1/bin/crsd.bin reboot’.
Program terminated with signal 6, Aborted.
#0 0x003647a2 in ?? ()
So for some reason crsd.bin was executing with its CWD in the logging directory and dumped core. OK, whatever. But why a 442MB core dump after sending itself a SIGABRT (by calling abort(P))? Unfortunately those are good questions I can’t answer. I haven’t started on the path investigating why crsd.bin trashed its address space—I’ll get to that later. I also wasn’t done with checking the other nodes. Since this is a shared ORA_CRS_HOME, I simply changed directories up two parent levels (../../) and set off a find. The find command returned the core file I already knew about and one dropped from crsd.bin on node 1 of the cluster. I thought 442MB was special—how about 534MB:
# pwd
/u01/app/oracle/product/10.2.0/crs_1/log
# find . -name “core.*” -print
./qar11s10/crsd/core.6155
./qar11s9/crsd/core.6206
# find . -name “core.*” -exec ls -l {} \;
-rw——- 1 root root 463667200 Mar 19 10:52 ./qar11s10/crsd/core.6155
-rw——- 1 root root 560902144 Mar 19 17:06 ./qar11s9/crsd/core.6206
Why Am I Blogging About This?
If ORA_CRS_HOME runs out of space, availability can be affected.
Where Were You The First Time You Ever …?
After having seen this, I started to investigate just how big crsd.bin is when it isn’t hitting a bug—and what its CWD is when executing normally. The following was taken from a different cluster, but no matter:
# cd /
# ps -ef | grep crsd.bin
root 9880 1 0 10:53 ? 00:02:36 /u01/app/oracle/product/10.2.0/crs_1/bin/crsd.bin reboot
root 23005 24382 0 17:01 pts/0 00:00:00 grep crsd.bin
# cd /proc/9880
# ls -l cwd
lrwxrwxrwx 1 root root 0 2007-03-22 16:38 cwd -> /u01/app/oracle/product/10.2.0/crs_1/log/tmr18s1/crsd
Eek! I was surprised to find that crsd.bin executes with its CWD in a logging directory, but no matter—just be aware that this logging directory under ORA_CRS_HOME can get some good sized bombs (core files) dropped there if things go bad.
Back to the Core Files
With 442MB and 534MB seen so far, it made me wonder just how big crsd.bin normally is. In fact, I thought it best to look at how large csrd.bin is before any databases are even booted:
# ps -ef | grep pmon
root 14147 10188 0 17:21 pts/1 00:00:00 grep pmon
# ps -ef | grep crsd.bin
root 5953 1 0 12:09 ? 00:01:10 /u01/app/oracle/product/10.2.0/crs_1/bin/crsd.bin reboot
root 14329 10188 0 17:21 pts/1 00:00:00 grep crsd.bin
# cat /proc/5953/statm
127760 4207 1859 875 0 121306 0
So, I know this seems a bit like trivial pursuit, but according to /proc crsd.bin has a base memory utilization of roughly 499MB (first word in statm in KB). If crsd.bin dumps core smaller than that it is because core dumps don’t have the entire process context. Program text, for instance, is not dumped.
Start, Disable. Same Thing.
Then there was this other thing I saw today…
Did I really ask to disable the database in the following command?
$ srvctl start database -d PROD
PRKP-1019 : Database PROD already disabled.
If you try to start a database after disabling it, this is the message you’ll get. It is a bit misleading.
Kevin,
statm is in pages. So that’s 499 Mb.
Remember at RMOUG we discussed the memory footprint of CRSD ? It’s huge.
Christo, you are right. It is pages. The man page on RHAS3, RHEL4 and SuSE all say so. Guess who has it wrong?
http://www.redhat.com/docs/manuals/enterprise/RHEL-3-Manual/ref-guide/s1-proc-directories.html
Quote from that RH webpage:
statm — The status of the memory in use by the process. Below is a sample /proc/statm file:
263 210 210 5 0 205 0
The seven columns relate to different memory statistics for the process. From left to right, they report the following aspects of the memory used:
1. Total program size, in kilobytes.
2. Size of memory portions, in kilobytes.
I will fix the blog entry but leave this comment thread in here for historical purposes.
Thanks for stopping by.
Kevin,
Just wondering what you did to get the DB started again after your full disk issue. I have freed up space, but I still get:
[oracle@etudtk01 etudtk01]$ srvctl start database -d pr
PRKP-1001 : Error starting instance pr1 on node etudtk01
CRS-1028: Dependency analysis failed because of:
CRS-0223: Resource ‘ora.pr.pr1.inst’ has placement error.
[oracle@etudtk01 etudtk01]$
Any ideas?
Eamon.
On my (newer) kernel the /proc/PID/status is easier. I do not know if it is supported in older kernels.
cat /proc/5952/status | grep VmSize
Out of curiosity, has anyone investigated why crsd.bin is using so much memory? On Red Hat 5.2, x86-64 with 11.1.0.7, I find that the allocated memory is closer to 600 MB. I guess I need to look deeper into the smaps information. Thanks for the heads up.
I would recommend talking to support. My 11.1.0.7 crsd.bin has about 160MB total size and only about 30MB resident:
# ps -elF | egrep ‘^F|7824’ | grep -v grep
F S UID PID PPID C PRI NI ADDR SZ WCHAN RSS PSR STIME TTY TIME CMD
4 S root 7824 5609 0 75 0 – 162037 184466 30588 0 May05 ? 00:01:17 /u01/crs/bin/crsd.bin reboot
The reason why clients like srvctl and other commands fail when $CRS_HOME is full is because of the failure to authenticate. This blog is focused on the size of the core. However what prevents the DB from being started by srvctl is the fact that the client commands aka srvctl fails because they cannot be authenticated. The authentication is implemented such that the client creates a token under a directory in the CRS_HOME. When space is full, the create directory fails and therefore the client is not authenticated and therefore fails.
..
About the size of crsd.bin being huge, The next release patchset on top of 11.2.0.1 has some fixes which will improve this considerably.
Thanks,
Anil Nair
Global Technical Lead /RAC and Oracle Clusterware
Oracle Support.
Thanks for stopping by, Anil.