I’ve been doing a lot of 4 node 10gR2 RAC stuff lately on 4 different clusters. I ran into a problem that I thought I’d share since I can’t find very much on the web or in Metalink. It is a real headache, but “easy” to fix. I put easy in quotes because it is only easy to fix if you know what to do. First, the problem.
ORA-30012
I was running a Real Application Clusters stress test that I developed called thrash. It is not sophisticated, but it does put a lot of pain on the instances, servers and storage. It consists of staggered instance rebooting followed by the creation of a small tablespace—one per instance. Once the tablespaces are created, a set of sqlplus sessions alter the tablespaces adding a significant number of random sized data files. The tablespaces are dropped and the instances are rebooted in a staggered fashion. As I said, each instance is sustaining this workload. I know it is nothing like a production workload. It’s just one test I run in hopes of exposing file manipulation bugs associated with tablespace creation, datafile addition and tablespace dropping. That is, I’m looking for filesystem bugs exposed by Oracle tablespace manipulation.
Bumps in the Road
On occasion I was getting instances that would stop participating in the thrash. I read the alert logs and found that the error was ORA-30012—which made no sense to me since it was happening out of the blue. In fact, it was happening after running thrash for as long as 12 hours. To show you the oddity of these alert log entries, I’ll provide grep(1) output from all the alert logs (ORACLE_BASE is on a CFS):
$ grep ‘does not exist or of wrong type’ *log
alert_PROD2.log:ORA-30012: undo tablespace ‘UNDOTBS2’ does not exist or of wrong type
alert_PROD2.log:ORA-30012: undo tablespace ” does not exist or of wrong type
alert_PROD3.log:ORA-30012: undo tablespace ‘UNDOTBS3’ does not exist or of wrong type
alert_PROD3.log:ORA-30012: undo tablespace ” does not exist or of wrong type
alert_PROD4.log:ORA-30012: undo tablespace ‘UNDOTBS4’ does not exist or of wrong type
How strange! What is an undo tablespace with a NULL name? Part of the SPFILE is shown later in this post establishing the fact that I assign undo to instances explicity so what gives?
The following is a snippet to show that the error was occuring during ALTER DATABASE OPEN:
Errors in file /u01/app/oracle/admin/PROD/udump/prod2_ora_1569.trc:
ORA-30012: undo tablespace ‘UNDOTBS2’ does not exist or of wrong type
Fri Mar 16 04:41:38 2007
Error 30012 happened during db open, shutting down database
USER: terminating instance due to error 30012
Instance terminated by USER, pid = 1569
ORA-1092 signalled during: ALTER DATABASE OPEN…
The trace file wasn’t much help:
$ more /u01/app/oracle/admin/PROD/udump/prod2_ora_1569.trc
/u01/app/oracle/admin/PROD/udump/prod2_ora_1569.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 – 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
ORACLE_HOME = /u01/app/oracle/product/10.2.0/rac_1
System name: Linux
Node name: qar14s22
Release: 2.6.9-42mxs351RHELupdate4
Version: #1 SMP Tue Mar 6 16:37:53 PST 2007
Machine: x86_64
Instance name: PROD2
Redo thread mounted by this instance: 2
Oracle process number: 16
Unix process pid: 1569, image: oracle@qar14s22 (TNS V1-V3)
*** SERVICE NAME:() 2007-03-16 04:41:35.038
kspgetpeeq: kspasci not KSPASCNOP (0x110001 != 0x0)
*** SESSION ID:(137.25) 2007-03-16 04:41:38.788
ORA-30012: undo tablespace ‘UNDOTBS2’ does not exist or of wrong type
After exhausting my patience spelunking for information on the web and Metalink, I asked my fellow OakTable Network members. Jože Senegačnik of dbprof.com replied with:
If this is a RAC database then you need to specify the instance name in the init.ora or spfile together with the undo_tablespace parameter.
Where did he come up with that? I asked him if he’d actually hit this before. His answer was:
Yes, a couple of weeks ago I have experienced it on RAC on Windows. One node had problems with the undo_tablespace parameter after an unplanned database restart. The undo_tablespace parameter was changed in runtime months before but obviously it lacked the SID information and this caused problems after database restart.
A DBCA NO-NO
Folks, this database was created with DBCA. I don’t know how I haven’t seen this issue before, but the problem is that DBCA does not configure the SPFILE with explicit assignments for the INSTANCE_NAME parameter. For example, the following are a couple of strings(1)|grep(1) command pipelines that would return the 4 assigments of the INSTANCE_NAME parameter had DBCA set it up that way:
$ strings – spfilePROD.ora | grep –I instance_name
$ strings – spfilePROD.ora | grep ‘^PROD’
PROD2.__db_cache_size=71303168
PROD4.__db_cache_size=67108864
PROD3.__db_cache_size=75497472
PROD1.__db_cache_size=71303168
PROD3.__java_pool_size=4194304
PROD2.__java_pool_size=4194304
PROD4.__java_pool_size=4194304
PROD1.__java_pool_size=4194304
PROD3.__large_pool_size=4194304
PROD2.__large_pool_size=4194304
PROD4.__large_pool_size=4194304
PROD1.__large_pool_size=4194304
PROD2.__shared_pool_size=79691776
PROD4.__shared_pool_size=83886080
PROD3.__shared_pool_size=75497472
PROD1.__shared_pool_size=79691776
PROD3.__streams_pool_size=0
PROD2.__streams_pool_size=0
PROD4.__streams_pool_size=0
PROD1.__streams_pool_size=0
PROD3.instance_number=3
PROD4.instance_number=4
PROD2.instance_number=2
PROD1.instance_number=1
PROD3.thread=3
PROD2.thread=2
PROD4.thread=4
PROD1.thread=1
PROD1.undo_tablespace=’UNDOTBS1′
PROD3.undo_tablespace=’UNDOTBS3′
PROD4.undo_tablespace=’UNDOTBS4′
PROD2.undo_tablespace=’UNDOTBS2′
Remedy
It was exactly what Joze said. I set explicit assignments for the ISNTANCE_NAME parameters as follows and the problem has gone away.
PROD1.INSTANCE_NAME=PROD1
PROD2.INSTANCE_NAME=PROD2
PROD3.INSTANCE_NAME=PROD3
PROD4.INSTANCE_NAME=PROD4
Hope this helps some googler someday.
hmmm I checked a recent installation of 10.2.0.2 on hp-ux – well i looked into the alert log file (do not have access to the installation at this moment in time), and found that the instance_name was not to be found under the section:
.
System parameters with non-default values:
.
I do see the instance_number, however.
.
That said, I came across a similar issue – dbca not setting init.ora parameter(s). For me it was the local_listener parameter. The impact of this was that clients needed to have the public hostnames of the nodes in their local hosts files in addition to vip hostname.
I have my network listener running out of the ASM ORACLE_HOME – not sure if this could have been the reason why dbca didn’t plug in the values.
Naqi
Hi.
There are many bugs in DBCA,
so personally i don’t trust it anymore.
I do check generated by DBCA scripts and
then personally create db – this is for production DBs.
I have faced another bug with DBCA on HP-UX
when i was able to create DB using DBCA only once per cluster node.
When i was running DBCA for creating second DB from the same node DBCA just hang after pressing FINISH button.
Started from another node DBCA worked fine
And it was consistent problem.
DBCA is really buggy tool…
Alex
Regarding DBCA not setting local_listener, this is documented in the install guide… When you create scripts for a RAC database from a DBCA template they leave off a few important steps at the end (for example, say, setting CLUSTER_DATABASE=true… not that cluster_database is an important parameter or anything…) Appendix B has instructions for creating RAC databases with DBCA-generated scripts — and basically you have to do a few manual steps after running them (one of which is setting local_listener).
However the documentation seems a bit lacking to me… for example step 7 just says “configure srvctl” with a pointer off to the admin guide. I share the sentiment of being a bit more comfortable with scripts although in practice I tend to use DBCA most of the time. Also, FWIW it’s not hard to see what DBCA is actually doing; if you enable tracing it seems to be pretty verbose about what’s going on.
I agree with Alex in that dbca buggy. DBCA on HPUX 11.23 ia64 (SG SMS CFS for ORacle RAC) is buggy with Veritas ODM even when patched with 10.2.0.3 in that it won’t complete even to creation page whilst complaining that the shared storage is not accessible from all nodes.
The workaround: Veritas ODM has to be unlinked (ie default linkage with dummy ODM driver libodmd10.so) when running DBCA and then relinked after creating the database.
I created the scripts via the custom database option and use those scripts as a sample for that db release for future db installations (been doing that since 9.2.0). I modify most things in the DBNAME.sql and DBNAME.sh with some customization and even pass in the logging directory and listener TCP port for postDBCreation.sql as a parameter to reduce mistakes in editing. This modification is significant if the cluster is to contain more than one database.
Good thing about CFS is it allows a central location for spfile and pfile. The initSID.ora just contains a pointer to either one.
One other blunder I ran into with 10g initially is that registering the db with OCR (srvctl add database/instance commands) has to be done under OSDBA account and not CRS (owned by root).
The reason is that starting it later creates the hidden files in $ORACLE_HOME/racg/init///.connect to be owned by root. Running svrctl start database -d DBNAME will return starting errors and such. Reason? Those files were owned by root instead of OSDBA account and hence we get read permission disguised under “could not stat $ORACLE_HOME/racg/init///.connect” in the log files.
# ls $ORACLE_HOME/racg/init///
drxwr—– 2 root root 96 09:50 DBNAME
instead of
drxwr—– 2 oracle oinstall 96 09:50 DBNAME
Normally dbca will do this but I was running it manually since 9.2.0. The thing is in 9i, the cluster registry file (srvmconfig) is still owned by the OSDBA account rather than root in 10g. I didn’t see put the two toegther until I went spelunking into the directory above and saw the history of the commands I ran as root and OSDBA.
Cheers!