Direct I/O Can Crash Dataguard. A Tricky ORA-01031.

BLOG CORRECTION: When explaining the use of the ORASYM environment variable to a reader, I noticed I typed it and assigned it to the wrong value. A double-decker bug! This entry corrects that bug. Please read the comments for context.

If improperly configured as per your platform, that is.

A Tricky ORA-01031 Error Case
I noticed something very odd today in the Linux x86_64 port of Oracle10gR2. If you set filesystemio_options=directIO and start up the instance (without connecting through the listener), subsequent connects as SYSDBA through the listener will open the orapw file with O_DIRECT. Huh? Yes, a normal, non-shared file opened O_DIRECT. I don’t know when that started, or what other platforms do this, but if the orapw file is located on a filesystem that does not support open(2) with O_DIRECT, you get an ORA-01031. This was no big deal really, other than the mystery, since I store my database on PolyServe with the direct I/O mount option. That is, the filesystemio_options=directIO is redundant for the database files and—as it turns out—creates a problem if the orapw file is on PSFS with regular mount options. I could simply omit the init.ora setting. But I’d like to cover this anyway.

It seems this was hit before by someone back in Oracle9iR2 days as per Metalink 3312751. So, as I was saying, I have a couple of simple solutions for my particular case where I hit this, but I thought there might be someone out there that would enjoy seeing the way I figured out it was the orapw file that was causing me trouble. But first, what did this do to Dataguard in my environment?

Mixing Dataguard, O_DIRECT and the orapw File
It may be that nobody on earth will ever hit this problem. I don’t know. But when it happened to me I had shut down my Dataguard primary, set filesystemio_options= directIO (for some test or another) and then went off and did other things. You know the memory is the first thing to go. “All of the sudden”, my Dataguard setup was not working and I could not remember what had changed (test gear). After starting my Dataguard primary back up (with filesystemio_options=directIO), I saw the following little treat in my alert log:

Thu Feb 22 10:12:46 2007
Errors in file /u01/app/oracle/admin/PROD/bdump/prod_arc0_6977.trc:
ORA-01017: invalid username/password; logon denied
ORA-27041: unable to open file
Linux-x86_64 Error: 22: Invalid argument
Additional information: 2
Thu Feb 22 10:12:46 2007
Error 1017 received logging on to the standby
————————————————————
Check that the primary and standby are using a password file
and remote_login_passwordfile is set to SHARED or EXCLUSIVE,
and that the SYS password is same in the password files.
returning error ORA-16191
It may be necessary to define the DB_ALLOWED_LOGON_VERSION
initialization parameter to the value “10”. Check the
manual for information on this initialization parameter.

OK, forget for a moment that Metalink note 259142.1 states that DB_ALLOWED_LOGON_VERSION is replaced by a setting in the sqlnet.ora file yet this Oracle10gR2 Dataguard primary is spitting out mention of that as possible remedy for the problem. I want to go over how I figured out what the problem actually was.

Oracle Binary Wrapper
Since the problem at hand was the inability to connect as SYSDBA through the listener, I would not be able to simply run sqlplus under strace (or truss). The listener is the parent of the dedicated server process in that situation. So, I had to implement a wrapper around the Oracle binary. To do this you simply mv $ORACLE_HOME/bin/oracle to something like $ORACLE_HOME/bin/oracle.bin and then create a script like this:

-bash-3.00$ cat $ORACLE_HOME/bin/oracle
#!/bin/bash
export ORASYM=/u01/app/oracle/product/10.2.0/db_1/bin/oracle.bin
if [ -f /tmp/trace ]
then
strace -f -o /tmp/trace.$$ /u01/app/oracle/product/10.2.0/db_1/bin/oracle.bin

else

exec /u01/app/oracle/product/10.2.0/db_1/bin/oracle.bin $*
fi

I didn’t create the /tmp/trace file until after I started the instance. This is how you can have background processes executing without single stepping (e.g., strace, truss) all Oracle processes. Instead you target a specific connection to the database by creating the /tmp/trace file immediately before connecting and then removing the file so that other connections are normal. I then ran sqlplus and generated the ORA-01031 error:

-bash-3.00$ sqlplus /nolog
SQL*Plus: Release 10.2.0.1.0 – Production on Thu Feb 22 12:44:28 2007
Copyright (c) 1982, 2005, Oracle. All rights reserved.
SQL> conn sys@PROD/test as sysdba
ERROR:
ORA-01031: insufficient privileges
SQL> exit

The shadow process PID was 20209 so I grep(1)ed for orapw from /tmp/trace.20209 and found the culprit:

NOTE: Right click-> view the image

As the strace output revealed, the shadow process created through the listener uses O_DIRECT when opening the orapw file. Trivial pursuit!

Solution
The solution is very platform specific as I was saying, but it is not inconceivable to hit this on other platforms. In my case I simply don’t run with filesystemio_options= directIO because I store all my database files in PolyServe(HP) PSFS with the direct I/O mount option anyway. Works just fine, but I thought I’d elaborate on the point by relocating the orapw file to a mount that supports O_DIRECT and then use a symlink to $ORACLE_HOME/dbs followed by a successful connection as follows:

-bash-3.00$ cp /u01/app/oracle/product/10.2.0/db_1/dbs/orapwPROD \ /u01/app/oracle/oradata/PROD/orapwPROD
-bash-3.00$ rm /u01/app/oracle/product/10.2.0/db_1/dbs/orapwPROD
-bash-3.00$ ln -s /u01/app/oracle/oradata/PROD/orapwPROD \ /u01/app/oracle/product/10.2.0/db_1/dbs/orapwPROD
-bash-3.00$ df /u01/app/oracle/oradata/PROD
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/psd/psd2p2 1225017312 3617672 1221399640 1% /u02
-bash-3.00$ mount | grep u02
/dev/psd/psd2p2 on /u02 type psfs (rw,dboptimize,shared,data=ordered)
-bash-3.00$ sqlplus /nolog

NAME TYPE VALUE
———————————— ———– ——————————
filesystemio_options string directIO

Forward Look
I’ll be making a blog entry quite soon about how O_DIRECT can save memory for non-database activity like archived redo log compression.

6 Responses to “Direct I/O Can Crash Dataguard. A Tricky ORA-01031.”

Feed for this Entry Trackback Address

1 Alex Gorbachev February 23, 2007 at 4:28 am

It’s really a pity that due to stupid bugs like that we have to introduce more complexity in the areas such as DR that simply cannot afford to be complex.

Well, one more link is not a big deal but who knows when it will be forgotten and who will delete that password file and why would it happen just an hour before failover to standby.

2 kevinclosson February 23, 2007 at 4:40 am

Alex,

Be aware that this is only an issue if you use filesystemio_options=directIO|setall **AND** your Oracle Home is on a filesystem that doesn’t support open O_DIRECT. NFS is safe, Ext3 is safe, PolyServe is safe. I haven’t checked, but I’m fairly sure ocfs2 would be safe. I don’t know about the boutique offering on the Legacy Unix platforms at this point. As for the symlink, I only used it to show that would fix the problem. As I pointed out, filesystemio_options does not get set to directIO on PolyServe so it is a non-issue.

3 Connor March 14, 2007 at 4:17 am

Hi Kevin

What is the significance of the

export ORA_SYM=/u01/app/oracle/product/10.2.0/db_1/bin/oracle

line in your demo ?

Cheers
Connor

4 kevinclosson March 14, 2007 at 4:31 am

Hi Connor,

If Oracle is executing out of $ORACLE_HOME/bin/oracle and that happens to not be the actual binary, trace files are a bit light on info should they need to be generated 🙂 It is old habit, but it is the correct thing to do if you are using a wrapper around the binary.

Now, having said that, you just pointed out a bug 🙂 That should be:

ORASYM=$ORACLE_HOME/bin/oracle.bin

$ uname -a
Linux tmr6s13 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
$ pwd
/u01/app/oracle/product/10.2.0/db_1/bin
$ strings – oracle | grep -i ORASYM
ORASYM
ORASYM

Good thing I didn’t need a stack trace.

- 5 DCT March 18, 2010 at 3:35 pm
  
  Thanks for taking the time to post this – my Oracle familiarity was adaquate to realize the pwd file access was the problem; my Linux-fu was not up to realizing the change to Direct I/O (to resolve some temporary memory issues) was the root cause!

1 Cheat Code » Direct I/O Can Crash Dataguard. Tricky ORA-01031. Trackback on February 24, 2007 at 2:36 pm

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage