Archive Page 34

Ouch! Migrating an Oracle7 Database from OpenVMS to Oracle10g on Linux.

I’ve been following the thread over at Yet Another Oracle DBA Blog about the effort to migrate a very complex Oracle7 database on OpenVMS (old Digital gear)  to Oracle10g on Linux. I hope Herod will give some details about the hardware used for the Linux system. It looks like they are about 6 days into the data migration aspect of the project. Here is a snippet of the system being migrated from:

Oracle database version 7, with an astounding 3,319 tables in a single schema with only one primary key constraint created on one table. No database referential integrity at all. 891 procedures, 4319 triggers, 771 functions (no packages).

That sounds like a mess! This would be quite a project for anyone to take on. Give the site a visit and see what you think:

http://yaodba.blogspot.com/2007/01/new-project-time-for-new-job.html

http://yaodba.blogspot.com/2007/01/conversion-of-07.html

http://yaodba.blogspot.com/2007/01/conversion-of-07-still-processing.html

Using Linux sched_setaffinity(2) To Bind Oracle Processes To CPUs

I have been exploring the effect of process migration between CPUs in a multi-core Linux system while running long duration Oracle jobs. While Linux does schedule processes as best as possible for L2 cache affinity, I do see migrations on my HP DL 585 Opteron 850 box. Cache affinity is important, and routine migrations can slow down long running jobs. In fact, when a process gets scheduled to run on a CPU different than the one it last ran on the CPU will stall immediately while the cache is loaded with the process’ page tables—regardless of cache warmth. That is, the cache might have pages of text, data, stack and shared memory, but it won’t have the right versions of the page tables. Bear in mind that we are talking really small stalls here, but on long running jobs it can add up.

CPU_BIND
This Linux Journal webpage has the source for a program called cpu_bind that uses the Linux 2.6 sched_setaffinity(2) library routine to establish hard-affinity for a process to a specified CPU. I’ll be covering more of this in my NUMA series, but I thought I’d make a quick blog entry about this new to get the ball rolling.

After downloading the cpu_bind.c program, it is simple to compile and execute. The following session shows compilation and execution to set the PID of my current bash(1) shell to execute with hard affinity on CPU 3:

$ cc -o cpu_bind cpu_bind.c
$ cpu_bind $$ 3
$ while true
> do
> :
> done

The following is a screen shot of top(1) with CPU 3 utilized 100% in user mode by my looping shell. Note, you may have to ricght-click->vew image:

top1

If you wanted to experiment with Oracle, you could start a long running job and execute cp_bind on its PID once it is running, or do what I did with $$ and then invoke sqlplus for instance. Also, a SQL*Net listener process could be started with hard affinity to a certain CPU and you could connect to it when running a long CPU-bound job. Just a thought, but I’ll be showing real numbers in my NUMA series soon.

Give it a thought, see what you think.

The NUMA series links are:

Oracle on Opteron with Linux–The NUMA Angle (Part I)

Oracle on Opteron with Linux-The NUMA Angle (Part II)

An 8GHz Pentium Processor

A fellow member of the Oaktable Network sent out an interesting URL. So how badly do you think a Pentium 4 overclocked to 8GHz would stall on memory loads?   Can you say, “CPI?”

RMOUG Schedule

Just a quick update about RMOUG. The speaking schedule has been cast in stone so I updated my appearances section of the blog. I hope to see you at RMOUG—it is a very good conference!

Microsoft-Minded People Covering Shared Data Clustering

Just FYI.

Since shared data clustering is a little foreign to the Microsoft community, I watch how they cover products like PolyServe Matrix Server. Here is an example:

Microsoft Certified Professional Online Coverage of PolyServe Matrix Server

High Availability…MySpace.com Style

I was checking out Paul Vallee’s comments about MySpace’s definition of uptime. It seems others are seeing spotty uptime with this poster child of the Web 2.0 phenomenon.

I’m watching MySpace for other reasons though. They have deployed the Isilon IQ Clustered Storage solution for serving up the video content. Isilon is a competitor of my company, PolyServe. Isilon is good at what they do—read intensive workloads (e.g., streaming media). I don’t like the fact that it is a hardware/software solution. I’m a much bigger fan of being free of vendor lock-in. In the end, I’m an Oracle guy and Isilon can’t do Oracle so that’s that.

Anyway, another thing that is interesting about Web 1.0 and now Web 2.0 shops is the odd amount of “IT Street Cred” they seem to get. Folks like Amazon, eBAY and now MySpace are not IT shops, really. They have gargantuan technology staff, and their IT budget is not representative of normal companies. Basically, they can take the oddest of technology combinations and throw tremendous headcount of very gifted people at the problem to make it work. Not your typical COTS shop.

Now, having said that, are these shops solving interesting problem? Sure. Would any normal Oracle shop be able to do things they way, say, Amazon does it? Likely not. Back in 2004, Amazon admitted to an IT budget of USD $64 Million before some $16 Million savings realized in one way or another by deploying Linux.

Oracle on Opteron with Linux-The NUMA Angle (Part II)

A little more groundwork. Trust me, the Linux NUMA API discussion that is about to begin and the microbenchmark and Oracle benchmark tests will make a lot more sense with all this old boring stuff behind you.

Another Terminology Reminder
When discussing NUMA, the term node is not the same as in clusters. Remember that all the memory from all the nodes (or Quads, QBBs, RADs, etc) appear to all the processors as cache-coherent main memory.

More About NUMA Aware Software
As I mentioned in Oracle on Opteron with Linux–The NUMA Angle (Part I), NUMA awareness is a software term that refers to kernel and user mode software that makes intelligent decisions about how to best utilize resources in a NUMA system. I use the generic term resources because as I’ve pointed out, there is more to NUMA than just the non-uniform memory aspect. Yes, the acronym is Non Uniform Memory Access, but the architecture actually supports the notion of having building blocks with only processors and cache, only memory, or only I/O adaptors. It may sound really weird, but it is conceivable that a very specialized storage subsystem could be built and incorporated into a NUMA system by presenting itself as memory. Or, on the other hand, one could envision a very specialized memory component—no processors, just memory—that could be built into a NUMA system. For instance, think of a really large NVRAM device that presents itself as main memory in a NUMA system. That’s much different than an NVRAM card stuffed into something like a PCI bus and accessed with a device driver. Wouldn’t that be a great place to put an in-memory database for instance? Even a system crash would leave the contents in memory. Dealing with such topology requires the kernel to be aware of the differing memory topology that lies beneath it, and a robust user mode API so applications can allocate memory properly (you can’t just blindly malloc(3) yourself into that sort of thing). But alas, I digress since there is no such system commercially available. My intent was merely to expound on the architecture a bit in order to make the discussion of NUMA awareness more interesting.

In retrospect, these advanced NUMA topics are the reason I think Digital’s moniker for the building blocks used in the AlphaServer GS product line was the most appropriate. They used the acronym RAD (Resource Affinity Domain) which opens up the possible list of ingredients greatly. An API call would return RAD characteristics such as how many processors, how much memory (if any) and so on a RAD consisted of. Great stuff. I wonder how that compares to the Linux NUMA API? Hmm, I guess I better get to blogging…

When it comes to the current state of “commodity NUMA” (e.g., Opteron and Itanium) there are no such exotic concepts. Basically, these systems have processors and memory “nodes” with varying latency due to locality—but I/O is equally costly for all processors. I’ll speak mostly of Opteron NUMA with Linux since that is what I deal with the most and that is where I have Oracle running.

For the really bored, here is a link to a AlphaServer GS320 diagram.

The following is a diagram of the Sequent NUMA-Q components that interfaced with the SHV Xeon chipset to make systems with up to 64 processors:

lynx1.jpg

OK, I promise, the next NUMA blog entry will get into the Linux NUMA API and what it means to Oracle.

Migrate from Windows to Linux. The Stupid Quote of the Day.

While I prefer Linux over Windows for Oracle (purely personal preference), I think this Linux Journal webpage has the Stupid Quote of the Day Award:

The smartest move for anyone to make is to migrate from Windows to Linux.

Techno-Religious fanaticism at its best! Way to go!

What Does This Have to do With Oracle?
As I pointed out in my blog entry about Oracle revenue from Windows deployments, Larry still makes more money from Windows deployments than Linux. Yes, these are CY2005 numbers, we’ll have to see what 2006 looks like. I suspect more of the same honestly. That is, if those numbers are ever revealed.

Windows or Linux for Oracle is a choice that can only be made by each IT shop. If you are a Windows shop, you’ll choose Windows. If you are a traditional Unix shop, and want to play in the commodity space, you’ll go with Linux.

Yes Direct I/O Means Concurrent Writes. Oracle Doesn’t Need Write-Ordering.

If Sir Isaac Newton was walking about today dropping apples to prove his theory of gravity, he’d feel about like I do making this blog entry. The topic? Concurrent writes on file system files with Direct I/O.

A couple of months back, I made a blog entry about BIGFILE tablespaces in ASM versus modern file systems.The controversy at hand at the time was about the dreadful OS locking overhead that must surely be associated with using large files in a file system. I spent a good deal of time tending to that blog entry pointing out that the world is no longer flat and such age-old concerns over OS locking overhead on modern file systems no longer relevant. Modern file systems support Direct I/O and one of the subtleties that seems to have been lost in the definition of Direct I/O is the elimination of the write-ordering locks that are required for regular file system access. The serialization is normally required so that if two processes should write to the same offset in the same file, one entire write must occur before the other—thus preventing fractured writes. With databases like Oracle, no two processes will write to the same offset in the same file at the same time. So why have the OS impose such locking? It doesn’t with modern file systems that support Direct I/O.

In regards to the blog entry called ASM is “not really an optional extra” With BIGFILE Tablespaces, a reader posted the following comment:

“node locks are only an issue when file metadata changes”
This is the first time I’ve heard this. I’ve had a quick scout around various sources, and I can’t find support for this statement.
All the notes on the subject that I can find show that inode/POSIX locks are also used for controlling the order of writes and the consistency of reads. Which makes sense to me….

Refer to:
http://www.ixora.com.au/notes/inode_locks.htm

Sec 5.4.4 of
http://www.phptr.com/articles/article.asp?p=606585&seqNum=4&rl=1

Sec 2.4.5 of
http://www.solarisinternals.com/si/reading/oracle_fsperf.pdf

Table 15.2 of
http://www.informit.com/articles/article.asp?p=605371&seqNum=6&rl=1

Am I misunderstanding something?

And my reply:

…in short, yes. When I contrast ASM to a file system, I only include direct I/O file systems. The number of file systems and file system options that have eliminated the write-ordering locks is a very long list starting, in my experience, with direct async I/O on Sequent UFS as far back as 1991 and continuing with VxFS with Quick I/O, VxFS with ODM, PolyServe PSFS (with the DBOptimized mount option), Solaris UFS post Sol8-U3 with the forcedirectio mount option and others I’m sure. Databases do their own serialization so the file system doing so is not needed.

The ixora and solarisinternals references are very old (2001/2002). As I said, Solaris 8U3 direct I/O completely eliminates write-ordering locks. Further, Steve Adams also points out that Solaris 8U3 and Quick I/O where the only ones they were aware of, but that doesn’t mean VxFS ODM (2001), Sequent UFS (starting in 1992) and ptx/EFS, and PolyServe PSFS (2002) weren’t all supporting completely unencumbered concurrent writes.

Ari, thanks for reading and thanks for bringing these old links to my attention. Steve is a fellow Oaktable Network Member…I’ll have to let him know about this out of date stuff.

There is way too much old (and incomplete) information out there.

A Quick Test Case to Prove the Point
The following screen shot shows a shell process on one of my Proliant DL585s with Linux RHEL 4 and the PolyServe Database Utility for Oracle. The session is using the PolyServe PSFS filesystem mounted with the DBOptimized mount option which supports Direct I/O. The test consists of a single dd(1) process overwriting the first 8GB of a file that is a little over 16GB. The first invocation of dd(1) writes 2097152 4KB blocks in 283 seconds for an I/O rate of 7,410 writes per second. The next test consisted of executing 2 concurrent dd(1) processes each writing a 4GB portion of the file. Bear in mind that the age old, decrepit write-ordering locks of yester-year serialized writes. Without bypassing those write locks, two concurrent write-intensive processes cannot scale their writes on a single file. The screen shot shows that the concurrent write test achieved 12,633 writes per second. Although 12,633 represents only 85% scale-up, remember, these are physical I/Os—I have a lot of lab gear, but I’d have to look around for a LUN that can do more than 12,633 IOps and I wanted to belt out this post. The point is that on a “normal” file system, the second go around of foo.sh with two dd(1) processes would take the same amount of time to complete as the single dd(1) run. Why? Because both tests have the same amount of write payload and if the second foo.sh suffered serialization the completion times would be the same:

conc_write2.JPG

Oracle on Opteron with Linux–The NUMA Angle (Part I)

There are Horrible Definitions of NUMA Out There on the Web
I want to start blogging about NUMA with regard to Oracle because NUMA has reached the commodity hardware scene with Opteron and Hypertransport technology Yes, I know Opteron has been available for a long time, but it wasn’t until the Linux 2.6 Kernel that there were legitimate claims of the OS being NUMA-aware. Before I can start blogging about NUMA/Oracle on Opteron related topics, I need to lay down some groundwork.

First, I’ll just come out and say it, I know NUMA—really, really well. I spent the latter half of the 1990’s inside the Sequent Port of Oracle working out NUMA-optimizations to exploit Sequent NUMA-Q 2000—the first commercially available NUMA system. Yes, Data General, SGI and Digital were soon to follow with AViiON, Origin 2000 and the AlphaServer GS320 respectively. The first port of Oracle to have code within the kernel specifically exploiting NUMA architecture was the Sequent port of Oracle8i.

 

Glossary
I’d like to offer a couple of quick definitions. The only NUMA that matters where Oracle is concerned is Cache Coherent NUMA (a.k.a CC-NUMA):

NUMA – A microprocessor-based computer system architecture comprised of compute nodes that possess processors and memory and usually disk/network I/O cards. A CC-NUMA system has specialized hardware that presents all the varying memory components as a single memory image to the processors. This has historically been accomplished with crossbar, switch or SCI ring technologies. In the case of Opteron, NUMA is built into the processor since each processor has an on-die memory controller. Understanding how a memory reference is satisfied in a NUMA system is the most important aspect of understanding NUMA. Each memory address referenced by the processors in a NUMA system is essentially “snooped” by the “NUMA memory controller” which in turn determines if the memory is local to the processor or remote. If remote, the NUMA “engine” must perform a fetch of the memory and install it into the requesting processor cache (which cache depends on the implementation although most have historically implemented an L3 cache for this remote-memory “staging”). The NUMA “engine” has to be keenly tuned to the processor’s capabilities since all memory related operations have to be supported including cache line invalidations and so forth. Implementations have varied wildly since the early 1990s. There have been NUMA systems that were comprised of complete systems linked by a NUMA engine. One such example was the Sequent NUMA-Q 2000 which was built on commodity Intel-based Pentium systems “chained” together by a very specialized piece of hardware that attached directly to each system bus. That specialized hardware was the called the Lynx Card which had an OBIC (Orion Bus Interface Controller) and a SCLIC (SCI Line Interface Controller) as well as 128MB L3 remote cache. On the Lynx card was a 510-pin GaAs ASIC that served as the “data pump” of the NUMA “engine”. These commodity NUMA “building blocks” were called “Quads” because they had 4 processors, local memory, local network and disk I/O adaptors—a lot of them. Digital referred to their physical building blocks as QBB (Quad Building Blocks) and logically (in their API for instance) as“RAD”s for Resource Affinity Domains. In the case of Opteron, each processor is considered a “node” with only CPU and memory locality. With Opteron, network and disk I/O are uniform.

NUMA Aware – This term applies to software. NUMA-aware software is optimized for NUMA such that the topology is understood and runtime decisions can be made such as what segment of memory to allocate from or what adaptor to perform I/O through. The latter, of course, not applying to Opteron. NUMA awareness starts in the kernel and with a NUMA API, applications too can be made NUMA aware. The Linux 2.6 Kernel had NUMA awareness built into the kernel—to a certain extent and there has been a NUMA API available for just as long. Is the Kernel fully NUMA-optimized? Not by any stretch of the imagination. Is the API complete? No. Does that mean the Linux NUMA-related technology is worthless? That is what I intend to blog about.

Some of the good engineers that build NUMA-awareness into the Sequent NUMA-Q operating system—DYNIX/ptx—have contributed NUMA awareness to Linux through their work in the IBM Linux Technology Center. That is a good thing.

This thread on Opteron and Linux NUMA is going to be very Oracle-centric and will come out as a series of installments. But first, a trip down memory lane.

The NUMA Stink
In the year 2000, Sun was finishing a very anti-NUMA campaign. I remember vividly the job interview I had with Sun’s Performance, Availability and Architecture Engineering (PAE) Group lead by Ganesh Ramamurthy. Those were really good guys, I enjoyed the interview and I think I even regretted turning down their offer so I could instead work in the Veritas Database Editions Group on the Oracle Disk Manager Library. One of the prevailing themes during that interview was how hush, hush, wink, wink they were about using the term NUMA to describe forthcoming systems such as StarCat. That attitude even showed in the following Business Review Online article where the VP Enterprise Systems at Sun in that time frame stated:

“We don’t think of the StarCat as a NUMA or COMA server,” he said. “This server has SMP latencies, and it is just a bigger, badder Starfire.”

No, it most certainly isn’t a COMA (although it did implement a few of the aspects of COMA) and it most certainly has always been a NUMA. Oops, I forgot to define COMA…next entry…and, oh, Opteron has made saying NUMA cool again!

 

A Day for Typos. Let’s move the “c” and “n” Keys, OK?

Two typos in one session. If ci(1) and nash(8) are important I think we should mode “c” far away from “v” and “”n from “b” on the QWERTY keyboard. When I think vi and bash, I’m not thinking ci(1) and nash(8)…Votes?

$ uname -a
Linux tmr6s13 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linu
$ cd /tmp
$ mkdir foo
$ cd foo
$ cat > foo.sh
echo this is a dumb script
$ chmod +x foo.sh
$ bash foo.sh
this is a dumb script
$ nash foo.sh
(running in test mode).
Red Hat nash version 4.2.1.6 starting
(echo) this is a dumb script
$ ci foo.sh
foo.sh,v <– foo.sh
enter description, terminated with single ‘.’ or end of file:
NOTE: This is NOT the log message!
>>
RCS: Quit
RCS: Cleaning up.
$ type nash
nash is hashed (/sbin/nash)
$ bash –help
GNU bash, version 3.00.15(1)-release-(x86_64-redhat-linux-gnu)
Usage: bash [GNU long option] [option] …
bash [GNU long option] [option] script-file …
GNU long options:
–debug
–debugger
–dump-po-strings
–dump-strings
–help
–init-file
–login
–noediting
–noprofile
–norc
–posix
–protected
–rcfile
–rpm-requires
–restricted
–verbose
–version
–wordexp
Shell options:
-irsD or -c command or -O shopt_option (invocation only)
-abefhkmnptuvxBCHP or -o option
Type `bash -c “help set”‘ for more information about shell options.
Type `bash -c help’ for more information about shell builtin commands.
Use the `bashbug’ command to report bugs.

The Decommissioning of the Oracle Storage Certification Program

I’ve known about this since December 2006, but since the cat is out of the proverbial bag, I can finally blog about it.

Oracle has taken another step to break down Oracle-over-NFS adoption barriers. In the early days of Oracle supporting deployments of Oracle over NFS, the Oracle Storage Compatibility Program (OSCP) played a crucial role in ensuring a particular NAS device was suited to the needs of an Oracle database. Back then the model was immature but a lot has changed since then. In short, if you are using Oracle over NFS, storage-related failure analysis is as straight-forward as it is with a SAN. That is, it takes Oracle about the same amount of time to determine fault is in the storage—downwind of their software—with either architecture. To that end, Oracle has announced the decommissioning of the Oracle Storage Compatibility Program. The URL for the OSCP (click here , or here for a a copy of the web page in the Wayback Machine) states the following (typos preserved):

At this time Oracle believes that these three specialized storage technologies are well understood by the customers, are very mature, and the Oracle technology requirements are well know. As of January, 2007, Oracle will no longer validate these products. We thank our partners for their contributions to the OSCP.

Lack of Choice Does Not Enable Success
It will be good for Oracle shops to have even more options to choose from when selecting a NAS-provider as an Oracle over NFS platform.  I look forward to other players to emerge on the scene. This is not just Network Appliance’s party by any means. Although I don’t have first-hand experience, I’ve been told that the BlueArc Titan product is a very formidable platform for Oracle over NFS—but it should come as no surprise that I am opposed to vendor lock-in.

Oracle Over NFS—The Demise of the Fibre Channel SAN
That seems to be the conclusion people draw when Oracle over NFS comes up. That is not the case, so your massive investment in SAN infrastructure was not a poor choice. It was the best thing going at the time. If you have a formidable SAN, you would naturally use a SAN-gateway to preserve your SAN investment while reducing the direct SAN connectivity headaches. In this model deploying another commodity server is as simple as plugging in Cat 5 cabling, and mounting an exported NFS filesystem from the SAN gateway. No raw partitions to fiddle with on the commodity server, no LUNs to carve out on the SAN and most importantly, no FCP connectivity overhead. All the while, the data is stored in the SAN so your existing backup strategy applies. This model works for Linux, Solaris, HPUX, AIX.

Oracle over NFS—Who Needs It Anyway?
The commodity computing paradigm is so drastically different than the central server approach we grew to know in the 1990s. You know, one or two huge servers connected to DAS or a SAN. It is very simple to run that little orange cabling from a single cabinet to a couple of switches. These days people throw around terms like grid without ever actually drawing a storage connectivity schematic. Oracle’s concept of a grid is, of course, a huge Real Application Clusters database spread out over a large number of commodity servers. Have you ever tried to build one with a Fibre Channel SAN? I’m not talking about those cases where you meet someone at an Oracle User Group that refers to his 3 clustered Linux servers running RAC as a grid. Oh how I hate that! I’m talking about connecting, say, 50,100 or 250 servers all running Oracle—some RAC, but mostly not—to a SAN. I’m talking about commodity computing in the Enterprise—but the model I’m discussing is so compelling it should warrant consideration from even the guy with the 3-node “grid”. I’m talking, again, about Oracle over NFS—the simplest connectivity and storage provisioning model available for Oracle in the commodity computing paradigm.

Storage Connectivity and Provisioning
Providing redundant storage paths for large numbers of commodity servers with Fibre Channel is too complex and too expensive. Many IT shops are spending more than the cost of each server to provide redundant SAN storage paths since each server needs 2 Host Bus Adaptors (or a dual port HBA) and 2 ports in large Director-class switches (at approximately USD $4,000 per). These same servers are also fitted with Gigabit Ethernet. How many connectivity models do you want to deal with? Settle on NFS for Oracle and stick with bonded Gigabit Ethernet as the connectivity model—very simple! With the decommissioning of the OSCP, Oracle is making the clear statement that Oracle over NFS is no longer an edge-case deployment model. I’d recommend giving it some thought.

EMC’s MPFSi for Oracle: Enjoy It While It Lasts, or Not.

Regular readers of my blog know that I am a proponent of Oracle over NFS—albeit in the commodity computing space. I’ll leave those Superdomes and IBM System p servers with their direct SAN plumbing. So I must therefore be a huge fan of EMC’s Celerra MPFSi—the Multi-Path Filesystem, right? No, I’m not. This blog post is about why not MPFSi.

In this paper about EMC MPFSi, pictures speak a thousand words. But first, some of my own—with an Oracle-centric take. MPFSi would be just fine I suppose except it is both an NFS server-side architecture and a proprietary NFS client package. The following screen shot shows a basic diagram of Celerra with MPFSi. First, there are three components at the back end. One is the Celerra and another is an MDS 9509 Connectrix. The Celerra is there to service NAS filesystem metadata operations and the Connectrix with some iSCSI glue is there to transfer data requests in block form. That is, if you create a file and immediately write a block to it, you will have the file creation satisfied by the Celerra and the block write by the Connectrix. The final component is the SAN—since Celerra is a SAN-gateway. There is nothing wrong with SAN gateways by any means. I think SAN gateways are the best way to leverage a SAN for provisioning storage to the legacy monolithic Unix systems as well as the large number of commodity servers sitting on the same datacenter floor. That is, SAN to the legacy Unix system and SAN-gateway-NFS to the commodity servers. That’s tiered storage. Ultimately you have a single SAN holding all the data, but the provisioning and connectivity model of the gateway side is much better suited to large numbers of commodity servers than FCP. Here is the simplified topology of MPFSi:

NOTE, some browsers require you to right click->view.

smallcelerra-1.jpg

 

 

MPFSi requires NFS client-side software. The software presents a filesystem that is compatible with NFS protocols. There is an agent that intercepts NFS protocol messages and forwards them to the Celerra which then does with it what it will as per the MPFSi architecture as the following screen shot shows.

smallcelerra-2.jpg

What’s This Have to do With Oracle?
So what’s the big deal? Well, I suppose if you absolutely need to stay with EMC as your SAN gateway vendor, then this is the choice for you. There are SAN-agnostic choices for SAN gateways as I’ve pointed out on this blog too many times. What about Oracle? Since Oracle10g supports NFS in the traditional model, I’m sure MPFSi works just fine. What about 11g? We’ve all heard “rumors” that 11g has a significant NFS-improvement focus. It is good enough with 10g, but 11g aims to make it an even better I/O model. That is good for Oracle’s On Demand hosting business since they use NFS exclusively. Will the 11g NFS enhancements function with MPFSi? Only an 11g beta program participant could tell you at the moment. I also know that the beta program legalese essentially states that participants can neither confirm nor deny whether they are, or are not, Oracle11g beta program participants. I’ll leave it at that.

Oracle over NFS is Not a Metadata Problem
When Oracle accesses files over NFS, there is no metadata overhead to speak of. Oracle is a simple lseek, read/write engine as far as NFS is concerned and there is no NFS client cache to get in the way either. Oracle opens files on NFS filesystems with the O_DIRECT flag. This alleviates a good deal of the overhead typical NFS exhibits. Oracle has an SGA, it doesn’t need NFS client-side cache. So MPFSi is not going to help where scalable NFS for Oracle is concerned. MPFSi better addresses the traditional problems with scaling home shares and so on.

Using Absolutely Dreadful Whitepapers as Collateral
Watch out if you read this ESG paper on EMC MPFSi because a belt sander might just drop from the ceiling and grind you to a fine powder as punishment for exposing yourself to such spam. This paper is a real jewel. If you dare risk the belt sander, I’ll leave it to you to read the whole thing. I’d like to point out, however, that it shamelessly uses relative performance numbers without the trouble of filling in any baselines for us in the performance section. For instance, the following shot shows a “graph” in the paper where the author makes the claim that MPFSi performs 300% better than normal NFS. This is typical chicanery—without the actual throughput achieved at the baseline, we can’t really ascertain what was tested. I have a $5 bet that the baseline was not, say, triple-bonded GbE delivering some 270+ MB/sec.

smallcelerra-3.PNG

 

No Blog Entries Over The Weekend!

I’ve been told that a blog without photos is too boring. Well, it just so happens that the reason I didn’t make any blog entries over the past weekend was because I was down at the family farm making a mess and taking photos. The job at hand was to relocate the pump that supplies the house and barn with water from one spring to another. It was messy, but first, a photo from the driveway…nice country.

dig2

When we first arrived we got to see the condition the contractors left their equipment in. We know where the hard ground is and told them how to approach the spring, but they had their ideas and wound up stuck up to the chassis in cold mud with a track missing:

dig6.jpg

That was pretty late in the day. The next morning I had to jump in there for a photo—while I was still nice and tidy.

dig5.jpg

What ensued after that photo was about 6 hours of toiling with the contractors and our family machinery to get that thing out of the mudhole. Next, they relocated the machine into position for digging the new pump location while an old friend of mine and I did some fence repair.

You can’t really see it well, but the machine is tethered to that young Douglas Fir tree behind it—or else the machine was going right into the hole being dug. Or maybe my daughter is just holding it in place…

The next shot shows the hole complete at about 8 feet deep with a 12’ x 4’ culvert positioned on-end to prevent caving. At that point the spring was producing about 200 to 300 GPM into the hole—a very dependable water source. Within hours the water was running crystal clear. The next task is to place some 15 cubic yards of 5″-open rock and a fabric barrier then the pump goes in and the whole this is capped off.

dig4.jpg

The next shot puts it into perspective with a view from the house down into the hole where the spring is. Steep country. Nice farm. Good time had by all.

dig3

There, I did it! Another blog entry with photos!

The 10.2.0.3 Patchset with VxFS Saga: An Example of Incorrectly Describing the Incorrectness

In the blog entry entitled “Oracle 10.2.0.3 Patchset is Not Functional with Solaris SPARC 64-bit and Veritas Filesystem”, I pointed out that the the 10.2.0.3 patchset was not functional if your database resides in VxFS (bug 5747918). There is updated information now, but first a bit of humor.

In the solution section of the note covering this bug, Metalink note 405825.1 states:

Workaround
————–
Move the entire database to a non-Veritas filesystem

Resolution
———–
Download and apply Metalink patch:5752399.
The instructions to apply patch:5752399 are included in the patch README file.

Move the entire database? Uh, I’d go for the patch for the patchset. Or as I’ve already pointed out, Oracle Disk Manager is not affected by this bug at all.

The Patch for the Patchset
Oracle Patch number 5752399 is considered a mandatory patch for the 10.2.0.3 patchset.

Incorrectly Describing the Incorrectness
Regarding the nature of the bug, Metalink note
405825.1 incorrectly states:

The 10.2.0.3 patchset code changes attempted to use directio with vxfs (Veritas) filesystems, which vxfs does not support.

On the contrary, VxFS does support direct I/O via:

  • Quick I/O
  • ODM
  • VxFS mount options (e.g., convosync)

This documentation on Sun’s website gets it correctly:

If you are using databases with VxFS and if you have installed a license key for the VERITAS Quick I/O™ for Databases feature, the mount command enables Quick I/O by default. The noqio option disables Quick I/O. If you do not have Quick I/O, mount ignores the qio option. Alternatively, you can increase database performance using the mount option convosync=direct, which utilizes direct I/O.

Correctly Describing the Incorrectness
Since the Metalink note got it wrong by stating that VxFS doesn’t support “directio” (a.k.a Direct I/O), I’ll clear it up here. As I stated in
this blog entry, the true nature of the bug is that the 10.2.0.3 porting team implemented a call to the Solaris directio(3C) library routine which is a way to push Direct I/O onto a UFS file, but is not supported by VxFS. There, now, doesn’t that make more sense? Am I being a stickler? Yes, because there is a huge difference between the two following phrases:

attempted to use directio with vxfs

attempted to use directio(3C) with vxfs

Workaround
Did they really suggest moving an entire database as a workaround for a misplaced call to directio(3C)?


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.