Archive for the 'oracle' Category



SAN Madness–Don’t Let It Ruin Your Project.

I was recently exchanging email with a fellow participant in the oracle-l mailing list about a problem they were having evaluating Oracle on a SAN. It seems their evaluation was completely blown out due to non-stop SAN madness. I wonder if their Storage Administrator is also their Unstructured Data Administrator? I know one thing for certain; this is a shop that needs to join up with the forces of BAARF. When answering my question about how things were going, he responded with:

Poorly. The installation/setup of oracle was fine. The problem was with the SAN the SA gave me. It was an [brand name removed]. I know he set it up as Raid 5 but I don’t know the specifics after that. All I know it was 5x slower than anything we currently have, and they’ve made us use Raid 5 everywhere. We use alot of T3’s with a few SANs sprinkled in. I think 15 days of the 30 day trial was spent hooking up the SAN.

SAN madness indeed! I understand how frustrating infrastructure issues are when they hinder the effort to actually perform the required testing. It isn’t all the fault of the SAN in this situation as it seems there were a few 8i->10g Cost Based Optimizer (CBO) hurdles to overcome as well. He wrote:

I also spent alot of time just trying to get explain plans to match between my 8i database and the new 10g, in order to compare apples to apples.

I hope he didn’t feel responsible for that. I am convinced one of the only humans that really knows the cost based optimizer in practical application is fellow OakTable member Jonathan Lewis.

The email continued by questioning whether a T2000 could replace a 16 CPU E6500. He wrote:

But after all that I still couldn’t see why they thought we could replace our 16 cpu Sun E6500 running full tilt with one T2000. Anyways, that project has been shelved.

SAN Madness
How sad it is that an entire project can get shelved because of SAN madness, but I am not surprised. On the other hand, the idea that a T2000 can supplant a E6500 is not that out of line as my old friend Glenn Fawcett points out, the bandwidth is not even close. The T2K has a 20GB/s backplane whereas the E6500 is only 9.6GB/s. For OLTP, the T2000 would most likely beat a fully loaded Starfire UE10K which is backplane limited to 12.8GB/s. And you might even have enough HVAC for it!

Microsoft and LSI–Move Over Novell? Time for More Open Source Patents.

Thanksgiving Day or April Fool’s Day?
Just about everyone knows Cary Milsap of Hotsos and other fame. On Thanksgiving Day, he sent out a real turkey of an email (I just couldn’t resist that one) to us on the OakTable email list. I thought it was a joke when I first saw it.

The email pointed out that LSI had been granted a patent on the Linked List. Yes, the title of the patent is simply “Linked List”. As the holder of a couple of software patents, I tend to play devils advocate. I see entirely too much of the open source anarchy mentality when it comes to software patents. However, this particular patent is really out there. I really see no way that this patent could have made it through scrutiny on prior art or novelty. There are millions of software applications of the following quoted summary of the invention:

The present invention overcomes the disadvantages and limitations of the prior art by providing a system and method for traversing a list using auxiliary pointers that indicate the next item in a sequence. The sequential list may be created in one sequence, but used in a second sequence without having to resort the list.

Uh, there is nothing novel about that. But that is not all. Look at claim number 4:

A computer system capable of traversing a list having at least two sequential pointers comprising: a plurality of items that are contained in said list to be traversed; at least a primary pointer and a secondary pointer for each of said items of said list such that each of said items has an associated primary pointer and an associated secondary pointer…

That is weird. Find me a computer that can’t do that. Was there a C compiler for the TRS-80? Yes, in fact, there was. Hmmm…I better dig further…

I don’t think this patent is a reason to abolish software patents, but clearly something is out of whack. Maybe this gentleman could explain it?

This is Not A Software Patent, Is It?
OK, this other patent granted to LSI this year is interesting. It doesn’t even describe a software program really. I wonder if Microsoft is paying royalties. How is the following abstract from the patent not in fact precisely what Microsoft Project does? Oh, I know, MS Project doesn’t send email.

The present invention is a computer-based system for managing projects. It allows the user to input data concerning a project and associate individuals with the project. The system then determines a deadline for completing a task associated with the project and send out reminders accordingly. The system provides the user a number of options not available on the conventional docketing systems, such as automatically increasing the frequency with which reminders are sent as the deadline approaches, and automatically increasing the number of individuals to whom the reminders are sent as the deadline draws near.

 

 

Quote of the Day

Not enough people are aware that the trouble at the high-end is no longer scalability as much as power, cooling and storage provisioning/connectivity.  Large scale commodity computing is the new high-end.

                                                                                                                        ~Kevin Closson, 2006

Storagemojo.com is A Really Good Blog

Those of you who know me are aware that I am new to blogging and that Oracle-related storage topics are very near and dear to my heart. I’m sure all you long-time bloggers have likely have seen this site, but I have got to say that storagemojo.com is a great blog.

I now know what my blog aspires to be—from the Oracle-centric viewpoint of course.

Pay this site a visit if you are storage minded. It has some serious mojo!

A Better KVM (New Oaktable Network Member)

…from Kevin Closson’s blog…

We just pulled up another chair to the Oaktable Network. This blog entry is to welcome Kurt Van Meerbeeck to the group. Kurt is the author of PePi, Pretoria, and DUDE. While his initials are KVM, I assure you he is much more complex than a KVM.

Welcome Kurt, it is a pleasure to have you on board!

 

 

 

 

UKOUG Update

It has been a great conference thus far. I have attended interesting presentations such as one about Disposable Computing Architecture by James Morle. It parallels my positions on commodity computing for certain, but he threw in some thought-provoking points. If you didn’t get to see the presentation, watch for it in the proceedings. He may have that presentation on his website soon which is here. James has been a friend for many years, if you’ve never been to his website you should check it out.

I enjoy this conference since so many of the Oracle OakTable Network members attend and speak. I have “known” a lot of fellow members for a long time that I have never met face to face.

I spoke about NAS architecture on Tuesday and am about to speak on Clusters Consolidation. Both presentations cover topics in whitepapers available on my front page—just  click “Papers, etc” on my blog front page, or here if you are interested in such topics.

Shameless Plug Time

One of our customers, Taleo, has been honored as one of the 2006 InfoWorld 100 Awards in the Services category. Teleo puts the PolyServe Database Utility for Oracle to good use. More on the award is available at this InfoWorld webpage.

 This was a very short blog entry since I need to rush in and do this presentation about Clusters Consolidation.

I promise to make a blog entry very soon about ASM on NAS from a slightly updated angle.

 

Network Appliance OnTap GX for Oracle? Clustered Name Space. A Good User Group (OSWOUG)

Presenting NAS Architecture at OSWOUG
I had a speaking session recently at OSWOUG about NAS architectures as pertaining to Oracle. It was a good group, and I enjoyed the opportunity to speak there. The presentation was an “animated” digest of this paper and covers:

  • single-headed filers
  • clustered single-headed filers
  • asymmetrical multi-headed NAS
  • symmetrical multi-headed NAS

Symmetric NAS versus Clustered Name Space
Since the only model in existence for the symmetrical multi-headed NAS architecture is the HP Enterprise File Services Clustered Gateway (which is an OEMed version of the PolyServe File Serving Utility), I spent some time discussing the underpinnings—the cluster filesystem. At one point I spent just a few moments discussing Cluster Name Space technology, but I wish I had taken it further. I’ll make up for it now.

If You Can’t Cluster Correctly, At Least Cluster Something
Something is better than nothing, right? Many years ago, an old friend of mine was coming back from duty in Antarctica on a coast guard ice cutter. You’d have to know Kenny to get the full picture but at one point, over beer, he pulled out a phrase that still makes me chuckle. I don’t remember the context, but his retort to some rant I was on was, “Well, if you can’t tie a knot, tie a lot”. That, my friends, is the best way to sum up Cluster Name Space NAS technology. I couldn’t find a dictionary definition, so I’ll strike a claim:

clus·ter name space'klus-ter 'nAm 'spAs
n.

Software technology which when applied to separate storage devices gives the illusion of scalable filesystem presentation. Cluster name space technology is a filesystem directory aggregator. Files stored in a directory in a cluster name space implementation cannot span the underlying storage devices. Single large files cannot benefit from cluster name space technology. Cluster name space is commonly referred to by the acronym CNS. When applied to Oracle databases, CNS forces the physical partitioning of data across the multiple, aggregated underlying storage devices.

v.

To apply a virtualized presentation layer to a bottlenecked storage architecture: “hey, let’s just CNS it and ship it”

The verb form of the term was demonstrated when Network Appliance bought Spinnaker back in 2003. When you spend 300 million dollars on technology, what is done with it often winds up looking more like a verb than a noun. What do I mean by that? There is nothing about Network Appliance’s OnTap GX technology that was built from the ground up to be clustered and scalable. Network Appliance has had clustered filers for a long time. That doesn’t mean the data in the filers is clustered. It is merely a cabling exercise. What they did between the 2003 purchase of Spinnaker and the release of OnTap GX was a verb—they CNSed their existing technology. I can tell from the emails I get that there are about 42 of you who think I make all this stuff up. Let’s get it from the horse’s mouth, quoting from page 8 of this NetApp whitepaper covering OnTap GX:

When an NFS request comes to a Data ONTAP GX node, if the requested data is local to the node that receives the request [reference removed], that node serves the file. If the data is not local [reference removed], the node routes the request over the private switched cluster network to the node where the data resides. That node then forwards the requested data back to the requesting node to be forwarded to the client.

Perhaps a picture can replace a thousand words. Nobody, not even in the deepest, darkest, remote areas of the Amazon rain forest would mistake what is depicted in the following graphic for a symmetrical file serving technology:

spinnaker4

The Press
There has been a lot of press coverage of Data OnTap GX (NetApp is the poster child for CNS), and most articles get the technology just flat wrong. This techtarget.com article, on the other hand gets it quite right:

NetApp’s SpinServer is a clustered file system for NFS that currently doesn’t support CIFS. SpinServer exports a single namespace across all storage nodes, but files aren’t striped across nodes

And so does this techtarget.com article which exposes the weakness of the CNS verb situation:

…immediately found a problem with the high-availability (HA) failover detection system. A failure could be bad enough that clients could not access data but not bad enough to alert the system.

What Does This Have To Do With Oracle?
OK, this is a blog about Oracle and storage. Let me interpret the quote from NetApp’s paper covering Data OnTap GX. If you have a large Oracle datafile that gets hammered with I/O this technology will do absolutely nothing to improve your throughput or reduce I/O latencies. A single file is contained within one filer’s storage and thus, represents an I/O throughput bottleneck. One of the sites I read, DrunkenData, has a thread on this topic. It appears as though the thread is aimed at trying to believe in something more than to understand something, but my assessment could be wrong.

The CNS topic is quite simple; the word “cluster” in the CNS context has nothing to do with file-level scalability. Yes, I know Network Appliance is a multi-billion dollar corporation so I’m sure my exposé of this topic will be met with nefarious men wearing trench coats lying in wait to attack…

Cram Some Technology into Your Solution
This storagemagazine.techtarget.com article covers some clustered NAS topics, but there was one bit in there that stood out when I read it and I’d like to comment. The following excerpt recommends CNS technology for Oracle:

…To avoid this problem, consider a global namespace product…The only way to implement a cross-platform global namespace is to replace your NAS infrastructure with, for example, NetApp’s SpinServer or Panasas’ ActiveScale. If Oracle 10 on a Linux cluster is in your future, then the NetApp and Panasas solutions should be on your short list.

Bad mojo! I have said many times before that Oracle on NAS is a good model and NetApp is obviously the 800lb gorilla in this space. However, I disagree with the idea posted in this quoted article for two reasons. First, Oracle uses large files and CNS does not scale single large files. To get the benefit of CNS, you will have to partition your large files into smaller ones so as to get multiple filers serving the data. If you wanted to physically partition your data, you would have chosen a different database technology like DB2 EEE. And secondly, Oracle has an established program for NAS vendors to certify their capabilities with Oracle databases. Panasas may be good technology, but since it is not an OSCP vendor, it doesn’t get to play with Oracle. On the other hand, the HP EFS-CG, which is multi-headed and scalable, is. Fair is Fair.

A Change of Heart
Here was an interesting post in the bioinformatics.org mailing list about “global name space”. I don’t know how the different filesystem technologies got clumped together as they did, but that was two years ago. I found it interesting that the poster was from Verari Systems who now resell the PolyServe File Serving Utility.

Summary
I hope you know more now about CNS than you would ever care to know–literally.

TPC-C Result Proves DB2 is “Better” Than Oracle

Oracle Versus “The Competition”
There is a thread in asktom.oracle.com that started back in 2001 about how Oracle compares to other database servers. While I think today’s competitive landscape in RDBMS technology proves that is a bit of a moot question, I followed the thread to see how it would go. The thread is a very long point/counter-point ordeal that Tom handles masterfully. There are bits and pieces about WinFS, MySQL, DB2, PostgresSQL, multi-block read consistency (thank heaven above—and Andy Mendelsohn et al. of course), writers blocking readers, page lock escalation (boo Sybase) and so on. It’s a fun read. The part that motivated me to blog is the set of TPC-C results posted by a participant in the thread. The TPC-C results showed record-breaking non-clustered TPC-C v5 results on a flagship RS/6000 (System p5 I believe it is now branded) server running DB2. Tom replies to the query with:

…do YOU run a TPC-C workload, or do YOU run YOUR system with YOUR transactions?

That was an excellent reply. In fact, that has been the standard remark about TPC-C results given by all players—when their result is not on top. I’m not taking a swipe at Tom. What I’m saying is that TPC-C results have very little to do with the database being tested. As an aside, it is a great bellwether for what software features actually offer enhanced performance and are stable enough to sustain the crushing load. More on that later.

TPC-C is a Hardware Benchmark
See, most people don’t realize that TPC-C is a hardware benchmark first and foremost. The database software approach to getting the best number is to do the absolute least amount of processing per transaction. Don’t get me wrong, all the audited results do indeed comply with the specification, but the “application” code is written to reduce instructions (and more importantly cycles) per transaction any way possible. If ever there was a “perfect” Oracle OLTP workload, it would be the TPC-C kit that Oracle gives to the hardware vendors in order to partake in a competitive, audited TPC-C. That application code, however, looks nothing like any application out in front of any Oracle database instance in production today. Don’t get me wrong, all the database vendors do the same thing because they know that it is a hardware benchmark, not a database benchmark.

TPC-C—Keeping it Real
The very benchmark itself is moot—a fact known nearly since the ratification of the specification. I remember sitting in a SIGMOD presentation of this paper back in 1995 where Tandem proved that the workload is fully partitionable and scaled linearly, thus ridiculously large results are only impeded by physical constraints. That is, if you can find the floor space, power, cooling and cable it you too can get a huge result. If only the industry would have listened! What followed has been years of the “arms race” that is TPC-C. How many features have gone into database server products for these benchmarks that do nothing for real datacenter customers? That is a good question. Having worked on the “inside” I could say, but men in trench coats would sweep me away never to be heard from again. Here’s a hint, the software being installed does not have to come from a shipping distribution medium (wink, wink). In fact, the software installation is not a part of the audit. Oops.

History and Perspective
In 1998 I was part of a team that delivered the first non-clustered, Oracle TPC-C result to hover next to what seemed like a sound barrier at the time—get this, 100,000 TpmC! Wow. We toiled, and labored and produced 93,901 TpmC on a Sequent NUMA-Q 2000 with 64 processors as can be seen in these archived version 3 TPC-C results. This and other workloads were the focus of my attention back then. What does this have to do with the AskTom thread?

The thread asking Tom about TPC-C cited a recent DB2 result on IBM’s System p5 595 of 4,033,378 TpmC. I think the point of that comment on the thread was that DB2 must surely be “better” since the closest result with Oracle is 1,601,784 TpmC. For those who don’t follow the rat race, TPC-C results between vendors constantly leap-frog each other. Yes, 4,033,378 is a huge number for a 64 processor (32-socket 64 core) system when compared to that measly number we got some 8 years ago. Or is it?

Moore’s Law?
There have been about 6 Moore’s Law units of time (18 months), since that ancient 93,901 TpmC result. A lot has changed. The processors used for that result were clocked at 405MHz, had 7.5 million transistors (250nm) and tiny little 512KB L2 caches. With Moore’s Law, therefore, we should expect processors with some 480 million transistors today. Well, somewhere along the way Moore’s Law sloped a bit so the IBM System p5 595 processors (POWER5+) “only” have 276 million transistors. Packaging, on the other hand, is a huge factor that generally trumps Moore’s Law. The POWER5+ are packaged in multi-chip modules (MCM) where there is some 144MB of L3 cache (36MB/socket, off-chip yes, but 80GB/s) backing up a full 7.6MB L2 cache—all this with 90nm technology. Oh, and that huge TpmC was obtained on a system configured with 2048GB (yes, 2 Terabytes) of memory whereas the “puny” Sequent at 93,901 TpmC had 64GB. And, my oh my, how much faster loading memory is on this modern IBM! Exponentially faster—about 6 fold in fact!

The Tale of The Tape
Is 4,033,378 really an astronomical TPC result? Let’s compare the IBM system to the old Sequent:

  • 43x more throughput (TpmC)
  • 32x more memory configured ( with 6x better latency )
  • 37x more transistors per processor (and clocked 6x faster)
  • 15x more processor L2 cache + 36x in L3

So, regardless of the fact that I just did a quick comparison of a DB2 result to an old Oracle8 result, I think there should be little surprise that these huge numbers are being produced especially when you factor in how partitionable TPC-C is. Let’s not forget disk. IBM used 6,400 hard disk drives for this result. If they get the floor space, power, do the cabling and add a bunch more drives and hook up the upcoming POWER6-based System p server, I’m quite certain they will get a bigger TPC-C result. There’s no doubt in my mind—hint, POWER6 has nearly 3 fold more transistors (on 65nm technology) than POWER5+ and clocked at 5GHz too.

Tom Kyte is Right
The question is, what does it mean to an Oracle shop? Nothing. So, as usual, Tom Kyte is right. Huge TPC-C results using DB2 don’t mean a thing.

Retrospect and Respect
I’m still proud of that number we got way back when. It was actually quite significant since the closest Oracle result at the time was a clustered 96-CPU 102,542 TpmC Compaq number using Alpha processors. That reminds me, I had a guy argue with me last year at OOW suggesting that result was the first 100K+ non-clustered Oracle result. I couldn’t blackberry the tpc.org results spreadsheet quick enough I guess. I digress. As I was saying, the effort to get that old result was fun. Not to mention the 510-pin gallium arsenide data pump that each set of 4 processors linked to in the NUMA system was pretty cool looking. It was a stable system too. The Operating System technology was a marvel. I’m fortunate to still be working with a good number of those Sequent kernel engineers right here at PolyServe where one of our products is a general purpose, fully symmetric, distributed cluster filesystem for Windows and Linux.

 

 

 

 

DBWR Efficiency, AIO, I/O Libraries with ASM.

I’m sure this information is not new to very many folks, but there might be some intersting stuff in this post…

I’m doing OLTP performance testing using a DL585, 32GB, Scalable NAS, RHEL4 (Real, genuine RHEL4) and 10gR2. I’m doing some oprofile analysis on both the NFS client and server. I’ll blog more about oprofile soon.

This post will sound a bit like a rant. Did you know that the Linux 2.5 Kernel Summit folks spent a bunch of time mulling over features that have been considered absolutely critical for Oracle performance on Unix systems for over 10 years! Take a peek at this partial list and chuckle with me please. After the list, I’m going to talk about DBWR. A snippet from the list:

  • Raw I/O has a few problems that keep it from achieving the performance it should get. Large operations are broken down into 64KB batches…
  • A true asynchronous I/O interface is needed.
  • Shared memory segments should use shared page tables.
  • A better, lighter-weight semaphore implementation is needed.
  • It would be nice to allow a process to prevent itself from being scheduled out when it is holding a critical resource.

Yes, the list included such features as eliminating the crazy smashing of Oracle multiblock I/O into little bits, real async I/O, shared page tables and non-preemption. That’s right. Every Unix variant worth its salt, in the mid to late 90s, had all this and more. Guess how much of the list is still not implemented. Guess how important those missing items are. I’ll blog some other time about the lighter-weight semaphore and non-preemption that fell off the truck.

I Want To Talk About Async I/O
Prior to the 2.6 Linux Kernel, there was no mainline async I/O support. Yes, there were sundry bits of hobby code out there, but really, it wasn’t until 2.6 that async I/O worked. In fact, a former co-worker (from the Sequent days) did the first HP RAC Linux TPC-C and reported here that the kludgy async I/O that he was offered bumped performance 5%. Yippie! I assure you, based on years of TPC-C work, that async I/O will give you much more than 5% if it works at all.

So, finally, 2.6 brought us async I/O. The implementation deviates from POSIX, which is a good thing. However, it doesn’t deviate enough. One of the most painful aspects of POSIX async I/O, from an Oracle perspective, is that each call can only initiate writes to a single file descriptor. At least the async I/O that did make it into the 2.6 Kernel is better in that regard. With the io_submit(2) routine, DBWR can send a batch of modified buffers to any number of datafiles in a single call. This is good. In fact, this is one of the main reasons Oracle developed the Oracle Disk Manager (ODM) interface specification. See, with odm_io(), any combination of reads and writes whether sync or async to any number of file descriptors can be issued in a single call. Moreover, while initiating new I/O, prior requests can be checked for completion. It is a really good interface, but was only developed by Veritas, NetApp and PolyServe. NetApps’ version died because it was locked to tightly with DAFS which is dead, really dead (I digress). So, yes, ODM (more info in this, and other papers) is quite efficient at completion processing. Anyone out there look at completion processing on 10gR2 with the 2.6 libaio? I did (a long time ago really).

Here is a screen shot of strace following DBWR while the DL585 is pumping a reasonable I/O rate (approx 8,000 IOPS) from the OLTP workload (click the graphic for better display):

test

Notice anything weird? There are:

38.8 I/O submit calls per cpu second (batches)

4872 I/O complete processing calls per cpu second (io_getevents())

7785 wall clock time reading calls per cpu second (times + gettimeofday)

Does Anyone Really Know What Time It Is?
Database writer, with the API currently being used, (when no ODM is in play) is doing what we call “snacking for completions”. This happens for one of many reasons. For instance, if the completion check was for any number of completions, there could be only 1 or 2. What’s with that? If DBWR just flushed, say, 256 modified buffers, why is it taking action on just a couple of completions? Waste of time. It’s because the API offers no more flexibility than that. On the other hand, the ODM specification allows for blocking on a completion check until a certain number, or certain request is complete—with an optional timeout. And like I said, that completion check can be done while already in the kernel to initiate new I/O.

And yes, DBWR alone is checking wall clock time with a combination of times(2) and gettimeofday(2) at a rate of 7,785 times per cpu second! Wait stats force this. The VOS layer is asking for a timed OS call. The OS can’t help it if DBWR is checking for I/O completes 4,872 times per cpu second—just to harvest I/Os from some 38.8 batch writes per cpu second…ho hum… you won’t be surprised when I tell you that the Sequent port of Oracle had a user mode gettimeofday(). We looked at it this way, if Oracle wants to call gettimeofday() thousands of times per second, we might as well make it really inexpensive. It is a kernel-mode gettimeofday() on Linux of course.

What can you do about this? Not much really. I think it would be really cool if Oracle actually implemented (Unbreakable 2.0) some of the stuff they were pressing the Linux 2.5 developers to do. Where do you think the Linux 2.5 Summit got that list from?

What? No Mention of ASM?
As usual, you should expect a tie in with ASM. Folks, unless you are using ASMLib, ASM I/Os are fired out of the normal OSDs (Operating System Dependent code) which is libaio on RHEL4. What I’m telling you is that ASM is storage management, not an I/O library. So, if a db file sequential read, or DBWR write is headed for an offset in an ASM raw partition, it will be done with the same system call as it would be if it was on a CFS or NAS.

Oracle Shared Latches, CAS, Porting. No mention of ASM?

I participate in the Oracle-L mail list managed by fellow OakTable network member Steve Adams. There are a lot of good folks over there. I want to hijack a thread from that list and weave it in to a post I’ve wanted to do about Oracle porting.

Oracle Port-level Optimizations
It is a little known fact that major performance ramifications are the result of port-level implementation decisions. Oracle maintains a “bridge layer” of code between the Oracle Kernel and the lower level routines that interface with the Operating System. This layer is called the Virtual Operating System layer, or VOS. Oracle was brilliant for implementing this layer of the server. Without it, there would be routines at widely varying levels of the Oracle Kernel interfacing with the Operating System—chaos! Under the VOS is where the nitty-gritty happens.

So, what does this have to do with the thread on Oracle-L?

There was a post where a list member stated:

I saw quite a few cas latch waits today, Oracle 9.2.0.4 on HP-UX 11.11 PA-RiSC. Do these CPUs support CAS instructions?

This post sparked a series of follow-ups about Oracle’s usage of shared latches. For the non-Oracle minded reader, the term shared latches means reader/writer locks. You see, for many years, all critical sections in Oracle were protected by complete mutual exclusion—a real waste for mostly-read objects. The post is referring to an optimization where shared latches are implemented using Compare and Swap primitives. Whether or not your port of Oracle has this optimization is a decision made at the port level—either the OS supports it or it doesn’t. If it doesn’t, there are tough choices to make at the porting level. But the topic is bigger than that. When Oracle uses generic terms like “CAS” and “Scattered Read”, a lot is lost in translation. That is, when the VOS calls an underlying “Scattered Read”, is it a simulation? Is it really a single system call that takes a multiblock disk read and populates the SGA buffers with DMA? Or is it more like the age old Berkeley readv(2) which actually just looped with singleton reads in library context? On the other hand, when Oracle executes CAS code, is it really getting a CAS instruction or a set of atomic instructions (with looping)? The latter is generally the case.

Another installment on that particular Oracle-L thread took it to the port level:

Correct, HP doesn’t do CAS. There are some shared read latch operations that Oracle therefore implements through the CAS latch…

Right, HP, or more precisely the PA-RISC instruction set does not have a Compare and Swap instruction—it seems HP took the “reduced” in reduced instruction set computing to the extreme! However, neither does PowerPC or Alpha for that matter. In fact, neither does x86 or IA64. Oh hold it; I’m playing a word game. Both x86 and IA64 do have CAS, but it is called cmpxchg (compare and exchange). But honestly, PowerPC and Alpha do not. So how do these platforms offer CAS?

The porting teams for these various platforms have to make decisions. In the case of offering CAS to the VOS layer, they either have to construct a CAS using an atomic block of instructions or punt and use the reference code which is a spinlock. In the latter case, the wait event CAS latch can pop up. You see, a CAS latch can wait, whereas a real CAS will only stall the CPU. That is, if the port implements a CAS latch where other ports go with a CAS atomic set or single instruction, the former can sleep on a miss and the latter cannot. The processor is going to do that CAS, and nothing else. A contended memory location being hammered by CAS will stall processors, because once the CPU enters that block of (or singleton) instruction(s), it stays there until the work is done. I’m not talking about pathology, just implementation subtleties. So, what does CAS really look like?

Sparc64, x86, IA64, S/370, and get this, Motorola 68020 all offer a CAS instruction. There are others for sure. On the other hand, PPC and Alpha require an atomic set of instructions built off of LL/SC (Load-Link/Store Conditional) which on Power is the famed “lorks and storks” (ldarx/stwcx) and Alpha with their lxd_l/stx_c. Finally, what about PA-RISC? Well, you can’t do Oracle on any CPU that totally lacks atomic instructions. In the case of PA-RISC there is a Load and Clear Word (ldcw) instruction.

The point is that the VOS can be given a CAS of one sort or the other, but not all architectures handle the contention that CAS can cause. For whatever reason, it seems the HP porting group punted on 9i and went with latches where other ports use a real, or constructed, CAS. Be aware that just following the masses and implementing a CAS atomic set is not always the right answer. These pieces of code can do really weird things when the words being modified by the CAS straddle a cacheline and other such issues. Hmmm, trivial pursuit?

How Subtle are These Subtleties?

They can be really really NOT subtle!

I was on a team of folks that implemented an atomic set CAS for Oracle’s System Change Number in Oracle8, which is actually a multi-word structure with the SCN word and another word representing the wrap value. The SCN value always increases, albeit not serially. It just gets “larger”. We were able to pull the latch that protected the incrementing of these values and replace it with a small block of atomic assembly that incremented it without any locks. I doubt we were the first to do that. The result was a 25% performance increase in TPC-C. Why? The SCN used to be a really big problem. Back then, propeller-heads like me used to collect bus traces on workloads like TPC-C and map the physical addresses back to SGA addresses. It so happened that in older versions of Oracle, 27% of all addresses referenced on the bus (64 CPU system) was the cacheline that held the SCN structure! Granted, that was for the sum of load, store and coherency ops (e.g., invalidate, cache to cache transfers). 

ASM is “not really an optional extra” With BIGFILE Tablespaces

I was reading Chris Foot’s (very good) blog entry about BIGFILE Tablespaces when I choked on the following text (mind you, Foot is quoting someone else):

…ASM is not only recommended, but a requirement because of inode issues: “If you create a truly large bigfile tablespace on a traditional file system, you will suffer from horrendous inode locking issues. That is why ASM is not really an optional extra with these things.”

This is a great topic as I too feel BIGFILE tablespaces is a great feature. I have to point out that the bit about inode locking is a red herring.

Folks, inode locks are only an issue when file metadata changes. Fact is, when you use file system files for Oracle, metadata doesn’t change, file contents changes. If you deploy on a file system that supports direct I/O, one benefit you get is elimination of atime/mtime updates. These are the only metadata that would get changed on every file access—doom! With mtime/atime updates removed from the direct I/O codepath, the only metadata changes left are structural changes to the file—again, not the file contents. That is, if you create, remove, truncate or extend a file (even with direct I/O), then, yes of course, inode locks must held in an elevated mode. In the case of a cluster file system, that extends to a cluster-inode lock. Now, if your cluster file system happens to be some rudimentary central lock approach,  as opposed to a symmetric/distributed (DLM) approach, then there are issues at that level—but only when the file changes (again, not the file contents).

The point being that unless you are extending a BIGFILE tablespace on a freakishly frequent basis, the inode thing is a red herring.

I really do hate it when non-issues like this are used to prop a technology choice such as ASM.

Troubles with Oracle on NAS? Old Stuff Deployed?

In Vidya Bala’s Blog post about Oracle on NAS, there is evidence of past problems with this NAS storage under older Linux distributions (e.g., SLES8) and older Oracle releases (e.g., Oracle9i). Most folks know I am a staunch proponent of Oracle on NAS and have blogged about it here and here.The most important thing to remember is that a noac mount option is no substitute for open(,O_DIRECT,).

 

I’ve blogged that, in my opinion, the first production-quality stack for Oracle on NAS is Oracle10gR2 on 2.6 Kernel releases. However, I can’t speak from authority on the Legacy Unix capabilities in this space. I’ve got too much Linux around here.

Nifty “Toys”

Fun Stuff to Play With
I thought I’d post a few photos of the lab gear allocated to my projects. I have the 2 racks right next to the chair. There is a 2 node cluster of heavily loaded HP DL585s, an 8 node cluster of HP DL145s, a 2 node cluster of HP DL385s, a 2 node cluster of IBM xSeries x366 , some Chaparall SAN, Imperial Solid State Disk, MSA 1500, DS4x00, and other goodies (e.g., FC and Ethernet switches, etc) … all running Oracle on Redhat and SuSE Linux.

Lab Systems

And the DL585s are cabled to some LUNs in the Data Direct Network cab:

DDN

And, when I schedule it, I am granted LUNs from the following storage arrays:

Storage

That is about 10% of the lab gear here… lots of AC.

Not pictured is the Texas Memory System I have on loan from the nice guys at TMS.

Now Is The Time To Open Source!

BLOG UPDATE 2011.08.11 : For years my criticism of Oracle Clusterware fencing methodology brought ire from many who were convinced I was merely a renegade. The ranks of “the many” in this case were generally well-intended but overly convinced that Oracle was the only proven clustering technology in existence.  It took many years for Oracle to do so, but they did finally offer support for IPMI fencing integration in the 11.2 release of Oracle Database. It also took me a long time to get around to updating this post.  Whether by graces of capitulation or a reinvention of the wheel, you too can now, finally, enjoy a proper fencing infrastructure. For more information please see: http://download.oracle.com/docs/cd/E11882_01/rac.112/e16794/admin.htm#CHDGIAAD

PolyServe to Open Source Products in Wake of “New” Oracle Unbreakable Linux

You all know I currently work for PolyServe. Over the last 5 years I cannot count how many times it has been recommended to us that we open source PolyServe Database Utility for Oracle.Back when Dave Dargo’s office was pushing the Unbreakable Linux program (circa 2002), the ecosystem for Oracle on Linux started to get dicey really fast. You see, when Unbreakable Linux started, it was a program that:

  • Mandated that all software on your system—except Oracle of course—be open source
  • You still have a support contract with Redhat or SuSE

Meeting these criteria allowed you to call Oracle Support with purely Operating System issues. I’m still waiting for anyone to comment on just how helpful that program was for you. It seemed like very dubious value add to me.

What about Open Source Cluster File Systems?

Over the years, we’ve been asked why our customers pay for our product when OCFS is free and GFS is available. Historically, it was because it worked and the others were seen as emerging technology. I question that supposition because GFS just about died in the hands of Sistina and OCFS has always been too rudimentary for general purpose use. I know OCFS2 is reportedly a general purpose filesystem, if that is true for you in production, that’s good. If you have only heard that rumor, I recommend my philosophy that belief should only be borne of testing.

SuSE was the only distro to adopt OCFS2 since Redhat was working on GFS. It may seem like trivial pursuit, but Novell Corporate IT actually run Oracle on PolyServe—but I digress.

Dark Dirty Secrets About Free Stuff

Oracle has been stating that OCFS is “good enough” for years, and after OCFS2 was touted as “general purpose”, SuSE decided to be nice and include it in their distribution—but they never told anyone that it doesn’t work. The voice of authority on what OCFS2 can do is actually SuSE, and Lars Marowsky-Brée is the voice of voices there since he works in the group that is trying to make free clustering solutions work. In this suse-oracle email list installment he writes:

So two nodes is not really sufficient with OCFS2 to protect against node failures – you need three at least for the majority quorum to be meaningful.

That was on 11 April 2006. I bet you haven’t heard that two node clusters with OCFS2 are split-brain poster children from anyone but this blog, right? What’s the big deal? If the stuff doesn’t work, why are you being told it does?

Of these two, GFS is at least close to pole position. It has a quality fencing model and actually works—at least their STONITH model that is. I wrote about fencing in this paper, and will go into the topic further here once the dust settles from Oracle OpenWorld. But just because it is functional doesn’t mean it performs. Sistina had all sorts of problems working out the locking model with GFS. Redhat didn’t inherit any favors in that regard. Here is an independent study of Linux CFS alternatives for unstructured data—the real kingpin performance metric for file systems.

What About Oracle Support

Now that is a great generic topic—but not for this post. After Oracle shook things up and muddied the water back on 2002 with Unbreakable Linux 1.0, they changed their stance on database support—when using third party clustering solutions—about 42 times. Eventually, Oracle instituted a program for third party cluster filesystems and clusterware compatibility which is discussed here. The RAC Technology Compatibility Matrix in Oracle Metalink is the final say on such matters.

The Linux ecosystem has been a train wreck not due to technology interoperability so much as the constant heavy-handedness by such companies as Oracle when it comes to partnering. In fact, Redhat has consistently tried to freeze out non-open source partners. Maybe this will change their attitude.

Is it really Redhat support that is slowing Linux adoption in the enterprise as Larry says? I think not.

1-800-Call-Larry—Who Ya Gonna Call? TSANet To the Rescue.

Oh, I know, we all want a single-source provider for all support. Uh huh, nothing like feeling really, truly, alone in the world. Have a complex problem and only a single 1-800 number to call? Good luck. Isn’t that why proprietary solutions were so despised? Isn’t the openness of Open Systems why Oracle is where it is today? What happened?

If this trend toward solutions consolidation continues, we are all going to sorely miss the days when there were multiple providers in a given solution who were fighting for your business and success. That is, after all, why TSANet is so crucial and why you need to know more about it. See, if you have a multi-provider solution where the providers are signatories to TSANet, there is no “finger pointing”. Ahh, yes, the fabled “finger pointing”. Unless you get a single provider—soup to nuts—there are going to be multiple players. In a problem resolution scenario, the only finger pointers are the big players. They are the only ones that can afford to lose your business. Does anyone think that, say, a small infrastructure player in a multi-vendor solution can actually get away with being the finger pointer? Heavens no. It is always the biggest player that tries to dismiss off their problem to the motivated, smaller new comer. Alas, TSANet has always been the protection from such poor business practice.

Deployment Standards

Imagine a datacenter that had both Windows (with SQL Server) and Linux running critical applications. Imagine further the need to consolidate and provide high availability at the same time. Now, quick, pick your solution. How is Unbreakable Linux (redux) or GFS or OCFS supposed to help at all?I’ve had dozens of you readers ask me why PolyServe. The answer is that only PolyServe solves this problem on both Linux and Windows. Our customers want a cluster deployment model that works for both Windows and Linux—imagine that!Doing things the same way regardless of operating system sounds like a good idea to me.

Open Source—The Perfect Business Model

Well, PolyServe missed the chance to open source our products when the timing was right. Had we, there is a chance that the same thing that just happened to Redhat could have happened to us. What a great business model.You get venture capital funding, build a world-class product, build a support ecosystem, open source it, get hundreds of customers in production, then someone like Oracle takes over for you. That almost makes the add-revenue model of the burning-piles-of-cash.com startups of the 1990’s like pretty attractive.

So, no, we won’t be doing the open source thing. After Larry’s announcement, it looks less and less attractive every day.

Solaris 10 on AMD anyone? Hmmm…

Gigabit Ethernet NFS is Not Sufficient for Oracle. Forget NAS, or Read On…

BLOG UPDATE – 2012.06.07 – Wayward Googlers resurrected this old post. Using my not-so-canny speed-reading skills I jumped in with comments. A reader emailed me to point out I bit myself with nomenclature (“B” vs “b”). At this point I think the 100Mb per 1GHz is prime for more scrutiny. I held fast to that ratio in the time frame of this original post. However, my work with Westmere and Sandy Bridge Xeons leads me to believe the ratio is in dire need for updating. I’ll address that topic in an up-coming post and link back to this post.

 

 

Calling My Friends Liars–How Fun

I can’t remember the last time I disagreed with Jeff Needham. I realized about 15 years ago (IIRC) that it doesn’t make sense to do so because he is always (5 “nines”) right. In a quick chat today he said:

Polyserve can sell up the notion that the gateway does break the 1Gbps barrier (which mostly people falsely believe is not enough I/O for “them”)

The “gateway” Jeff is referring to is the File Serving Utility for Oracle but that is not the topic of this post. I want to cowardly disagree with Jeff about his assertion that Oracle IT people are erroneously concerned that the most common NAS bandwidth (1GbE) is not sufficient for their needs. I assert that such a concern may in fact be warranted. The point I want to make, and therefore left-handedly disagreeing with Jeff about, is that it doesn’t matter.

Yes, 1 Gigabit Ethernet is the most common NAS connectivity medium today, and with very little tuning you can get a realizable payload of roughly 112MB/s—I do. If 112/MBs is starving your CPUs, all is not lost.

A Safe Rule of Thumb
There is a rule of thumb that has stood the test of time regarding the balance between processor capability and I/O bandwidth. Now, I know that sometimes man bites dog, but the highest majority of systems will strike a balance between CPU and I/O according to the following formula:

100Mb I/O bandwidth for each 1GHz of CPU

This formula leans towards DSS-style workloads, so it is certainly a safe bet for OLTP. Oh, by the way, Mb is not MB. I see such notation horribly interchanged all to often and it makes a big difference when you are trying to stuff 100 pounds (how many kilos is that?) of rocks into a 10 pound bag…

Memories
The last “really big” system I had dedicated to my projects (it was my “personal” lab system) was a Sequent NUMA-Q 2000 with 32 700MHz processors, 32GB RAM and 396 4GB hard disk drives. The formula was true then. It doesn’t sound like much by today’s standards, but about 280MB/s would saturate the system if I was doing heavy lifting such as index creation or complex queries with Parallel Query Option. On the other hand, I assure you that the processors nearly burst into flames running OLTP long before I hit ~280MB/s random 4KB transfers. After all, using the formula, that system would be able to deliver roughly 70,000 4KB IOps—and that was a lot in those days. On the contrary, I’ll blog soon about how uneventful that I/O rate is with modern commodity servers (and I still hate that term, need to blog on it—note to self).

The moral of the story is that if you are running on a legacy Unix system that is near the end of its lease it is quite likely that the compute power it offers can be replaced by an 8 core AMD Opteron system running 64-bit Oracle. Put that thought on the back burner if you had a short lease on, say, an IBM RS/6000 Regatta though J(hey, I still have my favorites). If you are planning a deployment that can be handled by an 8 core AMD Opteron system (very likely), I can all but guarantee you that triple-bonded NFS paths with client-side O_DIRECT will not starve your processors one bit.

Now, if you think a single-headed NAS device Filer) can really feed a 3-way triple bonded NFS data path for reads and writes, you need to do some testing and then read this.

So, in the end, I didn’t really disagree at all with Jeff, and for that, I feel good and safe!


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.