Archive Page 37

RAC Expert or Clusters Expert?

Introducing the Oracle SMP Expert. What is a Spinlock?
I am not joking when I tell you that I met an individual last year that billed himself as an “Oracle SMP expert.” That is fine and dandy, but through the course of our discussion I realized that this person had a severely limited understanding of the most crucial concept in SMP software scalability—critical sections. It wasn’t necessarily the concept of critical sections this individual didn’t really understand, it was the mutual exclusion that must accompany critical sections on SMP systems. In Oracle terms, this person could not deliver a coherent definition for what a latch is—that is, he didn’t understand what a spinlock was and how Oracle implements them. An “Oracle SMP expert” that lacks even cursory understanding of mutual exclusion principles is an awful lot like a “RAC expert” that does not have a firm understanding of what the term “fencing” means.

I have met a lot of “RAC experts” in the last 5 years who lack understanding of clusters principles—most notably what the term “fencing” is and what it means to RAC. Fencing is to clusters what critical sections are to SMP scalability.

Is it possible to be a “RAC expert” without being a cluster expert? The following is a digest of this paper about clusterware I have posted on the Oaktable Network website. For that matter, Julian Dyke and Steve Shaw accepted some of this information for inclusion in this RAC book.

Actually, I think getting it in their book was a part of the bribe for the technical review I did of the book (just joking).

I Adore RAC and Fencing is a Cool Sport!
No, not that kind of fencing. Fencing is a generic clustering term relating to how a cluster handles nodes that should no longer have access to shared resources such as shared disk. For example, if a node in the cluster has access to shared disk but has no functioning interconnects; it really no longer belongs in the cluster. There are several different types of fencing. The most common type came from academia and is referred to by the acronym STOMITH which stands for Shoot The Other Machine In The Head. A more popular variant of this acronym is STONITH where “N” stands for Node. While STONITH is a common term, there is nothing common with how it is implemented. The general idea is that the healthy nodes in the cluster are responsible for determining that an unhealthy node should no longer be in the cluster. Once such a determination is made, a healthy node takes action to power cycle the errant node. This can be done with network power switches for example. All told, STONITH is a “good” approach to fencing because it is generally built upon the notion that healthy nodes monitor and take action to fence unhealthy nodes.

This differs significantly from the “fencing” model implemented in Oracle Clusterware, which doesn’t implement STONITH at all. In Oracle Clusterware, nodes fence themselves by executing the reboot(8) command out of the /etc/init.d/init.cssd. This is a very portable approach to “fencing”, but it raises the question of what happens if the node is so unhealthy that it cannot successfully execute the reboot(8) command. Certainly we’ve all experienced systems that were so incapacitated that commands no longer executed (e.g., complete virtual memory depletion, etc.). In a cluster it is imperative that nodes be fenced when needed, otherwise they can corrupt data. After all, there is a reason the node is being fenced. Having a node with active I/O paths to shared storage after it is supposed to be fenced from the cluster is not a good thing.

Oracle Clusterware and Vendor Clusterware in Parallel
On all platforms, except Linux and Windows, Oracle Clusterware can execute in an integrated fashion with the host clusterware. An example of this would be Oracle10g using the libskgx[n/p] libraries supplied by HP for the MC ServiceGuard environment. When Oracle runs with integrated vendor clusterware, Oracle makes calls to the vendor-supplied library to perform fencing operations. This blog post is about Linux, so the only relationship between vendor clusterware and Oracle clusterware is when Oracle-validated compatible clusterware runs in parallel with Oracle Clusterware. One such example of this model is Oracle10g RAC on PolyServe Matrix Server.

In situations where Oracle’s fencing mechanism is not able to perform its fencing operation, the underlying validated host clusterware will fence the node, as is the case with PolyServe Matrix Server. It turns out that the criteria used by Oracle Clusterware to trigger fencing are the same criteria that host clusterware uses to take action. Oracle instituted the Vendor Clusterware Validation suites to ensure that underlying clusterware is compatible and complements Oracle clusterware. STONITH is one form of fencing, but far from the only one. PolyServe supports a sophisticated form of STONITH where the healthy nodes integrate with management interfaces such as Hewlett-Packard’s iLO (Integrated Lights-Out) and Dell DRAC. Here again, the most important principle of clustering is implemented—healthy nodes take action to fence unhealthy nodes— which ensures that the fencing will occur. This form of STONITH is more sophisticated than the network power-switch approach, but in the end they do the same thing—both approaches power-cycle unhealthy nodes. However, it is not always desirable to have an unhealthy server power-cycled just for the sake of fencing.

Fabric Fencing
With STONITH, there could be helpful state information lost in the power reset. Losing that information may make cluster troubleshooting quite difficult. Also, if the condition that triggered the fencing persists across reboots, a “reboot loop” can occur. For this reason, PolyServe implements Fabric Fencing as the preferred option for customers running Real Application Clusters. Fabric Fencing is implemented in the PolyServe SAN management layer. PolyServe certifies a comprehensive list of Fiber Channel switches that are tested with the Fabric Fencing code. All nodes in a PolyServe cluster have LAN connectivity to the Fiber Channel switches. With Fabric Fencing, healthy nodes make SNMP calls to the Fiber Channel switch to disable all SAN access from unhealthy nodes. This form of fencing is built upon the sound principle of having healthy servers fence unhealthy servers, but the fenced server is left in an “up” state—yet completely severed from shared disk access. Administrators can log into it, view logs and so on, but before the node can rejoin the cluster, it must be rebooted.

Kernel Mode Clusterware
The most important aspect of host clusterware, such as PolyServe, is that it is generally implemented in Kernel Mode. In the case of PolyServe, the most critical functionality of SAN management, cluster filesystem, volume manager and so on are implemented in Kernel Mode on both Linux and Windows. On the other hand, when fencing code is implemented in User Mode, there is always the risk that the code will not get processor cycles to execute. Indeed, with clusters in general, overly saturated nodes often need to be fenced because they are not responding to status requests by other nodes in the cluster. When nodes in the cluster are getting so saturated as to trigger fencing, having critical clusterware code execute in Kernel Mode is a higher level of assurance that the fencing operation will succeed. That is, if all the nodes in the cluster are approaching a critical state and a fencing operation is necessary against an errant node, having Kernel Mode fencing architected as either robust STONITH or Fabric Fencing ensures the correct action will take place.

Coming Soon
What about SCSI-III Persistent Reservation. Isn’t I/O fencing as good as server fencing? No, it isn’t.

Data Direct Networks and Texas Memory Systems. Should Be Fun.

Fun Weekend Ahead
Our cool lab manager has reconfigured the DDN storage allocated to my cluster of DL585s so that I get 65 spindles per LUN. This particular cluster had been hobbled with LUNs derived from only 16 spindles. Looks like I get to do some reconfiguration of RAC on the cluster this weekend. The good thing is that the cluster also has a Texas Memory Systems Solid State Disk (Another PolyServe Partner) configured for Redo logging. I should be able to get some good performance and stability readings.

This cluster always has ASM and PolyServe CFS set up side-by-side. Makes for good fun.

DBWR Multiblock Writes? Yes, Indeed!

Learning Something New
Learning Oracle is a never ending effort. “OK, tell me something I didn’t know”, you say? You may know this bit I’m about to blog about, but I sure didn’t. I don’t know when the Database Writer I/O profile changed, but it has—somewhere along the way.

I have a simple test of 10gR2 (10.2.0.1) on RHEL4 x86_64 using filesystem files. I was using strace(1) on the single DBWR process I have configured for this particular instance. The database uses an 8KB block size and there are no variable block sizes anywhere (pools are not even configured). The workload I’m running while monitoring DBWR is quite simple. I have a tablespace called TEST created using Oracle Managed Files (OMF)—thus the peculiar filenames you’ll see in the screen shots below. I have 2 tables in the TEST tablespace and am simply looping INSERT INTO SELECT * FROM statements between the 2 tables as the stimulus to get DBWR busy.

In the following screen shot you’ll see that I took a look at the file descriptors DBWR is using by listing a few of them out in /proc//fd. The interesting file descriptors for this topic are:

  • FD 18 – System Tablespace
  • FD 19 – UNDO Tablespace
  • FD 20 – SYSAUX Tablespace
  • FD 23 – The TEST Tablespace

NOTE: Please right click the image to open in a viewer. Some of you readers have reported a problem, but we’ve found that it is as simple as clicking it. I need to investigate  if that is something that wordpress is doing.

p1

I have purposefully set up this test to not use libaio, thus filesystemio_options was not set in the parameter file. In the next screen shot I use grep to pull all occurrences of the pwrite(1) system calls that DBWR is making that are not 8KB in size. Historically there should be none since DBWR’s job is to clean scattered SGA buffers by writing single blocks to random file offsets. That has always been DBWR’s lot in life.

p3

So, as strace(1) is showing, these days DBWR is exhibiting a variation of its traditional I/O profile. In this synchronous I/O case, on this port of Oracle, DBWR is performing synchronous multi-block writes to sequential blocks on disk! That may seem like trivial pursuit, but it really isn’t. First, where are the buffers? The pwrite system call does not flush scattered buffers as do such routines as writev(),lio_listio() or odm_io()—it is not a gathered write. So if DBWR’s job is to build write batches by walking LRUs and setting up write-lists by LRU age, how is it magically finding SGA buffers that are adjacent in memory and bound for sequential offsets in the same file? Where is the Twilight Zone soundtrack when you need it? For DBWR to issue these pwrite() system calls requires the buffers to be contiguous in memory.

Of course DBWR also behaved in a more “predictable” manner during this test as well as the following screen shot shows:

p2

Is This A Big Problem?
No, I don’t think so—unless you’re like me and have had DBWR’s I/O profile cast in stone dating back to Version 6 of Oracle. All this means is that when you are counting DBWR write calls, you can’t presume they are always single block. Now that is something new.

Porting
I often point out that Microsoft has always had it quite easy with SQL Server. I’m not doing that pitiful Microsoft bashing that I despise so much, but think about it. SQL Server started with a fully functional version of Sybase (I was a Sybase fan) and brought it to market on 1 Operating System and only 1 platform (x86). Since then their “porting” effort as “exploded” into x86, x86_64 and IA64. That is the simple life.

This DBWR “issue” may only be relevant to this Oracle revision and this specific port. I can say with great certainty that if Sequent were still alive, the Sequent port would have used one of the gathered write calls we had at our disposal in this particular case. With Oracle, so much of the runtime is determined by porting decisions. So the $64,000 dollar question is, why isn’t DBWR just using writev(1) which would nicely clear all those SGA buffers from their memory-scattered locations?

Things That Make You Go Hmmmmm

Marketing Efforts Prove SunFire T2000 Is Not Fit For Oracle.

I’ll try not to make a habit of referencing a comment on my blog as subject matter for a new post, but this one is worth it. One of my blog readers posted this comment. The question posed was what performance effect there would be with DSS or OLTP with the Sun CoolThreads architecture—given it has a single FPU shared by 8 cores. The comment was:

I have heard though that the CoolThread processors are not always great at supporting databases because they only have a single floating point processor? Would you see this as a problem in either a OLTP or DSS environment that don’t have any requirement for calculations that may involve floating points?

Since this is an Oracle blog, I’ll address this with an Oracle-oriented answer. I think most of you know how much I dislike red herring marketing techniques, so I’ll point out that there has been a good deal of web- FUD about the fact that the 8-core packaging of the CoolThreads architecture all share a single floating point unit (FPU). Who cares?

Oracle and Floating Point
The core Oracle kernel does not, by and large, use floating point operations. There are some floating point ops in layers that don’t execute at high frequency and therefore are not of interest. Let’s stick to the things that happen thousands or tens of thousands of times per second (e.g., buffer gets, latching, etc, etc).  And, yes there are a couple of new 10g native float datatypes (e.g., BINARY_FLOAT, BINARY_DOUBLE), but how arithmetic operations are performed on these are a porting decision.  That is, the team that ports Oracle to a given architecture must choose whether the handling of data types is done with floating point operations or not. Oracle documentation on the matter states:

The BINARY_FLOAT and BINARY_DOUBLE types can use native hardware arithmetic instructions…

Having a background in port-level engineering of Oracle, I’ll point out that the word “can” in this context is very important. I have a query out about whether the Solaris ports do indeed do this, but what is the real impact either way?

At first glance one expect that an operation like select sum(amt_sold) to benefit significantly if the amt_sold column was defined as a BINARY_FLOAT or BINARY_DOUBLE, but that is just not so. Oracle documentation is right to point out that machine floating point types are, uh, not the best option for financial data. The documentation reads further:

These types do not always represent fractional values precisely, and handle rounding differently than the NUMBER types. These types are less suitable for financial code where accuracy is critical.

So those folks out there that are trying to market against CoolThreads based largely on its lack of good FPU support can forget the angle of poor database performance. It is a red herring. Well, ok, maybe there is an application out there that is not financial and would like to benefit from the fact that a BINARY_FLOAT is 4 bytes of storage whereas a NUMBER is 21 bytes. But there again I would have to see real numbers from a real test to believe there is any benefit. Why? Remember that accesses to a row with a BINARY_FLOAT column is prefaced with quite a bit of  SGA code that is entirely integer. Not to mention the fact that it is unlikely a table would only have that column in it. All the other adjacent columns add overhead in the caching and fetching of this nice, new small BINARY_FLOAT column. All the layers of code to parse the query, to construct the plan, to allocate heaps and so on are mostly integer operations. Then to access each row piece in each block is laden with cache gets/misses (logical I/O) and necessary physical I/O. For each potential hardware FPU operating on a BINARY_FLOAT column there are orders of magnitude more integer operations.

All that “theory” aside, it is entirely possible to actually measure before we mangle as goes the cliché. Once again, thanks to my old friend Glenn Fawcett for a pointer to a toolkit for measuring floating point operations.

Why the Passion?
I remember the FUD marketing that competitors tried to use against Sequent when the infamous Pentium FDIV bug was found. That bug had no affect on the Sequent port of Oracle. It seems that was a subtle fact that the marketing personnel working for our competitors missed because they went wild with it. See, at the time Sequent was an Oracle server power house with systems based deep at the core with Intel processors (envision a 9 square foot board loaded with ASICS, PALs and other goodies with a little Pentium dot in the middle). Sequent was the development platform for Unix Oracle Parallel Server and Intra-node Parallel Query. Oracle ran their entire datacenter on Sequent Symmetry systems at the time (picture 100+ refrigerator sized chassis lined up in rows at Redwood Shores) and Oracle Server Technologies ran their nightly regression testing against Sequent Symmetry systems as well. Boring, I know. But I was in Oracle Advanced Engineering at the time and I didn’t appreciate the FUD marketing that our competitors (whose systems were RISC based) tried to play up with the supposed impact of that bug on Oracle performance on Sequent gear. I do not like FUD marketing. If you are a regular reader of my blog I bet you know what other current FUD marketing I particularly dislike.

More to the point of CoolThreads, I’ve seen web content from companies using what I consider to be red herring marketing against the SunFire T[12]000 family of servers. I am probably one of the biggest proponents of fair play out there and suggesting CoolThreads technology is not fit for Oracle due to poor FPU support is just not right. Now, does that mean I’d choose a SunFire over an industry standard server? Well, that would be another blog entry.

SAN Madness–Don’t Let It Ruin Your Project.

I was recently exchanging email with a fellow participant in the oracle-l mailing list about a problem they were having evaluating Oracle on a SAN. It seems their evaluation was completely blown out due to non-stop SAN madness. I wonder if their Storage Administrator is also their Unstructured Data Administrator? I know one thing for certain; this is a shop that needs to join up with the forces of BAARF. When answering my question about how things were going, he responded with:

Poorly. The installation/setup of oracle was fine. The problem was with the SAN the SA gave me. It was an [brand name removed]. I know he set it up as Raid 5 but I don’t know the specifics after that. All I know it was 5x slower than anything we currently have, and they’ve made us use Raid 5 everywhere. We use alot of T3’s with a few SANs sprinkled in. I think 15 days of the 30 day trial was spent hooking up the SAN.

SAN madness indeed! I understand how frustrating infrastructure issues are when they hinder the effort to actually perform the required testing. It isn’t all the fault of the SAN in this situation as it seems there were a few 8i->10g Cost Based Optimizer (CBO) hurdles to overcome as well. He wrote:

I also spent alot of time just trying to get explain plans to match between my 8i database and the new 10g, in order to compare apples to apples.

I hope he didn’t feel responsible for that. I am convinced one of the only humans that really knows the cost based optimizer in practical application is fellow OakTable member Jonathan Lewis.

The email continued by questioning whether a T2000 could replace a 16 CPU E6500. He wrote:

But after all that I still couldn’t see why they thought we could replace our 16 cpu Sun E6500 running full tilt with one T2000. Anyways, that project has been shelved.

SAN Madness
How sad it is that an entire project can get shelved because of SAN madness, but I am not surprised. On the other hand, the idea that a T2000 can supplant a E6500 is not that out of line as my old friend Glenn Fawcett points out, the bandwidth is not even close. The T2K has a 20GB/s backplane whereas the E6500 is only 9.6GB/s. For OLTP, the T2000 would most likely beat a fully loaded Starfire UE10K which is backplane limited to 12.8GB/s. And you might even have enough HVAC for it!

Microsoft and LSI–Move Over Novell? Time for More Open Source Patents.

Thanksgiving Day or April Fool’s Day?
Just about everyone knows Cary Milsap of Hotsos and other fame. On Thanksgiving Day, he sent out a real turkey of an email (I just couldn’t resist that one) to us on the OakTable email list. I thought it was a joke when I first saw it.

The email pointed out that LSI had been granted a patent on the Linked List. Yes, the title of the patent is simply “Linked List”. As the holder of a couple of software patents, I tend to play devils advocate. I see entirely too much of the open source anarchy mentality when it comes to software patents. However, this particular patent is really out there. I really see no way that this patent could have made it through scrutiny on prior art or novelty. There are millions of software applications of the following quoted summary of the invention:

The present invention overcomes the disadvantages and limitations of the prior art by providing a system and method for traversing a list using auxiliary pointers that indicate the next item in a sequence. The sequential list may be created in one sequence, but used in a second sequence without having to resort the list.

Uh, there is nothing novel about that. But that is not all. Look at claim number 4:

A computer system capable of traversing a list having at least two sequential pointers comprising: a plurality of items that are contained in said list to be traversed; at least a primary pointer and a secondary pointer for each of said items of said list such that each of said items has an associated primary pointer and an associated secondary pointer…

That is weird. Find me a computer that can’t do that. Was there a C compiler for the TRS-80? Yes, in fact, there was. Hmmm…I better dig further…

I don’t think this patent is a reason to abolish software patents, but clearly something is out of whack. Maybe this gentleman could explain it?

This is Not A Software Patent, Is It?
OK, this other patent granted to LSI this year is interesting. It doesn’t even describe a software program really. I wonder if Microsoft is paying royalties. How is the following abstract from the patent not in fact precisely what Microsoft Project does? Oh, I know, MS Project doesn’t send email.

The present invention is a computer-based system for managing projects. It allows the user to input data concerning a project and associate individuals with the project. The system then determines a deadline for completing a task associated with the project and send out reminders accordingly. The system provides the user a number of options not available on the conventional docketing systems, such as automatically increasing the frequency with which reminders are sent as the deadline approaches, and automatically increasing the number of individuals to whom the reminders are sent as the deadline draws near.

 

 

Quote of the Day

Not enough people are aware that the trouble at the high-end is no longer scalability as much as power, cooling and storage provisioning/connectivity.  Large scale commodity computing is the new high-end.

                                                                                                                        ~Kevin Closson, 2006

Storagemojo.com is A Really Good Blog

Those of you who know me are aware that I am new to blogging and that Oracle-related storage topics are very near and dear to my heart. I’m sure all you long-time bloggers have likely have seen this site, but I have got to say that storagemojo.com is a great blog.

I now know what my blog aspires to be—from the Oracle-centric viewpoint of course.

Pay this site a visit if you are storage minded. It has some serious mojo!

A Better KVM (New Oaktable Network Member)

…from Kevin Closson’s blog…

We just pulled up another chair to the Oaktable Network. This blog entry is to welcome Kurt Van Meerbeeck to the group. Kurt is the author of PePi, Pretoria, and DUDE. While his initials are KVM, I assure you he is much more complex than a KVM.

Welcome Kurt, it is a pleasure to have you on board!

 

 

 

 

UKOUG Update

It has been a great conference thus far. I have attended interesting presentations such as one about Disposable Computing Architecture by James Morle. It parallels my positions on commodity computing for certain, but he threw in some thought-provoking points. If you didn’t get to see the presentation, watch for it in the proceedings. He may have that presentation on his website soon which is here. James has been a friend for many years, if you’ve never been to his website you should check it out.

I enjoy this conference since so many of the Oracle OakTable Network members attend and speak. I have “known” a lot of fellow members for a long time that I have never met face to face.

I spoke about NAS architecture on Tuesday and am about to speak on Clusters Consolidation. Both presentations cover topics in whitepapers available on my front page—just  click “Papers, etc” on my blog front page, or here if you are interested in such topics.

Shameless Plug Time

One of our customers, Taleo, has been honored as one of the 2006 InfoWorld 100 Awards in the Services category. Teleo puts the PolyServe Database Utility for Oracle to good use. More on the award is available at this InfoWorld webpage.

 This was a very short blog entry since I need to rush in and do this presentation about Clusters Consolidation.

I promise to make a blog entry very soon about ASM on NAS from a slightly updated angle.

 

Network Appliance OnTap GX for Oracle? Clustered Name Space. A Good User Group (OSWOUG)

Presenting NAS Architecture at OSWOUG
I had a speaking session recently at OSWOUG about NAS architectures as pertaining to Oracle. It was a good group, and I enjoyed the opportunity to speak there. The presentation was an “animated” digest of this paper and covers:

  • single-headed filers
  • clustered single-headed filers
  • asymmetrical multi-headed NAS
  • symmetrical multi-headed NAS

Symmetric NAS versus Clustered Name Space
Since the only model in existence for the symmetrical multi-headed NAS architecture is the HP Enterprise File Services Clustered Gateway (which is an OEMed version of the PolyServe File Serving Utility), I spent some time discussing the underpinnings—the cluster filesystem. At one point I spent just a few moments discussing Cluster Name Space technology, but I wish I had taken it further. I’ll make up for it now.

If You Can’t Cluster Correctly, At Least Cluster Something
Something is better than nothing, right? Many years ago, an old friend of mine was coming back from duty in Antarctica on a coast guard ice cutter. You’d have to know Kenny to get the full picture but at one point, over beer, he pulled out a phrase that still makes me chuckle. I don’t remember the context, but his retort to some rant I was on was, “Well, if you can’t tie a knot, tie a lot”. That, my friends, is the best way to sum up Cluster Name Space NAS technology. I couldn’t find a dictionary definition, so I’ll strike a claim:

clus·ter name space'klus-ter 'nAm 'spAs
n.

Software technology which when applied to separate storage devices gives the illusion of scalable filesystem presentation. Cluster name space technology is a filesystem directory aggregator. Files stored in a directory in a cluster name space implementation cannot span the underlying storage devices. Single large files cannot benefit from cluster name space technology. Cluster name space is commonly referred to by the acronym CNS. When applied to Oracle databases, CNS forces the physical partitioning of data across the multiple, aggregated underlying storage devices.

v.

To apply a virtualized presentation layer to a bottlenecked storage architecture: “hey, let’s just CNS it and ship it”

The verb form of the term was demonstrated when Network Appliance bought Spinnaker back in 2003. When you spend 300 million dollars on technology, what is done with it often winds up looking more like a verb than a noun. What do I mean by that? There is nothing about Network Appliance’s OnTap GX technology that was built from the ground up to be clustered and scalable. Network Appliance has had clustered filers for a long time. That doesn’t mean the data in the filers is clustered. It is merely a cabling exercise. What they did between the 2003 purchase of Spinnaker and the release of OnTap GX was a verb—they CNSed their existing technology. I can tell from the emails I get that there are about 42 of you who think I make all this stuff up. Let’s get it from the horse’s mouth, quoting from page 8 of this NetApp whitepaper covering OnTap GX:

When an NFS request comes to a Data ONTAP GX node, if the requested data is local to the node that receives the request [reference removed], that node serves the file. If the data is not local [reference removed], the node routes the request over the private switched cluster network to the node where the data resides. That node then forwards the requested data back to the requesting node to be forwarded to the client.

Perhaps a picture can replace a thousand words. Nobody, not even in the deepest, darkest, remote areas of the Amazon rain forest would mistake what is depicted in the following graphic for a symmetrical file serving technology:

spinnaker4

The Press
There has been a lot of press coverage of Data OnTap GX (NetApp is the poster child for CNS), and most articles get the technology just flat wrong. This techtarget.com article, on the other hand gets it quite right:

NetApp’s SpinServer is a clustered file system for NFS that currently doesn’t support CIFS. SpinServer exports a single namespace across all storage nodes, but files aren’t striped across nodes

And so does this techtarget.com article which exposes the weakness of the CNS verb situation:

…immediately found a problem with the high-availability (HA) failover detection system. A failure could be bad enough that clients could not access data but not bad enough to alert the system.

What Does This Have To Do With Oracle?
OK, this is a blog about Oracle and storage. Let me interpret the quote from NetApp’s paper covering Data OnTap GX. If you have a large Oracle datafile that gets hammered with I/O this technology will do absolutely nothing to improve your throughput or reduce I/O latencies. A single file is contained within one filer’s storage and thus, represents an I/O throughput bottleneck. One of the sites I read, DrunkenData, has a thread on this topic. It appears as though the thread is aimed at trying to believe in something more than to understand something, but my assessment could be wrong.

The CNS topic is quite simple; the word “cluster” in the CNS context has nothing to do with file-level scalability. Yes, I know Network Appliance is a multi-billion dollar corporation so I’m sure my exposé of this topic will be met with nefarious men wearing trench coats lying in wait to attack…

Cram Some Technology into Your Solution
This storagemagazine.techtarget.com article covers some clustered NAS topics, but there was one bit in there that stood out when I read it and I’d like to comment. The following excerpt recommends CNS technology for Oracle:

…To avoid this problem, consider a global namespace product…The only way to implement a cross-platform global namespace is to replace your NAS infrastructure with, for example, NetApp’s SpinServer or Panasas’ ActiveScale. If Oracle 10 on a Linux cluster is in your future, then the NetApp and Panasas solutions should be on your short list.

Bad mojo! I have said many times before that Oracle on NAS is a good model and NetApp is obviously the 800lb gorilla in this space. However, I disagree with the idea posted in this quoted article for two reasons. First, Oracle uses large files and CNS does not scale single large files. To get the benefit of CNS, you will have to partition your large files into smaller ones so as to get multiple filers serving the data. If you wanted to physically partition your data, you would have chosen a different database technology like DB2 EEE. And secondly, Oracle has an established program for NAS vendors to certify their capabilities with Oracle databases. Panasas may be good technology, but since it is not an OSCP vendor, it doesn’t get to play with Oracle. On the other hand, the HP EFS-CG, which is multi-headed and scalable, is. Fair is Fair.

A Change of Heart
Here was an interesting post in the bioinformatics.org mailing list about “global name space”. I don’t know how the different filesystem technologies got clumped together as they did, but that was two years ago. I found it interesting that the poster was from Verari Systems who now resell the PolyServe File Serving Utility.

Summary
I hope you know more now about CNS than you would ever care to know–literally.

TPC-C Result Proves DB2 is “Better” Than Oracle

Oracle Versus “The Competition”
There is a thread in asktom.oracle.com that started back in 2001 about how Oracle compares to other database servers. While I think today’s competitive landscape in RDBMS technology proves that is a bit of a moot question, I followed the thread to see how it would go. The thread is a very long point/counter-point ordeal that Tom handles masterfully. There are bits and pieces about WinFS, MySQL, DB2, PostgresSQL, multi-block read consistency (thank heaven above—and Andy Mendelsohn et al. of course), writers blocking readers, page lock escalation (boo Sybase) and so on. It’s a fun read. The part that motivated me to blog is the set of TPC-C results posted by a participant in the thread. The TPC-C results showed record-breaking non-clustered TPC-C v5 results on a flagship RS/6000 (System p5 I believe it is now branded) server running DB2. Tom replies to the query with:

…do YOU run a TPC-C workload, or do YOU run YOUR system with YOUR transactions?

That was an excellent reply. In fact, that has been the standard remark about TPC-C results given by all players—when their result is not on top. I’m not taking a swipe at Tom. What I’m saying is that TPC-C results have very little to do with the database being tested. As an aside, it is a great bellwether for what software features actually offer enhanced performance and are stable enough to sustain the crushing load. More on that later.

TPC-C is a Hardware Benchmark
See, most people don’t realize that TPC-C is a hardware benchmark first and foremost. The database software approach to getting the best number is to do the absolute least amount of processing per transaction. Don’t get me wrong, all the audited results do indeed comply with the specification, but the “application” code is written to reduce instructions (and more importantly cycles) per transaction any way possible. If ever there was a “perfect” Oracle OLTP workload, it would be the TPC-C kit that Oracle gives to the hardware vendors in order to partake in a competitive, audited TPC-C. That application code, however, looks nothing like any application out in front of any Oracle database instance in production today. Don’t get me wrong, all the database vendors do the same thing because they know that it is a hardware benchmark, not a database benchmark.

TPC-C—Keeping it Real
The very benchmark itself is moot—a fact known nearly since the ratification of the specification. I remember sitting in a SIGMOD presentation of this paper back in 1995 where Tandem proved that the workload is fully partitionable and scaled linearly, thus ridiculously large results are only impeded by physical constraints. That is, if you can find the floor space, power, cooling and cable it you too can get a huge result. If only the industry would have listened! What followed has been years of the “arms race” that is TPC-C. How many features have gone into database server products for these benchmarks that do nothing for real datacenter customers? That is a good question. Having worked on the “inside” I could say, but men in trench coats would sweep me away never to be heard from again. Here’s a hint, the software being installed does not have to come from a shipping distribution medium (wink, wink). In fact, the software installation is not a part of the audit. Oops.

History and Perspective
In 1998 I was part of a team that delivered the first non-clustered, Oracle TPC-C result to hover next to what seemed like a sound barrier at the time—get this, 100,000 TpmC! Wow. We toiled, and labored and produced 93,901 TpmC on a Sequent NUMA-Q 2000 with 64 processors as can be seen in these archived version 3 TPC-C results. This and other workloads were the focus of my attention back then. What does this have to do with the AskTom thread?

The thread asking Tom about TPC-C cited a recent DB2 result on IBM’s System p5 595 of 4,033,378 TpmC. I think the point of that comment on the thread was that DB2 must surely be “better” since the closest result with Oracle is 1,601,784 TpmC. For those who don’t follow the rat race, TPC-C results between vendors constantly leap-frog each other. Yes, 4,033,378 is a huge number for a 64 processor (32-socket 64 core) system when compared to that measly number we got some 8 years ago. Or is it?

Moore’s Law?
There have been about 6 Moore’s Law units of time (18 months), since that ancient 93,901 TpmC result. A lot has changed. The processors used for that result were clocked at 405MHz, had 7.5 million transistors (250nm) and tiny little 512KB L2 caches. With Moore’s Law, therefore, we should expect processors with some 480 million transistors today. Well, somewhere along the way Moore’s Law sloped a bit so the IBM System p5 595 processors (POWER5+) “only” have 276 million transistors. Packaging, on the other hand, is a huge factor that generally trumps Moore’s Law. The POWER5+ are packaged in multi-chip modules (MCM) where there is some 144MB of L3 cache (36MB/socket, off-chip yes, but 80GB/s) backing up a full 7.6MB L2 cache—all this with 90nm technology. Oh, and that huge TpmC was obtained on a system configured with 2048GB (yes, 2 Terabytes) of memory whereas the “puny” Sequent at 93,901 TpmC had 64GB. And, my oh my, how much faster loading memory is on this modern IBM! Exponentially faster—about 6 fold in fact!

The Tale of The Tape
Is 4,033,378 really an astronomical TPC result? Let’s compare the IBM system to the old Sequent:

  • 43x more throughput (TpmC)
  • 32x more memory configured ( with 6x better latency )
  • 37x more transistors per processor (and clocked 6x faster)
  • 15x more processor L2 cache + 36x in L3

So, regardless of the fact that I just did a quick comparison of a DB2 result to an old Oracle8 result, I think there should be little surprise that these huge numbers are being produced especially when you factor in how partitionable TPC-C is. Let’s not forget disk. IBM used 6,400 hard disk drives for this result. If they get the floor space, power, do the cabling and add a bunch more drives and hook up the upcoming POWER6-based System p server, I’m quite certain they will get a bigger TPC-C result. There’s no doubt in my mind—hint, POWER6 has nearly 3 fold more transistors (on 65nm technology) than POWER5+ and clocked at 5GHz too.

Tom Kyte is Right
The question is, what does it mean to an Oracle shop? Nothing. So, as usual, Tom Kyte is right. Huge TPC-C results using DB2 don’t mean a thing.

Retrospect and Respect
I’m still proud of that number we got way back when. It was actually quite significant since the closest Oracle result at the time was a clustered 96-CPU 102,542 TpmC Compaq number using Alpha processors. That reminds me, I had a guy argue with me last year at OOW suggesting that result was the first 100K+ non-clustered Oracle result. I couldn’t blackberry the tpc.org results spreadsheet quick enough I guess. I digress. As I was saying, the effort to get that old result was fun. Not to mention the 510-pin gallium arsenide data pump that each set of 4 processors linked to in the NUMA system was pretty cool looking. It was a stable system too. The Operating System technology was a marvel. I’m fortunate to still be working with a good number of those Sequent kernel engineers right here at PolyServe where one of our products is a general purpose, fully symmetric, distributed cluster filesystem for Windows and Linux.

 

 

 

 

DBWR Efficiency, AIO, I/O Libraries with ASM.

I’m sure this information is not new to very many folks, but there might be some intersting stuff in this post…

I’m doing OLTP performance testing using a DL585, 32GB, Scalable NAS, RHEL4 (Real, genuine RHEL4) and 10gR2. I’m doing some oprofile analysis on both the NFS client and server. I’ll blog more about oprofile soon.

This post will sound a bit like a rant. Did you know that the Linux 2.5 Kernel Summit folks spent a bunch of time mulling over features that have been considered absolutely critical for Oracle performance on Unix systems for over 10 years! Take a peek at this partial list and chuckle with me please. After the list, I’m going to talk about DBWR. A snippet from the list:

  • Raw I/O has a few problems that keep it from achieving the performance it should get. Large operations are broken down into 64KB batches…
  • A true asynchronous I/O interface is needed.
  • Shared memory segments should use shared page tables.
  • A better, lighter-weight semaphore implementation is needed.
  • It would be nice to allow a process to prevent itself from being scheduled out when it is holding a critical resource.

Yes, the list included such features as eliminating the crazy smashing of Oracle multiblock I/O into little bits, real async I/O, shared page tables and non-preemption. That’s right. Every Unix variant worth its salt, in the mid to late 90s, had all this and more. Guess how much of the list is still not implemented. Guess how important those missing items are. I’ll blog some other time about the lighter-weight semaphore and non-preemption that fell off the truck.

I Want To Talk About Async I/O
Prior to the 2.6 Linux Kernel, there was no mainline async I/O support. Yes, there were sundry bits of hobby code out there, but really, it wasn’t until 2.6 that async I/O worked. In fact, a former co-worker (from the Sequent days) did the first HP RAC Linux TPC-C and reported here that the kludgy async I/O that he was offered bumped performance 5%. Yippie! I assure you, based on years of TPC-C work, that async I/O will give you much more than 5% if it works at all.

So, finally, 2.6 brought us async I/O. The implementation deviates from POSIX, which is a good thing. However, it doesn’t deviate enough. One of the most painful aspects of POSIX async I/O, from an Oracle perspective, is that each call can only initiate writes to a single file descriptor. At least the async I/O that did make it into the 2.6 Kernel is better in that regard. With the io_submit(2) routine, DBWR can send a batch of modified buffers to any number of datafiles in a single call. This is good. In fact, this is one of the main reasons Oracle developed the Oracle Disk Manager (ODM) interface specification. See, with odm_io(), any combination of reads and writes whether sync or async to any number of file descriptors can be issued in a single call. Moreover, while initiating new I/O, prior requests can be checked for completion. It is a really good interface, but was only developed by Veritas, NetApp and PolyServe. NetApps’ version died because it was locked to tightly with DAFS which is dead, really dead (I digress). So, yes, ODM (more info in this, and other papers) is quite efficient at completion processing. Anyone out there look at completion processing on 10gR2 with the 2.6 libaio? I did (a long time ago really).

Here is a screen shot of strace following DBWR while the DL585 is pumping a reasonable I/O rate (approx 8,000 IOPS) from the OLTP workload (click the graphic for better display):

test

Notice anything weird? There are:

38.8 I/O submit calls per cpu second (batches)

4872 I/O complete processing calls per cpu second (io_getevents())

7785 wall clock time reading calls per cpu second (times + gettimeofday)

Does Anyone Really Know What Time It Is?
Database writer, with the API currently being used, (when no ODM is in play) is doing what we call “snacking for completions”. This happens for one of many reasons. For instance, if the completion check was for any number of completions, there could be only 1 or 2. What’s with that? If DBWR just flushed, say, 256 modified buffers, why is it taking action on just a couple of completions? Waste of time. It’s because the API offers no more flexibility than that. On the other hand, the ODM specification allows for blocking on a completion check until a certain number, or certain request is complete—with an optional timeout. And like I said, that completion check can be done while already in the kernel to initiate new I/O.

And yes, DBWR alone is checking wall clock time with a combination of times(2) and gettimeofday(2) at a rate of 7,785 times per cpu second! Wait stats force this. The VOS layer is asking for a timed OS call. The OS can’t help it if DBWR is checking for I/O completes 4,872 times per cpu second—just to harvest I/Os from some 38.8 batch writes per cpu second…ho hum… you won’t be surprised when I tell you that the Sequent port of Oracle had a user mode gettimeofday(). We looked at it this way, if Oracle wants to call gettimeofday() thousands of times per second, we might as well make it really inexpensive. It is a kernel-mode gettimeofday() on Linux of course.

What can you do about this? Not much really. I think it would be really cool if Oracle actually implemented (Unbreakable 2.0) some of the stuff they were pressing the Linux 2.5 developers to do. Where do you think the Linux 2.5 Summit got that list from?

What? No Mention of ASM?
As usual, you should expect a tie in with ASM. Folks, unless you are using ASMLib, ASM I/Os are fired out of the normal OSDs (Operating System Dependent code) which is libaio on RHEL4. What I’m telling you is that ASM is storage management, not an I/O library. So, if a db file sequential read, or DBWR write is headed for an offset in an ASM raw partition, it will be done with the same system call as it would be if it was on a CFS or NAS.

Oracle Shared Latches, CAS, Porting. No mention of ASM?

I participate in the Oracle-L mail list managed by fellow OakTable network member Steve Adams. There are a lot of good folks over there. I want to hijack a thread from that list and weave it in to a post I’ve wanted to do about Oracle porting.

Oracle Port-level Optimizations
It is a little known fact that major performance ramifications are the result of port-level implementation decisions. Oracle maintains a “bridge layer” of code between the Oracle Kernel and the lower level routines that interface with the Operating System. This layer is called the Virtual Operating System layer, or VOS. Oracle was brilliant for implementing this layer of the server. Without it, there would be routines at widely varying levels of the Oracle Kernel interfacing with the Operating System—chaos! Under the VOS is where the nitty-gritty happens.

So, what does this have to do with the thread on Oracle-L?

There was a post where a list member stated:

I saw quite a few cas latch waits today, Oracle 9.2.0.4 on HP-UX 11.11 PA-RiSC. Do these CPUs support CAS instructions?

This post sparked a series of follow-ups about Oracle’s usage of shared latches. For the non-Oracle minded reader, the term shared latches means reader/writer locks. You see, for many years, all critical sections in Oracle were protected by complete mutual exclusion—a real waste for mostly-read objects. The post is referring to an optimization where shared latches are implemented using Compare and Swap primitives. Whether or not your port of Oracle has this optimization is a decision made at the port level—either the OS supports it or it doesn’t. If it doesn’t, there are tough choices to make at the porting level. But the topic is bigger than that. When Oracle uses generic terms like “CAS” and “Scattered Read”, a lot is lost in translation. That is, when the VOS calls an underlying “Scattered Read”, is it a simulation? Is it really a single system call that takes a multiblock disk read and populates the SGA buffers with DMA? Or is it more like the age old Berkeley readv(2) which actually just looped with singleton reads in library context? On the other hand, when Oracle executes CAS code, is it really getting a CAS instruction or a set of atomic instructions (with looping)? The latter is generally the case.

Another installment on that particular Oracle-L thread took it to the port level:

Correct, HP doesn’t do CAS. There are some shared read latch operations that Oracle therefore implements through the CAS latch…

Right, HP, or more precisely the PA-RISC instruction set does not have a Compare and Swap instruction—it seems HP took the “reduced” in reduced instruction set computing to the extreme! However, neither does PowerPC or Alpha for that matter. In fact, neither does x86 or IA64. Oh hold it; I’m playing a word game. Both x86 and IA64 do have CAS, but it is called cmpxchg (compare and exchange). But honestly, PowerPC and Alpha do not. So how do these platforms offer CAS?

The porting teams for these various platforms have to make decisions. In the case of offering CAS to the VOS layer, they either have to construct a CAS using an atomic block of instructions or punt and use the reference code which is a spinlock. In the latter case, the wait event CAS latch can pop up. You see, a CAS latch can wait, whereas a real CAS will only stall the CPU. That is, if the port implements a CAS latch where other ports go with a CAS atomic set or single instruction, the former can sleep on a miss and the latter cannot. The processor is going to do that CAS, and nothing else. A contended memory location being hammered by CAS will stall processors, because once the CPU enters that block of (or singleton) instruction(s), it stays there until the work is done. I’m not talking about pathology, just implementation subtleties. So, what does CAS really look like?

Sparc64, x86, IA64, S/370, and get this, Motorola 68020 all offer a CAS instruction. There are others for sure. On the other hand, PPC and Alpha require an atomic set of instructions built off of LL/SC (Load-Link/Store Conditional) which on Power is the famed “lorks and storks” (ldarx/stwcx) and Alpha with their lxd_l/stx_c. Finally, what about PA-RISC? Well, you can’t do Oracle on any CPU that totally lacks atomic instructions. In the case of PA-RISC there is a Load and Clear Word (ldcw) instruction.

The point is that the VOS can be given a CAS of one sort or the other, but not all architectures handle the contention that CAS can cause. For whatever reason, it seems the HP porting group punted on 9i and went with latches where other ports use a real, or constructed, CAS. Be aware that just following the masses and implementing a CAS atomic set is not always the right answer. These pieces of code can do really weird things when the words being modified by the CAS straddle a cacheline and other such issues. Hmmm, trivial pursuit?

How Subtle are These Subtleties?

They can be really really NOT subtle!

I was on a team of folks that implemented an atomic set CAS for Oracle’s System Change Number in Oracle8, which is actually a multi-word structure with the SCN word and another word representing the wrap value. The SCN value always increases, albeit not serially. It just gets “larger”. We were able to pull the latch that protected the incrementing of these values and replace it with a small block of atomic assembly that incremented it without any locks. I doubt we were the first to do that. The result was a 25% performance increase in TPC-C. Why? The SCN used to be a really big problem. Back then, propeller-heads like me used to collect bus traces on workloads like TPC-C and map the physical addresses back to SGA addresses. It so happened that in older versions of Oracle, 27% of all addresses referenced on the bus (64 CPU system) was the cacheline that held the SCN structure! Granted, that was for the sum of load, store and coherency ops (e.g., invalidate, cache to cache transfers). 

ASM is “not really an optional extra” With BIGFILE Tablespaces

I was reading Chris Foot’s (very good) blog entry about BIGFILE Tablespaces when I choked on the following text (mind you, Foot is quoting someone else):

…ASM is not only recommended, but a requirement because of inode issues: “If you create a truly large bigfile tablespace on a traditional file system, you will suffer from horrendous inode locking issues. That is why ASM is not really an optional extra with these things.”

This is a great topic as I too feel BIGFILE tablespaces is a great feature. I have to point out that the bit about inode locking is a red herring.

Folks, inode locks are only an issue when file metadata changes. Fact is, when you use file system files for Oracle, metadata doesn’t change, file contents changes. If you deploy on a file system that supports direct I/O, one benefit you get is elimination of atime/mtime updates. These are the only metadata that would get changed on every file access—doom! With mtime/atime updates removed from the direct I/O codepath, the only metadata changes left are structural changes to the file—again, not the file contents. That is, if you create, remove, truncate or extend a file (even with direct I/O), then, yes of course, inode locks must held in an elevated mode. In the case of a cluster file system, that extends to a cluster-inode lock. Now, if your cluster file system happens to be some rudimentary central lock approach,  as opposed to a symmetric/distributed (DLM) approach, then there are issues at that level—but only when the file changes (again, not the file contents).

The point being that unless you are extending a BIGFILE tablespace on a freakishly frequent basis, the inode thing is a red herring.

I really do hate it when non-issues like this are used to prop a technology choice such as ASM.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.