In the comment thread of my of my latest installment in the “Manly Man” series, a reader posted a humorous comment that included a serious question:
[…] what are your thoughts about x86 vs. x86-64 Linux, in relation to Oracle and RAC? I’d appreciate a blog entry if you could.
Tim Hall followed up in suit:
I would have thought it was obvious. The number 64 is twice the size of 32, so it must be twice as good, making you twice as manly!
Yes, when it comes to bitness, some is good so more must be better, right? Tim then continued to point out the political incorrectness of the term Manly Man by suggesting the following:
PS. I think politically correct names for this series of blog entries are:
Personally Persons only deploy…
Humanly Humans only deploy…
Before I actually touch the 32 versus 64-bit question from the thread, I’ll submit the following Manly Man entertainment:
If you have enjoyed the Manly Man series at all, or simply need a little background, you must must see this YouTube Video about Manly Men and Irish Spring.
32 or 64 Bit Linux
When people talk about 32 versus 64 bit Oracle with Linux they are actually talking about 3 topics:
- Running 32 bit Oracle on 32 bit Linux with native 32 bit hardware (e.g., Pentium IV (Willamette), Xeon MP (Foster MP) ).
- Running 32 bit Oracle on 32 bit Linux with x86_64 hardware.
- Running 64 bit Oracle on 64 bit Linux with x86_64 hardware.
The oddball combination folks don’t talk about is 32 bit Oracle running on 64 bit Linux because Oracle doesn’t support it. I do test that however by installing Oracle on a 32 bit server and then NFS mounting that ORACLE_HOME over on a 64 bit Linux system. However, discussing this combination would therefore be moot due to the support aspect.
The most interesting comparison would be between 1 and 2 above provided both systems have precisely the same core clock speed, L2 cache and I/O subsystem. As such, the comparison would come down to how well the 32-bit optimized code is treated by the larger cache line size. There would, of course, be other factors since there are several million more transistors in an EM64T processor than a Pentium IV and other fundamental improvements. I have wished I could make that comparison though. The workload of choice would be one that “fits” in a 32 bit environment (e.g., 1GB SGA, 1GB total PGA) and therefore doesn’t necessarily benefit from 64 bitness.
If anyone were to ask me, I’d say go with 3 above. Oracle on x86_64 Linux is not new.
In my recent blog entry about old software configurations in an Oracle over NFS situation, I took my well-deserved swipes at pre-2.6 Linux kernels. Perhaps the most frightening one in my experience was 32-bit RHEL 3. That whole 4/4 split kernel thing was a nightmare—unless you like systems that routinely lock up. But all told I was taking swipes at pre-2.6 kernels without regard for bitness. So long as the 2.6 kernel is on the table, the question of bitness is not necessarily so cut and dried.
In my recent blog entry about Tim Hall’s excellent step-by-step guide for RAC on NFS, a reader shared a very interesting situation he has gone through:
I have a Manly Man question for you. This Manly Man Wanna Be (MMWB) runs a 2-node 10g RAC on Dell 2850s with 2 dual-core Xeon CPUs (total of 4 CPUs). Each server has 16 GB of memory. While MMWB was installing this last year, he struggled mightily with 64-bit RAC on 64-bit Red Hat Linux 4.0. MMWB finally got it working after learning a lot of things about RPMs and such.
However, Boss Of Manly Man Wanna Be (BOMMWB) was nervous about 64-bit being “new,” and all of the difficulties that MMWB had with it, so we reinstalled with 32-bit RAC running on 32-bit Red Hat Linux 4.0.
My naturally petulant reaction would have been to focus on the comment about 64-bit being “new.” I’m glad I didn’t fire off. This topic deserves better treatment.
While I disagree with Boss of Manly Man’s assertion that 64-bit is “new”, I can’t take arms against the fact that this site measured different levels of pain when installing the same release of Oracle on the same release of RHEL4—only varying the bitness. It is unfortunate that this site has committed themselves to a 32 bit database based solely upon the their experiences during the installation. Yes, the x86_64 install of 10gR2 requires a bit more massaging of the platform vis a vis RPMs. In fact, I made a blog entry about 32 bit libraries required on 64 bit RHEL4. While there may occasionally be more headaches during an x86_64 install than x86, I would not deploy Oracle on a 32 bit operating system today unless there was a gun held to my head. All is not lost for this site, however. The database they created with 32 bit Oracle is perfectly usable in-place with 64 bit Oracle after a simple dictionary upgrade procedure documented in the note Metalink note entitled Changing between 32-bit and 64-bit Word Sizes (ML62290.1).
Has Anyone Ever Tested This Stuff?
I have…a lot! But I really doubt we are talking about running 32 bit Oracle on 32 bit hardware. Nobody even makes a native 32 bit x86 server these days (that I know of). I think the question at hand is more about 32 bit Oracle on 32 bit Linux with x86_64 hardware.
There has always been the big question about what 64 bit software performance is like when the workload possesses no characteristics that would naturally benefit from the larger address space. For instance, what about a small number of users attached to an SGA of 1GB and the total PGA footprint is no more than 1GB. That’s a workload that doesn’t need 64 bit. Moreover, what if the comparison is between 32 bit and 64 bit software running on the same server (e.g., AMD Opteron). In this case, the question gets more interesting. After all, the processor caches are the same, the memory->processor bandwidth is constant, the drivers can all DMA just fine. The answer is an emphatic yes! But yes, what? Yes there are occasions where 64 bit code will dramatically outperform 32 bit code on dual-personality 64 bit processors (e.g., AMD Opteron). It is all about porting. Let me explain.
The problem with running sophisticated 32 bit code on 64 bit processors is that the 32 bit code was most likely ported with a different processor cache line size in mind. This is important:
Native 32 bit x86 processors use a 32 byte cache line size whereas 64 bit processors (e.g., AMD64, EM64T) use a 64 byte cache line.
That means, in the case of a native 32 bit processor, load/store and coherency operations are performed on a swath of 32 bytes. Yes, there were exceptions like the Sequent NUMA-Q 2000 which had two different cache line sizes-but that was a prior life for me. Understanding cache line size and how it affects coherency operations is key to database throughput. And unlike Microsoft who never had to do the hard work of porting (IA64 not withstanding), Oracle pays very close attention to this topic. In the case of x86 Linux Oracle, the porting teams presumed the code was going to run on native 32 bit processors-a reasonable presumption.
What Aspects of Bitness Really Matter?
The area of the server that this topic impacts the most (by far) is latching. Sure, you use the database server to manage your data and the accesses to your data seem quite frequent to you (thousands of accesses per second), but that pales in comparison to the rate at which system memory is accessed for latches. These operations occur on the order of millions of times per second. Moreover, accesses to latches are write-intensive and most-painfully contended across multiple CPUs which results in a tremendous amount of bandwidth used for cache coherency. Spinlocks (latches) require attention to detail-period. Just sum up the misses and gets on all the latches during a processor-saturated workload sometime and you’ll see what I mean. What’s this have to do with 32 versus 64 bit?
It’s All About The Port
At porting time, Oracle pays close attention to ensure that latch structures fit within cache lines in a manner that eliminates false sharing. Remember, processors don’t really read or write single words. They read/write or invalidate the entire line that a given word resides in-at least when processor-to-memory operations occur. Imagine, therefore, a latch structure that is, say, 120 bytes long and that the actual latch word is the first element of the structure. Next, imagine that there are only 2 latches in our imaginary server and we are running on a 32 bit OS on a native 32 bit system such as Pentium IV or Xeon MP (Foster 32 bit) and therefore a 32 byte cache line size. We allocate and initialize our 2 structures at instance startup. These structures will lay out in 240 bytes within a single memory page. Since were dutiful enough to align our two structures on a page boundry, what we have is the first structure resides in the first 120 bytes of the memory page-the first 4 32 byte cache lines. But wait, there are 12 extra bytes in the 4th cache line. Doesn’t that mean the first 12 bytes of the second latch structure are going to share space in the 4th cache line? Not if you are good at porting. And in our example, we are.
That’s right, we were aware of our cache line size (32 bytes) so we padded the structure by allocating an array of unsigned integers (4 bytes) three deep as the last element of our structure. Now our latch structure is precisely 132 bytes or 4 lines. Finally, we have our imaginary 32 bit code optimized for a real 32 bit system (and therefore a 32 byte cache line size). That is, we have optimized our 32 bit software for our presumed platform which is 32 bit hardware. Now, if half of the CPUs in the box are hammering the first latch, there is no false sharing with the second. What’s this got to do with hardware bitness?The answer is in the fact that Oracle ports the x86 Linux release with a 32 bit system in mind.
Running 32 bit Code on a 64 bit CPU
The devil is in the details. Thus far our imaginary 2-latch system is optimized for hardware that operates on a 32 byte line. Since our structures fit within 4 32 byte lines or 2 64 byte lines should we execute on a x86_64 system there would be no false sharing so we must also be safe for a system with a 64 byte line, no? Well, true, there will be no false sharing between the two structures since they are now 2 64 byte lines as opposed to 4 32byte lines, but there is more to it.
Do you think it’s possible that the actual latch word in the structure might be adjacent (same 32 bytes) to anything interesting? Remember, heavily contended latches are constantly being tried for by processes on other CPUs. So if the holder of the latch writes on any other word in the cache line that holds the latch word, the processor coherency subsystem invalidates that line. To the other CPUs with processes spinning on the latch, this invalidation “looks” like the latch itself has been freed (changed from on to off) when in fact the latch is still held but an adjacent word in the same line was modified. This sort of madness absolutely thrashes a system. So, the dutiful port engineer rearranges the elements of the latch structure so that there is nothing else ever to be written in the same cache line that has the actual latch word. But remember, we ported to a 32 bit system with a 32 byte line. On the other hand, if you run this code on a 64 bit system–and therefore 64 byte lines–all of your effort to isolate the latch word from other write-mostly words was for naught. That is, if the cache line is now 64 bytes, any write by the latch holder in the first 64 bytes of the structure will cause invalidations (cache thrashing) for other processes trying to acquire the latch (spinning) on other CPUs. This isn’t a false sharing issue between 2 structures, but it has about the same effect.
Difficult Choices – 32 bit Software Optimized for 64 bit Hardware.
What if the porting engineer of our imaginary 2-latch system were to somehow know that the majority of 32 bit Linux servers would some day end up being 64 bit servers compatible with 32 bit software? Well, then he’d surely pad out the structure so that there are no frequently written words in the same 64 bytes in which the latch word resides. If the latch structure we have to begin with is 120 bytes, odds are quite slim that the percentage of read-mostly words will facilitate our need to pack the first 64 bytes with read-mostly objects along side the latch word. It’s a latch folks, it is not a read-mostly object! So what to do? Vapor!
Let’s say our 120 byte latch structure is a simple set of 30 words each being 4 bytes (remember we are dealing with a 32 bit port here). Let’s say further that there are only 4 read-mostly words in the bunch. In our imaginary 2 latch example, we’d have to set up the structure so that the first word is the latch, and the next 16 bytes are the 4 read-mostly elements. Now we have 20 bytes that need protection. To optimize this 32 bit code for a 64 bit processor, we’ll have to pad out to 64 bytes-with junk. So we’ll put an array of unsigned integers 11 deep (44 bytes) immediately after the latch word and our 4 read-mostly words. That fits nicely in 64 bytes-at the cost of wasting 44 bytes of processor cache for every single latch that comes through our processor caches. Think cache buffers chains folks! We aren’t done though.
We started with 120 bytes (30 4 byte words) and have placed only 5 of those words into their own cache line. We have 25 words, or 100 bytes left to deal with. Remember, we are the poor porting engineer that is doing an imaginary 32 bit software port optimized for 64 bit servers since nobody makes 32 bit servers any more. So, we’ll let the first 64 bytes of the remaining 100 fall into their own line. That leaves 36 bytes that we’ll also have to pad out to 64 bytes-there goes another 28 bytes of vapor. All told, we started with 120 bytes and wound up allocating 192 bytes so that our code will perform optimally on a processor that uses a 64 byte cache line. That’s a 60% increase in the footprint we leave on our processor caches which aren’t very large to start with. That’s how Oracle would have to optimize 32 bit Oracle for a 64 bit processor (x86 code on x86_64 kit). But they don’t because that would have been crazy. After all, 32 bit Oracle was intended to run on 32 bit hardware.
Porting, sound easy? Let me throw this one in there. It just so happens that the Oracle latch structure was in fact 120 bytes in Oracle8i on certain ports. Oh, and lest I forget, remember that Oracle keeps track of latch misses. What’s that got to do with this? Uh, that means processes that do not hold the latch increment counters in the latch structure when they miss. Imagine having one of those miss count words in the same line as the latch word itself!
This is tricky stuff.
Who Uses 32 bit Linux for Oracle These Days?
Finally, bugler, sound Taps.
A thread on oracle-l the other day got me thinking. The thread was about the difficulties being endured at a particular Linux RAC site that prompted the DBA there to audit what RPMs he has on the system. It appears as though everything installed was a revision “as high or higher” based on Oracle’s documented requirements. In his request for information from the list, I noticed uname(1) output that suggested he is using a 32 bit RHEL 4 system.
One place I always check for configuration information is Oracle’s Validated Configurations web page. This page covers Linux recipes for installation success. I just looked there to see if there was any help I could give offer that DBA and found that there are no 32 bit validated configurations!
I know there is a lot of 32 bit x86 hardware out there, but I doubt it is even possible to buy one today. Except for training or testing purposes I just can’t muster a reason to even use 32 bit Linux servers for the database tier at this point and to be honest, running a 32 bit port of Oracle on an x86_64 processor makes very little sense to me as well.