Kevin Closson's Blog: Platforms, Databases and Storage

Archive Page 33

Isilon Leads in Clustered Storage–Without Support for Oracle

Published February 10, 2007 Clustered Storage Leave a Comment

One of my favorite blogs, StorageMojo.com, is covering Isilon and Clustered Storage here. The debate is heating up because there are some folks that think NetApp OnTap GX is clustered storage. I have blogged before about how a clustered namespace is not clustered storage such as:

NetApp’s OnTap GX for Oracle. Clustered Name Space.

And other articles here:

FS, CFS, NFS, ASM Topics

Remember, this is an Oracle blog and Isilon is indeed clustered storage, but they can’t do Oracle. And while OnTap GX can do Oracle, it is not symmetric clustered storage so it won’t scale.

Scalable NFS Powered By Open Source Cluster Filesystems

40 Terabytes Per Week With Linux-based Clusters at Dunnhumby
It seems reasonable to think that this company tested the open source clustering stuff, but I don’t know for certain. There are folks out there using Open Source cluster filesystems for “large I/O” processing as is apparent in this recent OCFS2 bug report (emphasis added by me):

During maintenance window, decided to use the OCFS2 filesystem to store a large backup file (about 5-10 gig file). SCP’ed the file from an outside server to node1 of the cluster […]

A little third-party perspective is necessary. Not even back in 1990, with Fujitsu Swallow IV drives, was 10GB considered “large.” The OCFS2 user that filed the bug continued:

After a few minutes, node1 crashed.

Let’s think about that for a moment. The user is bringing unstructured data into the OCFS2 cluster filesystem using scp (1). Just for the heck of it, let’s take the user at his word and do the math. He said, “After a few minutes.” Let’s say a few minutes are 3—180 seconds. That means the scp(1) was likely not trafficked over Gigabit Ethernet because that would be more like enough time to move about 20GB at full bandwidth with a single wire. That pretty much leaves 100BaseT. So, somewhere along 2GB or so, OCFS2 crumbled. Hmmm, lowered expectations. And the fun continued:

Node1 restarted, but crashed again attempting to reenter the cluster.
Leaving Node1 down, attempted reboot of Node2 and Node3.
Both panic crashed during restart attempting to start OCFS2 and join the cluster.
Eventually, found that we had to start Node1 first, then restart the other two nodes.

Good grief, I’m not even going to comment on that bit, but I will point out that the suggested workaround to use the O_DIRECT enabled coreutils seems off mark. The user is trying to scp(1), not cp(1) or mv(1).

If It Isn’t Free, It’s Junk. Ad Revenue Funds Robust Software Development.
In spite of the fact that Ray Lane says traditional software products are soon to be replaced by cobbled together bits and pieces of open source stuff or what Wharton refers to as “ad supported software”, sometimes the good things in life are not free.

Huge Amounts of Unstructured Data
A recent article in Information Week’s Optimize Magazine covered one of PolyServe’s customers, Dunnhumby. These folks manipulate a lot of data using HP Blades as compute nodes accessing data over NFS in a PolyServe File Serving Utility scalable NAS solution. In their own words:

Each week, more than 40 terabytes of data is generated […]

“Hold it”, you say, that’s a comparison of OCFS2 to PolyServe CFS via NFS. What does OCFS2 have to do with NFS? That is a good question. OCFS2 is proclaimed to be a general purpose filesystem (emphasis added by me):

WHAT IS OCFS2?

OCFS2 is the next generation of the Oracle Cluster File System for Linux. It is an extent based, POSIX compliant file system. Unlike the previous release (OCFS), OCFS2 is a general-purpose file system

So why not export OCFS2 filesystems via NFS? That is the sort of thing you do with a general purpose filesystem after all. And, since OCFS2 is a cluster filesystem there shouldn’t be any second thoughts about exporting the same filesystems from multiple nodes—that’s scalable file serving. In fact, that has been tried before. That URL points to a bug report where a user was trying to implement scalable file serving using OCFS2. He reports:

I’m using OCSF2 for backups and to store files used by nfs clients. We have some errors during three file uploading from remote clients. In that case only one node can access those files but the other node receive from dlm a bad lockres error message […]

Right, OK. So what came next? Read on:

So I tried to stop ocfs2 and o2cb services on the second node but I can’t because heartbeat prevents any stop attempt. A stop attempt on the first node instead hungs and I have to reboot the first node because it is impossible to unmount ocfs2 filesystems (even if I use the lazy option).

I’m sure it couldn’t get any worse, right? He continued:

That is a serious problem because to recover the right functionality I had to reboot the first node (o2cb/ocfs2 services hang and after reboot ASM losts spfiles, so problem impacts even the databases running on cluster). There is any kind of action I can do to avoid that?

Surely he must be doing something really convoluted to hit problems so easily! He explains the scenario:

The scenario is:
node X exports filesystem to host Y
node W exports filesystem to host Z

from Y I create a file then I delete it then ls command on Z lists the file but I cannot open it. I receive I lot of messages like this:

Oct 20 08:53:34 proxb31 kernel: (15612,1):ocfs2_populate_inode:234 ERROR:
Invalid dinode: i_ino=9977187, i_blkno=9977187, signature = INODE01, flags = 0x0
Oct 20 08:53:34 proxb31 kernel: (15612,1):ocfs2_read_locked_inode:389 ERROR:
populate inode failed! i_blkno=9977187, i_ino=9977187

Good grief! Cache coherency problems? You mean like this warning about OCFS cache coherency :

Reasons for using odirect cp:

1. Buffered and direct ios are still racy in the kernel. As Oracle is doing directio, doing a normal cp exposes one to the chance of copying a stale page data.

2. Direct ios are less stressful on the page cache. As Oracle datafiles are invariably large, directio is more efficient in the long run.

3. In a clustered environment, the blocks on disk could be updated by any nodes in the cluster. Using odirect io ensures the latest version of the block is always read.

Oh boy. Anyway, back to the bug report. The bug report states that as of January 4, 2007, there is a patch for NFS exported OCFS2 problem being tested at Oracle, however, the following comment was given to help set expectations:

One thing I’m concerned with is having two clients connect to seperate nodes. Since NFSD is not cluster aware, there may be some issues with unlinked inodes being in cache on one node and looked up on another. Is it possibleto confine your nfs exports to a single node for now, until we can get a better handle on that particular issue.

That seems like something that should have been spelled out in the Product Requirements Document, but I’m old-fashioned.

Scalable File Serving with Linux. Who Needs a Cluster-Aware NFSD?
The NAS heads in a PolyServe File Serving Utility configuration (e.g., HP EFS Clustered Gateway), run the enterprise distributions; RHEL4 and SuSE SLES9. So while those folks in the Ray Lane and Wharton’s open source dream world might think that NFSD cannot function in a cluster with data consistency, PolyServe—with that dying traditional software model—seems to have pulled it off. Do you think Dunnhumby pushes 40TB of data per week through a PolyServe File Serving Utility cluster without NFSD scalability or—more importantly—cache coherency? Not a chance.

Learn Danish Before You Learn About NUMA

Published February 8, 2007 NUMA Oracle , Opteron Oracle , Oracle NUMA 3 Comments

I can’t speak Danish, but I have the next best thing—a Danish friend that speaks English. The Danish arm of Computer Reseller News has a video of Mogens Norgaard (founder of the OakTable Network of which I am glad to be a member). I have no idea whatsoever about what he is discussing, but since the video starts out with him pouring a beer I’m sure I’m missing out on something. No, hold it, I did get something. Featured prominently behind him is a well-used copy of my friend James Morle’s book Scaling Oracle8i.

By the way, if you want to be an Oracle expert, that book should be considered mandatory reading. I don’t care if it is based on Oracle8i, it is still rich with correct information. Also, if you are following my series on NUMA/Oracle, I particularly recommend section 8.1.2 which I contributed to this book. It covers the original NUMA port of Oracle—Sequent. Of particular interest should be the section on one of my only claims to fame: Quad-Local Buffer Preference.

I can’t recall, but perhaps that was the topic James and I were discussing in this photo Alex Gorbachev took at one of our pub stops during UKOUG 2006. Or, maybe we (James and I to the right in the photo) were discussing the guys to our right (Mogens and Thomas Presslie) who were wearing skirts—ur, uh, I mean kilts! I do recall that 5AM came early that morning. Not the best way to start my trip home.

Real Application Clusters: The Shared Database Architecture for Loosely-Coupled Clusters

Published February 8, 2007 Clustered Storage , Real Application Clusters , server consolidation , Shared Applications Tier , Shared APPL_TOP , shared nothing architecture , shared nothing database 9 Comments

The typical Real Application Clusters (RAC) deployment is a true enigma. Sometimes I just scratch my head because I don’t get it. I’ve got this to say, if you think Shared Nothing Architecture is the way to go, then deploy it. But this is an Oracle blog, so let’s talk about RAC.

RAC is a shared disk architecture, just like DB2 on IBM mainframes. It is a great architecture, one that I agree with as is manifested by my working for shared data clustering companies all these years. Again, since this is an Oracle blog I think arguments about shared disk versus shared nothing are irrelevant.

Dissociative Identity Disorder
The reason I’m blogging this topic is because in my opinion the typical RAC deployment exhibits the characteristics of a person suffering from Dissociative Identity Disorder. Mind you, I’m discussing the architecture of the deployment, not the people that did the deployment. That is, we spend tremendous amounts of money for shared disk database architecture and then throw it into a completely shared nothing cluster. How much sense does that make? What areas of operations does that paradigm affect? Why does Oracle promote shared disk database deployments on shared-nothing clusters? What is the cause of this Dissociative Identity Disorder? The answer: the lack of a general purpose shared disk filesystem that is suited to Oracle database I/O that works on all Unix derivations and Linux. But wait, what about NFS?

Shared “Everything Else”
I can’t figure out any other way to label the principle I’m discussing so I’ll just call it “Shared Everything Else”. However, the term Shared Everything Else (SEE for short) insinuates that there is less importance in that particular content—an insinuation that could not be further from the truth. What do I mean? Well, consider the Oracle database software itself. How do you suppose an Oracle RAC (shared disk architecture) database can exist without having the product installed somewhere.

The product install directory for the database is called Oracle Home. Oracle has supported the concept of a shared Oracle Home since the initial release of RAC—even with Oracle9i. Yes, Metalink note 240963.1 describes the requirement for Oracle9i to have context dependent symbolic links (CDSL), but that was Oracle9i. Oracle10g requires no context dependent symbolic links. Oracle Universal Installer will install a functional shared Oracle Home without a any such requirements.

What if you don’t share a software install? It is very easy to have botched or mismatched product installs—which doesn’t sit well with a shared disk database. In a recent post on the oracle-l list, sent the following call for help:

We are trying to install a 2-node RAC with ASM (Oracle 10.2.0.2.0 on Solaris 10) and getting the error below when using dbca to create the database.The error occurs when dbca is done creating the DB (100%).Any suggestions?

We have tried starting atlprd2 instance manually and get the error below regarding an issue with spfile which is on ASM.

ORA-01565: error in identifying file ‘+SYS_DG/atlprd/spfileatlprd.ora’
ORA-17503: ksfdopn:2 Failed to open file +SYS_DG/atlprd/spfileatlprd.ora
ORA-03113: end-of-file on communication channel

OK, for those who are not Oracle-minded, this sort of deployment is what I call the Dissociative Identity Disorder since the database will be deployed on a bunch of LUNs provisioned, masked and accessed as RAW disk from the OS side—ASM is a collection of RAW disks. This is clearly not a SEE deployment.The original poster followed up with a status of the investigatory work he had to do to try and get around this problem:

[…] we have checked permissions and they are the same.We also checked and the same disk groups are mounted in both ASM instances

also.We have also tried shutting everything down (including reboot of both servers) and starting everything from scratch (nodeapps, asm, listeners, instances), but the second node won’t start.Keep getting the same error […]

What a joy. Deploying a shared disk database in a shared nothing cluster! There he was on each server checking file permissions (I just counted, there are 20,514 files in one of my Oracle10g Oracle Homes), investigating the RAW disk aspects of ASM, rebooting servers and so on. Good thing this is only a 2 node cluster. What if it was an 8 node cluster? What if he had 10 different clusters?

As usual, the oracle-l support channel comes through. Another list participant posted the following:

Seem to be a known issue (Metalink Note 390591.1). We encountered similar issue in Linux RAC cluster and has been resoled by following this note.

The cause was included in his post (emphasis added by me):

Cause

Installing the 10.2.0.2 patchset in a RAC installation on any Unix platform does not correctly update the libknlopt.a file on all nodes. The local node where the installer is run does update libknlopt.a but remote nodes do not get the updated file. This can lead to dumps or internal errors on the remote nodes if Oracle is subsequently relinked.

That was the good and bad, now the ugly—his post continues with the following excerpt from the Oracle Metalink note:

There are two solutions for this problem:

1) Manual copy of the “libknlopt.a” library to the offending nodes:

-ensure all instances are shut down
-manually copy $ORACLE_HOME/rdbms/lib/libknlopt.a from the local node to all remote nodes

-relink Oracle on all nodes :
make -f ins_rdbms.mk ioracle

2) Install the patchset on every node using the “-local” option:

What’s So Bad About Shared Nothing Clusters?
I’m not going to get into that, but one of the central knock-offs Oracle uses against shared-nothing database architecture is the fact that replication is required. Since the software used to access RAC needs to be kept in lock-step, replication is required there as well, and as we see from this oracle-l email thread, replication is not all that simple with a complex software deployment like the Oracle database product. But speaking of complex, the Oracle database software pales in comparison to the Oracle E-Business Suite. How in the world do people manage to deploy E-Biz on anything other than a huge central server? Shared Applications Tier.

Shared Applications Tier
Yes, just like Oracle Home, the huge, complex Oracle E-Business Suite can be installed in a shared fashion as well. It is called a Shared Applications Tier. One of the other blogs I read has been discussing this topic as well, but this is not just a blogosphere topic—it is mainline. Perhaps the best resource for Shared Applications Tier is Metalink note 243880.1, but Metalink notes 384248.1 and 233428.1 should not be overlooked. The long story short is that Oracle supports SEE, but they don’t promote it for who-knows-what-reason.

Is SEE Just About Product Installs?
Absolutely not. Consider intrinsic RAC functionality that doesn’t function at all without a shared filesystem:

External Tables with Parallel Query Option
UTIL_FILE
BFILE

I’m sure there are others (perhaps compiled PL/SQL), but who cares. The product is expensive and if you are using shared disk architecture you should be able to use all the features of shared disk architecture. However, without a shared filesystem, External Tables and the other features listed are not cluster-ready. That is, you can use External Tables, UTIL_FILE and BFILE—but only from one node. Isn’t RAC about multi-node scalability?

So Why the Rant?
The Oracle Universal Installer will install a fully functional Oracle10g shared Oracle Home to simplify things, the complex E-Business Suite software is architected for shared install and there are intrinsic database features that require shared data outside of the database so why deploy a shared database architecture product on a platform that only shares the database? You are going to have to explain it to me like I’m six years old; because I know I’m not going to understand. Oh, yes, and don’t forget that with a shared-nothing platform, all the day to day stuff like imp/exp, SQL*Loader, compressed archive redo, logging, trace, scripts, spool and so on mean you have to pick a server and go. How symmetric is that? Not as symmetric as the software for which you bought the cluster (RAC), that’s for certain.

Shared Oracle Home is a Single Point of Failure
And so is the SYSTEM tablespace in a RAC database, so what is the point?People who choose to deploy RAC on a platform that doesn’t support shared Oracle Home often say this. Yes a single shared Oracle Home is a single point of failure, but like I said, so is the SYSTEM tablespace in every RAC database out there. Shops that espouse shared software provisioning (e.g., shared Oracle Home) are not dolts, so the off-the-cuff single point of failure red herring is just that. When we say shared Oracle Home, do we mean a single shared Oracle Home? Well, not necessarily. If you have, say, a 4 or 8 node RAC cluster, why assume that SEE or not to SEE is a binary choice? It is perfectly reasonable to have 8 nodes share something like 2 Oracle Homes. That is a significant condensing factor and appeases the folks that concentrate on the possible single point of failure aspect of a shared Oracle Home (whilst often ignoring the SYSTEM tablespace single point of failure). A total availability solution requires Data Guard in my opinion, and Data Guard is really good, solid technology.

Choices
All told, NFS is the only filesystem that can be used across all Unix (and Linux) platforms for SEE. However, not all NFS offerings are suffiently scalable and resilient for SEE. This is why there is a significant technology trend towards clustered storage (e.g., NetApp OnTAP GX, PolyServe(HP) EFS Clustered Gateway, etc).

Finally, does anyone think I’m proposing some sort of mix-match NFS here with a little SAN there sort of ordeal? Well, no, I’m not. Pick a total solution and go with it…either NFS or SAN, the choice is yours, but pick a total platform solution that has shared data to complement the database architecture you’ve chosen. RAC and SEE!

Which Version Supports Oracle Over NFS? Oracle9i? Oracle10g?

Published February 7, 2007 Oracle NFS , Oracle11g 6 Comments

Recently, a participant on the oracle-l email list asked the following question:

Per note 359515.1 nfs mounts are supported for datafiles with oracle 10. Does anyone know if the same applies for 9.2 databases?

I’d like to point out a correction. While Metalink note 359515.1 does cover Oracle10g related information about NFS mount options for various platforms, that does not mean Oracle over NFS is limited to Oracle10g. In fact, that couldn’t be further from the truth. But before I get ahead of myself I’d like to dive in to the port-level aspect of this topic.

There are no single set of NFS mount options that work across all Oracle platforms. In spite of that fact, another participant of the oracle-l list replied to the original query with the following:

try :
rw,bg,vers=3,proto=tcp,hard,intr,rsize=32768,wsize=32768,forcedirectio

OK, the problem is that of the 6 platforms that support Oracle over NFS (e.g., Solaris, HP-UX, AIX, Linux x86/x86_64/AI64), the forcedirectio NFS mount option is required only on Solaris and HP-UX. For this reason, I’ll point out that the best references for NFS mount options to use for Oracle10g is Metalink 359515.1 and NAS vendors’ documents for Oracle9i.

Oracle9i
Support for Oracle9i on NFS was a little spottier than Oracle10g, but it was there. The now defunct Oracle Storage Compatibility Program (OSCP) was very important in ensuring Oracle9i would work with varying NAS offerings. The Oracle server has evolved nicely to handle Oracle over NFS to such a degree that the OSCP program is no longer even necessary. That means that Oracle10g is sufficiently robust to know whether the NFS mount you are feeding it is valid. That aside, the spotty Oracle9i support I allude to is actually at the port level mostly. That is, from one port to another, Oracle9i may or may not have required patches to operate efficiently and with integrity. One such example is the Oracle9i port to Linux where Oracle Patch number 2448994 was necessary so that Oracle would open files on NFS mounts with the O_DIRECT flag of the open(2) call. But, imagine this, it was not that simple. No, you had to have the following correct:

The proper mount options specified by the NAS vendor
A version of the Linux kernel that supported O_DIRECT
Oracle patch 2448994
The Correct setting for the filesystemio_options init.ora parameter

Whew, what a mess. Well, not that bad really. Allow me to explain. Both of the Linux 2.6 Enterprise kernels (RHEL 4, SuSE 9) support open(2)s of NFS files with the O_DIRECT. So there is one requirement taken care of—because I assume nobody is using RHAS 2.1. The patch is simple to get from Metalink and the correct setting of the filesystemio_options parameter is “directIO”. Finally, when it comes to mount options, NAS vendors do pretty well documenting their recommendations. Netapp has an entire website dedicated to the topic of Oracle over NFS. HP EOMs the File Serving Utility for Oracle from PolyServe and documents their mount options in their User Guide as well as in this paper about Oracle on the HP Clustered Gateway NAS.

Oracle10g
I’m not aware of any patches for any Oracle10g port to enable Oracle over NFS. I watch the Linux ports closely and I can state that canned, correct support for NFS is built in. If there were any Oracle10g patches required for NFS I think they’d be listed in Metalink 359515.1 which, at this time, does not specify any. As far as the Linux ports go, you simply mount the NFS filesystems correctly and set the init.ora parameter filesystemio_options=setall and you get both Direct I/O and asynchronous I/O.

With Friends Like That Who Needs Enemies?

Published February 5, 2007 oracle 1 Comment

James Morle, my longtime friend, OakTable Network Co-founder, and Director of Scale Abilities has noticed I’m blogging about NUMA topics so he thought it was fitting to share a video of another apparent ex-Sequent employee who, like me, seems to have failed a NUMA 12-step program.

Thanks, James. I needed that.

Oracle on Opteron with Linux-The NUMA Angle (Part V). Introducing numactl(8) and SUMA. Is The Oracle x86_64 Linux Port NUMA Aware?

Published February 3, 2007 Linux NUMA , NUMA Oracle , Opteron NUMA , Opteron Oracle , oracle 8 Comments

This blog entry is part five in a series. Please visit here for links to the previous installments.

Opteron-Based Servers are NUMA Systems
Or are they? It depends on how you boot them. For instance, I have 2 HP DL585 servers clustered with the PolyServe Database Utility for Oracle RAC. I booted one of the servers as a non-NUMA by tweaking the BIOS so that memory was interleaved on a 4KB basis. This is a memory model HP calls Sufficiently Uniform Memory Access (SUMA) as stated in this DL585 Technology Brief (pg. 6):

Node interleaving (SUMA) breaks memory into 4-KB addressable entities. Addressing starts with address 0 on node 0 and sequentially assigns through address 4095 to node 0, addresses 4096 through 8191 to node 1, addresses 8192 through 12287 to node 3, and addresses 12888

Booting in this fashion essentially turns an HP DL585 into a “flat-memory” SMP—or a SUMA in HP parlance. There seems to be conflicting monikers for using Opteron SMPs in this mode. IBM has a Redbook that covers the varying NUMA offerings in their System x portfolio. The abstract for this Redbook states:

The AMD Opteron implementation is called Sufficiently Uniform Memory Organization (SUMO) and is also a NUMA architecture. In the case of the Opteron, each processor has its own “local” memory with low latency. Every CPU can also access the memory of any other CPU in the system but at longer latency.

Whether it is SUMA or SUMO, the concept is cool, but a bit foreign to me given my NUMA background. The NUMA systems I worked on in the 90s consisted of distinct, separate small systems—each with their own memory and I/O cards, power supplies and so on. They were coupled into a single shared memory image with specialized hardware inserted into the system bus of each little system. These cards were linked together and the whole package was a cache coherent SMP (ccNUMA).

Is SUMA Recommended For Oracle?
Since the HP DL585 can be SUMA/SUMO, I thought I’d give it a test. But first I did a little research to see how most folks use these in the field. I know from the BIOS on my system that you actually get a warning and have to override it when setting up interleaved memory (SUMA). I also noticed that in one of HP’s Oracle Validated Configurations, the following statement is made:

Settings in the server BIOS adjusted to allow memory/node interleaving to work better with the ‘numa=off’ boot option

and:

Boot options
elevator=deadline numa=off

I found this to be strange, but I don’t yet fully understand why that recommendation is made. Why did they perform this validation with SUMA? When running a 4-socket Opteron system in SUMA mode, only 25% of all memory accesses will be to local memory. When I say all, I mean all—both user and kernel mode. The Linux 2.6 kernel is NUMA-aware so is seems like a waste to transform a NUMA system into a SUMA system? How can boiling down a NUMA system with interleaving (SUMA) possibly be optimal for Oracle? I will blog about this more as this series continues.

Is the x86_64 Linux Oracle Port NUMA Aware?
No, sorry, it is not. I might as well just come out and say it.

The NUMA API for Linux is very rudimentary compared to the boutique features in legacy NUMA systems like Sequent DYNIX/ptx and SGI IRIX, but it does support memory and process placement. I’ll blog later about this things it is missing that a NUMA aware Oracle port would require.

The Linux 2.6 kernel is NUMA aware, but what is there for applicaitons? The NUMA API which is implemented in the library called libnuma.so. But you don’t have to code to the API to effect NUMA awareness. The major 2.6 Linux kernel distributions (RHEL4 and SLES) ship with a command that uses the NUMA API in ways I’ll show later in this blog entry. The command is numactl(8) and it dynamically links to the NUMA API library (emphasis added by me):

$ uname -a
Linux tmr6s13 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
$ type numactl
numactl is hashed (/usr/bin/numactl)
$ ldd /usr/bin/numactl
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003ba3200000)
libc.so.6 => /lib64/tls/libc.so.6 (0x0000003ba2f00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003ba2d00000)

Whereas the numactl(8) command links with libnuma.so, Oracle does not:

$ type oracle
oracle is /u01/app/oracle/product/10.2.0/db_1/bin/oracle
$ ldd /u01/app/oracle/product/10.2.0/db_1/bin/oracle
libskgxp10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libskgxp10.so (0x0000002a95557000)
libhasgen10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libhasgen10.so (0x0000002a9565a000)
libskgxn2.so => /u01/app/oracle/product/10.2.0/db_1/lib/libskgxn2.so (0x0000002a9584d000)
libocr10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libocr10.so (0x0000002a9594f000)
libocrb10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libocrb10.so (0x0000002a95ab4000)
libocrutl10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libocrutl10.so (0x0000002a95bf0000)
libjox10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libjox10.so (0x0000002a95d65000)
libclsra10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libclsra10.so (0x0000002a96830000)
libdbcfg10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libdbcfg10.so (0x0000002a96938000)
libnnz10.so => /u01/app/oracle/product/10.2.0/db_1/lib/libnnz10.so (0x0000002a96a55000)
libaio.so.1 => /usr/lib64/libaio.so.1 (0x0000002a96f15000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003ba3200000)
libm.so.6 => /lib64/tls/libm.so.6 (0x0000003ba3400000)
libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000003ba3800000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003ba7300000)
libc.so.6 => /lib64/tls/libc.so.6 (0x0000003ba2f00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003ba2d00000)

No Big Deal, Right?
This NUMA stuff must just be a farce then, right? Let’s dig in. First, I’ll use the SLB (http://oaktable.net/getFile/148). Later I’ll move on to what fellow OakTable Network member Anjo Kolk and I refer to as the Jonathan Lewis Oracle Computing Index. The JL Oracle Computing Index is yet another microbenchmark that is very easy to run and compare memory throughput from one server to another using an Oracle workload. I’ll use this next to blog about NUMA effects/affects on a running instance of Oracle. After that I’ll move on to more robust Oracle OLTP and DSS workloads. But first, more SLB.

The SLB on SUMA/SOMA
First, let’s use the numactl(8) command to see what this DL585 looks like. Is it NUMA or SUMA?

$ uname -a
Linux tmr6s13 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
$ numactl –hardware
available: 1 nodes (0-0)
node 0 size: 32767 MB
node 0 free: 30640 MB

OK, this is a single node NUMA—or SUMA since it was booted with memory interleaving on. If it wasn’t for that boot option the command would report memory for all 4 “nodes” (nodes are sockets in the Opteron NUMA world). So, I set up a series of SLB tests as follows:

$ cat example1
echo “One thread”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./trigger
wait

echo “Two Threads, same core”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 6

echo “One thread”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./trigger
wait

echo “Two threads, same socket”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 6
./memhammer 262144 6000 &
./trigger
wait

echo “Two threads, different sockets”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 5
./memhammer 262144 6000 &
./trigger
wait

echo “4 threads, 4 sockets”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 5
./memhammer 262144 6000 &
./cpu_bind $$ 3
./memhammer 262144 6000 &
./cpu_bind $$ 1
./memhammer 262144 6000 &
./trigger
wait

echo “8 threads, 4 sockets”
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./memhammer 262144 6000 &
./cpu_bind $$ 5
./memhammer 262144 6000 &
./memhammer 262144 6000 &
./cpu_bind $$ 3
./memhammer 262144 6000 &
./memhammer 262144 6000 &
./cpu_bind $$ 1
./memhammer 262144 6000 &
./memhammer 262144 6000 &
./trigger
wait

And now the measurements:

$ sh ./example1
One thread
Total ops 1572864000 Avg nsec/op 71.5 gettimeofday usec 112433955 TPUT ops/sec 13989225.9
Two threads, same socket
Total ops 1572864000 Avg nsec/op 73.4 gettimeofday usec 115428009 TPUT ops/sec 13626363.4
Total ops 1572864000 Avg nsec/op 74.2 gettimeofday usec 116740373 TPUT ops/sec 13473179.5
Two threads, different sockets
Total ops 1572864000 Avg nsec/op 73.0 gettimeofday usec 114759102 TPUT ops/sec 13705788.7
Total ops 1572864000 Avg nsec/op 73.0 gettimeofday usec 114853095 TPUT ops/sec 13694572.2
4 threads, 4 sockets
Total ops 1572864000 Avg nsec/op 78.1 gettimeofday usec 122879394 TPUT ops/sec 12800063.1
Total ops 1572864000 Avg nsec/op 78.1 gettimeofday usec 122820373 TPUT ops/sec 12806214.2
Total ops 1572864000 Avg nsec/op 78.2 gettimeofday usec 123016921 TPUT ops/sec 12785753.3
Total ops 1572864000 Avg nsec/op 78.5 gettimeofday usec 123527864 TPUT ops/sec 12732868.1
8 threads, 4 sockets
Total ops 1572864000 Avg nsec/op 156.3 gettimeofday usec 245773200 TPUT ops/sec 6399656.3
Total ops 1572864000 Avg nsec/op 156.3 gettimeofday usec 245848989 TPUT ops/sec 6397683.4
Total ops 1572864000 Avg nsec/op 156.4 gettimeofday usec 245941009 TPUT ops/sec 6395289.7
Total ops 1572864000 Avg nsec/op 156.4 gettimeofday usec 246000176 TPUT ops/sec 6393751.5
Total ops 1572864000 Avg nsec/op 156.6 gettimeofday usec 246262366 TPUT ops/sec 6386944.2
Total ops 1572864000 Avg nsec/op 156.5 gettimeofday usec 246221624 TPUT ops/sec 6388001.1
Total ops 1572864000 Avg nsec/op 156.7 gettimeofday usec 246402465 TPUT ops/sec 6383312.8
Total ops 1572864000 Avg nsec/op 156.8 gettimeofday usec 246594031 TPUT ops/sec 6378353.9

SUMA baselines at 71.5ns average write operation and tops out at about 156ns with 8 concurrent threads of SLB execution (one per core). Let’s see what SLB on NUMA does.

SLB on NUMA
First, let’s get an idea what the memory layout is like:

$ uname -a
Linux tmr6s14 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
$ numactl –hardware
available: 4 nodes (0-3)
node 0 size: 8191 MB
node 0 free: 5526 MB
node 1 size: 8191 MB
node 1 free: 6973 MB
node 2 size: 8191 MB
node 2 free: 7841 MB
node 3 size: 8191 MB
node 3 free: 7707 MB

OK, this means that there is approximately 5.5GB, 6.9GB, 7.8GB and 7.7GB of free memory on “nodes” 0, 1, 2 and 3 respectively. Why is the first node (node 0) lop-sided? I’ll tell you in the next blog entry. Let’s run some SLB. First, I’ll use numactl(8) to invoke memhammer with the directive that forces allocation of memory on a node-local basis. The first test is one memhammer process per socket:

$ cat ./membind_example.4
./create_sem
numactl –membind 3 –cpubind 3 ./memhammer 262144 6000 &
numactl –membind 2 –cpubind 2 ./memhammer 262144 6000 &
numactl –membind 1 –cpubind 1 ./memhammer 262144 6000 &
numactl –membind 0 –cpubind 0 ./memhammer 262144 6000 &
./trigger
wait

$ bash ./membind_example.4
Total ops 1572864000 Avg nsec/op 67.5 gettimeofday usec 106113673 TPUT ops/sec 14822444.2
Total ops 1572864000 Avg nsec/op 67.6 gettimeofday usec 106332351 TPUT ops/sec 14791961.1
Total ops 1572864000 Avg nsec/op 68.4 gettimeofday usec 107661537 TPUT ops/sec 14609340.0
Total ops 1572864000 Avg nsec/op 69.7 gettimeofday usec 109591100 TPUT ops/sec 14352114.4

This test is the same as the one above called “4 threads, 4 sockets” performed on the SOMA configuration where the latencies were 78ns. Switching from SOMA to NUMA and executing with NUMA placement brought the latencies down 13% to an average of 68ns. Interesting. Moreover, this test with 4 concurrent memhammer processes actually demonstrates better latencies than the single process average on SUMA which was 72ns. That comparison alone is quite interesting because it makes the point quite clear that SUMA in a 4-socket system is a 75% remote memory configuration—even for a single process like memhammer.

The next test was 2 memhammer processes per socket:

$ more membind_example.8
./create_sem
numactl –membind 3 –cpubind 3 ./memhammer 262144 6000 &
numactl –membind 3 –cpubind 3 ./memhammer 262144 6000 &
numactl –membind 2 –cpubind 2 ./memhammer 262144 6000 &
numactl –membind 2 –cpubind 2 ./memhammer 262144 6000 &
numactl –membind 1 –cpubind 1 ./memhammer 262144 6000 &
numactl –membind 1 –cpubind 1 ./memhammer 262144 6000 &
numactl –membind 0 –cpubind 0 ./memhammer 262144 6000 &
numactl –membind 0 –cpubind 0 ./memhammer 262144 6000 &
./trigger
wait

$ sh ./membind_example.8
Total ops 1572864000 Avg nsec/op 95.8 gettimeofday usec 150674658 TPUT ops/sec 10438809.2
Total ops 1572864000 Avg nsec/op 96.5 gettimeofday usec 151843720 TPUT ops/sec 10358439.6
Total ops 1572864000 Avg nsec/op 96.9 gettimeofday usec 152368004 TPUT ops/sec 10322797.2
Total ops 1572864000 Avg nsec/op 96.9 gettimeofday usec 152433799 TPUT ops/sec 10318341.5
Total ops 1572864000 Avg nsec/op 96.9 gettimeofday usec 152436721 TPUT ops/sec 10318143.7
Total ops 1572864000 Avg nsec/op 97.0 gettimeofday usec 152635902 TPUT ops/sec 10304679.2
Total ops 1572864000 Avg nsec/op 97.2 gettimeofday usec 152819686 TPUT ops/sec 10292286.6
Total ops 1572864000 Avg nsec/op 97.6 gettimeofday usec 153494359 TPUT ops/sec 10247047.6

What’s that? Writing memory on the SUMA configuration in the 8 concurrent memhammer case demonstrated latencies on order of 156ns but dropped 38% to 97ns by switching to NUMA and using the Linux 2.6 NUMA API. No, of course an Oracle workload is not all random writes, but a system has to be able to handle the difficult aspects of a workload in order to offer good throughput. I won’t ask the rhetorical question of why Oracle is not NUMA aware in the x86_64 Linux ports until my next blog entry where the measurements will not be based on the SLB, but a real Oracle instance instead.

Déjà vu
Hold it. Didn’t the Dell PS1900 with a Clovertown Xeon quad-core E5320’s exhibit ~500ns latencies with only 4 concurrent threads of SLB execution (1 per core)? That was what was shown in this blog entry. Interesting.

I hope it is becoming clear why NUMA awareness is interesting. NUMA systems offer a great deal of potential incremental bandwidth when local memory is preferred over remote memory.

Next up—comparisons of SUMA versus NUMA with the Jonathan Lewis Computing Index and why all is not lost just because the 10gR2 x86_64 Linux port is not NUMA aware.

Oracle on Opteron with Linux-The NUMA Angle (Part IV). Some More About the Silly Little Benchmark.

Published January 31, 2007 AMD Memory Latency , AMD Memory Throughput , Clovertown performance , DELL PE1900 , Opteron NUMA , Opteron Oracle , Oracle performance 9 Comments

In my recent blog post entitled Oracle on Opteron with Linux-The NUMA Angle (Part III). Introducing the Silly Little Benchmark, I made available the SLB and hoped to get some folks to measure some other systems using the kit. Well, I got my first results back from a fellow member of the OakTable Network—Christian Antognini of Trivadis AG. I finally got to meet him face to face back in November 2006 at UKOUG.

Christian was nice enough to run it on a brand new Dell PE1900 with, yes, a quad-core “Clovertown” processor of the low-end E5320 variety. As packaged, this Clovertown-based system has a 1066MHz front side bus and the memory is configured with 4x1GB 667MHz dimms. The processor was clocked at 1.86GHz.

Here is a snippet of /proc/cpuinfo from Christian’s system:

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU E5320 @ 1.86GHz
stepping : 7
cpu MHz : 1862.560

I asked Christian to run the SLB (memhammer) with 1, 2 and 4 threads of execution and to limit the amount of memory per process to 512MB. He submitted the following:

cha@helicon slb]$ cat example4.sh
./cpu_bind $$ 3
./create_sem
./memhammer 131072 6000 &
./trigger
wait

./cpu_bind $$ 3
./create_sem
./memhammer 131072 6000 &
./cpu_bind $$ 1
./memhammer 131072 6000 &
./trigger
wait

./cpu_bind $$ 3
./create_sem
./memhammer 131072 6000 &
./cpu_bind $$ 2
./memhammer 131072 6000 &
./cpu_bind $$ 1
./memhammer 131072 6000 &
./cpu_bind $$ 0
./memhammer 131072 6000 &
./trigger
wait
[cha@helicon slb]$ ./example4.sh
Total ops 786432000 Avg nsec/op 131.3 gettimeofday usec 103240338 TPUT ops/sec 7617487.7
Total ops 786432000 Avg nsec/op 250.4 gettimeofday usec 196953024 TPUT ops/sec 3992992.8
Total ops 786432000 Avg nsec/op 250.7 gettimeofday usec 197121780 TPUT ops/sec 3989574.4
Total ops 786432000 Avg nsec/op 503.6 gettimeofday usec 396010106 TPUT ops/sec 1985888.7
Total ops 786432000 Avg nsec/op 503.6 gettimeofday usec 396023560 TPUT ops/sec 1985821.2
Total ops 786432000 Avg nsec/op 504.6 gettimeofday usec 396854086 TPUT ops/sec 1981665.4
Total ops 786432000 Avg nsec/op 505.1 gettimeofday usec 397221522 TPUT ops/sec 1979832.3

So, while this is not a flagship Cloverdale Xeon (e.g., E5355), the latencies are pretty scary. Contrast these results with the DL585 Opteron 850 numbers I shared in this blog entry. The Opteron 850 is delivering 69ns with 2 concurrent threads of execution—some 47% quicker than this system exhibits with only 1 memhammer process running and the direct comparison of 2 concurrent memhammer processes is an astounding 3.6x slower than the Opteron 850 box. Here we see the true benefit of an on-die memory controller and the fact that Hypertransport is a true 1GHz path to memory. With 4 concurrent memhammer processes, the E5320 bogged down to 500ns! I’ll blog soon about what I see with the SLB on 4 sockets with my DL585 in the continuing NUMA series.

Other Numbers
I’d sure like to get numbers from others. How about Linux Itanium? How about Power system with AIX? How about some SPARC numbers. Anyone have a Soket F Opteron box they could collect SLB numbers on? If so, get the kit and run the same scripts Christian did on his Dell PE1900. Thanks Christian.

A Note About The SLB
Be sure to limit the memory allocation such that it does not cause major page faults or, eek, swapping. The first argument to memhamer is the number of 4KB pages to allocate.

A Non-Trivial MySQL Arithmetic Bug and The Moronic Quote of the Day

Published January 31, 2007 MySQL , Open Source Database 3 Comments

We were just having a good laugh about this over on the Oaktable Network list. It appears as though MySQL is exhibiting a nasty arithmetic bug. It seems when they ran the pricing numbers through the database, free—or $0.00 dollars—became $40,000 dollars. MySQL represents the best of the Open Source model—lots of free code, yet The MySQL Website says:

For the price of a single CPU of Oracle Enterprise Edition ($40,000 per CPU), you can deploy an unlimited number […]

What ever happened to the real Open Source database model which is free download and something like www.mysqlfreaks.com/ for support?

Moronic Quote of the Day
Also at The MySQL Website is this jewel (emphasis added by me):

Not only does open source save money, it provides an architecture that is more scalable for modern web-based applications.

Whether or not your application is “modern” does not make MySQL scale better than Oracle. Sorry, nice try. I wonder if MySQL marketing is a volunteer effort to match the product development?

Dell Compares RAC and Non-RAC Performance and Cost

Published January 31, 2007 Oracle performance 26 Comments

May 30, 2007. BLOG UPDATE: Note, the author of the papers I discussed in this blog entry has visited and commented. If nothing else, I recommend reading my follow up regarding the fact that these papers don’t even have the word Oracle in them.

It isn’t very often that you get a tier one hardware vendor directly comparing RAC with non-RAC. When it happens, it is generally by accident. That doesn’t stop me from learning from the information. I hope you will find it interesting too.

So, Dell didn’t exactly set out to compare RAC to non-RAC, but they inadvertently did. In October 2006, they released a series of whitepapers that compare Dell with Oracle to Sun with Oracle. I personally think such comparisons are a complete waste of time since Sun shops are going to run Sun and Windows shops are going to run Windows.

The whitepapers take two swipes at the Sun V490 with 8 UltraSPARC IV+ processors. The first is a cluster of Dell 2950s each with 2 dual-core Xeon 5160 (Woodcrest) processors running Red Hat Enterprise Linux 4. The second was a single Dell 6850 with 4 dual-core Xeon 7140 (Clovertown) processors running Windows 2003. Oh if only they both would have been Linux. No matter though, the comparison is still very interesting. The papers are available at the following URLs:

Even though the paper was intended to cast stones at the Sun V490, there was one particularly interesting aspect of the testing that makes the results helpful in so many other ways. See, Dell did all this testing with the same SAN. In fact, a good portion of these papers are identical. The description of the SAN used for the V490, Clustered 2950s and the 6850 appears in both papers as follows:

Storage for both the Dell and Sun servers was provided by a Storage Area Network (SAN) attached Dell/EMC CX3-80 fibre channel storage array. Each server was attached to the SAN via two QLogic Host Bus Adapters.

There we have it, 3 configurations with the same application code, the same database schema and the same database size. How tasty!

The Workload
They used Dell’s DVD Store Test Application test suite that has been available since about 2005 at http://linux.dell.com/dvdstore/. I have used this workload quite a bit myself actually. It exhibits a lot of the same characteristics as TPC-C—for what it is worth. By the way, the link I provided works, the one in the whitepapers is faulty. I guess that will be my value add.

The Numbers
Like I said, forget about the comparison to Sun. I say look at the comparison of clustered versus non-clustered Oracle. I’ll let you read the papers for the full nitty-gritty, but the summary is worth a lengthy discussion:

Configuration Cost Throughput (Orders/Minute)

Dell 6850 $185,747 32,264
Dell 2950 Cluster $266,852 22,169

Remarkable, and remember, all the important aspects of this sort of test were constant between the two. By important I mean the application, database schema and database size and storage.

Highly Available
Yes, the Dell 2950 cluster theoretically offers more availability. That is important in the event of a failure, sure, but it performs at 31% less throughput than the 6850 solution when it is fully healthy. The important comparison, I believe, is the 6850 to the “brown-out” effect of running an application on a single surviving node of the 2950 cluster. With only one node surviving in the event of a failure, the 2950 cluster solution would be capable of 11,084 orders per minute—about 67% less throughput than the 6850. I think it breaks down like this; the clustered 2950 solution costs 44% more and performs 31% less but in the event of a failure, a surviving 2950 will offer about 1/3rd the throughput of a 6850.

Oracle on Opteron with Linux-The NUMA Angle (Part III). Introducing The Silly Little Benchmark.

Published January 30, 2007 AMD Memory Latency , AMD Memory Throughput , Opteron NUMA 15 Comments

In my blog “mini-series” about Oracle on Opteron NUMA, I am about to start covering the Linux 2.6 NUMA API and what it means to Oracle. I will share a lot of statspack information for certain, but first we need to go with micro-benchmark tests. The best micro-benchmark test for analysis of memory latency is one that uses the least amount of processor cycles to write memory that is most likely not in processor cache. That is, spend the fewest cycles to put the most load on the memory subsystem. To that end, I’d like to make available a SLB—a Silly Little Benchmark.

Introducing the Silly Little Benchmark
I took some old code of mine as the framework, stripped out all but the most simplistic code to make the points of memory latency and locality clear. Now, I’m not suggesting this SLB mimics Oracle at all. There are 3 significantly contentious aspects missing from this SLB that could bring it closer to what Oracle does to a system:

Shared Memory
Mutual Exclusion (e.g., spinlocks in shared memory)
I/O

It just so happens that the larger code I ripped this SLB from does possess these three characteristics; however, I want to get the simplest form of it out there first. As this NUMA series progresses I’ll make other pieces available. Also, this version of the SLB is quite portable—it should work on about any modern Unix variant.

Where to Download the SLB
The SLB kit is stored in a tar archive here (slb.tar).

Description of the SLB
It is supposed to be very simple and it is. It consists of four parts:

create_sem: as simple as it sounds. It create a single IPC semaphore in advance of memhammer.
memhammer: the Silly Little Benchmark driver. It takes two arguments (without options). The first argument is the number of 4KB pages to allocate and the second is for how many loop iterations to perform
trigger: all memhammer processes wait on the semaphore created by create_sem, this operates the semaphore to trigger a run
cpu_bind: binds a process to a cpu

The first action for running the SLB is to execute create_sem. Next, fire off any number of memhammer processes up to the number of processors on the system. It makes no sense running more memhammer processes than processors in the machine. Each memhammer will use malloc(3) to allocate some heap, initializes it all with memset(3) and then wait on the semaphore created by create_sem. Next, execute trigger and the SLB will commence its work loop which loops through pseudo-random 4KB offsets in the malloc’ed memory and writes an 8 byte location within the first 64 bytes. Why 64 bytes? All the 64 bit systems I know of manage physical memory using a 64 byte cache line. As long as we write on any location residing entirely within a 64 byte line, we have caused as much work for the memory subsystem as we would if we scribbled on each of the eight 8-byte words the line can hold. Not scribbling over the entire line relieves us of the CPU overhead and allows us to put more duress on the memory subsystem—and that is the goal. SLB has a very small measured unit of work, but it causes maximum memory stalls. Well, not maximum, that would require spinlock contention, but this is good enough for this point of the NUMA mini-series.

Measuring Time
In prior lives, all of this sort of low-level measuring that I’ve done was performed with x86 assembly that reads the processor cycle counter—RDTSC. However, I’ve found it to be very inconsistent on multi-core AMD processors no matter how much fiddling I do to serialize the processor (e.g., with the CPUID instruction). It could just be me, I don’t know. It turns out that it is difficult to stop predictive reading of the TSC and I don’t have time to fiddle with it a pre-Socket F Opteron. When I finally get my hands on a Socket F Opteron system, I’ll change my measurement technique to RDTSCP which is an atomic set of instructions to serialize and read the time stamp counter correctly. Until then, I think performing millions upon millions of operations and then dividing by microsecond resolution gettimeofday(2) should be about sufficient. Any trip through the work loop that gets nailed by hardware interrupts will unfortunately increase the average but running the SLB on an otherwise idle system should be a pretty clean test.

Example Measurements
Getting ready for the SLB is quite simple. Simply extract and compile:

$ ls -l slb.tar
-rw-r–r– 1 root root 20480 Jan 26 10:12 slb.tar
$ tar xvf slb.tar
cpu_bind.c
create_sem.c
Makefile
memhammer.c
trigger.c
$ make

cc -c -o memhammer.o memhammer.c
cc -O -o memhammer memhammer.o
cc -c -o trigger.o trigger.c
cc -O -o trigger trigger.o
cc -c -o create_sem.o create_sem.c
cc -O -o create_sem create_sem.o
cc -c -o cpu_bind.o cpu_bind.c
cc -O -o cpu_bind cpu_bind.o

Some Quick Measurements
I used my DL585 with 4 dual-core Opteron 850s to test 1 single invocation of memhammer then compared it to 2 invocations on the same socket. The first test bound the execution to processor number 7 which executed the test in seconds 108.28 seconds with an average write latency of 68.8ns. The next test was executed with 2 invocations both on the same physical CPU. This caused the result to be a bit “lumpy.” The average of the two was 70.2ns—about 2% more than the single incovation on the same processor. For what it is worth, there was 2.4% average latency variation betweent he two concurrent invocations:

$ cat example1
./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./trigger
wait

./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 6
./memhammer 262144 6000 &
./trigger
wait

$ sh ./example1

Total ops 1572864000 Avg nsec/op 68.8 gettimeofday usec 108281130 TPUT ops/sec 14525744.2

Total ops 1572864000 Avg nsec/op 69.3 gettimeofday usec 108994268 TPUT ops/sec 14430703.8

Total ops 1572864000 Avg nsec/op 71.0 gettimeofday usec 111633529 TPUT ops/sec 14089530.4

The next test was to dedicate an entire socket to each of two concurrent invocations which really smoothed out the result. Executing this way resulted in less than 1/10^th of 1 percent variance between the average write latencies:

$ cat example2

./cpu_bind $$ 7
./create_sem
./memhammer 262144 6000 &
./cpu_bind $$ 5
./memhammer 262144 6000 &
./trigger
wait

$ sh ./example2

Total ops 1572864000 Avg nsec/op 69.1 gettimeofday usec 108640712 TPUT ops/sec 14477666.5

Total ops 1572864000 Avg nsec/op 69.2 gettimeofday usec 108871507 TPUT ops/sec 14446975.6

Cool And Quiet
Up to this point, the testing was done with the DL585s executing at 2200MHz. Since I have my DL585s set for dynamic power consumption adjustment, I can simply blast a SIGUSR2 at the cpuspeed processes. The spuspeed processes catch the SIGUSR2 and adjust the Pstate of the processor to the lowest power consumption—features supported by AMD Cool And Quiet Technology. The following shows how to determine what bounds the processor is fit to execute at. In my case, the processor will range from 2200MHz down to 1800MHz. Note, I recommend fixing the clock speed with the SIGUSR1 or SIGUSR2 before any performance testing. You might grow tired of inconsistent results. Note, there is no man page for the cpuspeed executable. You have to execute it to get a help message with command usage.

# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
2200000 2000000 1800000

# chkconfig –list cpuspeed
cpuspeed 0:off 1:on 2:on 3:on 4:on 5:on 6:on

# ps -ef | grep cpuspeed | grep -v grep
root 1796 1 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1797 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1798 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1799 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1800 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1801 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1802 1796 0 11:23 ? 00:00:00 cpuspeed -d -n
root 1803 1796 0 11:23 ? 00:00:00 cpuspeed -d -n

# cpuspeed -h 2>&1 | tail -12

To have a CPU stay at the highest clock speed to maximize performance send the process controlling that CPU the SIGUSR1 signal.

To have a CPU stay at the lowest clock speed to maximize battery life send the process controlling that CPU the SIGUSR2 signal.

To resume having a CPU’s clock speed dynamically scaled send the process controlling that CPU the SIGHUP signal.

So, now that the clock speed has been adjusted down 18% from 2200MHz to 1800MHz, let’s see what example1 does:

$ sh ./example1
Total ops 1572864000 Avg nsec/op 81.6 gettimeofday usec 128352437 TPUT ops/sec 12254259.0
Total ops 1572864000 Avg nsec/op 77.0 gettimeofday usec 121043918 TPUT ops/sec 12994159.7
Total ops 1572864000 Avg nsec/op 82.8 gettimeofday usec 130198851 TPUT ops/sec 12080475.3

The slower clock rate brought the single invocation number up to 81.6ns average—18%.

With the next blog entry, I’ll start to use the SLB to point out NUMA characteristics of 4-way Opteron servers. After that it will be time for some real Oracle numbers. Please stay tuned.

Non-Linux Platforms
My old buddy Glenn Fawcett of Sun’s Strategic Applications Engineering Group has collected data from both SPARC and Opteron-based Sun servers. He said to use the following compiler options:

-xarch=v9 … for sparc
-xarch=amd64 for opteron

He and I had Martini’s the other night, but I forgot to ask him to run it on idle systems for real measurement purposes. Having said that, I’d really love to see numbers from anyone that cares to run this. It would be great to have the output of /proc/cpuinfo, and the server make/model. In fact, if someone can run this on a Socket F Opteron system I’d greatly appreciate it. It seems our Socket F systems are not slated to arrive here for a few more weeks.

AMD Quad-Core “Barcelona” Processor For Oracle (Part V). 40% Expected Over Clovertown.

Published January 26, 2007 AMD Barcelona , AMD K8L , AMD Quad-core , AMD Quad-Core Performance , Oracle Barcelona , Oracle licensing 4 Comments

A reader posted an interesting comment on the latest installment on my thread about Oracle licensing on the upcoming AMD Barcelona processor. The comment as posted on my blog article entitled AMD Quad-Core “Barcelona” Processor For Oracle (Part IV) and the Web 2.0 Trolls states:

The problem with your numbers is that they are based on old AMD marketing materials. AMD has had a chance to run their engineering samples at their second stepping (they are now gearing up full production for late Q2 delivery – 12 weeks from wafer starts) and they are currently claiming a 40% advantage on Clovertown versus the 70% over the Opteron 2200 from their pre-A0 stepping marketing material.

The AMD claim was covered in this ZDNet article which quotes AMD Vice President Randy Allen as follows:

We expect across a wide variety of workloads for Barcelona to outperform Clovertown by 40 percent,” Allen said. The quad-core chip also will outperform AMD’s current dual-core Opterons on “floating point” mathematical calculations by a factor of 3.6 at the same clock rate, he said.

That is a significantly different set of projections than I covered in my article entitled AMD Quad-core “Barcelona” Processor For Oracle (Part II). That article covers AMD’s initial OLTP projections of 70% OLTP improvement on a per-processor (socket) over Opteron 2200. These new projections are astounding, and I would love to see it be the case for the sake of competition. Let’s take a closer look.

Hypertransport Bandwidth
I’m glad AMD has set expectations by stating the 40% uplift over Clovertown would be realized for “a wide variety of workloads.” However, since this is an Oracle blog I would much have preferred to see OLTP mentioned specifically. The numbers are hard to imagine, and it is all about feeding the processor, not the processor itself. The Barcelona processor is socket-compatible with Socket F. Any improvement of Opteron 2200/8200 would require existing headroom on the Hypertransport for workloads like OLTP. A lot of headroom—let’s look at the numbers.

The Socket F baseline that the original AMD projections were based on was 139,693 TpmC. If OLTP is included in the “wide variety of workloads”, then the projected OLTP throughput would be Clovertown 222,117 TpmC x 1.4, or 310.963 TpmC—all things being equal. This represents 2.2 times the throughput from the same Socket F/Hypertransport setup. Time for a show of hands, how many folks out there think that the Opteron 2200 OLTP result of 139,693 TpmC was achieved with more then 50% headroom to spare on the Hyptertransports? I would love to see Barcelona come in with this sort of OLTP throughput, but folks, systems are not made with more than 200% bus bandwidth than the processors need. I’m not very hopeful.

Bear in mind that today’s Tulsa processor as packaged in the IBM System x 3950 is capable of 331,087 TpmC with 8 cores. So, let’s factor our Oracle licensing in and see what the numbers look like if AMD’s projections apply to OLTP:

Opteron 2200 4 core: 139,693 TpmC, 2 licenses = 69,846 per license

Clovertown 8 core: 222,117 TpmC, 4 licenses = 55,529 per license

AMD Old Projection 8 core: 237,478 TpmC, 4 licenses = 59,369 per license

AMD New Projection 8 core: 310,963 TpmC, 4 licenses = 77,740 per license

Tulsa 8 core: 331,087 TpmC, 4 licenses = 82,771 per license

Barcelona Floating Point
FPU performance doesn’t matter to Oracle as I point out in this blog entry.

Clock Speed
The news about the expected 40% jump over Clovertown was accompanied by the news that Barcelona will clock in at a lower speed than Opteron 2200/8200 processors. I haven’t mentioned that aspect—because with Oracle it really doesn’t matter much. The amount of work Oracle gets done in cache is essentially nill. I’ll blog about clock speed with Opterons very soon.

AMD Quad-Core “Barcelona” Processor For Oracle (Part IV) and the Web 2.0 Trolls.

Published January 25, 2007 AMD Barcelona , AMD K8L , AMD Quad-core , AMD Quad-Core Performance , Opteron Oracle , Oracle Barcelona , Oracle performance 14 Comments

This blog entry is the fourth in a series:

Oracle on Opteron with Linux–The NUMA Angle (Part I)

Oracle on Opteron with Linux-The NUMA Angle (Part II)

Oracle on Opteron with Linux-The NUMA Angle (Part III)

It Really is All About The Core, Not the Processor (Socket)
In my post entitled AMD Quad-core “Barcelona” Processor For Oracle (Part III). NUMA Too!, I had to set a reader straight over his lack of understanding where the terms processor, core and socket are concerned. He followed up with:

kevin – you are correct. your math is fine. though, i may still disagree about core being a better term than “physical processor”, but that is neither here, nor there.

He continues:

my gut told me based upon working with servers and knowing both architectures your calculations were incorrect, instead i errored in my math as you pointed out. *but*, i did uncover an error in your logic that makes your case worthless.

So, I am replying here and now. His gut may just be telling him that he ate something bad, or it could be his conscience getting to him for mouthing off over at the investor village AMD board where he called me a moron. His self-proclaimed server expertise is not relevent here, nor is it likely the level he insinuates.

This is a blog about Oracle; I wish he’d get that through his head. Oracle licenses their flagship software (Real Application Clusters) at a list price of USD $60,000 per CPU. As I’ve pointed out, x86 cores are factored at .5 so a quad-core Barcelona will be 2 licenses—or $120,000 per socket. Today’s Tulsa processor licenses at $60,000 per socket and outperforms AMD’s projected Barcelona performance. AMD’s own promotional material suggests it will achieve a 70% OLTP (TPC-C) gain over today’s Opteron 2200. Sadly that is just not good enough where Oracle is concerned. I am a huge AMD fan, so this causes me grief.

Also, since he is such a server expert, he must certainly be aware that plugging a Barcelona processor into a Socket F board will need 70% headroom on the Hypertransport in order to attain that projected 70% OLTP increase. We aren’t talking about some CPU-only workload here, we are talking OLTP—as was AMD in that promotional video. OLTP hammers Hypertransport with tons of I/O, tons of contentious shared memory protected with spinlocks (a MESI snooping nightmare) and very large program text. I have seen no data anywhere suggesting this Socket F (Opteron 2200) TPC-C result of 139,693 TpmC was somehow achieved with 70% headroom to spare on the Hypertransport.

Specialized Hardware
Regarding the comparisons being made between the projected Barcelona numbers and today’s Xeon Tulsa, he states:

you are comparing a commodity chip with a specialized chip. those xeon processors in the ibm TPC have 16MB of L3 cache and cost about 6k a piece. amd most likely gave us the performance increase of the commodity version of barcelona, not a specialized version of barcelona. they specifically used it as a comparison, or upgrade of current socket TDP (65W,89W) parts.

What can I say about that? Specialized version of Barcelona? I’ve seen no indication of huge stepping plans, but that doesn’t matter. People run Oracle on specialized hardware. Period. If AMD had a “specialized” Barcelona in the plans, they wouldn’t have predicted a 70% increase over Opteron 2200—particularly not in a slide about OLTP using published TPC-C numbers from Opteron 2200 as the baseline. By the way, the only thing 16MB cache helps with in an Oracle workload is Oracle’s code footprint. Everything else is load/store operations and cache invalidations. The AMD caches are generally too small for that footprint, but the fact that the on-die memory controller is coupled with awesome memory latencies (due to Hypertransport), small cache size hasn’t mattered that much with Opteron 800 and Socket F—but only in comparison to older Xeon offerings. This whole blog thread has been about today’s Xeons and future Barcelona though.

Large L2/L3 Cache Systems with OLTP

Regarding Tulsa Xeon processors used in the IBM System x TPC-C result of 331,087 TpmC, he writes:

the benchmark likely runs in cache on the special case hardware.

Cache-bound TPC-C? Yes, now I am convinced that his gut wasn’t telling him anything useful. I’ve been talking about TPC-C. He, being a server expert, must surely know that TPC-C cannot execute in cache. That Tulsa Xeon number at 331,087 TpmC was attached to 1,008 36.4GB hard drives in a TotalStorage SAN. Does that sound like cache to anyone?

Tomorrow’s Technology Compared to Today’s Technology
He did call for a new comparison that is worth consideration:

we all know the p4 architecture is on the way out and intel has even put an end of line date on the architecture. compare the barcelon to woodcrest

So I’ll reciprocate, gladly. Today’s Clovertown ( 2 Woodcrest processors essentially glued together) has a TPC-C performance of 222,117 TpmC as seen in this audited Woodcrest TPC-C result. Being a quad-core processor, the Oracle licensing is 2 licenses per socket. That means today’s Woodcrest performance is 55,529 TpmC per Oracle license compared to the projected Barcelona performance of 59,369 TpmC per Oracle license. That means if you wait for Barcelona you could get 7% more bang for your Oracle buck than you can with today’s shipping Xeon quad-core technology. And, like I said, since Barcelona is going to get plugged into a Socket F board, I’m not very hopeful that the processor will get the required complement of bandwidth to achieve that projected 70% increase over Opteron 2200.

Now, isn’t this blogging stuff just a blast? And yes, unless AMD over-achieves on their current marketing projections for Barcelona performance, I’m going to be really bummed out.

Multi-core Oracle Licensing. Proc/Sock/Core…What a Bore!

Published January 25, 2007 AMD Barcelona , Oracle licensing Leave a Comment

In this AMD webpage regarding software licensing, AMD is appealing to software vendors to license products by the socket as opposed to core. I wish Oracle would go this way because the .25 (Sun T1), .50 (Intel/AMD) and .75 (Power) core factoring is tedious. The webpage specifically states:

AMD is providing industry-thought leadership by recommending software developers license their software by socket […]

It is hard to tell if this recommendation from AMD has Barcelona in mind or not. As I blogged about in this post about Oracle per-core licensing with regard to Barcelona, I think the performance per Oracle license on Barcelona will be in trouble.

How can we expect normal humans to make good decisions about server purchases for Oracle when the topic of per-core performance—as it applies to Oracle per-core licensing—is so hard to grasp? As I have found in a comment from a reader on my blog, some people don’t even understand the difference between the terms “processor”, “core” and “socket”. The reader of this post comments:

check you math on the xeon system. tpc is 331,087 and the box has 4 dual core processors for a total of 8 physical processors. 331,087/8 = 41386.

now compare that to the 2 way dual core opteron system. tpc is 139,693 (multiply by 1.7 to estimate barcalona ) = 237478 for 4 physical cpus or 59367.

the barcelona@59367 > xeon@41386 by a factor of 1.44

your welcome… and i’m glad you aren’t my IT buyer.

The comment has been quoted verbatim. As far as the bit about being their IT buyer, I’m sure all of you who know me well are certain I wouldn’t buy this person so much as a bottle of water—even if his hair was on fire—after commenting like this on my blog. I did follow up with even more clarification though because it is a difficult topic:

The Xeon system at 331,087 is 4 socket, 8 core not “8 physical processors” as you state. The terminology is very important and the term “physical processors” has generally been replaced with the term “socket.”

The Opteron number is 139,693 for 2 sockets, 4 cores. AMD expects an increase of 70% per socket, not core. So you are right, the projected Barcelona number is 1.7x or 237,478, but that would be for a 2 socket system–albeit 8 cores.

This is an Oracle blog and I’m blogging about performance per core. So I’ll reiterate:

Opteron 2200 34,923 TpmC per core (139,693/4)
Barcelona ~29,684 TpmC per core (237,478/8)
Tulsa 41,385 TpmC per core (331,087/8)

Oracle licenses by the core. That is all that matters on this blog.

Performance per Oracle license really is all that matters here.

Gettimeofday() and Oracle on AMD Processors

Published January 24, 2007 AMD Quad-core , AMD Quad-Core Performance , Oracle performance 5 Comments

It is pretty well known that the Oracle database relies quite heavily on gettimeofday(2) for timing everything from I/O calls to latch sleeps. The wait interface is coated with gettimeofday() calls. I’ve blogged about Oracle’s heavy reliance upon the gettimeofday(2) such as in this entry about DBWR efficiency. In fact, gettimeofday() usage is so high by Oracle that the boutique platforms of yesteryear even went so far as to work out a mapping of the system clock into user space so that a simple CPP macro could be used to get the data—eliminating the function overhead and kernel dive associated with the library routine. Well, it looks like there is relief on the horizon for folks running Linux on AMD. According to this AMD webpage about RDTSCP, there is about a 30% reduction in processor overhead for every call when using a gettimeofday() implementation based upon the new RDTSCP instruction in AMDs Socket-F compatable processors. The webpage states:

Testing shows that on RDTSCP capable CPUs, vast improvements in the time it takes to make gettimeofday() (GTOD)

calls. It takes 324 cycles per call to complete 1 million GTOD calls without RDTSCP and 221 cycles per call with the capability.

Of course that would be a kernel-mode reduction in CPU consumption which is even better for an Oracle database system.

I need to get my hands on a Socket F system to see whether the kernel support in RHEL4 U4 and the glibc side of things are set to use this RDTSCP-enabled gettimeofday() right out of the box. If not it might require the vgettimeofday() routine that is under development. If the latter is true it will require Oracle to release a patch to make the correct call—but only on AMD. Hmm, porting trickery. Either way, an optimized gettimeofday() can be a nice little boost. I’ll be sure to blog on that when I get the information. In the meantime, it is nice to see folks like AMD are trying to address these pain points.

Since Oracle calls gettimeofday() so frequently, and they are so very serious about Linux, I wonder why you are reading this here first?

« Previous Page — Next Page »

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage

Archive Page 33

Isilon Leads in Clustered Storage–Without Support for Oracle

Scalable NFS Powered By Open Source Cluster Filesystems

Learn Danish Before You Learn About NUMA

Real Application Clusters: The Shared Database Architecture for Loosely-Coupled Clusters

Which Version Supports Oracle Over NFS? Oracle9i? Oracle10g?

With Friends Like That Who Needs Enemies?

Oracle on Opteron with Linux-The NUMA Angle (Part V). Introducing numactl(8) and SUMA. Is The Oracle x86_64 Linux Port NUMA Aware?

Oracle on Opteron with Linux-The NUMA Angle (Part IV). Some More About the Silly Little Benchmark.

A Non-Trivial MySQL Arithmetic Bug and The Moronic Quote of the Day

Dell Compares RAC and Non-RAC Performance and Cost

Oracle on Opteron with Linux-The NUMA Angle (Part III). Introducing The Silly Little Benchmark.

AMD Quad-Core “Barcelona” Processor For Oracle (Part V). 40% Expected Over Clovertown.

AMD Quad-Core “Barcelona” Processor For Oracle (Part IV) and the Web 2.0 Trolls.

Multi-core Oracle Licensing. Proc/Sock/Core…What a Bore!

Gettimeofday() and Oracle on AMD Processors

DISCLAIMER

Pages

Blogroll

Follow Blog via Email

Recent Posts

Recent Comments

Fond Memories

Copyright

Archive Page 33

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

DISCLAIMER

Pages

Blogroll

Follow Blog via Email

Recent Posts

Recent Comments

Fond Memories

Copyright