Archive for the 'General Oracle Topics' Category

Oracle11g Automatic Memory Management – Part III. A NUMA Issue.

Now I’m glad I did that series about Oracle on Linux, The NUMA Angle. In my post about the the difference between NUMA and SUMA and “Cyclops”, I shared a lot of information about the dynamics of Oracle running with all the SGA allocated from one memory bank on a NUMA system. Déjà vu.

Well, we’re at it again. As I point out in Part I and Part II of this series, Oracle implements Automatic Memory Management in Oracle Database 11g with memory mapped files in /dev/shm. That got me curious.

Since I exclusively install my Oracle bits on NFS mounts, I thought I’d sling my 11g ORACLE_HOME over to a DL385 I have available in my lab setup. Oh boy am I going to miss that lab when I take on my new job September 4th. Sob, sob. See, when you install Oracle on NFS mounts, the installation is portable. I install 32bit Linux ports via 32bit server into an NFS mount and I can take it anywhere. In fact, since the database is on an NFS mount (HP EFS Clustered Gateway NAS) I can take ORACLE_HOME and the database mounts to any system with a RHEL4 OS running-and that includes RHEL4 x86_64 servers even though the ORACLE_HOME is 32bit. That works fine, except 32bit Oracle cannot use libaio on 64bit RHEL4 (unless you invokde everything under the linux32 command environment that is). I don’t care about that since I use either Oracle Disk Manager or, better yet, Oracle11g Direct NFS. Note, running 32bit Oracle on a 64bit Linux OS is not supported for production, but for my case it helps me check certain things out. That brings us back to /dev/shm on AMD Opteron (NUMA) systems. It turns out the only Opteron system I could test 11g AMM on happens to have x86_64 RHEL4 installed-but, again, no matter.

Quick Test

[root@tmr6s5 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3585 MB
node 1 size: 4095 MB
node 1 free: 3955 MB
[root@tmr6s5 ~]# dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024
1024+0 records in
1024+0 records out
[root@tmr6s5 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3585 MB
node 1 size: 4095 MB
node 1 free: 2927 MB

Uh, that’s not good. I dumped some zeros into a file on /dev/shm and all the memory was allocated from socket 1. Lest anyone forget from my NUMA series (you did read that didn’t you?), writing memory not connected to your processor is, uh, slower:

[root@tmr6s5 ~]# taskset -pc 0-1 $$
pid 9453's current affinity list: 0,1
pid 9453's new affinity list: 0,1
[root@tmr6s5 ~]# time dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024 conv=notrunc
1024+0 records in
1024+0 records out

real    0m1.116s
user    0m0.005s
sys     0m1.111s
[root@tmr6s5 ~]# taskset -pc 1-2 $$
pid 9453's current affinity list: 0,1
pid 9453's new affinity list: 1
[root@tmr6s5 ~]# time dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024 conv=notrunc
1024+0 records in
1024+0 records out

real    0m0.931s
user    0m0.006s
sys     0m0.923s

Yes, 20% slower.

What About Oracle?
So, like I said, I mounted that ORACLE_HOME on this Opteron server. What does an AMM instance look like? Here goes:

SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3587 MB
node 1 size: 4095 MB
node 1 free: 3956 MB
SQL> startup pfile=./amm.ora
ORACLE instance started.

Total System Global Area 2276634624 bytes
Fixed Size                  1300068 bytes
Variable Size             570427804 bytes
Database Buffers         1694498816 bytes
Redo Buffers               10407936 bytes
Database mounted.
Database opened.
SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 1331 MB
node 1 size: 4095 MB
node 1 free: 3951 MB

Ick. This means that Oracle11g AMM on Opteron servers is a Cyclops. Odd how this allocation came from memory attached to socket 0 when the file creation with dd(1) landed in socket 1’s memory. Hmm…

What to do? SUMA? Well, it seems as though I should be able to interleave tmpfs memory and use that for /dev/shm-at least according to the tmpfs documentation. And should is the operative word. I have been tweaking for a half hour to get the mpol=interleave mount option (with and without the -o remount technique) to no avail. Bummer!

Impact
If AMD can’t get the Barcelona and/or Budapest Quad-core off the ground (and into high-quality servers from HP/IBM/DELL/Verari), none of this will matter. Actually, come to think of it, unless Barcelona is really, really fast, you won’t be sticking it into your existing Socket F motherboards because that doubles your Oracle license fee (unless you are on standard edition which is priced on socket count). That leaves AMD Quad-core adopters waiting for HyperTransport 3.0 as a remedy. I blogged all this AMD Barcelona stuff already.

Given the NUMA characteristics of /dev/shm, I think I’ll test AMM versus MMM on NUMA, and them test again on SUMA-if I can find the time.

If anyone can get /dev/shm mounted with the mpol option, please let me know because, at times, I can be quite a dolt and I’d love this to be one of them.

Oracle’s Latest Filesystem Offering. Shades of AdvFS? I Want My OLT!

In this press release, Oracle has started making the existence of the Oracle Linux Test Kit a bit more widely known. I’ve been playing with this kit for a little while now and planned to blog my experiences. It has test components that work against Oracle instances. I’ll blog about my findings as soon as I can.

Long Live AdvFS?
The press release also announces Oracle’s new open source file system called Btrfs. If it was just another file system I wouldn’t mention it in my blog, but the project page mentions plans for features that I think are absolutely critical such as:

  1. Space efficient packing of small files
  2. Writable snapshots
  3. Object level mirroring and striping
  4. Strong integration with device mapper for multiple device support
  5. Efficient incremental backup and FS mirroring

This thing sure smells like a reincarnation of AdvFS, but what do I know?

The project page states that the developers are not even looking into database workloads at the moment but since it is an extent based filesystem I see no reason it wouldn’t work quite well with Oracle databases. I’m most keen on the list of features I’ve listed above though because they go a long way to make life better in production Oracle environments. First, number 1 on the list is good handling of small files. This will be a boon for the handling of trace and logging and a lot of the other ancillary file creation that goes on external to the database. Small files (at least myriads of them) are the bane of most filesystems.

Next on the list are clones-or as the project page calls them writable snapshots. Veritas has offered these for quite some time and they are a very nice feature for creating working copies of databases that are current up to the moment you create the snapshot. Creating a snapshot doesn’t impact performance and a good implementation of clones will not impact the real database even if you are doing a reasonable amount of writes on the clone. Also very cool!

Then there is number 3 on the list-object level mirroring and striping. When all good filesystems “grow up” they offer this sort of feature. Being able to implement software RAID on a file basis is the ultimate cool and I can’t wait to give this one a play! Being able to hand pick which Oracle files-or general purpose files-upon which to apply varying software RAID characteristics is a very nice feature.

Number 4 on the list is Device Mapper support. I’ve ranted on my blog and other forums about Linux handling of device names for quite some time and DM goes a long way to address the pet-peaves. Seeing this filesystem exploit DM at design level is a good thing.

Finally, number 5 above suggests good support for filesystem mirroring. This technology has proven itself useful for Veritas and NetApp customers-as well as others. It’s good to see it in Btrfs.

Clustered Btrfs?
Some time back I made a blog entry about ZFS and discussed the notion of “cluster butter.” That notion refers the the idea that you can take any filesystem and “clusterize” it. Generally adding cluster support after the design phase does not produce much of a cluster filesystem. So, I intend to dive in to see what underpinnings there are in Btrfs for future cluster support.

1.2 Transactions Per Second! Enterprise Software is Infinitely Partitionable

I read a post on blogs.zdnet.com about MySQL that I think was interesting. In the post, Dana Blankenhorn is posing that MySQL is “enterprise class” using the Booking.com deployment as case-in-point.

What is “Enterprise Class?”
The post got me thinking. What is “Enterprise Class” anyway? Is it any software used in any enterprise datacenter? I tend to think of an enterprise class database server as one that can vertically scale to exploit the largest servers in support of a single, large application. Using those criteria leaves MySQL out I should think. Or am I behind the times on that? Are there any single MySQL databases running on a 64CPU Superdome for instance? It appears as though MySQL is supported on Itanium HP-UX for 2-processor systems.

Enterprise MySQL
In this computerworlduk.com article, it looks as though Booking.com uses something like 20 MySQL database servers to handle “tens of thousands” of bookings for 30,000 hotels spanning some 8,000 destinations. Let’s say for the sake of argument that it is 20 database servers and “tens of thousands” is 100,000. I admit I don’t know anything about the richness of this application, but I don’t see anything too brutal here. These sorts of applications lend themselves to partitioning naturally. It wouldn’t surprise any of us Oracle types to find out that they partition based upon hotel. That seems like a natural line to partition on. If that is the case, I get 1,500 hotels per database server handling their fair share of about 1.2 transactions per second (100,000/86,400 seconds in a day). I know these things are not that simple, but folks, we are talking about 20 database servers. Even if they are 2-socket/dual core systems you’ve got some 80 cores to work with! At first glance it just doesn’t seem as though these systems would be working that hard. And MySQL? Well, it doesn’t have to work that hard at all since the workload is partitionable. Who knows, maybe all workloads are partitionable and we Oracle-types are just missing the ball. Anyway, I can’t seem to find what storage engine is being used at Booking.com. And speaking of MySQL storage engines…

A 3-legged Pink Elephant
If you’re interested in 3-legged pink elephants, I’ve got one for you since we are on the topic of MySQL. In computerworlduk.com article we find that MySQL announced support for MySQL on IBM System I (yes, OS-400) with DB2 as the storage engine. Wow, that would be weird. Or it seems so at least.

What’s this Really Have to do with Oracle?
Oracle Database can do everything MySQL can do. The opposite is not true. ‘Nuff said. Oh, did I mention that Oracle Corporation is not a “Database Company” anymore. They’ve got the database now they are getting everything else.

 


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 2,947 other followers

Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: