Archive Page 10

Kevin Closson Joins EMC Data Computing Division To Focus On Greenplum Performance Engineering!

Last week the email account associated with my blog amassed no less than 83 emails from readers asking what I’m up to in response to the cliff-hanger I left in my post entitled Will Oracle Exadata Database Machine Eventually Support Offload Processing for Everything?

I appreciate all the email and I regret I was unable to answer any of them as I was taking some time away with my family.

I’ve resigned from my position of the last 4 years as performance architect in Oracle’s Exadata development organization and have joined the EMC Data Computing Division to focus on Greenplum in a performance engineering role.  While this is a big and exciting news piece for me personally, I need to make this a small and quick blog entry at this time.

When Total Information Technology Failures Happen What Do You Do? I Drive for 11 Hours And Then Blog About It. Alaska Airlines / Horizon Air Computer Crash Crashes Spring Break 2011.

Alaska Airlines / Horizon Air Computer Crash Crashes Spring Break

As I set out to make this blog entry I considered using the “OT” (off-topic) preface so as to respect readers’ time in case this ended up seeming like a SPAM entry. After typing for a moment I realized this is completely on-topic.  Consider the following quotes from the below-referenced web news pieces (bold font added for effect):

The central computer system for Alaska Airlines […]

We are working to restore the computer system and to accommodate our passengers […]

The computer system is used to plan all flights […]

A statement posted on the airline’s website said technical specialists had made some progress in restoring the system since it first went down at 3 a.m. […]

All of you who are regular readers of this site know why I highlighted certain words in bold font!

Why isn’t there any news yet questioning the obvious lack of business continuity systems, procedures, operations, switch-over to whatever redundant system Alaska Airlines /Horizon Air must certainly have in place?

Here is my take on the computer crash that crashed spring break. Please give it a read and then, perhaps, comment on the DR/BC failure that put this whole blog entry into motion:

 

References:

CNN coverage of the Alaska Airlines / Horizon Air Computer Infrastructure Meltdown

http://www.businessweek.com/ap/financialnews/D9M72I4O0.htm

Will Oracle Exadata Database Machine Eventually Support Offload Processing for Everything?

BLOG UPDATE 24 SEP 2011: This blog entry has been viewed slightly more than 50 times per day, on average, since it was originally posted several months ago.  At this point I’d like to update the post with these words to serve as a bit of a preface to the post itself. The final point made in this post offers a glimpse into one of the technical reasons I resigned my position as Performance Architect in Oracle’s Exadata development organization. 

In my recent post entitled Exadata Database Machine: The Data Sheets Are Inaccurate! Part – I, I drew attention to the fact that there is increasing Exadata-related blog content produced by folks that know what they are talking about. I think that is a good thing since it would be a disaster if I were the only one providing Exadata-related blog content.

The other day I saw Tanel Poder blogging about objects that are suitable targets for Smart Scan. Tanel has added bitmap indexes to his list. Allow me to quickly interject that the list of what can and cannot be scanned with Smart Scan is not proprietary information. There are DBA views in every running Oracle Database 11g Release 2 instance that can be queried to obtain this information.  Tanel’s blog entry is no taboo.

So, while Tanel is correct, I think it is also good to simply point out that the seven core Exadata fundamentals do in fact cover this topic. I’ll quote the relevant fundamentals:

Full Scan or Index Fast Full Scan.

  • The required access method chosen by the query optimizer in order to trigger a Smart Scan.

Direct Path Reads.

  • The required buffering model for a Smart Scan. The flow of data from a Smart Scan cannot be buffered in the SGA buffer pool. Direct path reads can be performed for both serial and parallel queries. Direct path reads are buffered in process PGA (heap).

So, another way Tanel could have gone about it would have been to ask, rhetorically, why wouldn’t Exadata perform a Smart Scan on a bitmap index if the plan chooses access method full? The answer would be simple—no reason. It is an index after all and can be scanned with fast full scan.  So why am I blogging about this?

Can I Add Index Organized Tables To That List?
In a recent email exchange, Tanel asked me why Smart Scan cannot attack an index organized table (IOT). Before I go into the outcome of that email exchange I’d like to revert to a fundamental aspect of Exadata that eludes a lot of folks. It’s about the manner in which data is stored in the Exadata Storage Servers and how that relates to offload processing such as Smart Scan.

Data stored in cells is striped by Automatic Storage Management (ASM) across the cells with coarse-grain striping (granularity established by the ASM allocation unit size). With Exadata, the allocation unit size by default—and best-practice—is 4MB. Therefore, tables and indexes are scattered in 4MB chunks across all the cells’ disks.

Smart Scan performs multiple, asynchronous 1MB reads for allocation units (thus four 1MB asynchronous reads for adjacent 1MB storage regions). As the I/O operations complete, Smart Scan performs predicate operations (filtration) upon each storage region (1MB). If the data contained in a 1MB region references another portion of the database (e.g., a chained row ), Smart Scan cannot completely process that storage region. The blocks that reference indirect data are sent to the database grid in standard block form (the same form as when reading an ASM disk on conventional storage). The database server then chases the indirection because only it has the code to map the block-level indirection to an ASM AU in some cell, somewhere. Cells cannot ask other cells for data because cells don’t know anything about each other. The storage grid of Exadata is shared-nothing.

Thus far, in this blog post, I’ve taken the recurring question of whether Smart Scan works on a certain type of object (in this case IOT) and broadened the discussion to focus on a fundamental aspect of Exadata. So what does this broadened scope have to do with Smart Scan on IOT? Well, when I read that email from Tanel I used logic based on the fundamentals and shot off an answer. Before that hasty reply to Tanel I recalled IOT has the concept of an overflow tablespace. The concept of overflow tablespace—in my mind—has “indirection” written all over it. Later I became more curious about IOT so I scanned through the oracle source code (server side) and couldn’t find any hard barriers against Smart Scan on IOT. I was stumped (trust me that aspect of the code is not all that straightforward) so I asked the developers that own that specific part of the server. I found out my logic was faulty. I was wrong. It turns out that Smart Scan for IOT is simply not implemented. I’m not insinuating that means “not implemented yet” either. That isn’t the point of this blog entry. Neither is admitting I was wrong in my original answer to Tanel. There is more to this train of thought.

Will The List Of Smart Scan Compatible Objects Keep Growing And Growing?
Neither confessing how I shot off a hasty answer to Tanel, nor specifics about IOT Smart Scan support are the central points of this blog entry. So, just what is my agenda?  Primarily, I wanted to remind folks about the fundamental aspect of Exadata regarding indirection and Smart Scan (e.g., chained row, etc) and secondarily, I wanted to point out that the list of objects suitable for Smart Scan is limited for reasons other than feasibility. Time to market is important. I know that. If an object like IOT is not commonly used in the data warehousing use-case it is unnecessary work to implement support for Smart Scan. But therein lies the third hidden agenda item for this post which is to question our continual pondering over the list of objects that support Smart Scan.

Offload processing is a good thing. I wonder, is the goal to offload more and more?  Some is good, certainly more must be better in a scale-out solution. Could offload support grow to the point where Exadata nears a state of “total offload processing?”  Would that be a bad thing? Well,  “total offload processing” is, in fact, impossible since cells do not contain discrete segments of data but instead the scattering of data I wrote about above.  However, more  can be offloaded. The question is just how far does that go and what does it mean in architectural terms? Humor me for another moment in this “total offload processing” train of thought.

If, over time, “everything”—or even nearly “everything”—is offloaded to the Exadata Storage Servers there may be two problems. First, offloading more and more to the cells means the query-processing responsibility in the database grid is systematically reduced. What does that do to the architecture? Second, if the goal is to pursue offloading more and more, the eventual outcome gets dangerously close to “total offload processing.” But, is that really dangerous?

So let me ask: In this hypothetical state of “total offload processing” to Exadata Storage Servers (that do not share data by the way), isn’t the result a shared-nothing MPP?  Some time back I asked myself that very question and the answer I came up with put in motion a series of events leading to a significant change in my professional career. I’ll blog about that as soon as I can.

Exadata Database Machine: The Data Sheets Are Inaccurate! Part – I.

Yes, the title of this blog entry is a come-on. I am ever-so-slightly apologetic (smiley face).

This post follows the longest dry spell in my blogging over the last five years. I haven’t posted since early January and thus I am quite overdue for the next installment in my series regarding the Oracle Database 11g Direct NFS clonedb feature. I set out to make the next installment yesterday but before doing so I visited the analytics for my blog readership to see what’s been happening. I discovered that essentially nobody comes to this blog through Exadata related search terms anymore. That surprised me. Indeed, for the first—what—two or so years after Exadata went into general availability the first page worth of Google search results always included some of my posts. I can’t find any of my Exadata posts in the first several pages Google spoon-feeds me now when I google “Exadata.” This isn’t a wounded-soul post. I do have a point to make. Humor me for a moment while I show the top twenty search terms that have directed readers to my blog since January 1, 2011.

kevin closson 417
oracle performance 320
oracle 11g 290
oracle linux 200
oracle on flash SSD 188
oracle nfs clonedb 182
intel numa 133
oracle on nfs 122
oracle fibre channel 115
huge pages allocated 104
oracle orion 99
real application clusters 92
automatic memory management 82
oracle xeon 80
oracle i/o 78
oracle file systems 75
oracle numa 73
_enable_NUMA_support 73
greenplum versus exadata 70
oracle exadata 69

So, as far as search terms go there seems to be a lack of traffic coming to this site for Exadata-related information. The page views for my Exadata posts are high, but the search terms are lightly-weighted. This means folks generally read Exadata-related material here after being directed for a non-related search term. Oh well. I’d ordinarily say, “so what.” However, it is unbelievable to me how many people ask me questions each and every day that would be unnecessary if not for a quick read of one of the entries I posted before Oracle Open World 2010. That post, entitled Seven Fundamentals Everyone Should Know Before Attending Openworld 2010 Sessions might be better named You Really Need to Know This Little Bit About Exadata Before Anyone Else Tries to Tell You Anything About Exadata. Folks, if you get a moment and you care at all about Exadata, please do read that short blog entry. It will enhance your experience with your sales folks or any other such Exadata advocates. Indeed, who wants to be introduced to a technology solution by the folks trying to sell it to you. Now, don’t get me wrong. I’m not saying Exadata sales folks are prone to offering misinformation. What I’m trying to say is your interaction with sales folks will be enhanced if you don’t find yourself in such remedial space as the very definition of the product and its most basic fundamentals. That leads me to point out some of the folks who have taken the helm from me where  Exadata blog content is concerned.


Oaktable Network Members Booting Up Exadata Blogging

Fellow Oaktable Network member Kerry Osborne blogs about Exadata, in addition to his current efforts to write a book on the topic. I’ve seen the content of his book in my role as Technical Editor. I think you will all find it a must-read regarding Exadata because it is shaping up to be a very, very good book. I have the utmost of respect for fellow Oaktable Network members like Kerry. In addition to Kerry, Fritz Hoogland (a recent addition to the Oaktable Network) is also producing helpful Exadata-related content.  Oracle’s Uwe Hesse blogs frequently about Exadata-related matters as well. So, there, I’ve pointed out the places people graze for Exadata content these days. But I can’t stop there.

We Believe the Oracle Data Sheets
The content I’ve seen in blogs seems to mostly confirm the performance claims stated in Oracle Data Sheet materials. Let me put it another way. We all know the latest Exadata table/index scan rates (e.g., 25 GB/s HDD full rack or 70GB/s combined Flash + HDD).  We’ve seen the Data Sheets and we believe the cited throughput numbers. I have an idea—but first let me put on my sarcasm hat.  I’m going to predict that the next person to blog about Exadata will start out by blogging something very close to the following:

My big_favorite_table has many columns and a bazillion rows. On disk it requires 200 gigabytes of storage but with mirroring it takes up 400 gigabytes. When I run the following query—even without Exadata Smart Flash Cache—it only takes eight seconds on my full-rack Exadata configuration to get the result:

 
SQL> select count(*) from big_favorite_table where pk_number < 0;
COUNT(*)
----------
0

Don’t get me wrong. It is important for folks to validate the Data Sheet numbers with their own personal testing. But folks, please, we believe the light-scan rates are what the marketing literature states.  I’m probably not alone in my desire to see blogs on users’ experience in solving particularly complex analytical data analysis problems involving vast amounts of data stored in Exadata. That sort of blogging is where social networking truly ads value—you know, going “beyond the Data Sheet.”

In Closing
So what does all this have to do with the infrequent nature of my blogging? Well, I’ll just have to leave that for a future entry. And, no, the Data Sheets on Exadata Database Machine are not inaccurate.

Oracle Database 11g Direct NFS Clonedb Feature – Part I (and a half).

In Part I of my series on Oracle Database 11g Release 2 Direct NFS Clonedb, I offered short videos with a presentation and a demonstration of the feature. I received a significant amount of email which essentially asked the same few questions. So, instead of answering a bunch of email individually, I’ll address the questions here:

Q. Why is your demonstration modeled around cloning an RMAN backup stored in a file system? All Oracle customers use ASM!
A. I’ve put out Part I as an introduction to the technology. I’ll have more on ASM later. I’m certain that not all customers use ASM. Some might even comment on this post accordingly.

Q. What are the NFS server requirements for the CLONE_FILE_CREATE_DEST?
A. Any NFS mount that supports Oracle Database 11g Direct NFS. Now, this is a bit tricky. Since we are talking about test and development instances I’m not convinced it has to be a commercial-grade filer. After all, the only data that will be stored on NFS with this model is the changed block files (COW) and any new datafiles the test/development instance creates. I have tested on a simple HP Proliant storage server running Linux and exporting NFS shares, but that shouldn’t be misconstrued as a support statement.

Q. What is the My Oracle Support note number mentioned in the Part I?
A. The MOS note is 1210656.1. Keep an eye out for it.

Q. Where can I get the clonedb.pl script?
A. Once the MOS note is online the perl script will be available there. It is just a script that automates a few important tasks and generates SQL (very helpful by the way). I’ll offer a copy at the following link: clonedb.pl

Note, this copy of the script is suitable for clonedb usage with production 11.2.0.2 Oracle. After the performance patch (10403302) is applied this rendition of the script will not work. With that performance patch, the clonedb instance needs to boot with the new init.ora parameter clonedb set to true. The new script will generate the requisite text into the auto-generated clone init.ora.

Q. Does the RMAN backup need to be stored in an NFS mount?
A. No. The copy-on-write/thin-provisioning aspect of the feature is implemented in libodm.so. For this reason the CLONE_FILE_CREATE_DEST assignment needs to be to an NFS mount. That RMAN backup can be elsewhere. More on that later.

Q. What about hot RMAN backups and incremental backups?
A. As I mentioned in Part I, I’ll be going into more detail about such topics as how the snapshot features of commercial NFS filers can augment the Oracle Database 11g Direct NFS clonedb feature. I’ll go over hot backups in that post.

Summary
Folks, thus far the intent was to get introductory materials out so we can all end 2010 thinking about a new way to do something old. I’ve left questions unanswered because, after all, we are only at Part I (and a half) in the series.

Oracle Database 11g Direct NFS Clonedb Feature – Part I.

Database Clones Without Storage Snapshot/Clone Technology? Yes, Of Course! You Knew That, Didn’t you?

Oracle Database 11g Release 2 has a bit of a “stealth feature” that few are aware of. The feature is called clonedb which is functionality built into Oracle Database 11g Direct NFS (DNFS). The best way to explain this feature is to pose a short list of questions:

  1. Do you have NFS mounts?
  2. Do you have DNFS enabled?
  3. Do you have an RMAN backup?
  4. Do you want to quickly and simply provision fully read/write database clones for development/test purposes?
  5. Do you generally provision development/test instances using your vendor NFS snapshot/clone technology?
  6. Do you find it too cumbersome to set up development/test clones using your vendor snapshot/clone technology?

If you say yes to most of these then you’ll appreciate the clonedb feature.

Database Clones Without Storage Snapshot/Clone Technology?
This is Part I in the series so at this stage I’ll clearly point out that with Oracle Database 11g Direct NFS clonedb functionality you can create a fully read/write clone database without storage snapshots or clones. Moreover, the clonedb feature is a thin-provisioning approach.

I could type a lot more words about this new feature, but this is a Part I blog entry and since I have a video presentation introducing the feature I’ll just offer a link to it:

Summary
I’m excited about this feature. In terms of administrative effort, it is by far the easiest way to provision clones for development/test instances that I am aware of. In my assessment it is simple, stable and it performs. I don’t get to say that as often as I’d like to about technology.

Link to Part I (and a half)

Meet _enable_NUMA_support: The if-then-else Oracle Database 11g Release 2 Initialization Parameter.

Since the release of Oracle Database 11g I have made a few posts about Oracle NUMA awareness and the _enable NUMA_support parameter. There is an index of most of those posts here.

This is a really short blog entry about a little-known fact about the Oracle Database 11g Release 2 (11.2.0.2) default value for the _enable_NUMA_support initialization parameter in the Linux x86_64 port. The following is the if-then-else logic. There aren’t many initialization parameters (that I know of) that have so much logic around the default assignment.

At instance boot time, the booting foreground process performs “discovery” to see if there are Exadata Storage Servers available. If you strace instance startup you’ll see the following:

open(“/etc/oracle/cell/network-config/cellinit.ora”, O_RDONLY) = -1 ENOENT (No such file or directory)

Also at boot time the numa libraries are dynamically linked and API calls are used to determine how many NUMA nodes there are. If there are more than 4 NUMA nodes and Exadata Storage is discovered the _enable_NUMA_support parameter is set to TRUE.

I have systems that attach to both Exadata and NFS storage at the same time. I have databases that reside entirely in both storage types as well. For maintenance reasons I needed to sever away the Exadata storage. That’s why the above discovery call suffered ENOENT. That changed my default setting for _enable_NUMA_support and  in doing so my performance numbers changed dramatically because I was not explicitly setting _enable_NUMA_support = TRUE on the Sun x4800 system I was testing. The Sun x4800 is 8-socket Nehalem EX and 8-socket EX is something you don’t want to do without Oracle NUMA awareness. Well, at least not if the instance will be running on all processors.

I doubt any of you would ever run into this, but I thought it was worth a blog entry.

Configuring Linux Hugepages for Oracle Database Is Just Too Difficult! Isn’t It? Part – III. Do You Really Want To Configure The Absolute Minimum Hugepages?

In part I of my recent blog series on Linux hugepages and modern Oracle releases I closed the post by saying that future installments would materialize if I found any pitfalls. I don’t like to blog about bugs, but in cases where there is little material on the matter provided elsewhere I think it adds value. First, however, I’d like to offer links to parts I and II in the series:

The pitfall I’d like to bring to readers’ attention is a situation that can arise in the case where the Oracle Database 11g Release 2 11.2.0.2 parameter use_large_pages is set to “only” thus forcing the instance to either successfully allocate all shared memory from the hugepages pool or fail to boot. As I pointed out in parts I and II this is a great feature. However, after an instance is booted it stands to reason that other processes (e.g., Oracle instances) may in fact use hugepages thus drawing down the amount of free hugepages. In fact, it stands to reason that other uses of hugepages could totally deplete the hugepages pool.

So what happens to a running instance that successfully allocated its shared memory from the hugepages pool and hugepages are later externally drawn down? The answer is nothing. An instance can plod along just fine after instance startup even if hugepages continue to get drawn down to the point of total depletion. But is that the end of the story?

What Goes Up, Must (be able to) Come Down
OK, so for anyone that finds themselves in a situation where an instance is up and happy but HugePages_Free is zero the following is what to expect:

$ sqlplus '/ as sysdba'

SQL*Plus: Release 11.2.0.2.0 Production on Wed Sep 29 17:32:32 2010

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

Connected to an idle instance.

SQL>
SQL> HOST grep -i huge /proc/meminfo
HugePages_Total:  4663
HugePages_Free:      0
HugePages_Rsvd:     10
Hugepagesize:     2048 kB

SQL> shutdown immediate
ORA-01034: ORACLE not available
ORA-27102: out of memory
Linux-x86_64 Error: 12: Cannot allocate memory
Additional information: 1
Additional information: 6422533
SQL>

Pay particular attention to the fact that sqlplus is telling us that it is attached to an idle instance! I assure you, this is erroneous. The instance is indeed up.

Yes, this is bug 10159556 (I filed it for what it is worth). The solution is to have ample hugepages as opposed to precisely enough. Note, in another shell a privileged user can dynamically allocate more hugepages (even a single hugepage) and the instance will be then able to be shutdown cleanly. As an aside, an instance in this situation can be shutdown with abort. I don’t aim to insinuate that this is some sort of zombie instance that will not go away.

Reintroducing SLB – The Silly Little Benchmark.

BLOG UPDATE (09-JUN-2015): The link to the downloadable SLB tar archive has been updated below.

BLOG UPDATE (23-NOV-2010): Please note, I have updated the SLB tarball to address some irregularities found by readers during their testing. I need to update this post further as the numbers in the boxes below are begat of the previous SLB and are therefore no longer relevant.

A few years ago I was working a series on AMD Opteron, Hypertransport, NUMA and what that meant to Oracle. Along the way I put out the Silly Little Benchmark (SLB) as discussed in my post entitled Oracle on Opteron with Linux-The NUMA Angle (Part III). Introducing The Silly Little Benchmark. I’ve had a lot of requests recently for updated copies of SLB. If you are looking for SLB, please download it at the following link:

SLB (Silly Little Benchmark) Tar Archive 09 June 2015

New SLB Kit
I’d like to point out a couple of things about the new SLB tar archive.

  1. The code has changed so results from this kit are not comparable to prior kits.
  2. The kit now performs 30 seconds of random memory reads followed by 30 seconds of random memory writes.
  3. The kit includes a wrapper script called runit.sh that runs SLB processes each with 512MB physical memory. The argument to runit.sh is a loop control of how many SLB processes to run upon each invocation of the benchmark.
  4. The kit includes a README that shows how to compile the kit and also offers further explanation of item #3 in this list.

Previous SLB Blog Posts
The following are a few pointers to prior content that dealt with SLB in one way or the other.

Some Recent SLB Results
The following are a few updated SLB results using the new kit.

The first recent result is from a 2s Westmere EP (Xeon 5600) system. I passed “6” into runit.sh to see what one socket’s worth of performance looks like.

$ ./runit.sh 6
Users: 6
Buffer area size  524288 KB  ADDR 0x2AD3ABA02010
Waiting for semaphore...
Total wops 239999724  Secs 30.1  Avg nsec/op 125 TPUT ops/sec 7978921.92
Total rops 399999540  Secs 30.1  Avg nsec/op 75 TPUT ops/sec 13300920.06
Buffer area size  524288 KB  ADDR 0x2B56949FA010
Waiting for semaphore...
Total wops 379999563  Secs 30.6  Avg nsec/op 80 TPUT ops/sec 12416873.45
Total rops 619999287  Secs 30.1  Avg nsec/op 48 TPUT ops/sec 20623014.23
Buffer area size  524288 KB  ADDR 0x2AAF02293010
Waiting for semaphore...
Total wops 239999724  Secs 30.2  Avg nsec/op 125 TPUT ops/sec 7939962.30
Total rops 459999471  Secs 30.1  Avg nsec/op 65 TPUT ops/sec 15257316.97
Buffer area size  524288 KB  ADDR 0x2AFADC4C9010
Waiting for semaphore...
Total wops 379999563  Secs 31.1  Avg nsec/op 81 TPUT ops/sec 12216920.78
Total rops 599999310  Secs 30.2  Avg nsec/op 50 TPUT ops/sec 19873638.29
Buffer area size  524288 KB  ADDR 0x2AEB7B430010
Waiting for semaphore...
Total wops 379999563  Secs 31.2  Avg nsec/op 82 TPUT ops/sec 12174302.22
Total rops 599999310  Secs 30.1  Avg nsec/op 50 TPUT ops/sec 19941302.38
Buffer area size  524288 KB  ADDR 0x2B6A80F63010
Waiting for semaphore...
Total wops 239999724  Secs 30.2  Avg nsec/op 125 TPUT ops/sec 7938049.67
Total rops 479999448  Secs 31.0  Avg nsec/op 64 TPUT ops/sec 15474601.85

Test Summary: Total wops 1859997861  Total rops  3159996366 Runtime seconds: 31 wops/s 59615316 rops/s 101281934

That was a bit bumpy. I re-ran it with affinity (taskset) and collected the following results:

$ taskset -pc 0-5 $$
pid 15320's current affinity list: 0-23
pid 15320's new affinity list: 0-5
$ sh ./runit.sh 6
Users: 6
Buffer area size  524288 KB  ADDR 0x2B28784C4010
Waiting for semaphore...
Total wops 379999563  Secs 31.0  Avg nsec/op 81 TPUT ops/sec 12238869.35
Total rops 499999425  Secs 30.2  Avg nsec/op 60 TPUT ops/sec 16580155.46
Buffer area size  524288 KB  ADDR 0x2B1241B67010
Waiting for semaphore...
Total wops 379999563  Secs 31.4  Avg nsec/op 82 TPUT ops/sec 12118541.38
Total rops 499999425  Secs 30.4  Avg nsec/op 60 TPUT ops/sec 16446948.61
Buffer area size  524288 KB  ADDR 0x2B4893BFD010
Waiting for semaphore...
Total wops 379999563  Secs 31.3  Avg nsec/op 82 TPUT ops/sec 12136661.49
Total rops 499999425  Secs 30.5  Avg nsec/op 60 TPUT ops/sec 16403891.60
Buffer area size  524288 KB  ADDR 0x2B94FD5AA010
Waiting for semaphore...
Total wops 379999563  Secs 31.0  Avg nsec/op 81 TPUT ops/sec 12272774.98
Total rops 519999402  Secs 30.9  Avg nsec/op 59 TPUT ops/sec 16820126.30
Buffer area size  524288 KB  ADDR 0x2B0D09454010
Waiting for semaphore...
Total wops 379999563  Secs 31.4  Avg nsec/op 82 TPUT ops/sec 12107983.29
Total rops 499999425  Secs 30.5  Avg nsec/op 61 TPUT ops/sec 16368642.72
Buffer area size  524288 KB  ADDR 0x2AAD4513E010
Waiting for semaphore...
Total wops 379999563  Secs 31.4  Avg nsec/op 82 TPUT ops/sec 12097160.14
Total rops 499999425  Secs 30.6  Avg nsec/op 61 TPUT ops/sec 16354937.65

Test Summary: Total wops 2279997378  Total rops  3019996527 Runtime seconds: 31 wops/s 72611381 rops/s 96178233

That result was a lot smoother and the wops (write ops per second) improved 22%. The rops, on the other hand, suffered a small 5% degredation. I’ll blog further about that in another post.

Other Results?
It sure would be nice if folks could try this out on other platforms. I’ve compiled and run it on Power6 so I know that it works on AIX 5L.

Configuring Linux Hugepages for Oracle Database Is Just Too Difficult! Isn’t It? Part – II.

After my recent blog entry entitled   Configuring Linux Hugepages for Oracle Is Just Too Difficult! Isnt It? Part I, I engaged in a couple of email threads and a thread on oracle-l about when to employ hugepages.  In those exchanges I was amazed to find that it is still a borderline issue for folks. I feel it is very cut and dried and thus I prepared the following guidelines that more or less spell it out.

  1. Reasons for Using Hugepages
    1. Use hugepages if OLTP or ERP. Full stop.
    2. Use hugepages if DW/BI with large numbers of dedicated connections or a large SGA. Full stop.
    3. Use hugepages if you don’t like the amount of memory page tables are costing you (/proc/meminfo). Full stop.
  2. SGA Memory Management Models
    1. AMM does not support hugepages. Full stop.
    2. ASMM supports hugepages.
  3. Instance Type
    1. ASM uses AMM by default. ASM instances do not need hugepages. Full stop.
    2. All non-ASM instances should be considered candidate for hugepages. See 1.1->1.3 above.
  4. Configuration
    1. Limits (multiple layers)
      1. /etc/security/limits.conf establishes limits for hugepages for processes. Note, setting these values does not pre-allocate any resources.
      2. Ulimit also establishes hugepages limits for processes.
  5. Allocation
    1. /etc/sysctl.conf vm.nr_hugepages allocates memory to the hugepages pool.
  6. Sizing
    1. Read MOS 401749.1 for information on tools available to aid in the configuration of vm/nr_hugepages

To make the point of how urgently  Oracle DBAs need to qualify their situation against list items 1.1 through 1.3 above, please consider the following quote from an internal email I received. The email is real and the screen output came from a real customer system. Yes, 120+ gigabytes of memory wasted in page tables. Fact is often stranger than fiction!

And here is an example of kernel pagetables usage, with a 24GB SGA, and 6000+ connections ..  with no hugepages in use .

# grep PageT /proc/meminfo

PageTables:   123731372 kB

Link to Part III in this series:

Configuring Linux Hugepages for Oracle Database is Just Too Difficult! Isn’t It? Part – III.

Exadata Database Machine X2-2 or X2-8? Sure! Why Not? Part II.

In my recent post entitled Exadata Database Machine X2-2 or X2-8? Sure! Why Not? Part I, I started to address the many questions folks are sending my way about what factors to consider when choosing between Exadata Database Machine X2-8 versus Exadata Database Machine X2-2. This post continues that thread.

As my friend Greg Rahn points out in his recent post about Exadata, the latest Exadata Storage Server is based on Intel Xeon 5600 (Westmere EP) processors. The Exadata Storage Server is the same whether the database grid is X2-2 or X2-8. The X2-2 database hosts are also based on Intel Xeon 5600. On the other hand, the X2-8 database hosts are based on Intel Xeon 7500 (Nehalem EX). This is a relevant distinction when thinking about database encryption.

Transparent Database Encryption

In his recent post, Greg brings up the topic of Oracle Database Transparent Data Encryption (TDE). As Greg points out, the new Exadata Storage Server software is able to leverage Intel Advanced Encryption Standard New Instructions (Intel AES-NI) found in the Intel Integrated Performance Primitives (Intel IPP) library because the processors in the storage servers are Intel Xeon 5600 (Westmere EP). Think of this as “hardware-assist.” However, in the case of the database hosts in the X2-8, there is no hardware-assist for TDE as Nehalem EX does not offer support for the necessary instructions. Westmere EX will—someday. So what does this mean?

TDE and Compression? Unlikely Cousins?

At first glance one would think there is nothing in common between TDE and compression. However, in an Exadata environment there is storage offload processing and for that reason roles are important to understand. That is, understanding what gets done is sometimes not as important as who is doing what.

When I speak to people about Exadata I tend to draw the mental picture of an “upper” and “lower” half. While the count of servers in each grid is not split 50/50 by any means, thinking about Exadata in this manner makes understanding certain features a lot simpler. Allow me to explain.

Compression

In the case of compressing data, all work is done by the upper half (the database grid). On the other hand, decompression effort takes place in either the upper or lower half depending on certain criteria.

  • Upper Half Compression. Always.
  • Lower Half Compression. Never
  • Lower Half Decompression. Data compressed with Hybrid Columnar Compression (HCC) is decompressed in the Exadata Storage Servers when accessed via Smart Scan. Visit my post about what triggers a Smart Scan for more information.
  • Upper Half Decompression. With all compression types, other than HCC, decompression effort takes place in the upper half. When accessed without Smart Scan, HCC data is also decompressed in the upper half.

Encryption

In the case of encryption, the upper/lower half breakout is as follows:

  • Upper Half Encryption. Always. Data is always encrypted by code executing in the database grid. If the processors are Intel Xeon 5600 (Westmere EP), as is the case with X2-2, there is hardware assist via the IPP library. The X2-8 is built on Nehalem EX and therefore does not offer hardware-assist encryption.
  • Lower Half Encryption. Never.
  • Lower Half Decryption. Smart Scan only. If data is not being accessed via Smart Scan the blocks are returned to the database host and buffered in the SGA (see the Seven Fundamentals). Both the X2-2 and X2-8 are attached to Westmere EP-based storage servers. To that end, both of these configurations benefit from hardware-assist decryption via the IPP libarary. I reiterate, however, that this hardware-assist lower-half decryption only occurs during Smart Scan.
  • Upper Half Decryption. Always in the case of data accessed without Smart Scan. In the case of X2-2, this upper-half decryption benefits from hardware-assist via the IPP library.

That pretty much covers it and now we see commonality between compression and encryption. The commonality is mostly related to whether or not a query is being serviced via Smart Scan.

That’s Not All

If HCC data is also stored in encrypted form, a Smart Scan is able to filter out vast amount of encrypted data without even touching it. That is, HCC short-circuits a lot of decryption cost. And, even though Exadata is really fast, it is always faster to not do something at all than to shift into high gear and do it as fast as possible.

Intel Sandy Bridge Architecture

I recommend a visit to David Kanter’s Real World Technologies site. David recently published an excellent article on Intel Sandy Bridge architecture.

Intel Sandy Bridge Architecture

Pay close attention to the diagram on page 2. You’ll see this architecture includes an on-die (uncore) PCI controller.

And here I am just chomping at the bits for Westmere EX!

David, sorry I wasn’t able to meet up with you at OOW 2010.

Configuring Linux Hugepages for Oracle Database Is Just Too Difficult! Isn’t It? Part – I.

Allocating hugepages for Oracle Database on Linux can be tricky. The following is a short list of some of the common problems associated with faulty attempts to get things properly configured:

  1. Insufficient Hugepages.You can be short just a single 2MB hugepage at instance startup and Oracle will silently fall back to no hugepages. For instance, if an instance needs 10,000 hugepages but there are only 9,999 available at startup Oracle will create non-hugepages IPC shared memory and the 9,999 (x 2MB) is just wasted memory.
    1. Insufficient hugepages is an even more difficult situation when booting with _enable_NUMA_support=TRUE as partial hugepages backing is possible.
  2. Improper Permissions. Both limits.conf(5) memlock and the shell ulimit –l must accommodate the desired amount of locked memory.

In general, list item 1 above has historically been the most difficult to deal with—especially on systems hosting several instances of Oracle. Since there is no way to determine whether an existing segment of shared memory is backed with hugepages, diagnostics are in short supply. Oracle Database 11g Release 2 (11.2.0.2) The fix for Oracle bugs 9195408 (unpublished) and 9931916 (published) is available in 11.2.0.2. In a sort of fast forward to the past, the Linux port now supports an initialization parameter to force the instance to use hugepages for all segments or fail to boot. I recall initialization parameters on Unix ports back in the early 1990s that did just that. The initialization parameter is called use_large_pages and setting it to “only” results in the all or none scenario. This, by the way, addresses list item 1.1 above. That is, setting use_large_pages=only ensures an instance will not have some NUMA segments backed with hugepages and others without. Consider the following example. Here we see that use_large_pages is set to “only” and yet the system has only a very small number of hugepages allocated (800 == ~1.6GB). First I’ll boot the instance using an init.ora file that does not force hugepages and then move on to using the one that does. Note, this is 11.2.0.2.

$ sqlplus '/ as sysdba'

SQL*Plus: Release 11.2.0.2.0 Production on Tue Sep 28 08:10:36 2010

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

Connected to an idle instance.

SQL>
SQL> !grep -i huge /proc/meminfo
HugePages_Total:   800
HugePages_Free:    800
HugePages_Rsvd:      0
Hugepagesize:     2048 kB
SQL>
SQL> !grep large_pages y.ora x.ora
use_large_pages=only
SQL>
SQL> startup force pfile=./x.ora
ORACLE instance started.

Total System Global Area 4.4363E+10 bytes
Fixed Size                  2242440 bytes
Variable`Size            1406199928 bytes
Database Buffers         4.2950E+10 bytes
Redo Buffers                4427776 bytes
Database mounted.
Database opened.
SQL> HOST date
Tue Sep 28 08:13:23 PDT 2010

SQL>  startup force pfile=./y.ora
ORA-27102: out of memory
Linux-x86_64 Error: 12: Cannot allocate memory

The user feedback is a trite ORA-27102. So the question is,  which memory cannot be allocated? Let’s take a look at the alert log:

Tue Sep 28 08:16:05 2010
Starting ORACLE instance (normal)
****************** Huge Pages Information *****************
Huge Pages memory pool detected (total: 800 free: 800)
DFLT Huge Pages allocation successful (allocated: 512)
Huge Pages allocation failed (free: 288 required: 10432)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
NUMA Huge Pages allocation on node (1) (allocated: 3)
Huge Pages allocation failed (free: 285 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
Huge Pages allocation failed (free: 285 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
NUMA Huge Pages allocation on node (1) (allocated: 192)
NUMA Huge Pages allocation on node (1) (allocated: 64)

That is good diagnostic information. It informs us that the variable portion of the SGA was successfully allocated and backed with hugepages. It just so happens that my variable SGA component is precisely sized to 1GB. That much is simple to understand. After creating the segment for the variable SGA component Oracle moves on to create the NUMA buffer pool segments. This is a 2-socket Nehalem EP system and Oracle allocates from the Nth NUMA node and works back to node 0. In this case the first buffer pool creation attempt is for node 1 (socket 1). However, there were insufficient hugepages as indicated in the alert log. In the following example I allocated  another arbitrarily insufficient number of hugepages and tried to start an instance with use_large_pages=only. This particular insufficient hugepages scenario allows us to see more interesting diagnostics:

SQL>  !grep -i huge /proc/meminfo
HugePages_Total: 12000
HugePages_Free:  12000
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

SQL> startup force pfile=./y.ora
ORA-27102: out of memory
Linux-x86_64 Error: 12: Cannot allocate memory

…and, the alert log:

Starting ORACLE instance (normal)
****************** Huge Pages Information *****************
Huge Pages memory pool detected (total: 12000 free: 12000)
DFLT Huge Pages allocation successful (allocated: 512)
NUMA Huge Pages allocation on node (1) (allocated: 10432)
Huge Pages allocation failed (free: 1056 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
Huge Pages allocation failed (free: 1056 required: 10368)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
Huge Pages allocation failed (free: 1056 required: 5184)
Startup will fail as use_large_pages is set to "ONLY"
******************************************************
NUMA Huge Pages allocation on node (0) (allocated: 704)
NUMA Huge Pages allocation on node (0) (allocated: 320)

In this example we see 12,000 hugepages was sufficient to back the variable SGA component and only 1 of the NUMA buffer pools (remember this is Nehalem EP with OS boot string numa=on).

Summary

In my opinion, this is a must-set parameter if you need hugepages. With initialization parameters like use_large_pages, configuring hugepages for Oracle Database is getting a lot simpler.

Next In Series

  1. “[…] if you need hugepages”
  2. More on hugepages and NUMA
  3. Any pitfalls I find.

More Hugepages Articles

Link to Part II in this series: Configuring Linux Hugepages for Oracle Database Is Just Too Difficult! Isn’t It? Part – II. Link to Part III in this series: Configuring Linux Hugepages for Oracle Database is Just Too Difficult! Isn’t It? Part – III. And more: Quantifying hugepages Memory Savings with Oracle Database 11g Little Things Doth Crabby Make – Part X. Posts About Linux Hugepages Makes Some Crabby It Seems. Also, Words About Sizing Hugepages. Little Things Doth Crabby Make – Part IX. Sometimes You Have To Really, Really Want Your Hugepages Support For Oracle Database 11g. Little Things Doth Crabby Make – Part VIII. Hugepage Support for Oracle Database 11g Sometimes Means Using The ipcrm Command. Ugh. Oracle Database 11g Automatic Memory Management – Part I.

Exadata Database Machine X2-2 or X2-8? Sure! Why Not? Part I.

I’ve been getting a lot of questions about why one would choose Exadata Database Machine X2-8 over Exadata Database Machine X2-2. That’s actually a tough question, however, some topics do spring to mind. I’ll start a list:

  1. The Exadata Database Machine X2-8 only comes in full-rack configurations. No way to “start small.”
  2. The Exadata Database Machine X2-2 only (immediately) supports Oracle Linux. If Solaris is attractive to you then the X2-2 is not an option at the time of this blog entry. That is slated to change soon.
  3. Database Host RAM. The aggregate database grid RAM in a full-rack X2-2 system is 768 GB but 2 TB with the X2-8. The list is quite long for areas that benefit from the additional memory. Such topics as large user counts (consolidation or otherwise), join processing, and very large SGA come to mind. And, regarding large SGA, don’t forget, the Exadata Database Machine supports in-memory Parallel Query as well.

Not on the numbered list is the more sensitive topic of processor power. While these sorts of things are very workload-dependent, I’d go with 16 Intel Xeon 7500 (Nehalem EX) processors over 16 Intel Xeon 5600 (Westmere EP) for most any workload.

So, readers, what reasons would motivate you in one direction or the other?

Intel Xeon 7500 (Nehalem EX) Finds Its Way Into Exadata Database Machine. So Does Solaris!

Many folks have been wondering about when or if Oracle will integrate servers based on the Intel Xeon 7500 (Nehalem EX) family of processors. As of this morning, there are two freshly-announced packaging options that include:

  • Exadata Database Machine X2-8 HP Full Rack
  • Exadata Database Machine X2-8 HC Full Rack

Both of these configuration options offer two 8-socket Xeon 7500 systems each with 64 processor cores and 1 TB of physical memory. Also included is support for 8 paths of 10GbE connectivity.

The HP/HC identifiers stand for High Performance and High Capacity respectively and relate to the hard drive options available in the Exadata Storage Servers. The High Performance option is based on the 15,000 RPM 600 GB SAS drives and the High Capacity storage option is based on 2TB 7,200 RPM SAS drives. No SATA option.

The two 8-socket servers are attached to 14 Exadata Storage Servers each based on the Xeon 5600 (Westmere EP) family of Intel processors with 12 hard drives and the same complement of Exadata Smart Flash Cache as was available in the previous generation of X2-2 offerings (384 GB per Storage Server).

I shouldn’t think folks are too surprised that Xeon 7500-based servers have found their way into Exadata packaging. Oracle has been shipping the Sun Fire x4800 for some time now as I blogged in my post entitled Will Oracle Ever Release Sun Servers Based On Westmere EP and Nehalem EX Processors? Yes. However, folks may be (pleasantly) surprised to hear that the Exadata Database Machine X2-8 will offer support for either Solaris 11 Express or Oracle Linux 5 Update 5 with the Unbreakable Enterprise Linux kernel.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.