Archive for the 'oracle' Category



Webcast Announcement: Oracle Exadata Storage Server Technical Deep Dive – Part II.

Oracle Exadata Storage Server Technical Deep Dive – Part II.

Thursday, April 16, 2009 12:00 PM – 1:00 PM CDT

This is the second webinar in the series of the Oracle Exadata Storage Server Technical Deep Dive. Kevin Closson will offer a recap of his exciting first webinar on Exadata Storage Server and HP Oracle Database Machine internals and performance characteristics. He will revisit Unanswered Questions from Part I and also offer a new segment:

What About All That “Brainy Software Part I”

– Examination of Index Creation

– Index Smart Scan

The session will conclude with Q & A. Given the fact that this is a series, Kevin will try to make Part II feel a bit more “town-hall”-like such that Q/A will have higher priority than time permitted in Part I.

Oracle Exadata Storage Server Technical Deep Dive: Part II. (Registration Link)

Don’t Blog About What You Intend To Blog About!

Well, after that “memory lane” post I just made (about SGI ), and feeling a bit hungry, there is an old blog post of mine that comes to mind. I know it is faux pas to blog-about-what-you-are-going-to-blog-about, but although I’m working on a good technical entry, it isn’t buttoned up quite yet. So, yes, I’m blogging about the fact that I will soon be blogging about something good (has to do with column ordinality, but the Vertica guys should get too excited).

So, like I said, I’m hungry and reminiscing and wishing I had the time tonight to prepare something worthwhile, but I don’t…but I could, and routinely do:

Wildfleisch Ragout mit Champignon

When Sun Microsystems Got Their First Big System It Was No April Fool’s Day Joke

I don’t think this (SGI Sold for 25 Million Dollars) is an April Fool’s Day joke either!

Wow, what a wild ride that has been.  See, SGI holds a special place in my heart. While working in Sequent Computer Systems’ Advanced Oracle Engineering Group in the mid 1990’s I recall SGI selling the technology assets that included Cray’s CS6400 to Sun Microsystems for what was rumored to be about $50 million dollars. That was Sun’s first big system (a.k.a. UE 10000) thank you very much. Not that the UE 6000 was a loaf, but the UE 6000 was not about to stand toe-to-toe with a period Sequent NUMA-Q 2000-or hold a candle to it for that matter. Before the UE10K, Sun systems were “quick” but very limited bandwidth machines.  It is fairly well known that Sequent management of the time didn’t think to buy and burn the CS6400 technology like they should have. It was, after all, developed no more than 400 meters from Sequent’s HQ. Figuring out a way to buy and burn that system house would have been a better “waste” of money than “The Dragster.”

If only someone, anyone, besides Sun would have bought that CS6400 division… if only…

Sun Microsystems went on the sell over 1,000 of those CS6400 (UE10K) jobbies per year for an annual take of over a billion dollars.

Memories… but it has all brought NUMA back to the forefront of my thinking today…

Enter, Nehalem.

PS. I need to point out to my Oaktable Network friends that this post was indeed a part of my ex-Sequent 12-step program.

What Good Are “Vendor Blogs” Anyway?

Curt Monash is a prolific writer and analyst who maintains several blogs and routinely contributes to online publications. I try to keep up on his writings at DBMS2 as there is plenty of interesting DW/BI-related content there.

In a recent post in Text Technologies Blog, Curt was making some points about what effect social media might have on the future of the “information ecosystem.”  When referring to “Vendor Blogs”, Curt had the following to say:

Presenters of news. Vendors with stories to tell will take increasing responsibility for telling them deeply and well. Their economic motivation is obvious. And sometimes it goes beyond money. One of the most effective vendor blogs is surely Kevin Closson’s, and I know from talking with Kevin’s boss’s boss that Oracle was as surprised as anybody when his blog burst into popularity.

I was quite surprised to see Curt making mention of my blog out of seemingly thin air because (admittantly to my discredit) he and I have locked antlers a couple of times as a result of my zeal for all things Exadata.

But I’m not blogging about any of that.

Rockets: Red Glare. Blogs: Bursting in Air (or Popularity)
I thought Curt’s quote of my boss’s boss’ surprise was interesting and it got me curious. Has my blog “burst into popularity?”

I started blogging 11 months before I joined Oracle. So I thought I’d check my WordPress statistics to see what the average page viewing traffic was for:

  • The six months leading up to my starting date with Oracle
  • The 12 months prior to the release of Exadata
  • The 30 days following the release of Exadata
  • The last 6 months (dates back from today to the release of Exadata)

I’ll treat the average of the first row in the following table as the baseline and represent all other figures relative to that baseline. The data:

Time Frame

Average Page View Units

6 Months Prior to Joining Oracle

1.000

6 Months Prior to Exadata Release

0.876

30 Days After Exadata Release

1.460

6 Months After Exadata Release

0.986

The 30 days following the release of Exadata was a roller coaster with a 46% jump in reader activity, but I have to admit that losing 1.4% compared to my pre-Oracle days feels more like waning into obscurity than bursting into popularity 🙂

Vendor Blogs
I don’t really consider my blog a “Vendor Blog” per se, but I’m glad to get the hat-tip from Curt nonetheless. I admit there are technology-related topics I’d love to blog about but feel constrained as a corporate employee. I wonder, does that make me a shill?

Oracle Database 11g with Intel Xeon 5570 TPC-C Result: Beware of the Absurdly Difficult NUMA Software Configuration Requirements!

According to this Business Wire article, the Intel Xeon 5500 (a.k.a., Nehalem) is making a huge splash with an Oracle Database 11g TPC-C result of 631,766 TpmC. At 78,970 TpmC/core, that is an outrageous result! I remember when it was difficult to push a 64 CPU system to the level these CPUs get with only one processor core! I had to quickly scurry over to the TPC website to dig in to the disclosures, but, as of 8:12 PM GMT it had not been posted yet:

TPC posting

Jumping the Gun
Ah, but then by the time I got around again to check it had indeed been posted. The first thing I did was check the full disclosure report to see what sort of Oracle NUMA-specific tweaking was done in the init.ora. None. That, is very good news to me. The last thing I want to see is a bunch of confusing NUMA-specific tuning. Allow me to quote myself with a saying I’ve been rattling off for years:

The best NUMA system is the best SMP system.

By that I mean it shouldn’t take application software tuning to get your money’s worth out of the platform. Sure, we had to do it back in the mid to late 1990’s with the pioneer NUMA systems, but that was largely due to the incredible ratio between local memory latency and highly-contended remote memory (and due to the concept of remote I/O which does not apply here). Of course the operating system has to be NUMA aware. Period.

Speeds
I know what the ratios are on Xeon 5500 series but I can’t recall whether or not the specific number I have in mind is one I obtained under non-disclosure  so I’m not going to go blurting it out. However, it turns out that as long as memory is fairly placed (e.g., not a Cyclops ) and the ratio is comfortably below 2:1 (R:L) you’re going to get a real SMP “feel” from the box. Of course, the closer the ratio leans towards 1:1 the better.

Summary
NUMA is a hardware architecture that breaks bottlenecks. It shouldn’t have to break SMP programming principles in the process. The Intel Xeon 5570, it turns out, is the sort of NUMA system you should all be clamoring for. What kind of NUMA system is that? The answer is a NUMA system that is indistinguishable from a flat-memory SMP.

Very cool!

PS. I actually already knew what level of NUMA tuning was used in this TPC-C testing. I just couldn’t blog about it. I also know the precise R:L memory latency ratio for the box. The way I look at it though is since this modern NUMA system gets 78,970 TpmC/core, the R:L ratio is unnecessary minutiae-as is thoughts of NUMA software tuning. I never imagined NUMA would come far enough for me to write that.

Helpful Blogs! Yes, I Read the Documentation, But It Doesn’t Always Sink In.

I was just chatting with my friend Greg Rahn about an External Table related problem that I was hitting when he pointed me to a related post on Tim Hall’s ORACLE-BASE website. One again (as many times before) Tim’s blog proved extremely informative. It is by far one of my favorite blogs. I don’t know if it is a right-brain/left-brain thing, but Tim’s examples always help me smooth over any problems I’m having with documentation complexities. Come on, admit it, all of us have scratched our heads at least once while staring at a convoluted railroad diagram in the documentation and wished there was an example to bring it to life. Well, like I said, Tim always does that!

Come to think of it I almost forgot to mention that fellow Oaktable Network member Jared Still has been blogging at  Jared Still’s Ramblings for a couple of years. It is a good site and I recommend it. I’ll be adding it to my blogroll.

Poll Results: Stop Blogging.

According to the poll on my recent blog anniversary post, 1% of those participating in the poll recommend I stop blogging. There’s proof positive you can’t please everyone. On the other hand, 6% of the participants wanted more blogging about fishing. I am trying to post an occasional photo on my miscellaneous page. I just uploaded a couple of fishing-related photos for you six-percenters.

Bulk Data Loading Rates. Is 7.3 MB/s per CPU Core Fast, or Fast Enough? Part I.

It turns out that my suspicion about late-breaking competitive loading rates versus loading results was not that far off base. As I discussed in my recent post about Exadata “lagging” behind competitor’s bulk-loading proof points, there is a significant difference between citing a loading rate as opposed to a loading result. While it is true that customers generally don’t stage up a neat, concise set of flat files totaling, say, 1TB and load it with stopwatch in hand, it is important to understand processing dynamics with bulk data loading. Some products can suffer peaks and valleys in throughput as the data is being loaded.

I’m not suspecting any of that where these competitor’s results are concerned. I just want to reiterate the distinction between cited loading rates and loading results. When referring to the Greenplum news of customer’s 4TB/h loading rates, I wrote:

The Greenplum customer stated that they are “loading at rates of four terabytes an hour, consistently.” […] Is there a chance the customer loads, say, 0.42 or 1.42 terabytes as a timed procedure and normalizes the result to a per-hour rate?

What if it is 20 gigabytes loading in a single minute repeated every few minutes? That too is a 4TB/h rate.

While reading Curt Monash’s blog I found his reference to Eric Lai’s deeper analysis. Eric quotes a Greenplum representative as having said:

The company’s customer, Fox Interactive Media Inc., operator of MySpace.com, can load 2TB of Web usage data in half an hour

That is a good loading rate. But that isn’t what I’m blogging about. Eric continued to quote the Greenplum representative as saying:

To achieve 4TB/hour load speeds requires 40 Greenplum servers

40 Greenplum servers…now we are getting somewhere. To the best of my knowledge, this Greenplum customer would have the Sun fire 4500 based Greenplum solution.  The 4500 (a.k.a. Thumper) sports 2 dual-core AMD processors so the configuration has 160 processor cores.

While some people choose to quote loading rates in TB/h form, I prefer expressing loading rates in megabytes per second per processor core (MBPS/core). Expressed in MBPS/core, the Greenplum customer is loading data at the rate of 7.28 MBPS/core.

Summary
Bold statements about loading rates without any configuration information are not interesting.

PS. I almost forgot. I still think Option #2 in this list is absurd.

The HP Oracle Database Machine is Too Large and Too Powerful…Yes, for Some Applications!

Hypothetical Problem Scenario
Imagine your current Oracle data warehouse is performing within, say, 50% of your requirements. You’re a dutiful DBA. You have toiled, and you’ve tuned. Your query plans are in order and everything is running “just fine.” However, the larger BI group you are supporting is showing a significant number of critical queries that are completing in twice the amount of time specified in the original service level agreement. You’ve examined these queries, revisited all the available Oracle Database Data Warehousing features that improve query response time but you’ve determined the problem is boiling down to a plain old storage bottleneck.

Your current system is a two-node Real Application Clusters (RAC) configuration attached to a mid-range storage array (Fibre Channel). Each RAC server has 2 active 4GFC HBA ports (e.g., a single active card). The troublesome queries are scanning tables and indexes at an optimal (for this configuration) rate of 800 MB/s per RAC node for an aggregate throughput of 1.6 GB/s. Your storage group informs you that this particular mid-range array can sustain nearly 3 GB/s. So there is some headroom at that end. However, the troublesome queries are processor-intensive as they don’t merely scan data-they actually think about the data by way of joining, sorting and aggregating. As such, the processor utilization on the hosts inches up to within, say, 90% when the “slow” queries are executing.

The 90% utilized hosts have open PCI slots so you could add another one of those dual-port HBAs, but what’s going to happen if you run more “plumbing?” You guessed it. The queries will bottleneck on CPU and will not realize the additional I/O bandwidth.

Life is an unending series of choices:

  • Option 1: Double the number of RAC nodes and provision the 3 GB/s to the 4 nodes. Instead of 1.6 GB/s driving CPU to some 90%, you would see the 3 GB/s drive the new CPU capacity to something like 80% utilization. You’ have a totally I/O-bottlenecked solution, but the queries come closer to making the grade since you’ve increased I/O bandwidth  88%. CPU is still a problem.
  • Option 2: Totally jump ship. Get the forklift and wheel in entirely foreign technology from one of Oracle’s competitors.
  • Option 3: Wipe out the problem completely by deploying the HP Oracle Database Machine.

The problem with Option 1 is that it is a dead-end on I/O and it isn’t actually sufficient as you needed to double from 1.6 GB/s but you hit the wall at 3 GB/s. You’re going to have to migrate something somewhere sometime.

Option 2 is very disruptive.

And, in your particular case, Option 3 is a bit “absurd.”

He’s Off His Rocker Now
No, honestly, deploying a 14 GB/s solution (HP Oracle Database Machine) to solve a problem that can be addressed by doubling your 1.6 GB/s throughput is total overkill. This all presumes, of course, that you only have one warehouse (thus no opportunity for consolidation) and a powerful HP Oracle Database Machine would be too much kit.

No, He’s Not Off His Rocker Now
We had to be hush-hush for a bit on this, but I see that Jean-Pierre Dijcks over at The Data Warehouse Insider finally got to let the cat out of the bag. Oracle is now offering a “half-rack” HP Oracle Database Machine.

This configuration offers a 4-node Proliant DL360 database grid and 7 HP Oracle Exadata Storage Servers. This is, therefore, a 7GB/s capable system. To handle the flow of data, there are 88 Xeon “Harpertown” processors performing query processing that starts right at the disks where filtration and projection functions are executed by Exadata Storage Server software.

So, as far as the option list, I’d now say Option 3 is perfect for the hypothetic scenario I offered above. Just order, “Glass half empty, please.”

Option 2 is very disruptive.

Webcast Announcement: Oracle Exadata Storage Server Technical Deep Dive. Part I.

Wednesday, March 25, 2009 12:00 PM – 1:00 PM CDT

Kevin Closson will offer an in-depth presentation on Exadata Storage Server and HP Oracle Database Machine internals and performance characteristics. Topics planned for this installment in the series include:

  • Brief Technical Architecture Overview
  • Understanding Producer/Consumer Data Flow Dynamics
  • A “How” and “Why” Comparison of Exadata versus Conventional Storage
  • Storage Join Filters

Link to the Registration Page for the Webcast.

Where’s the Proof? Poof, It’s a Spoof! Exadata Lags Competitor Bulk Data Loading Capability. Are You Sure?

I’ve received a good deal of email following my recent blog entry entitled Winter Corporation Assessment of Exadata Performance: Lopsided! Test it All, or Don’t Test at All? I’m not going to continue the drama that ensued from that blog post, but an email I received the other day on the matter warrants a blog entry. The reader stated:

[…text deleted…] that is why I sort of agree with Dan. It makes no sense to load a huge test database without showing how long it took to load it. Now I see Greenplum has very fast data loading Oracle should wake up or […text deleted…]

That is a good question and it warrants this blog entry. I was quite clear in my post about why the Winter Corporation report didn’t cover every imaginable test, but I want to go into the topic of data loading a bit.

The reader asking this question was referring to a blogger who took Richard Winter’s Exadata performance assessment to task citing three areas he deemed suspiciously missing from the assessment. The first of these perceived shortcomings is what the reader was referring to:

High Performance Batch Load – where are the performance numbers of high performance batch load, or of parallel loads executing against the device?  How many parallel BIG batch loads can execute at once before the upper limits of the machine and Oracle are reached?

So the blogger and the reader who submitted this question/comment are in agreement.

Nothing Hidden
The Winter Corporation Exadata performance assessment is quite clear in two areas related to the reader’s question. First, the report shows that the aggregate size of the tables was 14 terabytes. With Automatic Storage Management, this equates to 28 terabytes of physical disk. Second, the report is clear that Exadata Storage Server offers 1 GB/s (read) disk bandwidth. If Exadata happened to be unlike every other storage architecture by offering parity between read and write bandwidth, loading the 28 TB of mirrored user data would have taken only 2000 seconds which is a load rate of about 50 TB/h. Believe me, it wasn’t loaded at a 50 TB/h rate. See, Exadata-related literature is very forthcoming about the sustained read I/O rate of 1 GB/s per Exadata Storage Server. It turns out that the sustained write bandwidth of Exadata is roughly 500 MB/s or 7 GB/s for a full-height HP Oracle Database Machine.  Is that broken?

Writes are more expensive for all storage architectures. I personally don’t think a write bandwidth equivalent to 50% of the demonstrated read bandwidth is all that bad.  But, now the cat is out of the bag. The current maximum theoretical write throughput of a HP Oracle Database Machine is a paltry 7 GB/s (I’m being facetious because, uh, 7GB/s write bandwidth in a single 42U configuration is nothing to shake a stick at). At 7 GB/s, the 28 TB of mirrored user data would have been loaded in 4000 seconds (a load rate of about 25 TB/h). But, no, the test data was not loaded at a rate of 25 TB/h either. So what’s my point?

My point is that even if I told you the practical load rates for Exadata Storage Server, I wouldn’t expect you to believe me. After all, it didn’t make the cut for inclusion in the Winter Corporation report so what credence would you give a blog entry that claims something between 0 TB/h and 25 TB/h? Well, I hope you give at least a wee bit of credence because the real number did in fact fall within those bounds. It had to. As an aside, I have occasionally covered large streaming I/O topics specific to Exadata but remember that a CTAS operation is much lighter than ingesting ASCII flat file data and loading into a database. I digress.

This is an Absurd Blog Entry
Is it? Here is how I feel about this. I know the practical data loading rates of Exadata but I can’t go blurting it out without substantiation. Nonetheless, put all that aside in your minds for a moment.

Until quite recently, bulk data loading claims by DW/BI solution providers such as Netezza and Vertica were not exactly phenomenal. For instance, Netezza advertises 500 GB/hour load rates in this collateral and Vertica claims of 300 MB/minute and 5 GB/minute seemed interesting enough for them to mention it in their collateral. As you can tell, data loading rates are all over the map. So, one of two things must be true; either a) Oracle is about to get more vocal about Exadata bulk loading capabilities, or b) Exadata is a solution that offers the best realizable physical disk read rates but an embarrassing bulk data loading rate.

I know the competition is betting on the latter because they have to.

What About the Reader’s Question?
Sorry, I nearly forgot. The reader was very concerned about bulk data loading and mentioned Greenplum. According to this NETWORKWORLD article, Greenplum customer Fox Interactive Media is quoted as having said:

We’re loading at rates of four terabytes an hour, consistently.

Quoting a customer is a respectable proof point in my opinion. The only problem I see is that there is no detail about the claim. For instance, the quote says “rates of four terabytes an hour […]” There is a big difference between stating a loading rate and a loading result. For example, this Greenplum customer cites a rate of 4TB/h (about 1.1GB/s) without mention of the configuration. Let’s suppose for a moment that the configuration is a DWAPP-DW40, which is the largest single-rack configuration available (to my knowledge) from Greenplum. I admit I don’t know enough about Greenplum architecture to know, but depending on how RAM is used by the four Sun fire x4500 servers in the configuration (aggregate 64 GB), it is conceivable that the customer could “load” data at a 40 TB/h rate for an entire minute without touching magnetic media.

I presume Greenplum doesn’t cache bulk-inserted data, but the point I’m trying to make is the difference between a loading rate and a loading result. It is a significant difference. Am I off my rocker? Well, let’s think about this. The Greenplum customer stated that they are “loading at rates of four terabytes an hour, consistently.” Who amongst you thinks the customer stages up exactly 1TB of data in production and then plops it into the database with stopwatch in hand? Is there a chance the customer loads, say, 0.42 or 1.42 terabytes as a timed procedure and normalizes the result to a per-hour rate? Of course that is possible and totally reasonable. So why would it be totally absurd for me to suggest that perhaps the case is an occasional load of, say, 100 gigabytes taking about 350 seconds (also an approximate 4TB/h rate)? What if it is 20 gigabytes loading in a single minute repeated every few minutes? That too is a 4TB/h rate. And, honestly, both would be truthful and valid depending on the site requirements. The point is we don’t know. But I’m not going to totally discount the information because it is missing something I deem essential.

I’m willing to take the information reported by the Greenplum customer and say, “Yes, 4 TB/h is a good load rate!”  Thus, I have answered the blog reader’s original question about bulk loading vis a vis the recent Greenplum news on the topic.

Under-Exaggeration
The NETWORKWORLD article quotes Ben Werther, director of product marketing at Greenplum, as having said:

This is definitely the fastest in the industry, [ … ]  Netezza for example quotes 500GB an hour, and we have not seen anyone doing more than 1TB an hour.

Well, I think Ben has it wrong. Vertica has a proof point of a loading result of 5.4 terabytes in 57 minutes 21.5 seconds which is a rate of 5.65 TB/h. This result was independently validated (no doubt a paid engagement which is no-less valid) and to be trusted.  So, Ben’s statement represents a 5X under-exaggeration, which is better than 100X, 100X+ exaggerations I occasionally rant about.

A Word About the Vertica Bulk-Loading Result
It used TPC-H DBGEN to load data into a 3rd normal form schema. While I’m not going to totally discount a test because it doesn’t embody a tutorial on schema design, many readers may not know this fact about that particular proof point. The proof point is about data loading, not about schema design. The blocks of data were splat upon the round, brown spinning thingies at the reported, validated rate. No questioning that. Schemas have nothing to do with that though and thus it is just fine to use a 3rd normal form schema for such a proof point.

Summary
Greenplum and Vertica have good proof points out there for bulk data loading and single-rack HP Oracle Database Machine cannot bulk-load data faster than roughly 25 terabytes per hour. Oracle’s competitors rest on hopes that the actual bulk-loading capability of HP Oracle Database Machine is a small fraction of that. For the time being, it seems, that will remain the status quo.

I’ve lived prior lives wishing the competition was pidgeon-holed one way or the other.

What Does Snapple Have To Do With Information Technology? Cisco Makes Blade Servers?

This is just a short blog entry about Cisco’s Unified Computing initiative. It seems Cisco has been quite busy readying blade server technology to bring to market. According to this NETWORKWORLD article, analyst Zeus Kerravala of the Yankee Group was quoted as follows:

If Cisco builds their own server, it will forever change the relationship they have with server manufacturers, for the negative.

I don’t pretend to understand these things. I’ll be watching to learn what the value proposition is for these systems offerings what with HP, IBM, DELL, Sun and Verari coming to mind in order of volume it seems like a crowded field. I could be wrong about that order vis a vis volume come to think of it. It looks about right though.

As I was saying, I’ll be eager to learn more about this offering. I don’t  imagine the original business plan of Snapple included becoming the 3rd largest refreshment beverage business in North America. I’m sure they don’t mind having that spot in the marketplace though because that is actually quite an astounding position to be in considering the barriers to entry in that industry. I wonder if Cisco is making a sort of Snapple move? Would 3rd in volume be sufficient?

Time to Buy a PC. Come On Now, Everyone Knows Why an Intel Q8200 is Better Than a Q6600, Right?

…this one is off-topic…please forgive…

I’ve been shopping for a home deskside system and realized quickly that I was very out of tune with the branding Intel has for consumer CPU offerings. I’m versed in the server CPU nomenclature, but when it comes to the processors going into PCs I’m lost.  For instance, think quick, what is a Intel Q6600 and why should you like it so much less than a Intel Q8200? Wrapped up in this consumer nomenclature is cores, clock speed, socket type and processor cache size.

I’ve been relying on the convenient search interface at hardware.info. It works like a magic decoder ring.

Cached Ext3 File Access is Faster Than Infiniband. Infiniband is Just a Marketing Ploy. Who Needs Exadata?

Infiniband: Just an Exadata Storage Server Marketing Ploy
One phenomenon I’ve observed about Oracle Exadata Storage Server technology is the propensity of some folks to chop it up from the sum of its parts into a pile of Erector Set parts so that each bit can be the topic for discussion.

Readers of this blog know that I generally do not scrutinize any one element of Exadata architecture opting instead to handle it is a whole product. Some of my conversations at Rocky Mountain Oracle User Group Training Days 2009, earlier this month, reminded me of this topic. I heard one person at the confrence say words to the effect of, “…Exadata is so fast because of Infiniband.” Actually, it isn’t. Exadata is not powerful because of any one of its parts. It is powerful because of the sum of its parts. That doesn’t mean I would side with EMC’s Chuck Hollis and his opinions regarding the rightful place of Infiniband in Exadata architecture. Readers might recall Chuck’s words in his post entitled I annoy Kevin Closson at Oracle:

The “storage nodes” are interconnected to “database nodes” via Infiniband, and I questioned (based on our work in this environment) whether this was actually a bottleneck being addressed, or whether it was a bit of marketing flash in a world where multiple 1Gb ethernet ports seem to do as well.

Multiple GbE ports? Oh boy. Sure, if the storage array is saturated  at at the storage processor level (common) or back-end loop level (more common yet), I know for a fact that offering up-wind Infiniband connectivity to hosts is a waste of time. Maybe that is what he is referring to? I don’t know, but none of that has anything to do with Exadata because Exadata suffers no head saturation.

Dazzle Them With the Speeds and Feeds
Oracle database administrators sometimes get annoyed when people throw speed and feed specifications at them without following up with real-world examples demonstrating the benefit to a meaningful Oracle DW/BI workload. I truly hope there are no DBAs who care about wire latencies for Reliable Datagram Sockets requests any more than, say, what CPUs or Operating System software is embedded in the conventional storage array they use for Oracle today. Some technology details do not stand on their own merit.

Infiniband, however,  offers performance benefits in the Exadata Storage Server architecture that are both tangible and critical. Allow me to offer an example of what I mean.

Read From Memory, Write to Redundant ASM Storage
I was recently analyzing some INSERT /*+ APPEND */ performance attributes on a system using Exadata Storage Server for the database. During one of the tests I decided that I wanted to create a scenario where the “ingest” side suffered no physical I/O. To do this I created a tablespace in an Ext3 filesystem file and set filesystemio_options=asynch to ensure I would not be served by direct I/O. I wanted the tables in the datafile cached in the Linux page cache.

The test looped INSERT /*+ APPEND */ SELECT * FROM commands several times since I did not have enough physical memory for a large cached tablespace in the Ext3 filesystem. The cached table was 18.5GB and the workload consisted of 10 executions of the INSERT command using Parallel Query Option. To that end, the target table grew to roughly 185GB during the test. Since the target table resided in Exadata Storage Server ASM disks, with normal redundancy, the downwind write payload was 370GB. That is, the test read 185GB from memory and wrote  370GB to Exadata storage.

The entire test was isolated to a single database host in the database grid, but all 14 Exadata Storage Servers of the HP Oracle Database Machine were being written to.

After running the test once, to prep the cache, I dropped and recreated the target table and ran the test again. The completion time was 307 seconds for a write throughput of 1.21GB/s.

During the run AWR tracked 328,726 direct path writes:

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                          Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
direct path write                   328,726       3,078      9   64.1 User I/O
DB CPU                                            1,682          35.0
row cache lock                        5,528          17      3     .4 Concurrenc
control file sequential read          2,985           7      2     .1 System I/O
DFS lock handle                         964           4      4     .1 Other

…and, 12,143,325 reads from the SEED tablespace which, of course, resided in the Ext3 filesystem fully cached:

Segments by Direct Physical Reads        DB/Inst: TEST/test1  Snaps: 1891-1892
-> Total Direct Physical Reads:      12,143,445
-> Captured Segments account for  100.0% of Total

           Tablespace                      Subobject  Obj.        Direct
Owner         Name    Object Name            Name     Type         Reads  %Total
---------- ---------- -------------------- ---------- ----- ------------ -------
TEST       SEED       SEED                            TABLE   12,143,325  100.00

AWR also showed that the same number of blocks read were also written:

Segments by Direct Physical Writes       DB/Inst: TEST/test1  Snaps: 1891-1892
-> Total Direct Physical Writes:      12,143,453
-> Captured Segments account for  100.0% of Total

           Tablespace                      Subobject  Obj.        Direct
Owner         Name    Object Name            Name     Type        Writes  %Total
---------- ---------- -------------------- ---------- ----- ------------ -------
TEST       CARDX      ALL_CARD_TRANS                  TABLE   12,139,301   99.97
SYS        SYSAUX     WRH$_ACTIVE_SESSION_ 70510_1848 TABLE            8     .00
          -------------------------------------------------------------

During the test I picked up a snippet of vmstat(1) from the database host. Since this is Exadata storage you’ll see no data under the bi or bo columns as those track physical I/O. Exadata I/O is sent via Reliable Datagram Sockets protocol over Infiniband (iDB) to the storage servers. And, of course, the source table was fully cached. Although processor utilization was a bit erratic, the peaks in user mode were on the order of 70% and kernel mode roughly 17%. I could have driven up the rate a little with higher DOP, but all told this is a great throughput rate at a reasonable CPU cost. I did not want processor saturation during this test.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0 126848 3874888 413376 22810008    0    0    31   128   18   16 31  3 66  0  0
 0  0 126848 3845352 413376 22810428    0    0     1    60 1648 5361 41  5 54  0  0
13  0 126848 3836192 413380 22810460    0    0     2    77 1443 6217  8  2 90  0  0
 0  0 126848 3855868 413388 22810456    0    0     1   173 1459 4870 15  2 82  0  0
27  0 126848 3797260 413392 22810840    0    0    10    47 7750 5769 53 12 36  0  0
 6  0 126848 3814064 413392 22810936    0    0     1    44 11810 10695 59 16 25  0  0
 7  0 126848 3804336 413392 22810956    0    0     2    72 12128 8498 76 18  6  0  0
16  0 126848 3810312 413392 22810988    0    0     1    83 11921 11085 59 18 23  0  0
11  0 126848 3819564 413392 22811872    0    0     2   134 12118 7760 74 16 10  0  0
10  0 126848 3804008 413392 22819644    0    0     1    19 11873 10544 59 17 24  0  0
 2  0 126848 3789324 413392 22827004    0    0     2    92 11933 7753 74 17 10  0  0
39  0 126848 3759416 413396 22839640    0    0     9    39 10625 8992 54 14 32  0  0
 6  0 126848 3766288 413400 22845200    0    0     2    40 12061 9121 70 17 13  0  0
35  0 126848 3739296 413400 22853452    0    0     2   161 12124 8447 66 16 18  0  0
11  0 126848 3761772 413400 22860092    0    0     1    38 12011 9040 67 16 17  0  0
26  0 126848 3724364 413400 22868624    0    0     2    60 12489 9105 69 17 14  0  0
11  0 126848 3734572 413400 22877056    0    0     1    85 12067 9760 64 17 19  0  0
21  0 126848 3713224 413404 22883044    0    0     2   102 11794 8454 72 17 11  0  0
14  0 126848 3726168 413408 22895800    0    0     9     2 10025 9538 51 14 35  0  0
13  0 126848 3706280 413408 22902448    0    0     2    83 12133 7879 75 17  8  0  0
 7  0 126848 3707200 413408 22909060    0    0     1    61 11945 10135 59 17 24  0  0

That was an interesting test. I can’t talk about the specifics of why I was studying this stuff, but I liked driving a single DL360 G5 server to 1.2GB/s write throughput. That got me thinking. What would happen if I put the SEED table in Exadata?

Read from Disk, Write to Disk.
I performed a CTAS (Create Table as Select ) to create a copy of the SEED table (same storage clause, etc) into the Exadata Storage Servers. The model I wanted to test next was: Read the 185GB from Exadata while writing the 370GB right back to the same Exadata Storage Server disks. This test increased physical disk I/O by 50% while introducing latency (I/O service time) for the ingest side. Remember, the SEED table I/O in the Ext3 model enjoyed read service times at RAM speed (3 orders of magnitude faster than disk).

So, I ran the test and of course it was slower now that I increased the physical I/O payload by 50% and introduced I/O latency on the ingest side. It was, in fact, 4% slower-or a job completion time of 320 seconds! Yes, 4%.

The AWR report showed the same direct path write cost to the target table:

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
direct path write                   365,950       3,296      9   66.2 User I/O
DB CPU                                            1,462          29.4
cell smart table scan               238,027         210      1    4.2 User I/O
row cache lock                        5,473           6      1     .1 Concurrenc
control file sequential read          3,087           6      2     .1 System I/O

…and reads from the tablespace called EX_SEED (resides in Exadata storage) was on par with the volume read from the cached Ext3 tablespace:

Segments by Direct Physical Reads        DB/Inst: TEST/test1  Snaps: 1923-1924
-> Total Direct Physical Reads:      12,143,445
-> Captured Segments account for  100.0% of Total

           Tablespace                      Subobject  Obj.        Direct
Owner         Name    Object Name            Name     Type         Reads  %Total
---------- ---------- -------------------- ---------- ----- ------------ -------
TEST       EX_SEED    EX_SEED                         TABLE   12,142,160   99.99

…and the writes to the all_card_trans table was on par with the test conducted using the cached Ext3 SEED table:

Segments by Physical Writes              DB/Inst: TEST/test1  Snaps: 1923-1924
-> Total Physical Writes:      12,146,714
-> Captured Segments account for  100.0% of Total

           Tablespace                      Subobject  Obj.      Physical
Owner         Name    Object Name            Name     Type        Writes  %Total
---------- ---------- -------------------- ---------- ----- ------------ -------
TEST       CARDX      ALL_CARD_TRANS                  TABLE   12,144,831   99.98

Most interestingly, but not surprising, was the processor utilization profile (see the following box). In spite of  performing 50% more physical disk I/O kernel-mode cycles were reduced nearly in half. It is, of course,  more processor intensive (in kernel mode) to perform reads from an Oracle tablespace that is cached in an Ext3 file than to read data from Exadata because in-bound I/O from Exadata storage is DMAed directly into the process address space without any copys. I/O from an Ext3 cached file, on the other hand, requires kernel-mode memcpys from the page cache into the address space of the Oracle process reading the file. However, since the workload was not bound by CPU saturation, the memcpy overhead was not a limiting factor. But that’s not all.

I’d like to draw your attention to the user-mode cycles. Notice how the user-mode cycles never peak above the low-fifties percent mark? The cached Ext3 test, on the other hand, exhibited high sixties and low seventies percent user-mode processor utilization. That is a very crucial differentiator. Even when I/O is being serviced at RAM speed, the code required to deal with conventional I/O (issuing and reaping, etc) is significantly more expensive (CPU-wise) than interfacing with Exadata Storage.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0 374788 2214348  62904 24535308    0    0    26   103   18    3 31  3 66  0  0
 0  0 374788 2214868  62904 24535320    0    0     2    79 1259 4232 19  2 79  0  0
 0  0 374788 2215756  62904 24535328    0    0     1   148 1148 4894  0  1 99  0  0
 0  0 374788 2215788  62904 24535328    0    0     2     1 1054 3983  0  0 100  0  0
13  0 374788 2203004  62904 24535540    0    0    17   142 8044 9675 31  6 63  0  0
 8  0 374788 2187292  62904 24535544    0    0     2     1 12856 13948 53  8 40  0  0
 6  0 374788 2191432  62904 24535584    0    0     1    69 12864 13904 53  7 40  0  0
 6  0 374788 2194812  62904 24535584    0    0     2    35 12948 13954 54  8 39  0  0
 7  0 374788 2194284  62904 24535592    0    0     1     2 12940 13605 53  7 40  0  0
 3  0 374788 2193988  62904 24535596    0    0     2    22 12849 13329 52  7 41  0  0
 9  0 374788 2192988  62904 24535604    0    0     1     2 12797 13555 52  7 41  0  0
10  0 374788 2193788  62908 24537648    0    0    10    60 11497 13184 46  8 46  0  0
 7  0 374788 2191976  62912 24537656    0    0     1   146 12990 14248 52  8 40  0  0
 5  0 374788 2190668  62912 24537668    0    0     2     1 12974 13401 52  7 41  0  0
 3  0 374788 2191492  62912 24537676    0    0     1    16 12886 13495 52  7 41  0  0
 6  0 374788 2191292  62912 24537676    0    0     2     1 12825 13736 52  8 40  0  0
 4  0 374788 2190768  62912 24537688    0    0     1    89 12832 14500 53  8 40  0  0
 7  0 374788 2189928  62912 24537688    0    0     2     2 12849 14015 52  8 40  0  0
11  0 374788 2190320  62916 24539712    0    0     9    36 11588 12411 46  7 47  0  0
17  0 374788 2189496  62916 24539744    0    0     2    40 12948 13373 52  7 41  0  0
 6  1 374788 2189032  62916 24539748    0    0     1     2 12932 13899 53  7 39  0  0

Ten Pounds of Rocks in a 5-Pound Bag?
For those who did the math and determined that the 100% Exadata case was pushing a combined read+write throughput of 1730MB/s through the Infiniband card, you did the math correctly.

The IB cards in the database tier of the HP Oracle Database Machine each support a maximum theoretical throughput of 1850MB/s in full-duplex mode. This workload has an “optimal” (serendipitous really) blend of read traffic mixed with write traffic so it drives the card to within 6% of its maximum theoretical throughput rating.

We don’t suggest that this sort of throughput is achievable with an actual DW/BI workload. We prefer the more realistic, conservative 1.5GB/s (consistently measured with complex, concurrent queries) when setting expectations. That said, however, even if there was a DW/BI workload that demanded this sort of blend of reads and writes (i.e., a workload with a heavy blend of sort-spill writes) the balance between the database tier and storage tier is not out of whack.

The HP Oracle Database Machine sports a storage grid bandwidth of 14GB/s so even this workload protracted to 8 nodes in the database grid would still fit within that range since 1730MB/s * 8 RAC nodes == 13.8GB/s.

I like balanced configurations.

Summary
This test goes a long way to show the tremendous efficiencies in the Exadata architecture. Indeed, both tests had to “lift” the same amount of SEED data and produce the same amount of down-wind write I/O. Everything about the SQL layer remains constant. For that matter most everything remains constant between the two models with the exception of the lower-level read-side I/O from the cached file in the Ext3 case.

The Exadata test case performed 50% more physical I/O and did so with roughly 28% less user-mode processor utilization and a clean 50% reduction in kernel-mode cycles while coming within 4% of the job completion time achieved by the cached Ext3 SEED case.

Infiniband is not just a marketing ploy. Infiniband is an important asset in the Exadata Storage Server architecture.

Disk Drives: They’re Not as Slow as You Think! Got Junk Science?

I was just taking a look at Curt Monash’s TDWI slide set entitled How to Select an Analytic DBMS when I got to slide 5 and noticed something peculiar. Consider the following quote:

Transistors/chip:  >100,000 since 1971

Disk density: >100,000,000 since 1956

Disk speed: 12.5 since 1956

Disk Speed == Rotational Speed?
The slide was offering a comparison of “disk speed” from 1956 and CPU transistor count from 1971 to the present. I accept the notion that processors have outpaced disk capabilities in that time period-no doubt! However, I think there is too much emphasis placed on disk rotational speed and not enough emphasis on the 100 million-fold increase in density. The topic at hand is DW/BI and I don’t think as much attention should be given to rotational delay. I’m not trying to read into Curt’s message here because I wasn’t in the presentation, but it sparks food for thought. Are disks really that slow?

Are Disks Really That Slow?
Instead of comparing modern drives to the prehistoric “winchester” drive of 1956, I think a better comparison would be to the ST-506 which is the father of modern disks. The ST506 of 1984 would have found itself paired to an Intel 80286 in the PC of the day. Comparing transistor count from the 80286 to a “Harpertown” Xeon yields an increase of 3280-fold and a clock speed improvement of 212-fold. The ST506 (circa 1984) had a throughput capability of 625KB/s). Modern 450GB SAS drives can scan at 150MB/s-an improvement of 245-fold. When considered in these terms, the hard drive throughput and CPU clock speed have seen a surprisingly similar increase in capability. Of course Intel is cramming 3280x more transistors in a processor these days, but read on.

The point I’m trying to make is that disks haven’t lagged as far behind CPU as I feel is sometimes portrayed. In fact, I think the refrigerator-cabinet array manufacturers disingenuously draw attention to things like rotational delay in order to detract from the real bottleneck, which is the flow of data from the platters through the storage processors to the host. This bottleneck is built into modern storage arrays and felt all the way through the host bus adaptors. Let’s not punish ourselves by mentioning the plumbing complexities of storage networking models like Fibre Channel.

Focus on Flow of Data, Not Spinning Speed.
Oracle Exadata Storage Server, in the HP Oracle Database Machine offering, configures 1.05 processor cores per hard drive (176:168).  Even if I clump Flash SSD into the mix (about 60% increase in scan throughput over round, brown spinning disks) it doesn’t really change that much (i.e., not orders of magnitude).

Junk Science? Maybe.
So, am I just throwing out the 3280x increase in transistor count gains I mentioned? No, but I think when we compare the richness of processing that occurs on data coming off of disk in today’s world (e.g., DW/BI) compared to the 80286->ST506 days (e.g., VisiCalc, a 26KB executable), the transistor count gets factored out. So we are left with 245-fold disk performance gains and 212-fold cpu clock gains. So, is it a total coincidence that a good ratio of DW/BI cpu to disk is about 1:1? Maybe not. Maybe this is all just junk science. If so, we should all continue connecting as many disks to the back of our conventional storage arrays as they will support.

Summary
Stop bottlenecking your disk drives. Then, and only then, you’ll be able to see just how fast they are and whether you have a reasonable ratio of CPU to disk for your DW/BI workload.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.