Archive Page 19

What Does Snapple Have To Do With Information Technology? Cisco Makes Blade Servers?

This is just a short blog entry about Cisco’s Unified Computing initiative. It seems Cisco has been quite busy readying blade server technology to bring to market. According to this NETWORKWORLD article, analyst Zeus Kerravala of the Yankee Group was quoted as follows:

If Cisco builds their own server, it will forever change the relationship they have with server manufacturers, for the negative.

I don’t pretend to understand these things. I’ll be watching to learn what the value proposition is for these systems offerings what with HP, IBM, DELL, Sun and Verari coming to mind in order of volume it seems like a crowded field. I could be wrong about that order vis a vis volume come to think of it. It looks about right though.

As I was saying, I’ll be eager to learn more about this offering. I don’t  imagine the original business plan of Snapple included becoming the 3rd largest refreshment beverage business in North America. I’m sure they don’t mind having that spot in the marketplace though because that is actually quite an astounding position to be in considering the barriers to entry in that industry. I wonder if Cisco is making a sort of Snapple move? Would 3rd in volume be sufficient?

Time to Buy a PC. Come On Now, Everyone Knows Why an Intel Q8200 is Better Than a Q6600, Right?

…this one is off-topic…please forgive…

I’ve been shopping for a home deskside system and realized quickly that I was very out of tune with the branding Intel has for consumer CPU offerings. I’m versed in the server CPU nomenclature, but when it comes to the processors going into PCs I’m lost.  For instance, think quick, what is a Intel Q6600 and why should you like it so much less than a Intel Q8200? Wrapped up in this consumer nomenclature is cores, clock speed, socket type and processor cache size.

I’ve been relying on the convenient search interface at hardware.info. It works like a magic decoder ring.

Cached Ext3 File Access is Faster Than Infiniband. Infiniband is Just a Marketing Ploy. Who Needs Exadata?

Infiniband: Just an Exadata Storage Server Marketing Ploy
One phenomenon I’ve observed about Oracle Exadata Storage Server technology is the propensity of some folks to chop it up from the sum of its parts into a pile of Erector Set parts so that each bit can be the topic for discussion.

Readers of this blog know that I generally do not scrutinize any one element of Exadata architecture opting instead to handle it is a whole product. Some of my conversations at Rocky Mountain Oracle User Group Training Days 2009, earlier this month, reminded me of this topic. I heard one person at the confrence say words to the effect of, “…Exadata is so fast because of Infiniband.” Actually, it isn’t. Exadata is not powerful because of any one of its parts. It is powerful because of the sum of its parts. That doesn’t mean I would side with EMC’s Chuck Hollis and his opinions regarding the rightful place of Infiniband in Exadata architecture. Readers might recall Chuck’s words in his post entitled I annoy Kevin Closson at Oracle:

The “storage nodes” are interconnected to “database nodes” via Infiniband, and I questioned (based on our work in this environment) whether this was actually a bottleneck being addressed, or whether it was a bit of marketing flash in a world where multiple 1Gb ethernet ports seem to do as well.

Multiple GbE ports? Oh boy. Sure, if the storage array is saturated  at at the storage processor level (common) or back-end loop level (more common yet), I know for a fact that offering up-wind Infiniband connectivity to hosts is a waste of time. Maybe that is what he is referring to? I don’t know, but none of that has anything to do with Exadata because Exadata suffers no head saturation.

Dazzle Them With the Speeds and Feeds
Oracle database administrators sometimes get annoyed when people throw speed and feed specifications at them without following up with real-world examples demonstrating the benefit to a meaningful Oracle DW/BI workload. I truly hope there are no DBAs who care about wire latencies for Reliable Datagram Sockets requests any more than, say, what CPUs or Operating System software is embedded in the conventional storage array they use for Oracle today. Some technology details do not stand on their own merit.

Infiniband, however,  offers performance benefits in the Exadata Storage Server architecture that are both tangible and critical. Allow me to offer an example of what I mean.

Read From Memory, Write to Redundant ASM Storage
I was recently analyzing some INSERT /*+ APPEND */ performance attributes on a system using Exadata Storage Server for the database. During one of the tests I decided that I wanted to create a scenario where the “ingest” side suffered no physical I/O. To do this I created a tablespace in an Ext3 filesystem file and set filesystemio_options=asynch to ensure I would not be served by direct I/O. I wanted the tables in the datafile cached in the Linux page cache.

The test looped INSERT /*+ APPEND */ SELECT * FROM commands several times since I did not have enough physical memory for a large cached tablespace in the Ext3 filesystem. The cached table was 18.5GB and the workload consisted of 10 executions of the INSERT command using Parallel Query Option. To that end, the target table grew to roughly 185GB during the test. Since the target table resided in Exadata Storage Server ASM disks, with normal redundancy, the downwind write payload was 370GB. That is, the test read 185GB from memory and wrote  370GB to Exadata storage.

The entire test was isolated to a single database host in the database grid, but all 14 Exadata Storage Servers of the HP Oracle Database Machine were being written to.

After running the test once, to prep the cache, I dropped and recreated the target table and ran the test again. The completion time was 307 seconds for a write throughput of 1.21GB/s.

During the run AWR tracked 328,726 direct path writes:

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                          Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
direct path write                   328,726       3,078      9   64.1 User I/O
DB CPU                                            1,682          35.0
row cache lock                        5,528          17      3     .4 Concurrenc
control file sequential read          2,985           7      2     .1 System I/O
DFS lock handle                         964           4      4     .1 Other

…and, 12,143,325 reads from the SEED tablespace which, of course, resided in the Ext3 filesystem fully cached:

Segments by Direct Physical Reads        DB/Inst: TEST/test1  Snaps: 1891-1892
-> Total Direct Physical Reads:      12,143,445
-> Captured Segments account for  100.0% of Total

           Tablespace                      Subobject  Obj.        Direct
Owner         Name    Object Name            Name     Type         Reads  %Total
---------- ---------- -------------------- ---------- ----- ------------ -------
TEST       SEED       SEED                            TABLE   12,143,325  100.00

AWR also showed that the same number of blocks read were also written:

Segments by Direct Physical Writes       DB/Inst: TEST/test1  Snaps: 1891-1892
-> Total Direct Physical Writes:      12,143,453
-> Captured Segments account for  100.0% of Total

           Tablespace                      Subobject  Obj.        Direct
Owner         Name    Object Name            Name     Type        Writes  %Total
---------- ---------- -------------------- ---------- ----- ------------ -------
TEST       CARDX      ALL_CARD_TRANS                  TABLE   12,139,301   99.97
SYS        SYSAUX     WRH$_ACTIVE_SESSION_ 70510_1848 TABLE            8     .00
          -------------------------------------------------------------

During the test I picked up a snippet of vmstat(1) from the database host. Since this is Exadata storage you’ll see no data under the bi or bo columns as those track physical I/O. Exadata I/O is sent via Reliable Datagram Sockets protocol over Infiniband (iDB) to the storage servers. And, of course, the source table was fully cached. Although processor utilization was a bit erratic, the peaks in user mode were on the order of 70% and kernel mode roughly 17%. I could have driven up the rate a little with higher DOP, but all told this is a great throughput rate at a reasonable CPU cost. I did not want processor saturation during this test.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0 126848 3874888 413376 22810008    0    0    31   128   18   16 31  3 66  0  0
 0  0 126848 3845352 413376 22810428    0    0     1    60 1648 5361 41  5 54  0  0
13  0 126848 3836192 413380 22810460    0    0     2    77 1443 6217  8  2 90  0  0
 0  0 126848 3855868 413388 22810456    0    0     1   173 1459 4870 15  2 82  0  0
27  0 126848 3797260 413392 22810840    0    0    10    47 7750 5769 53 12 36  0  0
 6  0 126848 3814064 413392 22810936    0    0     1    44 11810 10695 59 16 25  0  0
 7  0 126848 3804336 413392 22810956    0    0     2    72 12128 8498 76 18  6  0  0
16  0 126848 3810312 413392 22810988    0    0     1    83 11921 11085 59 18 23  0  0
11  0 126848 3819564 413392 22811872    0    0     2   134 12118 7760 74 16 10  0  0
10  0 126848 3804008 413392 22819644    0    0     1    19 11873 10544 59 17 24  0  0
 2  0 126848 3789324 413392 22827004    0    0     2    92 11933 7753 74 17 10  0  0
39  0 126848 3759416 413396 22839640    0    0     9    39 10625 8992 54 14 32  0  0
 6  0 126848 3766288 413400 22845200    0    0     2    40 12061 9121 70 17 13  0  0
35  0 126848 3739296 413400 22853452    0    0     2   161 12124 8447 66 16 18  0  0
11  0 126848 3761772 413400 22860092    0    0     1    38 12011 9040 67 16 17  0  0
26  0 126848 3724364 413400 22868624    0    0     2    60 12489 9105 69 17 14  0  0
11  0 126848 3734572 413400 22877056    0    0     1    85 12067 9760 64 17 19  0  0
21  0 126848 3713224 413404 22883044    0    0     2   102 11794 8454 72 17 11  0  0
14  0 126848 3726168 413408 22895800    0    0     9     2 10025 9538 51 14 35  0  0
13  0 126848 3706280 413408 22902448    0    0     2    83 12133 7879 75 17  8  0  0
 7  0 126848 3707200 413408 22909060    0    0     1    61 11945 10135 59 17 24  0  0

That was an interesting test. I can’t talk about the specifics of why I was studying this stuff, but I liked driving a single DL360 G5 server to 1.2GB/s write throughput. That got me thinking. What would happen if I put the SEED table in Exadata?

Read from Disk, Write to Disk.
I performed a CTAS (Create Table as Select ) to create a copy of the SEED table (same storage clause, etc) into the Exadata Storage Servers. The model I wanted to test next was: Read the 185GB from Exadata while writing the 370GB right back to the same Exadata Storage Server disks. This test increased physical disk I/O by 50% while introducing latency (I/O service time) for the ingest side. Remember, the SEED table I/O in the Ext3 model enjoyed read service times at RAM speed (3 orders of magnitude faster than disk).

So, I ran the test and of course it was slower now that I increased the physical I/O payload by 50% and introduced I/O latency on the ingest side. It was, in fact, 4% slower-or a job completion time of 320 seconds! Yes, 4%.

The AWR report showed the same direct path write cost to the target table:

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
direct path write                   365,950       3,296      9   66.2 User I/O
DB CPU                                            1,462          29.4
cell smart table scan               238,027         210      1    4.2 User I/O
row cache lock                        5,473           6      1     .1 Concurrenc
control file sequential read          3,087           6      2     .1 System I/O

…and reads from the tablespace called EX_SEED (resides in Exadata storage) was on par with the volume read from the cached Ext3 tablespace:

Segments by Direct Physical Reads        DB/Inst: TEST/test1  Snaps: 1923-1924
-> Total Direct Physical Reads:      12,143,445
-> Captured Segments account for  100.0% of Total

           Tablespace                      Subobject  Obj.        Direct
Owner         Name    Object Name            Name     Type         Reads  %Total
---------- ---------- -------------------- ---------- ----- ------------ -------
TEST       EX_SEED    EX_SEED                         TABLE   12,142,160   99.99

…and the writes to the all_card_trans table was on par with the test conducted using the cached Ext3 SEED table:

Segments by Physical Writes              DB/Inst: TEST/test1  Snaps: 1923-1924
-> Total Physical Writes:      12,146,714
-> Captured Segments account for  100.0% of Total

           Tablespace                      Subobject  Obj.      Physical
Owner         Name    Object Name            Name     Type        Writes  %Total
---------- ---------- -------------------- ---------- ----- ------------ -------
TEST       CARDX      ALL_CARD_TRANS                  TABLE   12,144,831   99.98

Most interestingly, but not surprising, was the processor utilization profile (see the following box). In spite of  performing 50% more physical disk I/O kernel-mode cycles were reduced nearly in half. It is, of course,  more processor intensive (in kernel mode) to perform reads from an Oracle tablespace that is cached in an Ext3 file than to read data from Exadata because in-bound I/O from Exadata storage is DMAed directly into the process address space without any copys. I/O from an Ext3 cached file, on the other hand, requires kernel-mode memcpys from the page cache into the address space of the Oracle process reading the file. However, since the workload was not bound by CPU saturation, the memcpy overhead was not a limiting factor. But that’s not all.

I’d like to draw your attention to the user-mode cycles. Notice how the user-mode cycles never peak above the low-fifties percent mark? The cached Ext3 test, on the other hand, exhibited high sixties and low seventies percent user-mode processor utilization. That is a very crucial differentiator. Even when I/O is being serviced at RAM speed, the code required to deal with conventional I/O (issuing and reaping, etc) is significantly more expensive (CPU-wise) than interfacing with Exadata Storage.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0 374788 2214348  62904 24535308    0    0    26   103   18    3 31  3 66  0  0
 0  0 374788 2214868  62904 24535320    0    0     2    79 1259 4232 19  2 79  0  0
 0  0 374788 2215756  62904 24535328    0    0     1   148 1148 4894  0  1 99  0  0
 0  0 374788 2215788  62904 24535328    0    0     2     1 1054 3983  0  0 100  0  0
13  0 374788 2203004  62904 24535540    0    0    17   142 8044 9675 31  6 63  0  0
 8  0 374788 2187292  62904 24535544    0    0     2     1 12856 13948 53  8 40  0  0
 6  0 374788 2191432  62904 24535584    0    0     1    69 12864 13904 53  7 40  0  0
 6  0 374788 2194812  62904 24535584    0    0     2    35 12948 13954 54  8 39  0  0
 7  0 374788 2194284  62904 24535592    0    0     1     2 12940 13605 53  7 40  0  0
 3  0 374788 2193988  62904 24535596    0    0     2    22 12849 13329 52  7 41  0  0
 9  0 374788 2192988  62904 24535604    0    0     1     2 12797 13555 52  7 41  0  0
10  0 374788 2193788  62908 24537648    0    0    10    60 11497 13184 46  8 46  0  0
 7  0 374788 2191976  62912 24537656    0    0     1   146 12990 14248 52  8 40  0  0
 5  0 374788 2190668  62912 24537668    0    0     2     1 12974 13401 52  7 41  0  0
 3  0 374788 2191492  62912 24537676    0    0     1    16 12886 13495 52  7 41  0  0
 6  0 374788 2191292  62912 24537676    0    0     2     1 12825 13736 52  8 40  0  0
 4  0 374788 2190768  62912 24537688    0    0     1    89 12832 14500 53  8 40  0  0
 7  0 374788 2189928  62912 24537688    0    0     2     2 12849 14015 52  8 40  0  0
11  0 374788 2190320  62916 24539712    0    0     9    36 11588 12411 46  7 47  0  0
17  0 374788 2189496  62916 24539744    0    0     2    40 12948 13373 52  7 41  0  0
 6  1 374788 2189032  62916 24539748    0    0     1     2 12932 13899 53  7 39  0  0

Ten Pounds of Rocks in a 5-Pound Bag?
For those who did the math and determined that the 100% Exadata case was pushing a combined read+write throughput of 1730MB/s through the Infiniband card, you did the math correctly.

The IB cards in the database tier of the HP Oracle Database Machine each support a maximum theoretical throughput of 1850MB/s in full-duplex mode. This workload has an “optimal” (serendipitous really) blend of read traffic mixed with write traffic so it drives the card to within 6% of its maximum theoretical throughput rating.

We don’t suggest that this sort of throughput is achievable with an actual DW/BI workload. We prefer the more realistic, conservative 1.5GB/s (consistently measured with complex, concurrent queries) when setting expectations. That said, however, even if there was a DW/BI workload that demanded this sort of blend of reads and writes (i.e., a workload with a heavy blend of sort-spill writes) the balance between the database tier and storage tier is not out of whack.

The HP Oracle Database Machine sports a storage grid bandwidth of 14GB/s so even this workload protracted to 8 nodes in the database grid would still fit within that range since 1730MB/s * 8 RAC nodes == 13.8GB/s.

I like balanced configurations.

Summary
This test goes a long way to show the tremendous efficiencies in the Exadata architecture. Indeed, both tests had to “lift” the same amount of SEED data and produce the same amount of down-wind write I/O. Everything about the SQL layer remains constant. For that matter most everything remains constant between the two models with the exception of the lower-level read-side I/O from the cached file in the Ext3 case.

The Exadata test case performed 50% more physical I/O and did so with roughly 28% less user-mode processor utilization and a clean 50% reduction in kernel-mode cycles while coming within 4% of the job completion time achieved by the cached Ext3 SEED case.

Infiniband is not just a marketing ploy. Infiniband is an important asset in the Exadata Storage Server architecture.

Disk Drives: They’re Not as Slow as You Think! Got Junk Science?

I was just taking a look at Curt Monash’s TDWI slide set entitled How to Select an Analytic DBMS when I got to slide 5 and noticed something peculiar. Consider the following quote:

Transistors/chip:  >100,000 since 1971

Disk density: >100,000,000 since 1956

Disk speed: 12.5 since 1956

Disk Speed == Rotational Speed?
The slide was offering a comparison of “disk speed” from 1956 and CPU transistor count from 1971 to the present. I accept the notion that processors have outpaced disk capabilities in that time period-no doubt! However, I think there is too much emphasis placed on disk rotational speed and not enough emphasis on the 100 million-fold increase in density. The topic at hand is DW/BI and I don’t think as much attention should be given to rotational delay. I’m not trying to read into Curt’s message here because I wasn’t in the presentation, but it sparks food for thought. Are disks really that slow?

Are Disks Really That Slow?
Instead of comparing modern drives to the prehistoric “winchester” drive of 1956, I think a better comparison would be to the ST-506 which is the father of modern disks. The ST506 of 1984 would have found itself paired to an Intel 80286 in the PC of the day. Comparing transistor count from the 80286 to a “Harpertown” Xeon yields an increase of 3280-fold and a clock speed improvement of 212-fold. The ST506 (circa 1984) had a throughput capability of 625KB/s). Modern 450GB SAS drives can scan at 150MB/s-an improvement of 245-fold. When considered in these terms, the hard drive throughput and CPU clock speed have seen a surprisingly similar increase in capability. Of course Intel is cramming 3280x more transistors in a processor these days, but read on.

The point I’m trying to make is that disks haven’t lagged as far behind CPU as I feel is sometimes portrayed. In fact, I think the refrigerator-cabinet array manufacturers disingenuously draw attention to things like rotational delay in order to detract from the real bottleneck, which is the flow of data from the platters through the storage processors to the host. This bottleneck is built into modern storage arrays and felt all the way through the host bus adaptors. Let’s not punish ourselves by mentioning the plumbing complexities of storage networking models like Fibre Channel.

Focus on Flow of Data, Not Spinning Speed.
Oracle Exadata Storage Server, in the HP Oracle Database Machine offering, configures 1.05 processor cores per hard drive (176:168).  Even if I clump Flash SSD into the mix (about 60% increase in scan throughput over round, brown spinning disks) it doesn’t really change that much (i.e., not orders of magnitude).

Junk Science? Maybe.
So, am I just throwing out the 3280x increase in transistor count gains I mentioned? No, but I think when we compare the richness of processing that occurs on data coming off of disk in today’s world (e.g., DW/BI) compared to the 80286->ST506 days (e.g., VisiCalc, a 26KB executable), the transistor count gets factored out. So we are left with 245-fold disk performance gains and 212-fold cpu clock gains. So, is it a total coincidence that a good ratio of DW/BI cpu to disk is about 1:1? Maybe not. Maybe this is all just junk science. If so, we should all continue connecting as many disks to the back of our conventional storage arrays as they will support.

Summary
Stop bottlenecking your disk drives. Then, and only then, you’ll be able to see just how fast they are and whether you have a reasonable ratio of CPU to disk for your DW/BI workload.

Addition to my Blog Roll: Real World Technologies.

I added Real World Technologies to my blog roll. This site is loaded with good information for commodity computing systems-minded professionals! I love it!

I’m Ready! I’ve Read the Exadata Documentation. Join Me for the Web Seminar!

I’m so flattered! I just got a call from corporate and it seems  they want me to join the Winter Corporation Web Seminar tomorrow so I can help out with the question and answer segment. As the Performance Architect on the product, I guess I better study up on my notes right quick 🙂

Joking aside, please follow the this link and sign up for the event. It should be informative. At the end, you can play “stump the host” too  🙂

Please fell free to read the Whitepaper in advance.

Intel Hyperthreading Does Little for Oracle Workloads. Who Cares? I Want My Xeon 5500 CPU with Simultaneous Multi-Threading (SMT)!

Two years ago I was working on a series of blog threads about Oracle on AMD Opteron processors. I made it perfectly clear at that time that I was a total fanboi of AMD dating back to my first experience with Opterons in the HyperTransport 1.0 timeframe. I had a wide variety of hardware in those days. That was then, this is now.

I’ve not yet had personal experience with the Xeon 5500 (Nehalem-EP) processors,  but I’m chomping at the bits to do so. I blogged about Nehalem with CSI interconnect technology nearly two years ago. I am a patient man.

I’m very exited about these processors as they represent the most significant technology leap in Intel processors since the jump from Pentium to Pentium Pro (the first Intel MCM cpu). But, all that aside, what does it mean for real workloads? From the things I heard first hand from engineers of HP’s ISS team it looks like this processor offers contentious workloads like Oracle a doubling of throughput on a core-for-core basis. After I heard that I started digging for independent measurements that back that up.

Although this Nehalem SAP Sales and Distribution Benchmark result was performed with Microsoft SQL Server, I know enough about the SDU test to know that it is a very contentious workload that is difficult to scale. I can’t say the boost will map one for one to Oracle, but I wouldn’t be surprised if it did. I like this result because it is nearly apples-to-apples and showing a 100% performance increase over Xeon 5400 “Harpertown.”  And, Harpertown CPUs are no slouches.

Systems based on Xeon 5500 processors are going to mow through Oracle workloads very nicely! As for packaging, I believe most servers are going to come in 2s8c and 4s16c configurations at first.

The other thing I like about these CPUs is the emergence of functional multithreading. Since these are NUMA systems it will be important to have processors that can get useful work done while a thread is stalled on remote memory. Not to be confused with Hyperthreading (Netburst), which didn’t do much if anything for Oracle workloads, the Nehalem (S)imultaneous (M)ulti-(T)hreading feature has proven helpful in a wide variety of workloads as this comprehensive paper shows.

Exciting!

Little Things Doth Crabby Make Part VI. Oracle Database 11g Automatic Storage Management Doesn’t Work. Exadata Requires ASM So Exadata Doesn’t Work.

I met someone at Rocky Mountain User Group Training Days 2009 who mentioned that they enjoyed my Little Things Doth Crabby Make…series ( found here). I was reminded of that this morning as I suffered the following Oracle Database 11g Automatic Storage Management (ASM) issue:


$ sqlplus '/ as sysdba'
SQL*Plus: Release 11.1.0.7.0 - Production on Thu Feb 19 09:33:53 2009
Copyright (c) 1982, 2008, Oracle.  All rights reserved.
Connected to an idle instance.

SQL> startup

ASM instance started
Total System Global Area  283930624 bytes
Fixed Size                  2158992 bytes
Variable Size             256605808 bytes
ASM Cache                  25165824 bytes

ORA-15032: not all alterations performed
ORA-15063: ASM discovered an insufficient number of disks for diskgroup "DATA2"
ORA-15063: ASM discovered an insufficient number of disks for diskgroup "DATA1"

Ho hum. I know the disks are there. I’ve just freshly configured this system. After all, this is Exadata and configuring ASM to use Exadata couldn’t be easier as you simply list the IP addresses of the Exadata Storage Servers in a text configuration file. No more ASMLib sort of stuff. Just point and go.


SQL> select count(*) from v$asm_disk;

COUNT(*)
----------
24

See, even ASM agrees with me. I set up 12 disks for each diskgroup and viola, there they are.

KFOD
There is even a nice little command line tool that ships with Oracle Database 11g 11.1.0.[67] that reports what Exadata disks are discovered. This is a nice little tool. It shows that I have 12 Exadata “griddisks” (ASM disks really) of 20GB and another 12 of 200GB all within a single Exadata Storage Server  (for testing purposes). Note, it also reports a list of other ASM instances in the database grid.

$ kfod -disk all
--------------------------------------------------------------------------------
 Disk          Size Path                                     User     Group
================================================================================
 1:      20480 Mb o/192.168.50.32/data1_CD_10_cell06       <unknown> <unknown>
 2:      20480 Mb o/192.168.50.32/data1_CD_11_cell06       <unknown> <unknown>
 3:      20480 Mb o/192.168.50.32/data1_CD_12_cell06       <unknown> <unknown>
 4:      20480 Mb o/192.168.50.32/data1_CD_1_cell06        <unknown> <unknown>
 5:      20480 Mb o/192.168.50.32/data1_CD_2_cell06        <unknown> <unknown>
 6:      20480 Mb o/192.168.50.32/data1_CD_3_cell06        <unknown> <unknown>
 7:      20480 Mb o/192.168.50.32/data1_CD_4_cell06        <unknown> <unknown>
 8:      20480 Mb o/192.168.50.32/data1_CD_5_cell06        <unknown> <unknown>
 9:      20480 Mb o/192.168.50.32/data1_CD_6_cell06        <unknown> <unknown>
 10:      20480 Mb o/192.168.50.32/data1_CD_7_cell06        <unknown> <unknown>
 11:      20480 Mb o/192.168.50.32/data1_CD_8_cell06        <unknown> <unknown>
 12:      20480 Mb o/192.168.50.32/data1_CD_9_cell06        <unknown> <unknown>
 13:     204800 Mb o/192.168.50.32/data2_CD_10_cell06       <unknown> <unknown>
 14:     204800 Mb o/192.168.50.32/data2_CD_11_cell06       <unknown> <unknown>
 15:     204800 Mb o/192.168.50.32/data2_CD_12_cell06       <unknown> <unknown>
 16:     204800 Mb o/192.168.50.32/data2_CD_1_cell06        <unknown> <unknown>
 17:     204800 Mb o/192.168.50.32/data2_CD_2_cell06        <unknown> <unknown>
 18:     204800 Mb o/192.168.50.32/data2_CD_3_cell06        <unknown> <unknown>
 19:     204800 Mb o/192.168.50.32/data2_CD_4_cell06        <unknown> <unknown>
 20:     204800 Mb o/192.168.50.32/data2_CD_5_cell06        <unknown> <unknown>
 21:     204800 Mb o/192.168.50.32/data2_CD_6_cell06        <unknown> <unknown>
 22:     204800 Mb o/192.168.50.32/data2_CD_7_cell06        <unknown> <unknown>
 23:     204800 Mb o/192.168.50.32/data2_CD_8_cell06        <unknown> <unknown>
 24:     204800 Mb o/192.168.50.32/data2_CD_9_cell06        <unknown> <unknown>
--------------------------------------------------------------------------------
ORACLE_SID ORACLE_HOME
================================================================================
 +ASM1 /u01/app/oracle/product/db
 +ASM2 /u01/app/oracle/product/db
 +ASM3 /u01/app/oracle/product/db
<pre>

I know why ASM is trying to mount these diskgroups because I set the parameter file to direct it to do so.


SQL> show parameter asm_diskgroups;

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
asm_diskgroups                       string      DATA1, DATA2

I suppose I should get some information about the diskgroups. How about names first:


SQL> select name from v$asm_diskgroup;
no rows selected

SQL> host date
Thu Feb 19 09:37:25 PST 2009

Idiot!  When you have several configurations “stewing” it is quite easy to miss a step. Today that seems to be forgetting to actually create diskgroups before I ask ASM to mount them.


SQL> startup force

ASM instance started
Total System Global Area  283930624 bytes
Fixed Size                  2158992 bytes
Variable Size             256605808 bytes
ASM Cache                  25165824 bytes

ORA-15032: not all alterations performed
ORA-15063: ASM discovered an insufficient number of disks for diskgroup "DATA2"

SQL>  select name from v$asm_diskgroup;

NAME
------------------------------
DATA1

SQL> host date
Thu Feb 19 09:44:21 PST 2009

Magic. I created the DATA1 diskgroup in a separate xterm and did a STARTUP FORCE.

Summary
Stupidity is one of those little things that doth crabby make. And, yes, the title of this blog post was a come-on. Who knows, however, someday there may be a flustered googler that’ll end up feeling crabby and stupid (like I do now)  🙂  after finding this worthless post.

Another Web Seminar About Exadata. This One Covers the Winter Corporation Report on Exadata Performance.

According to this post on blogs.oracle.com, Information Management is hosting a Web Seminar on February 25, 2009 covering the Winter Corporation findings in a recent Exadata proof of concept.

The signup page for the event is here.

I’ll be attending. I’m always curious about what people are saying when they do these things. Go ahead, sign up and join me.

Oracle Database 11g Versus Orion. Orion Gets More Throughput! Death To Oracle Database 11g!

Several readers sent in email questions after reading the Winter Corporation Paper about Exadata I announced in a recent blog entry. I thought I’d answer one in this quick blog entry.

The reader’s email read as follows (I made no edits other than to remove the SAN vendor’s name):

I have read through the Winter Corporation paper regarding Exadata, but is it ok for me to ask a questio? We have an existing data warehouse on a 4 node 10g RAC cluster attached to a [ brand name removed ]   SAN array by 2 active 4GB ports on each RAC node. When we test with Orion we see nearly 2.9 gigabytes per second throughput but with RDBMS queires we never see more than about 2 gigabytes per sec throughput except in select count(*) situation. With select count(*) we see about 2.5GB/s. Why is this?

Think Plumbing First

It is always best to focus first on the plumbing, and then on the array controller itself. After all, if the supposed maximum theoretical throughput of an array is on the order of 3GB/s, but servers are connected to the array with limited connectivity, the bandwidth is unrealizable. In this case, 2 active 4GFC HBAs per RAC node demonstrate sufficient throughput for this particular SAN array. Remember, I deleted the SAN brand. The particular brand cited by the reader is most certainly limited to 3GB/s (I know the brand and model well)  but no matter because the 2 active 4GFC paths to each RAC node limit I/O to an aggregate of 3.2GB/s sustained read throughput no matter what kind of SAN it is. This is actually a case of a well-balanced server-to-storage configuration and I pointed out so in a private email to the blog reader who sent me this question. But, what about the reader’s question?

Orion is a Very Light Eater

The reason the reader is able to drive storage at approximately 2.9GB/s with Orion is because Orion does nothing with the data being read from disk. As I/Os are completed it simply issues more. We sometimes call this lightweight I/O testing because the code doesn’t touch the data being read from disk. Indeed, even the dd(1) command can drive storage at maximum theoretical I/O rates with a command like dd if=/dev/sdN of=/dev/null bs=1024k. A dd(1) command like this does not touch the data being read from disk.

SELECT COUNT(*)

The reason the reader sees Oracle driving storage at 2.5GB/s with a SELECT COUNT(*) is because when a query such as this reads blocks from disk only a few bytes of the disk blocks being read are loaded into the processor caches. Indeed, Oracle doesn’t have to touch every row piece in a block to know how many rows the block contains. There is summary information in the header of the block that speeds up row counting. When code references just one byte of data in an Oracle block, after it is read from disk, the processor causes the memory controller to load 64 bytes (on x86_64 cpus) into the processor cache. Anything in that 64-byte “line” can be accessed for “free” (meaning additional loads from memory are not needed). Accesses to any other 64-byte lines in the Oracle block causes subsequent memory lines to be installed into the processor cache. While the CPU is waiting for a line to be loaded it is in a stalled state, which is accounted for as user-mode cycles charged to the Oracle process referencing the memory. The more processes do with the blocks being read from disk, the higher processor utilization goes up and eventually I/O goes down. This is why the reader stated that they see about 2GB/s when Oracle is presumably doing “real queries” such as those which perform filtration, projection, joins, sorting, aggregation and so forth. The reader didn’t state processor utilization for the queries seemingly limited to 2GB/s, but it stands to reason they were more complex than the SELECT COUNT(*) test.

Double Negatives for Fun and Learning Purposes

You too can see what I’m talking about by running a select that ingests dispersed columns and all rows in a query after performing zero-effect filtration such as the following 16-column table example:

SELECT AVG(LENGTH(col1)), AVG(LENGTH(col8)), AVG(LENGTH(col16)) FROM TABX WHERE col1 NOT LIKE ‘%NEVER’ and col8 NOT LIKE ‘%NEVER’ and col16 NOT LIKE ‘%NEVER’;

This test presumes columns 1,8 and 16 never contain the value ‘NEVER’. Observe the processor utilization when running this sort of query and compare that to a simple select count(*) of the same table.

Other Factors?

Sure, the reader’s throughput difference between the SELECT COUNT(*) and Orion could be related to tuning issues (i.e., Parallel Query Option degree of parallelism). However, in my experience achieving about 83% of maximum theoretical I/O with SELECT COUNT(*) is pretty good. Further, the reader’s complex query achieved about 66% of maximum theoretical I/O throughput which is also quite good–when using conventional storage.

What Does This Have To Do With Exadata

Exadata offloads predicate filtering and column projection (amongst the many other value propositions). Even this silly example has processing that can be offloaded to Exadata such as filtration that filters out no rows and the cost of projecting columns 1,8 and 16. The database host spends no cycles with the filtration or projection. It just performs the work of the AVG() and LENGTH() functions.

I didn’t have the heart to point out to the reader that 3GB/s is the least amount of throughput available when using Exadata and Real Application Clusters (RAC). That is, with RAC the fewest number of Exadata Storage Servers supported is 3 and there’s no doubt that 3 Exadata Storage Servers do indeed offer 3GB/s query I/O throughput. In fact, as the Winter Corporation paper shows, Exadata is able to perform maximum theoretical I/O throughput even with complex, concurrent queries because there is 2/3rds of a Xeon 54XX “Harpertown” processor for each disk drive offloading processing from the database grid.

So, while Orion is indeed a “light eater”, Exadata is quite ravenous.

Kevin Closson Promotes Netezza? That’s Odd!

AdSense Nonsense

I don’t have a screen-shot for verification, but a blog reader sent me email notifying me that WordPress (the site that hosts my blog) is letting Google AdSense put advertisements for Netezza on my blog posts. The reader thought I was the one doing the AdSense but I assured him that it is not I. My blogging is a non-profit effort. It is WordPress and the AdSense nonsense is the “pay” for using a “free” site.

There’s No Such Thing as a Free Lunch

I knew WordPress did this sort of thing on occasion, but I had totally put it out of mind…until now. So, have no fear readers, I just used some of my very own bier money to pay WordPress so you don’t have to see any ads from Oracle’s competitors any more!

Announcing a Winter Corporation Paper About Oracle Exadata Storage Server

WinterCorporation has posted a paper covering a recent Exadata proof of concept testing exercise. Highlights of the paper are evidence of concurrent moderately complex queries being serviced by Exadata at advertised disk throughput rates. The paper can be found at the following link:

Measuring the Performance of the Oracle Exadata Storage Server

There is also a copy of the paper on oracle.com at this link: Measuring the Performance of the Oracle Exadata Storage Server and a copy in the Wayback Machine in case it ages out of oracle.com.

Quotable Quotes
I’d like to draw attention to the following quote:

14 Gigabytes per second is a rate that can be achieved with some conventional storage arrays — only in dedicated large-scale enterprise arrays, that would require multiple full-height cabinets of hardware – and would therefore entail more space, power, and cooling than the HP Oracle Database Machine we tested here. Additionally, with established storage architectures, Oracle cannot offload any processing to the storage tier, therefore the database tier would require substantially more hardware to achieve a rate approaching 14 GB/second.

Yes, it may be possible to connect enough conventional storage to drive query disk throughput at 14GB/s, but the paper correctly points out that since there is no offload processing with conventional storage the database grid would require substantially more hardware than is assembled in the HP Oracle Database Machine. One would have to start from the ground up, as it were. By that I mean a database grid capable of simply ingesting 14GB/s would have to have 35 4Gb FC host bus adaptors. That requires a huge database grid.

If I could meet the guys (that would be me) that worked on this proof of concept I’d love to ask them what storage grid processor utilization was measured at the point where storage was at  peak throughput and performing the highest degree of storage processing complexity.  Now that would be a real golden nugget.  One thing is for certain, there was enough storage processor bandwidth to perform the Smart Scans which consist of applying WHERE predicates, column projection and performing bloom filtration in storage. Moreover, the test demonstrated ample storage processor bandwidth  to execute Smart Scan processing while blocks of data were whizzing by at the rate of 1GB/s per Oracle Exadata Storage Server. Otherwise, the paper wouldn’t be there.

Maybe 1.0476 processors per hard drive ( 176:168  )will become the new industry standard for optimal processor to disk ratio in DW/BI solutions.

A Quick Tip About Orion

In the comment thread of one of my old posts about Orion a reader posted an example of a problem he was having with the tool.

# ./orion10.2_linux -run simple -testname mytest -num_disks 1
ORION: ORacle IO Numbers — Version 10.2.0.1.0
Test will take approximately 9 minutes
Larger caches may take longer

storax_skgfr_openfiles: File identification failed: /mnt/1201879371/Orion, error 4
storax_skgfr_openfiles: File identification failed on /mnt/1201879371/Orion
OER 27054, Error Detail 0: please look up error in Oracle documentation
rwbase_lio_init_luns: lun_openvols failed
rwbase_rwluns: rwbase_lio_init_luns failed
orion_thread_main: rw_luns failed
Non test error occurred
Orion exiting

Carry on Wayward Googler

I didn’t work the problem with the poster of this issue, but I do know the answer to the issue and for the sake of any future wayward googlers I’ll post up the solution. The solution was to set  fs.aio-max-nr in sysctl.conf file to 4194304. The value can be change through the /proc interface as well:

# echo 4194304 > /proc/sys/fs/aio-max-nr

Announcement: An Exadata Webcast.

Juan Loaiza (SVP Oracle Systems Technologies Div) is offering a webcast this Wednesday. Here are the details:

Webcast: Extreme Performance for Your Data Warehouse
Wednesday, January 28th, 9:00 am PST

Data warehouses are tripling in size every two years and supporting ever-larger databases with ever increasing demands from business users to get “answers” faster requires a new way to approach this challenge. Oracle Exadata overcomes the limitations of conventional storage by utilizing a massively parallel architecture to dramatically increase data bandwidth between database and storage servers. In this webcast, we’ll examine these limitations and demonstrate how Oracle Exadata delivers extremely fast and completely scalable, enterprise-ready systems.

The signup sheet is here:

January 28, 2009 Exadata Webcast


Counterpointing Beliefs? Not me! Nightmares of Gruesome Imaginary Coopetition!

My old friend Matt Zito of  GridApp has made a blog entry entitled Where is Exadata. The post takes issue with some of the latest rants uttered by Chuck Hollis of EMC. It seems Chuck thinks Oracle Exadata Storage Server is dead or dying based on his best guess of the field adoption rate over the approximate 2,880 hours Exadata has been in production.

I grew tired of Chuck’s musings on the topic weeks ago so I didn’t want to call it out on my own. Matt makes a good point about the fact that Exadata requires Oracle Database 11g and most sites take the sensibly cautious approach toward adopting software that is newer then what they currently have. I’m not saying that is any sort of mitigation for Chuck’s beliefs. I am saying that people like Matt are awake at the wheel-as it were.

Matt does point out correctly that nowhere in Exadata marketing literature is EMC called out by name as competition. To the contrary Oracle and EMC share a tremendous install base. That doesn’t even need to be said. What Matt further points out is how seemingly paranoid Chuck is about Exadata. I too find it odd, but not sufficiently odd to make a blog entry specifically about the point. I, therefore, have not done so.


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.