Cached Ext3 File Access is Faster Than Infiniband. Infiniband is Just a Marketing Ploy. Who Needs Exadata?

Infiniband: Just an Exadata Storage Server Marketing Ploy
One phenomenon I’ve observed about Oracle Exadata Storage Server technology is the propensity of some folks to chop it up from the sum of its parts into a pile of Erector Set parts so that each bit can be the topic for discussion.

Readers of this blog know that I generally do not scrutinize any one element of Exadata architecture opting instead to handle it is a whole product. Some of my conversations at Rocky Mountain Oracle User Group Training Days 2009, earlier this month, reminded me of this topic. I heard one person at the confrence say words to the effect of, “…Exadata is so fast because of Infiniband.” Actually, it isn’t. Exadata is not powerful because of any one of its parts. It is powerful because of the sum of its parts. That doesn’t mean I would side with EMC’s Chuck Hollis and his opinions regarding the rightful place of Infiniband in Exadata architecture. Readers might recall Chuck’s words in his post entitled I annoy Kevin Closson at Oracle:

The “storage nodes” are interconnected to “database nodes” via Infiniband, and I questioned (based on our work in this environment) whether this was actually a bottleneck being addressed, or whether it was a bit of marketing flash in a world where multiple 1Gb ethernet ports seem to do as well.

Multiple GbE ports? Oh boy. Sure, if the storage array is saturated at at the storage processor level (common) or back-end loop level (more common yet), I know for a fact that offering up-wind Infiniband connectivity to hosts is a waste of time. Maybe that is what he is referring to? I don’t know, but none of that has anything to do with Exadata because Exadata suffers no head saturation.

Dazzle Them With the Speeds and Feeds
Oracle database administrators sometimes get annoyed when people throw speed and feed specifications at them without following up with real-world examples demonstrating the benefit to a meaningful Oracle DW/BI workload. I truly hope there are no DBAs who care about wire latencies for Reliable Datagram Sockets requests any more than, say, what CPUs or Operating System software is embedded in the conventional storage array they use for Oracle today. Some technology details do not stand on their own merit.

Infiniband, however, offers performance benefits in the Exadata Storage Server architecture that are both tangible and critical. Allow me to offer an example of what I mean.

Read From Memory, Write to Redundant ASM Storage
I was recently analyzing some INSERT /*+ APPEND */ performance attributes on a system using Exadata Storage Server for the database. During one of the tests I decided that I wanted to create a scenario where the “ingest” side suffered no physical I/O. To do this I created a tablespace in an Ext3 filesystem file and set filesystemio_options=asynch to ensure I would not be served by direct I/O. I wanted the tables in the datafile cached in the Linux page cache.

The test looped INSERT /*+ APPEND */ SELECT * FROM commands several times since I did not have enough physical memory for a large cached tablespace in the Ext3 filesystem. The cached table was 18.5GB and the workload consisted of 10 executions of the INSERT command using Parallel Query Option. To that end, the target table grew to roughly 185GB during the test. Since the target table resided in Exadata Storage Server ASM disks, with normal redundancy, the downwind write payload was 370GB. That is, the test read 185GB from memory and wrote 370GB to Exadata storage.

The entire test was isolated to a single database host in the database grid, but all 14 Exadata Storage Servers of the HP Oracle Database Machine were being written to.

After running the test once, to prep the cache, I dropped and recreated the target table and ran the test again. The completion time was 307 seconds for a write throughput of 1.21GB/s.

During the run AWR tracked 328,726 direct path writes:

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                          Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
direct path write                   328,726       3,078      9   64.1 User I/O
DB CPU                                            1,682          35.0
row cache lock                        5,528          17      3     .4 Concurrenc
control file sequential read          2,985           7      2     .1 System I/O
DFS lock handle                         964           4      4     .1 Other

…and, 12,143,325 reads from the SEED tablespace which, of course, resided in the Ext3 filesystem fully cached:

Segments by Direct Physical Reads        DB/Inst: TEST/test1  Snaps: 1891-1892
-> Total Direct Physical Reads:      12,143,445
-> Captured Segments account for  100.0% of Total

           Tablespace                      Subobject  Obj.        Direct
Owner         Name    Object Name            Name     Type         Reads  %Total
---------- ---------- -------------------- ---------- ----- ------------ -------
TEST       SEED       SEED                            TABLE   12,143,325  100.00

AWR also showed that the same number of blocks read were also written:

Segments by Direct Physical Writes       DB/Inst: TEST/test1  Snaps: 1891-1892
-> Total Direct Physical Writes:      12,143,453
-> Captured Segments account for  100.0% of Total

           Tablespace                      Subobject  Obj.        Direct
Owner         Name    Object Name            Name     Type        Writes  %Total
---------- ---------- -------------------- ---------- ----- ------------ -------
TEST       CARDX      ALL_CARD_TRANS                  TABLE   12,139,301   99.97
SYS        SYSAUX     WRH$_ACTIVE_SESSION_ 70510_1848 TABLE            8     .00
          -------------------------------------------------------------

During the test I picked up a snippet of vmstat(1) from the database host. Since this is Exadata storage you’ll see no data under the bi or bo columns as those track physical I/O. Exadata I/O is sent via Reliable Datagram Sockets protocol over Infiniband (iDB) to the storage servers. And, of course, the source table was fully cached. Although processor utilization was a bit erratic, the peaks in user mode were on the order of 70% and kernel mode roughly 17%. I could have driven up the rate a little with higher DOP, but all told this is a great throughput rate at a reasonable CPU cost. I did not want processor saturation during this test.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0 126848 3874888 413376 22810008    0    0    31   128   18   16 31  3 66  0  0
 0  0 126848 3845352 413376 22810428    0    0     1    60 1648 5361 41  5 54  0  0
13  0 126848 3836192 413380 22810460    0    0     2    77 1443 6217  8  2 90  0  0
 0  0 126848 3855868 413388 22810456    0    0     1   173 1459 4870 15  2 82  0  0
27  0 126848 3797260 413392 22810840    0    0    10    47 7750 5769 53 12 36  0  0
 6  0 126848 3814064 413392 22810936    0    0     1    44 11810 10695 59 16 25  0  0
 7  0 126848 3804336 413392 22810956    0    0     2    72 12128 8498 76 18  6  0  0
16  0 126848 3810312 413392 22810988    0    0     1    83 11921 11085 59 18 23  0  0
11  0 126848 3819564 413392 22811872    0    0     2   134 12118 7760 74 16 10  0  0
10  0 126848 3804008 413392 22819644    0    0     1    19 11873 10544 59 17 24  0  0
 2  0 126848 3789324 413392 22827004    0    0     2    92 11933 7753 74 17 10  0  0
39  0 126848 3759416 413396 22839640    0    0     9    39 10625 8992 54 14 32  0  0
 6  0 126848 3766288 413400 22845200    0    0     2    40 12061 9121 70 17 13  0  0
35  0 126848 3739296 413400 22853452    0    0     2   161 12124 8447 66 16 18  0  0
11  0 126848 3761772 413400 22860092    0    0     1    38 12011 9040 67 16 17  0  0
26  0 126848 3724364 413400 22868624    0    0     2    60 12489 9105 69 17 14  0  0
11  0 126848 3734572 413400 22877056    0    0     1    85 12067 9760 64 17 19  0  0
21  0 126848 3713224 413404 22883044    0    0     2   102 11794 8454 72 17 11  0  0
14  0 126848 3726168 413408 22895800    0    0     9     2 10025 9538 51 14 35  0  0
13  0 126848 3706280 413408 22902448    0    0     2    83 12133 7879 75 17  8  0  0
 7  0 126848 3707200 413408 22909060    0    0     1    61 11945 10135 59 17 24  0  0

That was an interesting test. I can’t talk about the specifics of why I was studying this stuff, but I liked driving a single DL360 G5 server to 1.2GB/s write throughput. That got me thinking. What would happen if I put the SEED table in Exadata?

Read from Disk, Write to Disk.
I performed a CTAS (Create Table as Select ) to create a copy of the SEED table (same storage clause, etc) into the Exadata Storage Servers. The model I wanted to test next was: Read the 185GB from Exadata while writing the 370GB right back to the same Exadata Storage Server disks. This test increased physical disk I/O by 50% while introducing latency (I/O service time) for the ingest side. Remember, the SEED table I/O in the Ext3 model enjoyed read service times at RAM speed (3 orders of magnitude faster than disk).

So, I ran the test and of course it was slower now that I increased the physical I/O payload by 50% and introduced I/O latency on the ingest side. It was, in fact, 4% slower-or a job completion time of 320 seconds! Yes, 4%.

The AWR report showed the same direct path write cost to the target table:

Top 5 Timed Foreground Events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
direct path write                   365,950       3,296      9   66.2 User I/O
DB CPU                                            1,462          29.4
cell smart table scan               238,027         210      1    4.2 User I/O
row cache lock                        5,473           6      1     .1 Concurrenc
control file sequential read          3,087           6      2     .1 System I/O

…and reads from the tablespace called EX_SEED (resides in Exadata storage) was on par with the volume read from the cached Ext3 tablespace:

Segments by Direct Physical Reads        DB/Inst: TEST/test1  Snaps: 1923-1924
-> Total Direct Physical Reads:      12,143,445
-> Captured Segments account for  100.0% of Total

           Tablespace                      Subobject  Obj.        Direct
Owner         Name    Object Name            Name     Type         Reads  %Total
---------- ---------- -------------------- ---------- ----- ------------ -------
TEST       EX_SEED    EX_SEED                         TABLE   12,142,160   99.99

…and the writes to the all_card_trans table was on par with the test conducted using the cached Ext3 SEED table:

Segments by Physical Writes              DB/Inst: TEST/test1  Snaps: 1923-1924
-> Total Physical Writes:      12,146,714
-> Captured Segments account for  100.0% of Total

           Tablespace                      Subobject  Obj.      Physical
Owner         Name    Object Name            Name     Type        Writes  %Total
---------- ---------- -------------------- ---------- ----- ------------ -------
TEST       CARDX      ALL_CARD_TRANS                  TABLE   12,144,831   99.98

Most interestingly, but not surprising, was the processor utilization profile (see the following box). In spite of performing 50% more physical disk I/O kernel-mode cycles were reduced nearly in half. It is, of course, more processor intensive (in kernel mode) to perform reads from an Oracle tablespace that is cached in an Ext3 file than to read data from Exadata because in-bound I/O from Exadata storage is DMAed directly into the process address space without any copys. I/O from an Ext3 cached file, on the other hand, requires kernel-mode memcpys from the page cache into the address space of the Oracle process reading the file. However, since the workload was not bound by CPU saturation, the memcpy overhead was not a limiting factor. But that’s not all.

I’d like to draw your attention to the user-mode cycles. Notice how the user-mode cycles never peak above the low-fifties percent mark? The cached Ext3 test, on the other hand, exhibited high sixties and low seventies percent user-mode processor utilization. That is a very crucial differentiator. Even when I/O is being serviced at RAM speed, the code required to deal with conventional I/O (issuing and reaping, etc) is significantly more expensive (CPU-wise) than interfacing with Exadata Storage.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0 374788 2214348  62904 24535308    0    0    26   103   18    3 31  3 66  0  0
 0  0 374788 2214868  62904 24535320    0    0     2    79 1259 4232 19  2 79  0  0
 0  0 374788 2215756  62904 24535328    0    0     1   148 1148 4894  0  1 99  0  0
 0  0 374788 2215788  62904 24535328    0    0     2     1 1054 3983  0  0 100  0  0
13  0 374788 2203004  62904 24535540    0    0    17   142 8044 9675 31  6 63  0  0
 8  0 374788 2187292  62904 24535544    0    0     2     1 12856 13948 53  8 40  0  0
 6  0 374788 2191432  62904 24535584    0    0     1    69 12864 13904 53  7 40  0  0
 6  0 374788 2194812  62904 24535584    0    0     2    35 12948 13954 54  8 39  0  0
 7  0 374788 2194284  62904 24535592    0    0     1     2 12940 13605 53  7 40  0  0
 3  0 374788 2193988  62904 24535596    0    0     2    22 12849 13329 52  7 41  0  0
 9  0 374788 2192988  62904 24535604    0    0     1     2 12797 13555 52  7 41  0  0
10  0 374788 2193788  62908 24537648    0    0    10    60 11497 13184 46  8 46  0  0
 7  0 374788 2191976  62912 24537656    0    0     1   146 12990 14248 52  8 40  0  0
 5  0 374788 2190668  62912 24537668    0    0     2     1 12974 13401 52  7 41  0  0
 3  0 374788 2191492  62912 24537676    0    0     1    16 12886 13495 52  7 41  0  0
 6  0 374788 2191292  62912 24537676    0    0     2     1 12825 13736 52  8 40  0  0
 4  0 374788 2190768  62912 24537688    0    0     1    89 12832 14500 53  8 40  0  0
 7  0 374788 2189928  62912 24537688    0    0     2     2 12849 14015 52  8 40  0  0
11  0 374788 2190320  62916 24539712    0    0     9    36 11588 12411 46  7 47  0  0
17  0 374788 2189496  62916 24539744    0    0     2    40 12948 13373 52  7 41  0  0
 6  1 374788 2189032  62916 24539748    0    0     1     2 12932 13899 53  7 39  0  0

Ten Pounds of Rocks in a 5-Pound Bag?
For those who did the math and determined that the 100% Exadata case was pushing a combined read+write throughput of 1730MB/s through the Infiniband card, you did the math correctly.

The IB cards in the database tier of the HP Oracle Database Machine each support a maximum theoretical throughput of 1850MB/s in full-duplex mode. This workload has an “optimal” (serendipitous really) blend of read traffic mixed with write traffic so it drives the card to within 6% of its maximum theoretical throughput rating.

We don’t suggest that this sort of throughput is achievable with an actual DW/BI workload. We prefer the more realistic, conservative 1.5GB/s (consistently measured with complex, concurrent queries) when setting expectations. That said, however, even if there was a DW/BI workload that demanded this sort of blend of reads and writes (i.e., a workload with a heavy blend of sort-spill writes) the balance between the database tier and storage tier is not out of whack.

The HP Oracle Database Machine sports a storage grid bandwidth of 14GB/s so even this workload protracted to 8 nodes in the database grid would still fit within that range since 1730MB/s * 8 RAC nodes == 13.8GB/s.

I like balanced configurations.

Summary
This test goes a long way to show the tremendous efficiencies in the Exadata architecture. Indeed, both tests had to “lift” the same amount of SEED data and produce the same amount of down-wind write I/O. Everything about the SQL layer remains constant. For that matter most everything remains constant between the two models with the exception of the lower-level read-side I/O from the cached file in the Ext3 case.

The Exadata test case performed 50% more physical I/O and did so with roughly 28% less user-mode processor utilization and a clean 50% reduction in kernel-mode cycles while coming within 4% of the job completion time achieved by the cached Ext3 SEED case.

Infiniband is not just a marketing ploy. Infiniband is an important asset in the Exadata Storage Server architecture.

5 Responses to “Cached Ext3 File Access is Faster Than Infiniband. Infiniband is Just a Marketing Ploy. Who Needs Exadata?”

Feed for this Entry Trackback Address

1 Timur Akhmadeev March 3, 2009 at 8:53 am

Thank you, Kevin.

>> i.e., a workload with a heavy blend of sort-spill writes
I want to see the demo! 🙂

2 Connor March 3, 2009 at 12:46 pm

I’ve said it before, I’ll say it again…

Love ya work Kevin….

- 3 kevinclosson March 3, 2009 at 3:23 pm
  
  Thanks Connor!
  
4 Kevin Leach March 26, 2009 at 3:55 pm

The base of your argument for infiniband seems to avoid the real reason CPU isn’t used “issuing and reaping”. The real reason is that you aren’t doing file system i/o, right?

Isn’t this “Exadata I/O is sent via Reliable Datagram Sockets protocol” also applicable to other hardware choices like fibre?

5 kevinclosson March 26, 2009 at 7:46 pm

“The base of your argument for infiniband seems to avoid the real reason CPU isn’t used “issuing and reaping”. The real reason is that you aren’t doing file system i/o, right?”

…I don’t exactly understand the question, but I’ll give it a shot. When I was discussing the cost of marshalling I/O (issuing/reaping) I was focusing on the ~50% more user-mode cycles exhibited in the cached Ext3 case. I wasn’t necessarily trying to account entirely for the difference in user-mode cycles because the Oracle routines that interface with libaio are certainly not sufficiently expensive to account for the difference. It is, however, a significant portion of the difference.

…Filesystem I/O is kernel-mode so, no, your assertion that file system i/o is the culprit is not on target. The cycles Oracle spends in OSD I/O (skgf*), through libaio stop at syscall(). If the main culprit was filesystem I/O, then the CPU usage would have skewed to kernel mode, not user mode.

“Isn’t this “Exadata I/O is sent via Reliable Datagram Sockets protocol” also applicable to other hardware choices like fibre?”

…RDS is just a protocol. It is currently availble for Infiniband and Ethernet. With some heavy lifting (akin to SRP from iB->FC with a gateway) I’m sure RDS could be horse-shoed into such a technology stack. Whether Oracle would ever see fit to take advantage of some futuristic RDSoFC is well beyond me to say.

…I wonder if folks get confused over the fact that I discuss RDMA as if it is the only DMA. Indeed, FC I/O with the right drives and HBAs (um, pretty much all of them) is indeed DMAed from storage into user virtual address space when using direct I/O libraries. This differs from linux processes RDMAing data between their address spaces. That sort of IPC is key for RAC inter-instance and hoststorage in the case of Exadata.

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage