A question came in about whether it is possible to measure how much data is filtered out when running a query serviced by a Smart Scan in the Exadata Storage Server grid. The following is the long answer.
An Example of Offload Processing Effectiveness Accounting
I’d like to answer this question by taking real information from a test configuration consisting of 4 Real Application Clusters nodes attached to 6 Exadata Storage Server cells with the SAS disk option (12 x 300GB 3.5″ 15K RPM).
Test Workload – The Affinity Card Program Test Database
The Affinity Card Program Test Database (ACPTD) is a test schema and set of queries that mimics the type of queries performed by a marketing group responsible for promoting the use of a retail club card and affinity credit card for a fictitious company. In this example deployment of the ACPTD, the group responsible for promoting club card and affinity credit card activity for Acme Inc has built a data mart with data from several sources, including the main corporate ERP and CRM systems; partner merchant ERP systems; and, outside syndicated data.
For this blog post I’ll focus on the main syndicated card transaction table called all_card_trans. However, there are other tables in the schema and for reference sake the following text box shows the table sizes in my current deployment of the test kit. As the query shows, the all_card_trans table is 613 GB. Yes, I know that’s small, but this is a test system and I don’t like watching 6 Exadata cells hammering 1 GB/s each for tens of minutes when a couple of minutes will do. I hope you’ll understand.
SQL> set echo on SQL> @tab_sizes SQL> col segment_name for a32 SQL> select segment_name, sum(bytes)/1024/1024 mb from user_segments 2 where segment_name not like 'BIN%' and segment_name not like 'SYS%' 3 group by segment_name; SEGMENT_NAME MB -------------------------------- ---------- PARTNER_MERCHANTS 8 PARTNER_MERCHANT_SALES 67780 ZIPCODES 8 OUR_SALES 21956 ALL_CARD_TRANS 629032 MCC 8 OUR_STORES 8 CUST_SERVICE 76 TM .03125 CUST_FACT 4708 10 rows selected.
The table definition for the all_card_trans table is shown in the following text box:
SQL> desc all_card_trans Name Null? Type ----------------------------------------- -------- ---------------------------- CARD_NO NOT NULL VARCHAR2(20) CARD_TYPE CHAR(20) MCC NOT NULL NUMBER(6) PURCHASE_AMT NOT NULL NUMBER(6) PURCHASE_DT NOT NULL DATE MERCHANT_CODE NOT NULL NUMBER(7) MERCHANT_CITY NOT NULL VARCHAR2(40) MERCHANT_STATE NOT NULL CHAR(3) MERCHANT_ZIP NOT NULL NUMBER(6)
The following queries show the total row count in the all_card_trans table as well as the number of rows representing transactions with credit cards with leading 4 digits of 3333 and 4447, which represent the card block granted to our fictitious company, Acme Inc, for their affinity card program. This is a needle in a haystack query since only .3% of the data matches the where predicate. Now, before you look at the text box please note that this is a single instance of Real Application Clusters driving this query. And, yes, the RAC node is a small 8-core DL360. Both scans are driving storage at 6 GB/s (613GB/102s). Folks, remember that to do what I just did in the following text box with Fibre Channel I’d have to have a system with 15 FC HBAs attached to several huge SAN arrays. Oh, and I should point out that it would also require about 85 Netezza Snippet Processing Units to match this throughput.
SQL> select count(*) from all_card_trans where card_no like '4777%' or card_no like '3333%’; COUNT(*) ---------- 22465830 Elapsed: 00:01:42.71 SQL> select count(*) from all_card_trans; COUNT(*) ---------- 7897611641 Elapsed: 00:01:42.47
I Can’t Believe I Ate the Whole Thing
The following query uses a method for measuring cell activity by forcing the database to ingest all rows and all columns. Don’t try to make sense of the query because it isn’t supposed to make sense. It is only supposed to drive Exadata to return all rows and all columns, which it does.
$ more demo_offload_bytes.sql drop view ingest; create view ingest as select avg(length(CARD_NO)) c1, avg(length(CARD_TYPE)) c2, max(MCC + PURCHASE_AMT) c3, max(length(PURCHASE_DT)) c4, max(MERCHANT_CODE) c5, avg(length(MERCHANT_CITY)) c6, avg(length(MERCHANT_STATE)) c7, max(MERCHANT_ZIP) c8 from all_card_trans; col MB format 99999999.9 select NAME,VALUE/(1024*1024*1024) GB from gv$sysstat where STATISTIC# in (196, 44, 43 ); select c1+c2+c3+c4+c5+c6+c7 from ingest; select NAME,VALUE/(1024*1024*1024) GB from gv$sysstat where STATISTIC# in (196, 44, 43);
The following text box shows that when I ran the query Exadata (in aggregate) scanned 613 GB of disk and returned all but 6% to the RDBMS instance (delta cell physical IO interconnect bytes). Also recorded by v$sysstat (bytes eligible) is the fact that there was nothing peculiar about the data being scanned-peculiar in any such fashion that would have interfered with offload processing (e.g., chained rows, etc). Since I asked for all the data, that is what I got. It is nice to know, however, that the entirety of the data was a candidate for offload processing.
Notice I didn’t time the query. I’ll offer more on that later in this blog post.
SQL> @demo_offload_bytes View dropped. View created. NAME GB ---------------------------------------------------------------- ---------- physical IO disk bytes 1227.80814 cell physical IO interconnect bytes 1151.52453 cell physical IO bytes eligible for predicate offload 1227.31451 C1+C2+C3+C4+C5+C6+C7 -------------------- 9010106.59 NAME GB ---------------------------------------------------------------- ---------- physical IO disk bytes 1841.5774 cell physical IO interconnect bytes 1727.15227 cell physical IO bytes eligible for predicate offload 1840.97177
So let’s take a look at some offload processing. The following change to the “ingest” query reduces the amount of data ingested by the database tier by only selecting rows where the credit card number started with 4777 or 3333. We know from the scan of the table that this whittles down the dataset by 99.7%.
$ more demo_offload_bytes1.sql create view ingest_less as select avg(length(CARD_NO)) c1, avg(length(CARD_TYPE)) c2, max(MCC + PURCHASE_AMT) c3, max(length(PURCHASE_DT)) c4, max(MERCHANT_CODE) c5, avg(length(MERCHANT_CITY)) c6, avg(length(MERCHANT_STATE)) c7, max(MERCHANT_ZIP) c8 from all_card_trans act where card_no like '4777%' or card_no like '3333%'; col MB format 99999999.9 select NAME,VALUE/(1024*1024*1024) GB from gv$sysstat where STATISTIC# in (196, 44, 43 ); set timing on select c1+c2+c3+c4+c5+c6+c7 from ingest_less; select NAME,VALUE/(1024*1024*1024) GB from gv$sysstat where STATISTIC# in (196, 44, 43);
The following text box shows that that Exadata scanned and filtered out the uninteresting rows. The projected columns accounted for only 1.7 GB across the iDB intelligent I/O fabric. That is, the database tier only had to ingest 1.7 GB, or 17 MB/s since the query completed in 103 seconds.
SQL> @demo_offload_bytes_1 View dropped. Elapsed: 00:00:00.21 View created. Elapsed: 00:00:00.00 NAME GB ---------------------------------------------------------------- ---------- physical IO disk bytes 1841.13743 cell physical IO interconnect bytes 43.1782733 cell physical IO bytes eligible for predicate offload 1840.97177 Elapsed: 00:00:00.01 C1+C2+C3+C4+C5+C6+C7 -------------------- 9010106.9 Elapsed: 00:01:43.31 NAME GB ---------------------------------------------------------------- ---------- physical IO disk bytes 2454.80203 cell physical IO interconnect bytes 44.8776852 cell physical IO bytes eligible for predicate offload 2454.62903 Elapsed: 00:00:00.00 SQL>
So, yes, Exadata does let you see how effectively Smart Scans are working. In this example we saw a single-table scan on a small Oracle Exadata Storage Server grid weeding out 99.7% of the data.
Sneaky, Sneaky.
You’ll notice when I executed demo_offload_bytes.sql (the non-filtering example) I was not tracking query completion time. That’s because up to this point I’ve only been showing examples driven by a single Real Application Clusters node. Now, let’s think about this. I have 8 Intel processor cores and a single inbound Infiniband path on the database server. From an inbound I/O bandwidth perspective the host can “only” ingest 1.6GB/s, but the CPUs may also further throttle that since the database is doing its own projection in this case.
I’ve shown, in this post, that this test Exadata Storage grid configuration can scan disk at 6 GB/s. The question you might ask-and I’m about to answer-is how much does this single database host throttle storage and why does that matter? Well, it matters because having ample storage bandwidth with limited database server bandwidth is the classic imbalance addressed by the HP Oracle Database Machine. So, let me run it again-with timing turned on. As you’ll see, bottlenecks are bottlenecks. This single database host cannot keep up with storage, so storage is throttled and the result is 18X increase in query completion time when compared to the heavily filtered case. Both queries had to read the same amount of data but in this case there was an imbalance in the upstream ability to ingest the data both from an I/O bandwidth and CPU perspective.
SQL> @demo_offload_bytes View dropped. Elapsed: 00:00:00.02 View created. Elapsed: 00:00:00.01 NAME GB ---------------------------------------------------------------- ---------- physical IO disk bytes 2454.81084 cell physical IO interconnect bytes 44.886501 cell physical IO bytes eligible for predicate offload 2454.62903 Elapsed: 00:00:00.00 C1+C2+C3+C4+C5+C6+C7 -------------------- 9010106.59 Elapsed: 00:29:26.85 NAME GB ---------------------------------------------------------------- ---------- physical IO disk bytes 3068.55792 cell physical IO interconnect bytes 620.492259 cell physical IO bytes eligible for predicate offload 3068.28629 Elapsed: 00:00:00.00
Need More RDBMS Bandwidth? OK, Use It.
So, let’s see what the full-ingestion case looks like with 4 Real Application Clusters nodes. And please forgive that I forgot to aggregate the v$sysstat output. I’ve added 4-fold database grid resources so I should complete the full-ingestion query in 75% less time. The text box starts out showing that the number of RAC instances was increased from 1 to 4.
SQL> host date Fri Oct 3 16:27:52 PDT 2008 SQL> select instance_name from gv$instance; INSTANCE_NAME ---------------- test1 Elapsed: 00:00:00.02 SQL> select instance_name from gv$instance; INSTANCE_NAME ---------------- test1 test4 test3 test2 Elapsed: 00:00:00.08 SQL> host date Fri Oct 3 16:32:06 PDT 2008 SQL> SQL> @demo_offload_bytes View dropped. Elapsed: 00:00:00.03 View created. Elapsed: 00:00:00.01 NAME GB ---------------------------------------------------------------- ---------- physical IO disk bytes 3068.69843 cell physical IO interconnect bytes 620.632769 cell physical IO bytes eligible for predicate offload 3068.28629 physical IO disk bytes .013385296 cell physical IO interconnect bytes .013385296 cell physical IO bytes eligible for predicate offload 0 physical IO disk bytes .010355473 cell physical IO interconnect bytes .010355473 cell physical IO bytes eligible for predicate offload 0 physical IO disk bytes .031275272 cell physical IO interconnect bytes .031275272 NAME GB ---------------------------------------------------------------- ---------- cell physical IO bytes eligible for predicate offload 0 12 rows selected. Elapsed: 00:00:00.02 C1+C2+C3+C4+C5+C6+C7 -------------------- 9010106.59 Elapsed: 00:07:25.01 NAME GB ---------------------------------------------------------------- ---------- physical IO disk bytes 3222.13893 cell physical IO interconnect bytes 764.535856 cell physical IO bytes eligible for predicate offload 3221.70425 physical IO disk bytes 154.31714 cell physical IO interconnect bytes 144.732708 cell physical IO bytes eligible for predicate offload 154.280945 physical IO disk bytes 152.860648 cell physical IO interconnect bytes 143.367385 cell physical IO bytes eligible for predicate offload 152.828033 physical IO disk bytes 153.195674 cell physical IO interconnect bytes 143.6831 NAME GB ---------------------------------------------------------------- ---------- cell physical IO bytes eligible for predicate offload 153.13031 12 rows selected. Elapsed: 00:00:00.02 SQL>
Well, that was predictable, but cool nonetheless. What about the filtered ingest-query? Should there be much of a speedup given the fact that storage bandwidth remains constant and the ingest rate of the filtered query was only roughly 17 MB/s with a single Real Application Clusters node? I could save the cut and paste effort and just tell you that adding Real Application Clusters nodes in this case will, of course, not reduce the query time, but since I’ve bored you this long, here it is:
SQL> @demo_offload_bytes_1 View dropped. Elapsed: 00:00:00.01 View created. Elapsed: 00:00:00.01 NAME GB ---------------------------------------------------------------- ---------- physical IO disk bytes 3222.15758 cell physical IO interconnect bytes 764.554509 cell physical IO bytes eligible for predicate offload 3221.70425 physical IO disk bytes 154.334342 cell physical IO interconnect bytes 144.74991 cell physical IO bytes eligible for predicate offload 154.280945 physical IO disk bytes 153.213285 cell physical IO interconnect bytes 143.70071 cell physical IO bytes eligible for predicate offload 153.13031 physical IO disk bytes 152.878347 cell physical IO interconnect bytes 143.385084 NAME GB ---------------------------------------------------------------- ---------- cell physical IO bytes eligible for predicate offload 152.828033 12 rows selected. Elapsed: 00:00:00.01 C1+C2+C3+C4+C5+C6+C7 -------------------- 9010106.9 Elapsed: 00:01:42.79 NAME GB ---------------------------------------------------------------- ---------- physical IO disk bytes 3375.27835 cell physical IO interconnect bytes 764.974159 cell physical IO bytes eligible for predicate offload 3374.81995 physical IO disk bytes 308.0746 cell physical IO interconnect bytes 144.133529 cell physical IO bytes eligible for predicate offload 307.986572 physical IO disk bytes 310.634206 cell physical IO interconnect bytes 145.189768 cell physical IO bytes eligible for predicate offload 310.5755 physical IO disk bytes 302.274069 cell physical IO interconnect bytes 143.805239 NAME GB ---------------------------------------------------------------- ---------- cell physical IO bytes eligible for predicate offload 302.218781 12 rows selected. Elapsed: 00:00:00.01 SQL> host date Fri Oct 3 16:49:32 PDT 2008
Summary
Exadata works and has statistics so that the solution doesn’t feel so much like a black box. In the end, however, none of the statistcs stuff really matters. What matters is whether your queries complete in the desired service times.
Kevin,
couldn’t you get the stats from v$sql? did they change sqlplus ‘trace mode’ to report these, as it does report logical and physical reads?
rgds
Hi Salem,
Yes, IO_CELL_OFFLOAD_ELIGIBLE_BYTES,IO_INTERCONNECT_BYTES and IO_DISK_BYTES are accounted for in v$sql
It is the same data as gv$sysstat in this regard though.
No such thing as too many statistics: very good stuff.
The example you use raises another imortant point, I think, which is that storage-based filtering and projection is not a cure for every problem, and that disk i/o is not the only component in a balanced system. It’s a reminder that Oracle has long had a set of features specifically to mitigate the inefficiency of repeating calculations — materialised views, query rewrite, and in 11g various result caching features — and that even when you couple the database to a fast storage system there is still a role to be played by these.
David,
Let me see if I get your point. I think you are saying that all the optimizations Oracle has implemented in the past (to mitigate I/O pain) are likely still helpful. Yes, you are right. The full arsenal is still there. The main thing about Exadata is that it can do I/O and in the absence of Exadata the ***only** tools you have are those which mitigate the impact of painful I/O.
Exactly.
As well as precalculating aggregates, in the past I have created “skinny” versions of very wide fact tables in order to improve full scan time, and used query rewrite to transparantly redirect queries to them. Works great. I’ve even made completely redundant copies of large tables with different partitioning schemes to provide more effective partition pruning for targeted sets of queries.
So one appeal of about 11g+Exadata is that nothing _necessarily_ invalidates those approaches, they just change the associated cost-benefit analysis so that the number of cases where you resort to them is greatly reduced, or even eliminated.
David,
Don’t forget the results cache 🙂
Like we say, Exadata is brute force computing …. but it doesn’t negate the “brainy” features of Oracle Database. You can use brains, brawn or both with Exadata…your choice 🙂
It would be interesting to compare 11g+Exadata with the other vendors to see how these brainy features compare. Query rewrite, results caching, partitioning, in-database OLAP, in-database Data Mining, um … flashback is also a good feature for ETL support, for rolling back a complex failed load without needing any lines of code.
Oh, and Apex as well
Er, ummm, I’m trying to avoid a ‘me too’ fan-boy comment, but I can’t help myself.
You’ve both hit the nail square on the head. Exadata gives you all the DW appliance style stuff, *but* on top of a top notch, mature feature-rich DBMS. (Crikey, maybe I should work for Oracle marketing. Perhaps not. 😉 )
That’s a very attractive combination from a technical point of view, regardless of industry analysis.
Yeah, ok, I admit it. (Educated) Fan-boy comment.
Look, the deal is that a high percentage of Netezza installs are sitting on the same raised floor as the large Oracle ERP/OLTP system from which they extract the data. I consider this madness that no longer has merit.
Kevin,
You made a typo in: “Both scans are driving storage at 6 GB/s (613MB/102s).”
Should probably be 613GB/102s
Freek,
Ah, if only that was the only typo…I just walked through the post again and cleaned up a few more…whew, that feels better…
Hi Kevin,
Was wondering if there is a way to simulate working on an Exadata storage machine.
Something like interaction with:
dcli
ipmitool
cellcli
(I mean something besides R.A.Testing.)
Believe this can be implemented in same manner as we have for Oracle Unified Storage 7000 Simulator.
This will allow for a pseudo-realistic experience.
Thanks.
Waseem.
(Please do allow for it to work on AMD too)
Hi Waseem,
I’m sorry. What you’re looking for is not available outside the walls of Oracle. BTW, Exadata ran on AMD before it ran on Intel.