This is installment number four in my series on Oracle Exadata Storage Server and HP Oracle Database Machine frequently asked questions. I recommend you also visit:
Oracle Exadata Storage Server Frequently Asked Questions Part I.
Oracle Exadata Storage Server Frequently Asked Questions Part II.
Oracle Exadata Storage Server. Frequently Asked Questions. Part III.
I’m mostly cutting and pasting questions from the comment threads of my blog posts about Exadata and mixing in some assertions I’ve seen on the web and re-phrasing them as questions. If they read as questions when I see then I cut and paste them without modification.
Q. What is meant in the Exadata white paper about Smart Scan Predicate Filtering by “…only rows where the employees hire date is after the specified date are sent from Exadata to the database instance..”? Does it really only return the rows matching the predicate or does it return all blocks containing rows which match the predicate? If the former is correct, how is this handled in the db block buffer?
A. Actually, that statement is a bit too liberal. Oracle Exadata Storage Server Smart Scans return only the cited columns from only the filtered rows, not entire rows as that statement may suggest. Results from a Smart Scan are not “real” database blocks thus the results do not get cached in the SGA. Consider the following business question:
List our Club Card members that have spent more than $1,000.00 at non-partner retail merchants in the last 180 days with our affinity credit card. Consider only non-partner merchants within 10 miles radius of one of our stores.
The SQL statement in the following text box answers this question (assuming 10-mile wide US postal code zones):
select cf.custid, sum(act.purchase_amt) salesfrom all_card_trans act, cust_fact cf where ( act.card_no like '4777%' or act.card_no like '3333%' ) and act.card_no = cf.aff_cc_num and cf.club_card_num not like '0%' and act.purchase_dt > to_date('07-MAR-2008','dd-mon-yyyy') and act.merchant_zip in ( select distinct(zip) from our_stores) and act.merchant_code not in (select merchant_code from partner_merchants) group by custid having sum(act.purchase_amt) > 1000 ; |
Intelligent Storage. Filtering and Projecting. Oh, and a Bloom Filter Too.
Now, let’s look at the predicate handling in the plan. The next text box shows that storage is filtering based on club cards, purchase dates and credit card. Given the way this query functions, all_card_trans is essentially like a fact table and cust_fact is more of a dimension table since there are 1,300 fold more rows in all_card_trans than cust_fact. You’ll also see there was a bloom filter pushed to the Exadata Storage Server cells used to join the filtered cust_fact rows to the filtered all_card_trans rows.
Predicate Information (identified by operation id):--------------------------------------------------- 3 - filter(SUM("ACT"."PURCHASE_AMT")>1000) 7 - access("ACT"."MERCHANT_CODE"="MERCHANT_CODE") 10 - access("ACT"."MERCHANT_ZIP"="ZIP") 15 - access("ACT"."CARD_NO"="CF"."AFF_CC_NUM") 20 - storage("CF"."CLUB_CARD_NUM" NOT LIKE '0%') filter("CF"."CLUB_CARD_NUM" NOT LIKE '0%') 25 - storage("ACT"."PURCHASE_DT">TO_DATE(' 2008-03-07 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND ("ACT"."CARD_NO" LIKE '4777%' OR "ACT"."CARD_NO" LIKE '3333%')) filter("ACT"."PURCHASE_DT">TO_DATE(' 2008-03-07 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND ("ACT"."CARD_NO" LIKE '4777%' OR "ACT"."CARD_NO" LIKE '3333%') AND SYS_OP_BLOOM_FILTER(:BF0000,"ACT"."CARD_NO")) |
This example should make the answer a bit clearer. Storage is ripping through the rows, yes, but only returning named columns and heavily filtered data.
The following text box shows frozen output of a tool we used to monitor I/O at a high level across cells. The names of the Exadata Storage cells in this case are called sgbeta1s01-sgbeta1s06 (I’m like you, I have to scrounge for hardware). These are 2 3-second averages showing I/O in excess of 1,000,000 KB per cell or an aggregate of 6 GB/s (see the data under the “bi” column). This is, of course, what we’ve been referring to as a “half-rack” in spite of the fact that it only has 6 Exadata Storage Server cells instead of the 7 that would be a true half-rack. Nonetheless, this is a 4-table query doing real work and, as you can see in the last text box, I’m connected to a single instance of RAC-on a dual socket/quad-core server! Without Exadata, a server would have to have 15 fibre channel host bus adaptors to drive this query at this I/O rate and a great deal of CPU power. That would be a very large server-and it would still be slower.
Kids, don’t try this at home without supervision
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------15:29:43: r b swpd free buff cache si so bi bo in cs us sy id wa st sgbeta1s01: 4 0 204 63292 134928 1523732 0 0 1033237 24 5062 37992 25 3 72 0 0 sgbeta1s02: 8 0 204 65708 141800 1596124 0 0 1038949 9 5238 37869 25 3 72 0 0 sgbeta1s03: 4 0 208 1183780 57832 584988 0 0 1049712 35 5862 40258 26 3 69 2 0 sgbeta1s04: 2 0 204 1254456 79084 568052 0 0 1053093 13 5329 38462 25 3 72 0 0 sgbeta1s05: 4 0 0 144924 122252 1453324 0 0 1022032 29 4991 38164 25 3 72 0 0 sgbeta1s06:11 0 0 95808 124200 1418236 0 0 1000635 24 4767 38249 25 2 73 0 0 Minimum: 2 0 0 63292 57832 568052 0 0 1000635 9 4767 37869 25 2 69 0 0 Maximum:11 0 208 1254456 141800 1596124 0 0 1053093 35 5862 40258 26 3 73 2 0 Average: 5 0 136 467994 110016 1190742 0 0 1032943 22 5208 38499 25 2 71 0 0 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ 15:29:46: r b swpd free buff cache si so bi bo in cs us sy id wa st sgbeta1s01: 3 0 204 63012 134928 1523732 0 0 1005643 24 5099 38475 25 2 72 0 0 sgbeta1s02: 2 0 204 65188 141800 1596124 0 0 1011285 3 4867 37671 25 2 73 0 0 sgbeta1s03: 4 0 208 1184408 57832 584988 0 0 1005371 20 5775 39378 25 3 72 0 0 sgbeta1s04: 3 0 204 1255200 79084 568052 0 0 993264 13 4891 38240 24 3 73 0 0 sgbeta1s05: 5 0 0 144320 122256 1453324 0 0 1016859 29 4947 38073 25 3 72 0 0 sgbeta1s06: 6 0 0 95824 124200 1418236 0 0 1019323 24 4961 38224 25 2 73 0 0 Minimum: 2 0 0 63012 57832 568052 0 0 993264 3 4867 37671 24 2 72 0 0 Maximum: 6 0 208 1255200 141800 1596124 0 0 1019323 29 5775 39378 25 3 73 0 0 Average: 3 0 136 467992 110016 1190742 0 0 1008624 18 5090 38343 24 2 72 0 0 |
[oracle@sgbeta1c01]$ sqlpus ‘/ as sysdba'SQL*Plus: Release 11.1.0.7.0 - Production on Thu Oct 2 15:35:55 2008 Copyright (c) 1982, 2008, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production With the Partitioning, Real Application Clusters, OLAP, Data Mining and Real Application Testing options SQL> select instance_name from gv$instance ; INSTANCE_NAME ---------------- test1 SQL> |
OK, enough … this is starting to get depressing for those with i/o bound systems still on 9i 😀
Seriously though, very interesting, and is there more you can say about the instrumentation of the storage layer — is it possible to measure the efficiency with which column projection and predicate filtering has been carried out, either in an individual query or at an aggregate level? Maybe this next part is out of your specific area of expertise on this project, but how integrated is the expected efficiency of these operations integrated into the optimizer’s cost calculations?
In my firefox browser the righthand side of the box giving the predicate filtering from the explain plan does not show up… can view the source though to see it all!
I was going to as a silly question, but the filter information I think answers it, unless of course I am wrong (and not for the first time) – so am I getting this right?
The Bloom filter is a probabilistic filter that returns at least all of the rows that satisfy the filter condition it also returns ‘false postive’ rows that do not contain matching data. I was going to as if the Bloom filtered set is returned to the “database” for further processing or is it further filtered on the actual predicates before being sent down the wire? But – from filter 25 in the plan I think I see the answer, just the rows of interest are projected.
Two questions about this post:
1. Were there any indexes on these tables in the example query? I am assuming not, and that everything was a full table scan for the data accesses.
2. How long did the query take to execute? And how big was the fact table (all_card_trans)? Both in terms of data row count, and in terms of bytes on disk.
I’m just wondering how well this query might have executed if all the join and constraint columns had been indexed on a ‘normal’ Oracle system.
As ever, you can always optimise a system for one particular query via a combination of indexes and so on, but that does not mean it will handle all queries as fast. So I can see that the Exabyte Storage Server gives you more consistent elapsed times, as it always processes all of the data in the table, if you see what I mean. But I am just curious if you can share the numbers for how this particular test query executed on the Exabyte Storage Server, compared to a normal Oracle database with indexes?
John
Hi John,
There were no indexes on the table. If there had been on on all_card_trans.card_no I’m quite certain CBO would have used it though. Nonetheless, if there is an index and CBO chooses to scan it, a Smart Scan would have been used. There is even value in Smart Scan for indexes (e.g., filtration and projection from composite indexes).
The query was not a performance PoC. It was only run to generate an educational piece about Smart Scan plan visibility. I’ve loaded more data and I’ll give a throughput number on it in a bit.
As for non-Exadata Oracle versus Oracle, well, if you did manage to trick out the index scheme to match what Exadata does, I’ll just do the same thing and leapfrog your result because indexes==less I/O but there is still I/O nonetheless and Exadata will beat non-Exadata. What I’m trying to say is indexes and Exadata are not mutually exclusive. The point is that with Exadata indexes are a matter of choice, not necessity.
PS. Are you the same John Brady from the old Sequent days?
“The Bloom filter is a probabilistic filter that returns at least all of the rows that satisfy the filter condition it also returns ‘false postive’ rows that do not contain matching data. I was going to as if the Bloom filtered set is returned to the “database” for further processing or is it further filtered on the actual predicates before being sent down the wire? But – from filter 25 in the plan I think I see the answer, just the rows of interest are projected.”
Pete,
Bloom Filters are a pre-Exadata functionality and just as is the case when a Bloom Filter executes server-side there can be false positives. The consumer of a Bloom Filter is responsible for chewing the meat and spitting out the bones. So, yes, false positives can make it up the wire, but the consumer will deal with it. The main goal of pushing a Bloom Filter to storage is politeness on the wire and reduced RDBMS host memory impact due to reduced RDMA pressure.
Kevin,
Yes, its me. I won’t deny it 😉 Good to see that you remember me. I’ve stuck at the Oracle database thing since then in the UK for a few different companies. But I am still interested in the performance and tuning thing, and big systems. I’ve got into reading blogs over the past few years (nothing like it in 1995), and came across yours and been tracking it. As ever I see you remain buried in the internals of the Oracle database, getting your head around where it spends its time when doing work.
Keep up the posting – its detailed, but always gives good information about what Oracle actually does inside. I even referred back to a post of yours on what happens during a log sync for a question on the Oracle forums. They had poor log sync performance but I/O times were good, and I remembered reading something from you on that before. So I just pointed them at your post if they wanted to understand what Oracle does on a log sync, and why it might be slow.
Like you, I think, I miss the big old iron, when we had 32 CPUs in a box, and could keep them all busy doing something. When Sun released the T2000 a couple of years ago with the T1 (Niagara) multi-core multi-threaded CPU in it, I did the maths and reckoned that the one CPU was equivalent to an E10000 StarFire with 32 UltraSPARC-II CPUs at 200 MHz each. So an amazing amount of horsepower now a days in very small footprint systems – 1 rack unit for a T2000 system with something like 8 GB memory.
It was a pity it did not work out at Sequent – NUMA-Q had a number of issues and was probably trying to do too much ahead of its time. By comparison the SunFire 15000 system released when Sun just had the new UltraSPARC-III CPUs is actually a NUMA system in a box. The CPU boards are 4 way with local cache coherency, and then a full directory based cache coherency between the boards i.e. not a full address broadcast like the E10000 did. And this is what Sequent were trying to do in the NUMA-Q systems around 1998 for scalability to large processor counts.
Keep up the interesting posts, and keep getting your hands dirty inside Oracle. I’m attacking Oracle from the outside in, so I have to work with the standard user land tools.
John