Oracle Exadata Storage Server FAQ – Part IV. A Peek at Storage Filtration and Column Projection. | Kevin Closson's Blog: Platforms, Databases and Storage

Oracle Exadata Storage Server FAQ – Part IV. A Peek at Storage Filtration and Column Projection.

Published October 2, 2008 oracle 7 Comments

This is installment number four in my series on Oracle Exadata Storage Server and HP Oracle Database Machine frequently asked questions. I recommend you also visit:

Oracle Exadata Storage Server Frequently Asked Questions Part I.

Oracle Exadata Storage Server Frequently Asked Questions Part II.

Oracle Exadata Storage Server. Frequently Asked Questions. Part III.

I’m mostly cutting and pasting questions from the comment threads of my blog posts about Exadata and mixing in some assertions I’ve seen on the web and re-phrasing them as questions. If they read as questions when I see then I cut and paste them without modification.

Q. What is meant in the Exadata white paper about Smart Scan Predicate Filtering by “…only rows where the employees hire date is after the specified date are sent from Exadata to the database instance..”? Does it really only return the rows matching the predicate or does it return all blocks containing rows which match the predicate? If the former is correct, how is this handled in the db block buffer?

A. Actually, that statement is a bit too liberal. Oracle Exadata Storage Server Smart Scans return only the cited columns from only the filtered rows, not entire rows as that statement may suggest. Results from a Smart Scan are not “real” database blocks thus the results do not get cached in the SGA. Consider the following business question:

List our Club Card members that have spent more than $1,000.00 at non-partner retail merchants in the last 180 days with our affinity credit card. Consider only non-partner merchants within 10 miles radius of one of our stores.

The SQL statement in the following text box answers this question (assuming 10-mile wide US postal code zones):

select    cf.custid, sum(act.purchase_amt) salesfrom all_card_trans act, cust_fact cf

where ( act.card_no like '4777%' or act.card_no  like '3333%' )

and act.card_no = cf.aff_cc_num and cf.club_card_num not like '0%'

and act.purchase_dt  >   to_date('07-MAR-2008','dd-mon-yyyy')

and act.merchant_zip in ( select distinct(zip) from our_stores)

and act.merchant_code not in (select merchant_code from partner_merchants)

group by custid having sum(act.purchase_amt) > 1000 ;

Intelligent Storage. Filtering and Projecting. Oh, and a Bloom Filter Too.

Now, let’s look at the predicate handling in the plan. The next text box shows that storage is filtering based on club cards, purchase dates and credit card. Given the way this query functions, all_card_trans is essentially like a fact table and cust_fact is more of a dimension table since there are 1,300 fold more rows in all_card_trans than cust_fact. You’ll also see there was a bloom filter pushed to the Exadata Storage Server cells used to join the filtered cust_fact rows to the filtered all_card_trans rows.

Predicate Information (identified by operation   id):---------------------------------------------------

3 -   filter(SUM("ACT"."PURCHASE_AMT")>1000)

7 -   access("ACT"."MERCHANT_CODE"="MERCHANT_CODE")

10 -   access("ACT"."MERCHANT_ZIP"="ZIP")

15 -   access("ACT"."CARD_NO"="CF"."AFF_CC_NUM")

20 -   storage("CF"."CLUB_CARD_NUM" NOT LIKE '0%')

filter("CF"."CLUB_CARD_NUM" NOT LIKE   '0%')

25 -   storage("ACT"."PURCHASE_DT">TO_DATE(' 2008-03-07   00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND ("ACT"."CARD_NO"   LIKE '4777%' OR

"ACT"."CARD_NO" LIKE '3333%'))

filter("ACT"."PURCHASE_DT">TO_DATE(' 2008-03-07   00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND ("ACT"."CARD_NO"   LIKE '4777%' OR

"ACT"."CARD_NO" LIKE '3333%') AND   SYS_OP_BLOOM_FILTER(:BF0000,"ACT"."CARD_NO"))

This example should make the answer a bit clearer. Storage is ripping through the rows, yes, but only returning named columns and heavily filtered data.

The following text box shows frozen output of a tool we used to monitor I/O at a high level across cells. The names of the Exadata Storage cells in this case are called sgbeta1s01-sgbeta1s06 (I’m like you, I have to scrounge for hardware). These are 2 3-second averages showing I/O in excess of 1,000,000 KB per cell or an aggregate of 6 GB/s (see the data under the “bi” column). This is, of course, what we’ve been referring to as a “half-rack” in spite of the fact that it only has 6 Exadata Storage Server cells instead of the 7 that would be a true half-rack. Nonetheless, this is a 4-table query doing real work and, as you can see in the last text box, I’m connected to a single instance of RAC-on a dual socket/quad-core server! Without Exadata, a server would have to have 15 fibre channel host bus adaptors to drive this query at this I/O rate and a great deal of CPU power. That would be a very large server-and it would still be slower.

Kids, don’t try this at home without supervision

procs -----------memory---------- ---swap-- -----io----  --system-- -----cpu------15:29:43:   r  b     swpd    free   buff     cache   si   so        bi    bo    in      cs us sy id wa st

sgbeta1s01: 4    0    204   63292 134928 1523732    0      0 1033237    24  5062 37992 25  3 72  0  0

sgbeta1s02: 8    0    204   65708 141800 1596124    0      0 1038949     9  5238 37869 25  3 72  0  0

sgbeta1s03: 4    0    208 1183780  57832    584988    0    0 1049712    35  5862 40258 26  3 69    2  0

sgbeta1s04: 2    0    204 1254456  79084    568052    0    0 1053093    13  5329 38462 25  3 72    0  0

sgbeta1s05: 4    0      0  144924 122252 1453324    0      0 1022032    29  4991 38164 25  3 72  0  0

sgbeta1s06:11    0      0   95808 124200 1418236    0      0 1000635    24  4767 38249 25  2 73  0  0

Minimum:   2  0      0   63292  57832    568052    0    0 1000635     9  4767 37869 25  2 69    0  0

Maximum:11  0    208 1254456 141800 1596124    0      0 1053093    35  5862 40258 26  3 73  2  0

Average:   5  0    136  467994 110016   1190742    0    0 1032943    22  5208 38499 25  2 71  0  0

procs -----------memory---------- ---swap-- -----io----  --system-- -----cpu------

15:29:46:   r  b     swpd    free   buff     cache   si   so        bi    bo    in      cs us sy id wa st

sgbeta1s01: 3    0    204   63012 134928 1523732    0      0 1005643    24  5099 38475 25  2 72  0  0

sgbeta1s02: 2    0    204   65188 141800 1596124    0      0 1011285     3  4867 37671 25  2 73  0  0

sgbeta1s03: 4    0    208 1184408  57832    584988    0    0 1005371    20  5775 39378 25  3 72    0  0

sgbeta1s04: 3    0    204 1255200  79084    568052    0    0    993264    13  4891 38240 24  3 73  0  0

sgbeta1s05: 5    0      0  144320 122256 1453324    0      0 1016859    29  4947 38073 25  3 72  0  0

sgbeta1s06: 6    0      0   95824 124200 1418236    0    0 1019323    24  4961 38224 25  2 73  0  0

Minimum:   2  0      0   63012  57832    568052    0    0    993264     3  4867 37671 24  2 72  0  0

Maximum:   6  0    208 1255200 141800 1596124      0    0 1019323    29    5775 39378 25  3 73  0    0

Average:   3  0    136  467992 110016   1190742    0    0 1008624    18  5090 38343 24  2 72  0  0

[oracle@sgbeta1c01]$ sqlpus ‘/ as sysdba'SQL*Plus: Release 11.1.0.7.0 - Production on Thu   Oct 2 15:35:55 2008

Copyright (c) 1982, 2008, Oracle.  All rights reserved.

Connected to:

Oracle Database 11g Enterprise Edition Release   11.1.0.7.0 - 64bit Production

With the Partitioning, Real Application Clusters,   OLAP, Data Mining

and Real Application Testing options

SQL> select instance_name from gv$instance ;

INSTANCE_NAME

----------------

test1

SQL>

7 Responses to “Oracle Exadata Storage Server FAQ – Part IV. A Peek at Storage Filtration and Column Projection.”

Feed for this Entry Trackback Address

1 David Aldridge October 3, 2008 at 2:27 am

OK, enough … this is starting to get depressing for those with i/o bound systems still on 9i 😀

Seriously though, very interesting, and is there more you can say about the instrumentation of the storage layer — is it possible to measure the efficiency with which column projection and predicate filtering has been carried out, either in an individual query or at an aggregate level? Maybe this next part is out of your specific area of expertise on this project, but how integrated is the expected efficiency of these operations integrated into the optimizer’s cost calculations?

Reply
2 Pete Scott October 3, 2008 at 9:30 am

In my firefox browser the righthand side of the box giving the predicate filtering from the explain plan does not show up… can view the source though to see it all!

I was going to as a silly question, but the filter information I think answers it, unless of course I am wrong (and not for the first time) – so am I getting this right?

The Bloom filter is a probabilistic filter that returns at least all of the rows that satisfy the filter condition it also returns ‘false postive’ rows that do not contain matching data. I was going to as if the Bloom filtered set is returned to the “database” for further processing or is it further filtered on the actual predicates before being sent down the wire? But – from filter 25 in the plan I think I see the answer, just the rows of interest are projected.

Reply
3 John Brady October 3, 2008 at 9:41 am

Two questions about this post:

1. Were there any indexes on these tables in the example query? I am assuming not, and that everything was a full table scan for the data accesses.

2. How long did the query take to execute? And how big was the fact table (all_card_trans)? Both in terms of data row count, and in terms of bytes on disk.

I’m just wondering how well this query might have executed if all the join and constraint columns had been indexed on a ‘normal’ Oracle system.

As ever, you can always optimise a system for one particular query via a combination of indexes and so on, but that does not mean it will handle all queries as fast. So I can see that the Exabyte Storage Server gives you more consistent elapsed times, as it always processes all of the data in the table, if you see what I mean. But I am just curious if you can share the numbers for how this particular test query executed on the Exabyte Storage Server, compared to a normal Oracle database with indexes?

John

Reply
4 kevinclosson October 3, 2008 at 2:23 pm

Hi John,

There were no indexes on the table. If there had been on on all_card_trans.card_no I’m quite certain CBO would have used it though. Nonetheless, if there is an index and CBO chooses to scan it, a Smart Scan would have been used. There is even value in Smart Scan for indexes (e.g., filtration and projection from composite indexes).

The query was not a performance PoC. It was only run to generate an educational piece about Smart Scan plan visibility. I’ve loaded more data and I’ll give a throughput number on it in a bit.

As for non-Exadata Oracle versus Oracle, well, if you did manage to trick out the index scheme to match what Exadata does, I’ll just do the same thing and leapfrog your result because indexes==less I/O but there is still I/O nonetheless and Exadata will beat non-Exadata. What I’m trying to say is indexes and Exadata are not mutually exclusive. The point is that with Exadata indexes are a matter of choice, not necessity.

PS. Are you the same John Brady from the old Sequent days?

Reply
5 kevinclosson October 3, 2008 at 2:30 pm

“The Bloom filter is a probabilistic filter that returns at least all of the rows that satisfy the filter condition it also returns ‘false postive’ rows that do not contain matching data. I was going to as if the Bloom filtered set is returned to the “database” for further processing or is it further filtered on the actual predicates before being sent down the wire? But – from filter 25 in the plan I think I see the answer, just the rows of interest are projected.”

Pete,

Bloom Filters are a pre-Exadata functionality and just as is the case when a Bloom Filter executes server-side there can be false positives. The consumer of a Bloom Filter is responsible for chewing the meat and spitting out the bones. So, yes, false positives can make it up the wire, but the consumer will deal with it. The main goal of pushing a Bloom Filter to storage is politeness on the wire and reduced RDBMS host memory impact due to reduced RDMA pressure.

Reply
6 John Brady October 3, 2008 at 2:47 pm

Kevin,

Yes, its me. I won’t deny it 😉 Good to see that you remember me. I’ve stuck at the Oracle database thing since then in the UK for a few different companies. But I am still interested in the performance and tuning thing, and big systems. I’ve got into reading blogs over the past few years (nothing like it in 1995), and came across yours and been tracking it. As ever I see you remain buried in the internals of the Oracle database, getting your head around where it spends its time when doing work.

Keep up the posting – its detailed, but always gives good information about what Oracle actually does inside. I even referred back to a post of yours on what happens during a log sync for a question on the Oracle forums. They had poor log sync performance but I/O times were good, and I remembered reading something from you on that before. So I just pointed them at your post if they wanted to understand what Oracle does on a log sync, and why it might be slow.

Like you, I think, I miss the big old iron, when we had 32 CPUs in a box, and could keep them all busy doing something. When Sun released the T2000 a couple of years ago with the T1 (Niagara) multi-core multi-threaded CPU in it, I did the maths and reckoned that the one CPU was equivalent to an E10000 StarFire with 32 UltraSPARC-II CPUs at 200 MHz each. So an amazing amount of horsepower now a days in very small footprint systems – 1 rack unit for a T2000 system with something like 8 GB memory.

It was a pity it did not work out at Sequent – NUMA-Q had a number of issues and was probably trying to do too much ahead of its time. By comparison the SunFire 15000 system released when Sun just had the new UltraSPARC-III CPUs is actually a NUMA system in a box. The CPU boards are 4 way with local cache coherency, and then a full directory based cache coherency between the boards i.e. not a full address broadcast like the E10000 did. And this is what Sequent were trying to do in the NUMA-Q systems around 1998 for scalability to large processor counts.

Keep up the interesting posts, and keep getting your hands dirty inside Oracle. I’m attacking Oracle from the outside in, so I have to work with the standard user land tools.

John

Reply

1 Oracle Exadata Storage Server and Oracle Database Machine Related Posts « Everything Exadata Trackback on February 25, 2009 at 3:30 pm

	David Zheng on Announcing pgio (The SLOB Meth…
	Oracle redo log perf… on File Systems For A Database? C…
	Oracle redo log perf… on Yes, File Systems Still Need T…
	kevinclosson on Announcing SLOB 2.5.4
	pgio nutzen? - I/O W… on So pgio Does Not Accurately Re…

Kevin Closson's Blog: Platforms, Databases and Storage