Archive for the 'I/O Topics' Category

Yes, Storage Arrays Can Deduplicate Oracle Database. Here Is Exactly Why It Doesn’t Matter!

I recently had some cycles on a freshly installed Dell EMC XtremIO Storage Array. I took this opportunity to prepare a blog entry about the never-ending topic of whether or not storage arrays are able to reduce physical data capacity through deduplication of blocks in Oracle Database.

Of Course There Is Duplicate Data In Oracle Datafiles

Before I continue, let me say something that may come as a surprise to you. Yes, Oracle Database has duplicate blocks in tablespaces! Yes, modern storage arrays can achieve astonishing data reduction rates through deduplication–even when the only data in the array is Oracle Database (whether ASM or file systems)!

XtremIO computes and displays global data reduction rate. This makes it a bit more difficult to show the effect of deduplication on Oracle Database because averages across diverse data makes pin-point focus impossible. However, as I was saying, I took some time on a freshly-installed XtremIO array and collected what I hope will be interesting information on the topic of deduplication.

Please take a look at Figure 1. To start the testing I created a 4TB XtremIO volume, attached it as a LUN to a test host and then created an XFS file system on it. Please be aware that the contents of an Oracle datafile is precisely the same whether stored in ASM or in a file system file. After the file system was created I used the SLOB database creation kit (SLOB/misc/create_database_kit) to create a small database with Oracle Database 12c. As Figure 1 shows, the small database consumed 11.83GB of logical space in the 4TB volume. However, since the data enjoyed a slight deduplication ratio of 1.1:1 and a healthy compression ratio of 3.3:1 for a 3.6:1 data reduction ratio, only 3.27GB physical space was consumed in the array.

markup-gui-after-slob-database-created

Figure 1

The next step in the testing was to consume the majority of the 4TB file system with a BIGFILE tablespace. Figure 2 shows the DDL I used to create the tablespace.

shell-create-3900g-ts

Figure 2

Figure 3 shows the file system file that corresponds to the tablespace created with DDL in Figure 2.

sqlplus-bigfile-info-and-host-to-ls

Figure 3

After creating the 3.9TB BIGFILE tablespace I took a screenshot of the XtremIO GUI Dashboard. As Figure 4 shows, there was no deduplication! Instead, the data was compressed 4.0:1 resulting in only 977.66GB physical space being consumed in the array. So why in the world would I blog the opposite of what I said above? Why show the array did not, in fact, deduplicate the 3.9TB datafile? The answer is in the fact that I said there are duplicate data block in tablespaces. I didn’t say there are duplicate blocks in the same datafile!

gui-after-bigfile-cr

Figure 4

To return the array to the state prior to the BIGFILE tablespace creation, I dropped the tablespace (including contents and datafiles thus unlinking the file) and then used the Linux fstrim(8) command to return the space to the array as shown in Figure 5.

shell-fstrim-after-drop-bigts

Figure 5

Once the fstrim command completed I took another screenshot of the XtremIO GUI Dashboard as shown in Figure 6. Figure 6 shows that the array space utilization and data reduction had returned to that of what was seen before the BIGFILE tablespace creation.

gui-after-bigfile-drop-and-fstrim

Figure 6

OK, Now For The Duplicate Data

The next step in the testing was to fill up the majority of the 4TB file system with SMALLFILE tablespaces. To do so I created 121 tablespaces each consisting of a single SMALLFILE datafile of 32GB. The output shown in Figure 7 is from a data dictionary query to display the size of each of the 121 datafiles and how the sum of these datafiles consumed 3.87TB of the 4TB file system.

shell-after-creating-smalfile-tablespaces

Figure 7

That’s Duplicate Data

Once the file system was filled with SMALLFILE datafiles I took another screenshot of the XtremIO GUI Dashboard. Figure 8 shows that the SMALLFILE datafiles enjoyed a deduplication ratio 81.8:1 combined with a compression ratio of 3.8:1 resulting in a global data reduction rate of 306.9:1. Because of the significant data reduction rate only 12.68GB of physical space was consumed in the array in spite of the 3.79TB logical space (the sum of the SMALLFILE datafiles) being allocated.

gui-after-smallfile-cr

Figure 8

So here we have it! I had a database created with Oracle Database 12c that consisted of 121 32GB files for roughly 3.8TB database size yet XtremIO deduplicated the data down by a factor of 82:1!

So arrays can deduplicate Oracle Database contents! Right? Well, yes, but it matters none whatsoever. Allow me to explain.

Oracle datafiles consist of initialized blocks but vast portions of that initialized content is the same from file to file. This fact can be seen with simple md5sum(1) output. Consider Figure 9 where you can see the output of the md5sum command used to compute Oracle datafile checksums but only after skipping the first 8,692 blocks (8KB blocks). It’s the first approximate 68MB of each datafile that is unique when a datafile is freshly initialized. Beyond that threshold we can see (Figure 9) that the rest of the file content is identical.

markup-shell-with-md5sum

Figure 9

Thus far this blog post has proven that initialized, but empty, Oracle Database datafiles have duplicate data. As the title of this post says, however, it does not matter.

Introduce Application Data To The Mix

Figure 10 shows the commands I used to populate each of the 121 tablespaces with a single table. The table has the sparse characteristic we are all accustomed to with SLOB. That is, I am only creating a single row in each block. Moreover, I’m populating each of these 121 tables with the same application data! This is precisely why I say deduplication of Oracle Database doesn’t matter because it only holds true until any application data is loaded into the data blocks. Figure 10 shows this set of DDL commands.

shell-load_tablespaces-script

Figure 10

After populating the blocks in each of the 121 tables (each residing in a dedicated SMALLFILE tablespace) with blocks containing just a single row of application data I took another screenshot of the XtremIO GUI Dashboard. Figure 11 shows how putting any data into the data blocks reverts the deduplication. Why? Well, remember that the block header of every block has the SCN of the last change made to the block. For this reason I can put the same application data in blocks and still have 100% unique blocks–at least at the 8KB level.

Please note that the application table I used to populate the 121 tables does not consume 100% of the data blocks in each of the SMALLFILE tablespaces. There were a few blocks remaining in each tablespace and thus there remained a scant amount of deduplication as seen in Figure 11. Most XtremIO customers see some insignificant deduplication in their Oracle Database environments. Some even see significant deduplication–at least until they insert data into the database.

gui-after-filling-smallfiles

Figure 11

In a follow-up post I’ll say a few words about the deduplication granularity and how it affects the ability to achieve small amounts of deduplication of unused space in initialized data blocks. However, bear in mind that the net result of any deduplication of Oracle Database data files is that the only space that can be deduplicated is space that has never had application data in it. After all, a SQL DELETE command doesn’t remove data–it only marks it as free in the block.

Summary

I don’t think there are that many Oracle shops that have an urgent need for data reduction of space that’s never been used to store application data. I could be wrong. Until I find out either way, I say that yes you can see deduplication of Oracle Database datafiles but it doesn’t matter one bit.

 

 

 

 

 

 

 

EMC XtremIO – The Full-Featured All-Flash Array. Interested In Oracle Performance? See The Whitepaper.

NOTE: There’s a link to the full article at the end of this post.

I recently submitted a manuscript to the EMC XtremIO Business Unit covering some compelling lab results from testing I concluded earlier this year. I hope you’ll find the paper interesting.

There is a link to the full paper at the bottom of this block post. I’ve pasted the executive summary here:

Executive Summary

Physical I/O patterns generated by Oracle Database workloads are well understood. The predictable nature of these I/O characteristics have historically enabled platform vendors to implement widely varying I/O acceleration technologies including prefetching, coalescing transfers, tiering, caching and even I/O elimination. However, the key presumption central to all of these acceleration technologies is that there is an identifiable active data set. While it is true that Oracle Database workloads generally settle on an active data set, the active data set for a workload is seldom static—it tends to move based on easily understood factors such as data aging or business workflow (e.g., “month-end processing”) and even the data source itself. Identifying the current active data set and keeping up with movement of the active data set is complex and time consuming due to variability in workloads, workload types, and number of workloads. Storage administrators constantly chase the performance hotspots caused by the active dataset.

All-Flash Arrays (AFAs) can completely eliminate the need to identify the active dataset because of the ability of flash to service any part of a larger data set equally. But not all AFAs are created equal.

Even though numerous AFAs have come to market, obtaining the best performance required by databases is challenging. The challenge isn’t just limited to performance. Modern storage arrays offer a wide variety of features such as deduplication, snapshots, clones, thin provisioning, and replication. These features are built on top of the underlying disk management engine, and are based on the same rules and limitations favoring sequential I/O. Simply substituting flash for hard drives won’t break these features, but neither will it enhance them.

EMC has developed a new class of enterprise data storage system, XtremIO flash array, which is based entirely on flash media. XtremIO’s approach was not simply to substitute flash in an existing storage controller design or software stack, but rather to engineer an entirely new array from the ground-up to unlock flash’s full performance potential and deliver array-based capabilities that are unprecedented in the context of current storage systems.

This paper will help the reader understand Oracle Database performance bottlenecks and how XtremIO AFAs can help address such bottlenecks with its unique capability to deal with constant variance in the I/O profile and load levels. We demonstrate that it takes a highly flash-optimized architecture to ensure the best Oracle Database user experience. Please read more:  Link to full paper from emc.com.

Oracle Exadata Storage Server: A Black Box with No Statistics.

A question came in about whether it is possible to measure how much data is filtered out when running a query serviced by a Smart Scan in the Exadata Storage Server grid. The following is the long answer.

An Example of Offload Processing Effectiveness Accounting

I’d like to answer this question by taking real information from a test configuration consisting of 4 Real Application Clusters nodes attached to 6 Exadata Storage Server cells with the SAS disk option (12 x 300GB 3.5″ 15K RPM).

Test Workload – The Affinity Card Program Test Database

The Affinity Card Program Test Database (ACPTD) is a test schema and set of queries that mimics the type of queries performed by a marketing group responsible for promoting the use of a retail club card and affinity credit card for a fictitious company. In this example deployment of the ACPTD, the group responsible for promoting club card and affinity credit card activity for Acme Inc has built a data mart with data from several sources, including the main corporate ERP and CRM systems; partner merchant ERP systems; and, outside syndicated data.

For this blog post I’ll focus on the main syndicated card transaction table called all_card_trans. However, there are other tables in the schema and for reference sake the following text box shows the table sizes in my current deployment of the test kit. As the query shows, the all_card_trans table is 613 GB. Yes, I know that’s small, but this is a test system and I don’t like watching 6 Exadata cells hammering 1 GB/s each for tens of minutes when a couple of minutes will do. I hope you’ll understand.

SQL> set echo on
SQL> @tab_sizes
SQL> col segment_name for a32
SQL> select segment_name, sum(bytes)/1024/1024 mb from user_segments
  2  where segment_name not like 'BIN%' and segment_name not like 'SYS%'
  3  group by segment_name;

SEGMENT_NAME                             MB
-------------------------------- ----------
PARTNER_MERCHANTS                         8
PARTNER_MERCHANT_SALES                67780
ZIPCODES                                  8
OUR_SALES                             21956
ALL_CARD_TRANS                       629032
MCC                                       8
OUR_STORES                                8
CUST_SERVICE                             76
TM                                   .03125
CUST_FACT                              4708

10 rows selected.

The table definition for the all_card_trans table is shown in the following text box:

SQL> desc all_card_trans
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CARD_NO                                   NOT NULL VARCHAR2(20)
 CARD_TYPE                                          CHAR(20)
 MCC                                       NOT NULL NUMBER(6)
 PURCHASE_AMT                              NOT NULL NUMBER(6)
 PURCHASE_DT                               NOT NULL DATE
 MERCHANT_CODE                             NOT NULL NUMBER(7)
 MERCHANT_CITY                             NOT NULL VARCHAR2(40)
 MERCHANT_STATE                            NOT NULL CHAR(3)
 MERCHANT_ZIP                              NOT NULL NUMBER(6)

The following queries show the total row count in the all_card_trans table as well as the number of rows representing transactions with credit cards with leading 4 digits of 3333 and 4447, which represent the card block granted to our fictitious company, Acme Inc, for their affinity card program. This is a needle in a haystack query since only .3% of the data matches the where predicate. Now, before you look at the text box please note that this is a single instance of Real Application Clusters driving this query. And, yes, the RAC node is a small 8-core DL360. Both scans are driving storage at 6 GB/s (613GB/102s). Folks, remember that to do what I just did in the following text box with Fibre Channel I’d have to have a system with 15 FC HBAs attached to several huge SAN arrays. Oh, and I should point out that it would also require about 85 Netezza Snippet Processing Units to match this throughput.

SQL> select count(*) from all_card_trans where  card_no like '4777%' or card_no like '3333%’;

  COUNT(*)
----------
  22465830

Elapsed: 00:01:42.71
SQL> select count(*) from all_card_trans;

  COUNT(*)
----------
7897611641

Elapsed: 00:01:42.47

I Can’t Believe I Ate the Whole Thing

The following query uses a method for measuring cell activity by forcing the database to ingest all rows and all columns. Don’t try to make sense of the query because it isn’t supposed to make sense. It is only supposed to drive Exadata to return all rows and all columns, which it does.

$ more demo_offload_bytes.sql

drop view ingest;
create view ingest as
select avg(length(CARD_NO)) c1,
avg(length(CARD_TYPE)) c2,
max(MCC + PURCHASE_AMT) c3,
max(length(PURCHASE_DT)) c4,
max(MERCHANT_CODE) c5,
avg(length(MERCHANT_CITY)) c6,
avg(length(MERCHANT_STATE)) c7,
max(MERCHANT_ZIP) c8 from all_card_trans;

col MB format 99999999.9
select NAME,VALUE/(1024*1024*1024) GB from gv$sysstat where STATISTIC# in (196, 44, 43 );

select c1+c2+c3+c4+c5+c6+c7 from ingest;

select NAME,VALUE/(1024*1024*1024) GB from gv$sysstat where STATISTIC# in (196, 44, 43);

The following text box shows that when I ran the query Exadata (in aggregate) scanned 613 GB of disk and returned all but 6% to the RDBMS instance (delta cell physical IO interconnect bytes). Also recorded by v$sysstat (bytes eligible) is the fact that there was nothing peculiar about the data being scanned-peculiar in any such fashion that would have interfered with offload processing (e.g., chained rows, etc). Since I asked for all the data, that is what I got. It is nice to know, however, that the entirety of the data was a candidate for offload processing.

Notice I didn’t time the query. I’ll offer more on that later in this blog post.

SQL> @demo_offload_bytes

View dropped.

View created.

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           1227.80814
cell physical IO interconnect bytes                              1151.52453
cell physical IO bytes eligible for predicate offload            1227.31451

C1+C2+C3+C4+C5+C6+C7
--------------------
          9010106.59

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                            1841.5774
cell physical IO interconnect bytes                              1727.15227
cell physical IO bytes eligible for predicate offload            1840.97177

So let’s take a look at some offload processing. The following change to the “ingest” query reduces the amount of data ingested by the database tier by only selecting rows where the credit card number started with 4777 or 3333. We know from the scan of the table that this whittles down the dataset by 99.7%.

$ more demo_offload_bytes1.sql

create view ingest_less as
select avg(length(CARD_NO)) c1,
avg(length(CARD_TYPE)) c2,
max(MCC + PURCHASE_AMT) c3,
max(length(PURCHASE_DT)) c4,
max(MERCHANT_CODE) c5,
avg(length(MERCHANT_CITY)) c6,
avg(length(MERCHANT_STATE)) c7,
max(MERCHANT_ZIP) c8 from all_card_trans act
where card_no like '4777%' or card_no like '3333%';

col MB format 99999999.9
select NAME,VALUE/(1024*1024*1024) GB from gv$sysstat where STATISTIC# in (196, 44, 43 );

set timing on
select c1+c2+c3+c4+c5+c6+c7 from ingest_less;

select NAME,VALUE/(1024*1024*1024) GB from gv$sysstat where STATISTIC# in (196, 44, 43);

The following text box shows that that Exadata scanned and filtered out the uninteresting rows. The projected columns accounted for only 1.7 GB across the iDB intelligent I/O fabric. That is, the database tier only had to ingest 1.7 GB, or 17 MB/s since the query completed in 103 seconds.

SQL> @demo_offload_bytes_1

View dropped.

Elapsed: 00:00:00.21

View created.

Elapsed: 00:00:00.00

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           1841.13743
cell physical IO interconnect bytes                              43.1782733
cell physical IO bytes eligible for predicate offload            1840.97177

Elapsed: 00:00:00.01

C1+C2+C3+C4+C5+C6+C7
--------------------
           9010106.9

Elapsed: 00:01:43.31

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           2454.80203
cell physical IO interconnect bytes                              44.8776852
cell physical IO bytes eligible for predicate offload            2454.62903

Elapsed: 00:00:00.00
SQL>

So, yes, Exadata does let you see how effectively Smart Scans are working. In this example we saw a single-table scan on a small Oracle Exadata Storage Server grid weeding out 99.7% of the data.

Sneaky, Sneaky.

You’ll notice when I executed demo_offload_bytes.sql (the non-filtering example) I was not tracking query completion time. That’s because up to this point I’ve only been showing examples driven by a single Real Application Clusters node. Now, let’s think about this. I have 8 Intel processor cores and a single inbound Infiniband path on the database server. From an inbound I/O bandwidth perspective the host can “only” ingest 1.6GB/s, but the CPUs may also further throttle that since the database is doing its own projection in this case.

I’ve shown, in this post, that this test Exadata Storage grid configuration can scan disk at 6 GB/s. The question you might ask-and I’m about to answer-is how much does this single database host throttle storage and why does that matter? Well, it matters because having ample storage bandwidth with limited database server bandwidth is the classic imbalance addressed by the HP Oracle Database Machine. So, let me run it again-with timing turned on. As you’ll see, bottlenecks are bottlenecks. This single database host cannot keep up with storage, so storage is throttled and the result is 18X increase in query completion time when compared to the heavily filtered case. Both queries had to read the same amount of data but in this case there was an imbalance in the upstream ability to ingest the data both from an I/O bandwidth and CPU perspective.

SQL> @demo_offload_bytes

View dropped.

Elapsed: 00:00:00.02

View created.

Elapsed: 00:00:00.01

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           2454.81084
cell physical IO interconnect bytes                               44.886501
cell physical IO bytes eligible for predicate offload            2454.62903

Elapsed: 00:00:00.00

C1+C2+C3+C4+C5+C6+C7
--------------------
          9010106.59

Elapsed: 00:29:26.85

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           3068.55792
cell physical IO interconnect bytes                              620.492259
cell physical IO bytes eligible for predicate offload            3068.28629

Elapsed: 00:00:00.00

Need More RDBMS Bandwidth? OK, Use It.

So, let’s see what the full-ingestion case looks like with 4 Real Application Clusters nodes. And please forgive that I forgot to aggregate the v$sysstat output. I’ve added 4-fold database grid resources so I should complete the full-ingestion query in 75% less time. The text box starts out showing that the number of RAC instances was increased from 1 to 4.

SQL> host date
Fri Oct  3 16:27:52 PDT 2008

SQL> select instance_name from gv$instance;

INSTANCE_NAME
----------------
test1

Elapsed: 00:00:00.02
SQL> select instance_name from gv$instance;

INSTANCE_NAME
----------------
test1
test4
test3
test2

Elapsed: 00:00:00.08
SQL> host date
Fri Oct  3 16:32:06 PDT 2008
SQL>
SQL> @demo_offload_bytes

View dropped.

Elapsed: 00:00:00.03

View created.

Elapsed: 00:00:00.01

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           3068.69843
cell physical IO interconnect bytes                              620.632769
cell physical IO bytes eligible for predicate offload            3068.28629
physical IO disk bytes                                           .013385296
cell physical IO interconnect bytes                              .013385296
cell physical IO bytes eligible for predicate offload                     0
physical IO disk bytes                                           .010355473
cell physical IO interconnect bytes                              .010355473
cell physical IO bytes eligible for predicate offload                     0
physical IO disk bytes                                           .031275272
cell physical IO interconnect bytes                              .031275272

NAME                                                                     GB
---------------------------------------------------------------- ----------
cell physical IO bytes eligible for predicate offload                     0

12 rows selected.

Elapsed: 00:00:00.02

C1+C2+C3+C4+C5+C6+C7
--------------------
          9010106.59
Elapsed: 00:07:25.01

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           3222.13893
cell physical IO interconnect bytes                              764.535856
cell physical IO bytes eligible for predicate offload            3221.70425
physical IO disk bytes                                            154.31714
cell physical IO interconnect bytes                              144.732708
cell physical IO bytes eligible for predicate offload            154.280945
physical IO disk bytes                                           152.860648
cell physical IO interconnect bytes                              143.367385
cell physical IO bytes eligible for predicate offload            152.828033
physical IO disk bytes                                           153.195674
cell physical IO interconnect bytes                                143.6831

NAME                                                                     GB
---------------------------------------------------------------- ----------
cell physical IO bytes eligible for predicate offload             153.13031

12 rows selected.

Elapsed: 00:00:00.02
SQL>

Well, that was predictable, but cool nonetheless. What about the filtered ingest-query? Should there be much of a speedup given the fact that storage bandwidth remains constant and the ingest rate of the filtered query was only roughly 17 MB/s with a single Real Application Clusters node? I could save the cut and paste effort and just tell you that adding Real Application Clusters nodes in this case will, of course, not reduce the query time, but since I’ve bored you this long, here it is:

SQL> @demo_offload_bytes_1

View dropped.

Elapsed: 00:00:00.01

View created.

Elapsed: 00:00:00.01

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           3222.15758
cell physical IO interconnect bytes                              764.554509
cell physical IO bytes eligible for predicate offload            3221.70425
physical IO disk bytes                                           154.334342
cell physical IO interconnect bytes                               144.74991
cell physical IO bytes eligible for predicate offload            154.280945
physical IO disk bytes                                           153.213285
cell physical IO interconnect bytes                               143.70071
cell physical IO bytes eligible for predicate offload             153.13031
physical IO disk bytes                                           152.878347
cell physical IO interconnect bytes                              143.385084

NAME                                                                     GB
---------------------------------------------------------------- ----------
cell physical IO bytes eligible for predicate offload            152.828033

12 rows selected.

Elapsed: 00:00:00.01

C1+C2+C3+C4+C5+C6+C7
--------------------
           9010106.9

Elapsed: 00:01:42.79

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           3375.27835
cell physical IO interconnect bytes                              764.974159
cell physical IO bytes eligible for predicate offload            3374.81995
physical IO disk bytes                                             308.0746
cell physical IO interconnect bytes                              144.133529
cell physical IO bytes eligible for predicate offload            307.986572
physical IO disk bytes                                           310.634206
cell physical IO interconnect bytes                              145.189768
cell physical IO bytes eligible for predicate offload              310.5755
physical IO disk bytes                                           302.274069
cell physical IO interconnect bytes                              143.805239

NAME                                                                     GB
---------------------------------------------------------------- ----------
cell physical IO bytes eligible for predicate offload            302.218781

12 rows selected.

Elapsed: 00:00:00.01
SQL> host date
Fri Oct  3 16:49:32 PDT 2008

Summary

Exadata works and has statistics so that the solution doesn’t feel so much like a black box. In the end, however, none of the statistcs stuff really matters. What matters is whether your queries complete in the desired service times.

Oracle Exadata Storage Server Version 1. A “FAQ” is Born. Part I.

BLOG UPDATE (22-MAR-10): Readers, please be aware that this blog entry is about the HP Oracle Database Machine (V1).

BLOG UPDATE (01-JUN-09). According to my blog statistics, a good number of new readers find my blog by being referred to this page by google. I’d like to draw new readers’ attention to the sidebar at the right where there are pages dedicated to indexing my Exadata-related posts.  The original blog post follows:

I expected Oracle Exadata Storage Server to make an instant splash, but the blogosphere has really taken off like a rocket with the topic. Unfortunately there is already quite a bit of misinformation out there. I’d like to approach this with routine quasi-frequently asked question posts. When I find misinformation, I’ll make a blog update. So consider this installment number one.

Q. What does the word programmable mean in the product name Exadata Programmable Storage Server?

A. I don’t know, but it certainly has nothing to do with Oracle Exadata Storage Server. I have seen this moniker misapplied to the product. An Exadata Storage Server “Cell”-as we call them-is no more programmable than a Fibre Channel SAN or NAS Filer. Well, it is of course to the Exadata product development organization, but there is nothing programmable for the field. I think, perhaps, someone may have thought that Exadata is a field programmable gate array (FPGA) approach to solving the problem of offloading query intelligence to storage. Exadata is not field-“programmable” and it doesn’t use or need FPGA technology.

Q. How can Exadata be so powerful if there is only a single 1gb path from the storage cells to the switch?

A. I saw this on a blog post today and it is an incorrect assertion. In fact, I saw a blogger state, “1gb/s???? that’s not that good.” I couldn’t agree more. This is just a common notation blunder. There is, in fact, 20 Gb bandwidth between each Cell and each host in the database grid, which is close to 2 gigabytes of bandwidth (maximum theoretical 1850MB/s due to the IB cards though). I should point out that none of the physical plumbing is “secret-sauce.” Exadata leverages commodity components and open standards (e.g., OFED ).

Q. How does Exadata change the SGA caching dynamic?

A. It doesn’t. Everything that is cached today in the SGA will still be cached. Most Exadata reads are buffered in the PGA since the plan is generally a full scan. That is not to say that there is no Exadata value for indexes, because there can be. Exadata scans indexes and tables with the same I/O dynamic.

Q. This Exadata stuff must be based on NAND FLASH Solid State Disk

A. No, it isn’t and I won’t talk about futures. Exadata doesn’t really need Solid State Disk. Let’s think this one through. Large sequential read and write  speed is about the same on FLASH SSD as rotating media, but random I/O is very fast. 12 Hard Disk Drives can saturate the I/O controller so plugging SSD in where the 3.5″ HDDs are would be a waste.

Q. Why mention sequential disk I/O performance since sequential accesses will only occur in rare circumstances (e.g., non-concurrent scans).

A. Yes, and the question is what? No, honestly. I’ll touch on this. Of course concurrent queries attacking the same physical disks will introduce seek times and rotational delays. And the “competition” can somehow magically scan different table extents on the same disks without causing the same drive dynamic? Of course not. If Exadata is servicing concurrent queries that attack different regions of the same drives then, yes, by all means there will be seeks. Those seek, by the way, are followed by 4 sequential 1MB I/O operations so the seek time is essentailly amortized out.

Q. Is Exadata I/O really sequential, ever?

A. I get this one a lot and it generally comes from folks that know Automatic Storage Management (ASM). Exadata leverages ASM normal redundancy mirroring which mirrors and stripes the data. Oh my, doesn’t that entail textbook random I/O? No, not really. ASM will “fill” a disk from the “outside-in. ” This does not create a totally random I/O pattern since this placement doesn’t randomize from the outer edge of the platters to the spindle and back. In general, the “next” read on any given disk involved in a scan will be at a greater offset in the physical device and not that “far” from the previous sectors read. This does not create the pathological seek times that would be associated with a true random I/O profile.

When Exadata is scanning a disk that is part of an ASM normal redundancy disk group and needs to “advance forward” to get the next portion of the table, Exadata directs the drive mechanics to position at the specific offset where it will read an ASM allocation unit of data, and on and on it goes. Head movements of this variety are considered “short seeks.” I know what the competition says about this topic in their positioning papers. Misinformation will be propagated.

Let me see if I can handle this topic in a different manner. If HP Oracle Exadata Storage Server was a totally random I/O train wreck then it wouldn’t likely be able to drive all the disks in the system at ~85MB/s. In the end, I personally think the demonstrated throughput is more interesting than an academic argument one might stumble upon in an anti-Exadata positioning paper.

Well, I think I’ll wrap this up as installment one of an on-going thread of Q&A on HP Oracle Exadata Storage Server and the HP Oracle Database Machine.

Don’t forget to read Ron Weiss’ Oracle Exadata Storage Server Technical Product Whitepaper. Ron is a good guy and it is a very informative piece. Consider it required reading-especially if you are trolling my site in the role of competitive technical marketing. <smiley>

Manly Men Only Deploy Oracle with Fibre Channel – Part V. What About Oracle9i on RHAS 2.1? Yippie!

Due to my Manly Man Fibre Channel Series Part I , Part II , Part III and Part IV, my email box is getting loaded with a lot of questions about various Oracle over NFS combinations. The questions run the gamut from how to best tune Oracle9i on Red Hat AS 2.1 to Oracle10g on Red Hat RHEL 3 (all on NAS/NFS of course). And then it dawned on me. When I say I’m a fan of Oracle over NFS, that is just entirely too generic.

It Ain’t Linux Unless It Is a 2.6 Kernel
Honestly folks, Red Hat 3.0-or worse yet, RHAS 2.1? Sheer madness. I’m more than convinced that there are a lot of solid RHEL 3.0 systems out there running Oracle. To those folks I’d say, “If it isn’t broken, don’t fix it.” But RHAS 2.1? That wasn’t even an operating system and to be hyper-critically honest, the “franken-kernel” that was RHEL 3.0 wasn’t really that much better, what with that hugemem 4×4 split garbage and all. SuSE SLES8 was vastly more stable than RHEL 3.0. But I digress. Look, if you are running on a pre-2.6 Kernel Linux distribution you’ve simply got to do yourself a favor and plan an upgrade! Now, back to NAS.

What Oracle on NFS?
I’ll be brief, I wouldn’t even think about using Oracle9i on NAS. I know there are a ton of databases out there doing it, but that is just me. The Oracle Server code specific to NFS (Operating System Dependent code) has gone through some serious evolution/maturation. I’ve watched the changes specifically handling NFS mature from 9i through 10g and now into 11g. Simply put, I didn’t like what I say in Oracle9i-specific to NFS that is. Oracle9i is a perfectly fine release-albeit the port to 64bit Linux was pretty scary. I guess I wasn’t that brief. So I’ll continue.

So, Oracle9i on NAS is a no-go (in my book), what about Oracle10g? There again, I’ll be brief. In my opinion, Oracle10gR1 on NAS was about as elegant as a fish flopping around on a hot sidewalk-not a pretty picture. Yes, I have my reasons why for all this stuff, but this blog entry is purely an assertion of my opinion.

Thus far, I discussed 9i and 10gR1 Linux ports. I cannot speak authoritatively about the Solaris ports of either vis a vis fitness for NFS. If I was a betting man and had two dimes to rub together I would wager them that even the Solaris releases of 9i and 10g were probably pretty shaky on NAS. That leads us to 10gR2.

Solid
Oracle10gR2 on NAS is solid-at least for Linux clients. I have seen Metalink stories about Legacy Unix ports that have RMAN problems with NFS as a near-line backup target. Again, I cannot speak for all these sundry platforms. They are good platforms, but I don’t deal with them day to day.

11g
Don’t jump the gun…tomorrow AM…

Examples
In this May 5, 2007 post on toasters, a list participant posted the following:

We are about to start testing Oracle 9i (single instance) with NetApp NAS (6070) filers. We currently have Oracle running on Solaris 9 with SAN storage attached and VERITAS.

I wouldn’t touch that project with a 10 foot pole. If that database is stable, I wouldn’t switch out the storage architecture-especially on that old of an Oracle release.

I’ve also had a thread going with Chen Shapira who has blogged about Oracle troubles on NAS. Her point throughout that blog entry, and the comments to follow, was that they’ve suffered uptime impact that never really solidly indicts to the storage, but there seems to be a lot of fingers pointed that way. Having read of the types of instability his systems have suffered, I suspected old stuff. It came out in the comment section that they are on RHEL 3.0 64-bit. Now, like I’ve said, RHEL 3.0 is carrying a lot of Oracle databases out there I know, but I wonder how many on NAS? When I say Oracle on NFS, I’m mostly saying Linux Oracle10gR2 releases on Linux 2.6 Kernels—and beyond.

I made a blog entry on this topic back in October of last year as well.

Old Operating System Releases
I take criticism (by true believers mostly) when I point out that running Oracle on a Legacy Unix release that is, say, four years old is not a reason for concern. I wish I could say the same thing about the current state of the art in the Linux world. Dating back to my first high-end Linux project (The Tens–A 10 Node, 10TB, 10,000 User Oracle9i Linux Cluster Project in 2002), I’ve been routinely reminded that Linux stands for:

(L)inux (i)s (n)ot (u)ni(x)

Now, that said, you’ll find much less dissatisfaction with Oracle in general on 2.6 Linux Kernel based systems, but in my opinion, that goes extra for NAS deployments

SAN Admins: Please Give Me As Much Capacity From As Few Spindles As Possible!

I was catching up on my mojo reading when I caught a little snippet I’d like to blog about. Oh, by the way, have I mentioned recently that StorageMojo is one of my favorite blogs?

In Robin Harris’ latest installment about ZFS action at Apple, he let out a glimpse of one of his other apparent morbid curiosities—flash. Just joking, I don’t think ZFS on Mac or flash technology are morbid, it just sounded catchy. Anyway, he says:

I’ve been delving deep into flash disks. Can you say “weird”? My take now is that flash drives are to disk drives what quantum mechanics is to Newtonian physics. I’m planning to have something out next week.

I look forward to what he has to say. I too have a great interest in flash.

Now, folks, just because we are Oracle-types and Jim Grey was/is a Microsoft researcher, you cannot overlook what sorts of things Jim was/is interesting in. Jim’s work has had a huge impact on technology over the years and it turns out that Jim took/takes an interest in flash technology with servers in mind. Just the abstract of that paper makes it a natural must-read for Oracle performance minded individuals. Why? Because it states (with emphasis added by me):

Executive summary: Future flash-based disks could provide breakthroughs in IOps, power, reliability, and volumetric capacity when compared to conventional disks.

 

Yes, IOps! Nothing else really matter where Oracle database is concerned. How can I say that? Folks, round-brown spinning things do sequential I/O just fine—naturally. What they don’t do is random I/O. To make it worse, most SAN array controllers (you know, that late 1990’s technology) pile on overhead that further choke off random I/O performance. Combine all that with the standard IT blunder of allocating space for Oracle on a pure capacity basis and you get the classic OakTable Network response:

Attention DBAs, it’s time for some déjà vu. I’ll state with belligerent repetition, redundantly, over and over, monotonously reiterating this one very important recurrent bit of advice: Do everything you can to get spindles from your storage group—not just capacity.

 

Flash
Yes that’s right, it wont be long (in relative terms) until you see flash memory storage fit for Oracle databases. The aspect of this likely future trend that I can’t predict, however, is what impact such technology would have on the entrenched SAN array providers. Will it make it more difficult to keep the margins at the levels they demand, or will flash be the final straw that commoditizes enterprise storage? Then again, and Jim Grey points out in that paper, flash density isn’t even being driven by the PC—and most certainly not enterprise storage—ecosystem. The density is being driven by consumer and mobile applications. Hey, I want my MTV. Um, like all of it, crammed into my credit-card sized mpeg player too.

When?
When it gets cheaper and higher capacity of course. Well, its not exactly that simple. I went spelunking for that Samsung 1.8” 32GB SSD and found two providers with street price of roughly USD $700.00 for 32GB here and here. In fact, upon further investigation, Ritek may soon offer a 32GB device at some $8 per GB. But let’s stick with current product for the moment. At $22 per GB, we’re not exactly talking SATA which runs more on the order of $.35 per GB. But then we are talking enterprise applications here, so a better comparison would be to Fibre drives which go for about $3-$4 per GB.

Now that is interesting since Jim Grey pointed out that in-spite of some industry predictions setting the stage for NAND to double every year, NAND had in fact gained 16 fold in 4 years–off by year. If that pace continues, could we really expect 512GB 1.8″ SSD devices in the next 4 years? And would the price stay relatively constant yielding a cost of something like $1.35 per GB? Remember, even the current state of the art (e.g., the Samsung 1.8″ 32GB SSD) delivers on the order of 130,000 random single-sector IOps–that’s approximately 7usec latency for a random I/O. At least that is what Samsung’s literature claims. Jim’s paper, on the other hand reports grim current art performance when measured with DskSpd.exe:

The story for random IOs is more complex – and disappointing. For the typical 4-deep 8KB random request, read performance is a spectacular 2,800 requests/second but write performance is a disappointing 27 requests/second.

The technology is young and technically superior, but there is work to do in getting the most out of NSSD as the paper reports. Jim suspects that short term quick fixes could be made to bring the random I/O performance for 8KB transfers on today’s NSSD technology up to about 1,700 IOps split evenly between read and write. Consider, however, that real world applications seldom exhibit a read:write ratio of 50:50. Jim generalized on the TPC-C workload as a case in point. It seems with “some re-engineering” (Jim’s words) even today’s SSD would be a great replacement for hard drives for typical Oracle OLTP workloads since you’ll see more 70:30 read:write ratios in the real world. And what about sequential writes? Well, there again, even today’s technology can handle some 35MB/s of sequential writes so direct path writes (e.g., sort spills) and redo log writes would be well taken care of. But alas, the $$/GB is still off. Time will fix that problem and when it does, NSSD will be a great fit for databases.

Don’t think for a moment Oracle Corporation is going to pass up on enabling customers to exploit that sort of performance–with or without the major SAN vendors.

But flash burns out, right? Well, yes and no. The thing that matters is how long the device lasts-the sum of its parts. MTBF numbers are crude, but Samsung sticks a 1,000,000hr MTBF on this little jewel-how cool.
Well, I’ve got the cart well ahead of the horse here for sure because it is still too expensive, but put it on the back burner, because we aren’t using Betamax now and I expect we’ll be using fewer round-brown spinning things in the span of our careers.

Apple OS/X with ZFS Shall Rule the World.

The problem with blogging is there is no way to clearly write something tongue-in-cheek. No facial expressions, no smirking, no hand waving—just plain old black and white. That aside, I can’t resist posting a follow-up to a post on StorageMojo about ZFS performance. I’m not taking a swipe at StorageMojo because it is one of my favorite blogs. However, every now and again third party perspective is a healthy thing. The topic at hand is this post about ZFS performance which is a digest of this post on digitalbadger.net. The original post starts with:

I have seen many benchmarks indicating that for general usage, ZFS should be at least as fast if not faster than UFS (directio not withstanding)[…]

Right off the bat I was scratching my head. I have experience with direct I/O dating back to 1991 and have some direct I/O related posts here. I was trying to figure out what the phrase “directio not withstanding” was supposed to mean. The only performance boost direct I/O can possibly yield is due to the elimination of the write-ordering locks generally imposed by POSIX and the double-buffering/memcopys overhead associated with the DMA into the page cache—and subsequent copy into the user address space buffer. So for a read-intensive benchmark, about the only thing one should expect is less processor overhead, not increased throughput.

Anyway, the post continues:

To give a little background : I have been experiencing really bad throughput on our 3510-based SAN. The hosts are X4100s, 12Gb RAM, 2x dual core 2.6Ghz opterons and Solaris 10 11/06. They are each connected to a 3510FC dual-controller array via a dual-port HBA and 2 Brocade SW200e switches, using MxPIO. All fabric is at 2Gb/s […]

OK, the 4100 is a 2-way Opteron 2000 server which, being a lot like an HP Proliant DL385, should have no problem consuming all the data the 2 x 2Gb FCP paths can deliver—roughly 400MB/s. If the post is about a UFS versus ZFS apples-apples comparison, both results should top out at the maximum theoretical throughput of 2x2Gb FCP. The post continues:

On average, I was seeing some pretty average to poor rates on the non-ZFS volumes depending on how they were configured, on average I was seeing 65MB/s write performance and around 450MB/s read on the SAN using a single drive LUN.

I’d be plenty happy with that 450MB/s given that configuration. The post insinuates this is direct I/O so getting 450MB/s out of 2x2Gb FCP should be the end of the contest. Further, getting 65MB/s written to a single drive LUN is quite snappy! What is left to benchmark? It seems UFS cranks the SAN at full bandwidth.

The post continues:

I then tried ZFS, and immediately started seeing a crazy rate of around 1GB/s read AND write, peaking at close to 2GB/s. Given that this was round 2 to 4 times the capacity of the fabric, it was clear something was going awry.

Hmmm…Opterons with HT 2.0 attaining peaks of 2GB/s write throughput via 2x2Gb FCP? There is nothing awry about this, it is simply not writing the data to the storage.

This is a memory test. ZFS is caching, UFS is not.

The post continues:

Many thanks to Greg Menke and Tim Bradshaw on comp.unix.solaris for their help in unravelling this mystery!

What mystery? I’d recommend testing something that runs longer than 1 second (reading 512MB from memory on any Opteron system should take about .6 seconds. I’d recommend a dataset that is about 2 fold larger than physical memory.

How Does ZFS and UFS Compare to Other Filesystems?
Let’s take a look at an equivalent test running on the HP Cluster Gateway internal filesystem (the former PolyServe PSFS). The following is a screen shot of a dd(1) test using real Oracle datafiles on a Proliant DL585. The first test executes a single thread of dd(1) reading the first 512MB of one of the files using 1M read requests. The throughput is 815MB/s. Next, I ran 2 concurrent dd(1) processes each chomping the first 512MB out of two different Oracle datafiles using 1MB read requests. In the concurrent case, the throughput was 1.46GB/s.

NOTE: You may have to right click->view the image.

zfsthing1.jpg

But wait, in order to prove the vast superiority of the HP Cluster Gateway filesystem over ZFS using similar hardware, I fired off 6 concurrent dd(1) processes each reading the first 512MB out of six different files. Here I measured 2.9GB/s!

zfsthing2.jpg


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 2,949 other followers

Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: