Archive for the 'Oracle I/O Performance' Category

SLOB Chewed Up All My File System Space and Spit It Out. But, Why?

This is a quick blog post in response to a recent interaction with a SLOB user. The user reached out to me to lament that all her file system space was consumed as the result of a SLOB execution ( I reminded her that will alert to possible derelict mpstat/iostat/vmstat processes from an aborted SLOB test. If these processes exist they will be spooling their output to unlinked files.

The following screen shot shows what to expect if a SLOB test detects potential “deadwood” processes. If you see this sort of output from, it’s best to investigate whether in fact they remain from an aborted test or whether there are other users on the system that left these processes behind.


The next screenshot shows how to take action on the output:


If SLOB is consuming all your file system space, it’s probably already telling you why–all you need to do is take action. I hope this is helpful for some wayward Googler someday.

SLOB Physical I/O Randomness. How Random Is Random? Random!

I recently read a blog post by Kyle Hailey regarding some lack of randomness he detected in the Orion I/O generator tool. Feel free to read Kyle’s post but in short he used dtrace to detect Orion was obliterating a very dense subset of the 96GB file Orion was accessing.

I’ve used Orion for many years and, in fact, wrote my first Orion related blog entry about 8 years ago. I find Orion to be useful for some things and of course DBAs must use Orion’s cousin CALIBRATE_IO as a part of their job. However, neither of these tools perform database I/O. If you want to see how database I/O behaves on a platform it’s best to use a database. So, SLOB it is. But wait! Is SLOB is just another storage cache-poking randomness-challenged distraction from your day job? No, it isn’t.

But SLOB Is So Very Difficult To Use

It’s quite simple actually. You can see how simple SLOB is to set up and test by visiting my picture tutorial.

How Random Is Random? Random!

SLOB is utterly random. However, there are some tips I’d like to offer in this post to show you how you can choose even higher levels of randomness in your I/O testing.

Kyle used dtrace and some shell commands to group block visits into buckets. Since I’m analyzing the randomness of SLOB I’ll use a 10046 trace on the database sessions. First I’ll run a 96 user SLOB test with slob.conf->UPDATE_PCT=0.

After the SLOB test was completed I scrambled around to find the trace files and worked out a simple set of sed(1) expressions to spit out the block numbers being visited by each I/O of type db file sequential read:


I then grouped the blocks being visited into buckets much the same way Kyle did in his post:


I’ll show some analysis of the those buckets later in the post.  Yes, SLOB is random as analysis of 96u.blocks.txt will show but it can be even more random if one configures a RECYCLE buffer pool. One of the lesser advertised features of SLOB is the fact that all working tables in the SLOB schemas are created with BUFFER_POOL RECYCLE in the storage clause. The idea behind this is to support the caching of index blocks in the SGA buffer pool. When no RECYCLE pool is allocated there is a battle for footprint in the SGA buffer pool causing even buffers with index blocks to be reused for buffering table/index blocks of the active transactions. Naturally when indexes are not cached there will be slight hot-spots for constant, physical, re-reads of the index blocks. The question becomes what percentage of the I/O do these hot blocks account for?

To determine how hot index blocks are I allocated a recycle buffer pool and ran another 2 minute SLOB test. As per the following screen shot I again grouped block visits into buckets:


After having both SLOB results (with and without RECYCLE buffer pool) I performed a bit of text processing to determine how different the access patterns were in both scenarios. The following shows:

  • The vast majority of blocks are visited 10 or less times in both models
  • The RECYCLE pool model clearly flattens out the re-visit rates as the hotest block is visited only 12 times compared to the 112 visits for the hottest block in the default case
  • If 12 is the golden standard for sparsity (as per the RECYCLE pool test case) then even the default is quite sparse because dense buckets accounted for only 84,583 physical reads compared to the nearly 14 million reads of blocks in the sparse buckets


The following table presents the data including the total I/O operations traced. The number of sparse visits are those blocks that were accessed less than or equal to 10 times during the SLOB test. I should point out that there will naturally be more I/O during a SLOB test when index accesses are forced physical as is the case with the default buffer pools. That is, the RECYCLE buffer pool case will have a slightly higher logical I/O rate (cache hits) due to index buffer accesses.



If you want to know how database I/O performs on a platform use a database. If using a database to test I/O on a platform then by all means drive it with SLOB ( a database I/O tool).

Regarding randomness, even in the default case SLOB proves itself to be very random. If you want to push for even more randomness then the way forward is to configure db_recycle_cache_size.

Enjoy SLOB! The best place to start with SLOB is the SLOB Resources Page.







EMC XtremIO – The Full-Featured All-Flash Array. Interested In Oracle Performance? See The Whitepaper.

NOTE: There’s a link to the full article at the end of this post.

I recently submitted a manuscript to the EMC XtremIO Business Unit covering some compelling lab results from testing I concluded earlier this year. I hope you’ll find the paper interesting.

There is a link to the full paper at the bottom of this block post. I’ve pasted the executive summary here:

Executive Summary

Physical I/O patterns generated by Oracle Database workloads are well understood. The predictable nature of these I/O characteristics have historically enabled platform vendors to implement widely varying I/O acceleration technologies including prefetching, coalescing transfers, tiering, caching and even I/O elimination. However, the key presumption central to all of these acceleration technologies is that there is an identifiable active data set. While it is true that Oracle Database workloads generally settle on an active data set, the active data set for a workload is seldom static—it tends to move based on easily understood factors such as data aging or business workflow (e.g., “month-end processing”) and even the data source itself. Identifying the current active data set and keeping up with movement of the active data set is complex and time consuming due to variability in workloads, workload types, and number of workloads. Storage administrators constantly chase the performance hotspots caused by the active dataset.

All-Flash Arrays (AFAs) can completely eliminate the need to identify the active dataset because of the ability of flash to service any part of a larger data set equally. But not all AFAs are created equal.

Even though numerous AFAs have come to market, obtaining the best performance required by databases is challenging. The challenge isn’t just limited to performance. Modern storage arrays offer a wide variety of features such as deduplication, snapshots, clones, thin provisioning, and replication. These features are built on top of the underlying disk management engine, and are based on the same rules and limitations favoring sequential I/O. Simply substituting flash for hard drives won’t break these features, but neither will it enhance them.

EMC has developed a new class of enterprise data storage system, XtremIO flash array, which is based entirely on flash media. XtremIO’s approach was not simply to substitute flash in an existing storage controller design or software stack, but rather to engineer an entirely new array from the ground-up to unlock flash’s full performance potential and deliver array-based capabilities that are unprecedented in the context of current storage systems.

This paper will help the reader understand Oracle Database performance bottlenecks and how XtremIO AFAs can help address such bottlenecks with its unique capability to deal with constant variance in the I/O profile and load levels. We demonstrate that it takes a highly flash-optimized architecture to ensure the best Oracle Database user experience. Please read more:  Link to full paper from

Oracle’s Timeline, Copious Benchmarks And Internal Deployments Prove Exadata Is The Worlds First (Best?) OLTP Machine – Part I

I recently took a peek at this online, interactive history of Oracle Corporation. When I got to the year 2008, I was surprised to see no mention of the production release of Exadata–the HP Oracle Database Machine. The first release of Exadata occurred in September 2008.

Once I advanced to 2009, however, I found mention of Exadata but I also found a couple of errors:

  • The text says “Sun Oracle Database Machine” yet the photograph is that of an HP Oracle Database Machine (minor, if not comical, error)
  • The text says Exadata “wins benchmarks against key competitors” (major error, unbelievably so)

What’s First, Bad News or Good News?

Bad News

The only benchmark Exadata has ever been used in was this 1TB scale TPC-H in 2009 with HP Blades.  Be aware, as I pointed out in this blog post, that particular TPC-H was an in-memory Parallel Query benchmark. Exadata features were not used. Exadata was a simple block storage device. The table and index scans were conducted against cached blocks in the Oracle SGAs configured in the 64 nodes of the cluster. Exadata served as nothing more than persistent storage for the benchmark. Don’t get me wrong. I’m not saying there was no physical I/O. The database was loaded as a timed test (per TPC-H spec) which took 142 minutes and the first few moments of query processing required physical I/O so the data could be pulled up into the aggregate SGAs. The benchmark also requires updates. However, these ancillary I/O operations did not lean on Exadata feature nor are they comparable to a TPC-H that centers on physical I/O.  So could using Exadata in an in-memory Parallel Query benchmark be classified as winning “benchmarks against key competitors?” Surely not but I’m willing to hear from dissenters on that. Now that the bad news is out of the way I’ll get to what I’m actually blogging about. I’m blogging about the good news.

Good News

The good news I’d like to point out from screenshot (below) of Oracle’s interactive history is that it spares us the torment of referring to the Sun Oracle Exadata Machine as the First Database Machine for OLTP as touted in this press release from that time frame.  A system that offers 60-fold more capacity for random reads than random writes cannot possibly be mistaken as a special-built OLTP machine.  I’m delighted that the screen shot below honestly represents the design center for Exadata which is DW/BI. For that reason, Exadata features have nothing at all to do with OLTP. That’s a good readon the term OLTP is not seen in that screen shot. That is good news.

OLTP does not trigger Smart Scans thus no offload processing (filtration,projection, storage index, etc). Moreover, Hybrid Columnar Compression has nothing to do with OLTP, except perhaps, in an information life cycle management hierarchy. So, there’s the good news. Exadata wasn’t an OLTP machine in Oracle’s timeline and it still is not an OLTP machine. No, Oracle was quite right for not putting the “First OLTP Machine” lark into that timeline. After all, 2009 is quite close to 40 years after the true first OLTP Machine which was CICS/IMS.  I don’t understand the compulsion to make outlandish claims.

Bad News

Yes, more bad news. Oracle has never published an Exadata benchmark result even with their own benchmarks. That’s right. Oracle has a long history of publishing Oracle Application Standard benchmarks–but no Exadata results.

I’ve gone on the record as siding with Oracle for not publishing TPC benchmarks with Exadata for many reasons.  However, I cannot think of any acceptable excuse for why Oracle would pitch Exadata to you as best for Oracle Applications when a) there are no OLTP features in Exadata*, b) Oracle Corporation does not use Exadata for their ERP and c) there is no benchmark proof for Exadata OLTP/ERP capabilities.

Closing Thoughts

Given all I’ve just said, why is it that (common knowledge) the majority of Exadata units shipping to customers are quarter-rack for non-DW/BI use cases? Has Exadata just become the modern replacement for “[…] nobody ever got fired for buying […]?” Is that how platforms are chosen these days? How did we get to that point of lowered-expectations?

Enjoy the screen shot of memory lane, wrong photo, bad, good and all:

* I am aware of the Exadata Smart Flash Log feature.

Modern Servers Are Better Than You Think For Oracle Database – Part I. What Problems Actually Need To Be Fixed?

Blog update 2012.02.28: I’ve received countless inquiries about the storage used in the proof points I’m making in this post. I’d like to state clearly that the storage is  not a production product, not a glimpse of something that may eventually become product or any such thing. This is a post about CPU, not about storage. That point will be clear as you read the words in the post.

In my recent article entitled How Many Non-Exadata RAC Licenses Do You Need to Match Exadata Performance I brought up the topic of processor requirements for Oracle with and without Exadata. I find the topic intriguing. It is my opinion that anyone influencing how their company’s Oracle-related IT budget is used needs to find this topic intriguing.

Before I can address the poll in the above-mentioned post I have to lay some groundwork. The groundwork I need to lay will come in this and an unknown number of installments in a series.

Exadata for OLTP

There is no value add for Oracle Database on Exadata in the OLTP/ERP use case. Full stop. OLTP/ERP does not offload processing to storage. Your full-rack Exadata configuration has 168 Xeon 5600 cores in the storage grid doing practically nothing in this use case. Or, I should say, the processing that does occur in the Exadata storage cells (in the OLTP/ERP use case) would be better handled in the database host. There simply is no value in introducing off-host I/O handling (and all the associated communication overhead) for random single-block accesses. Additionally, since Exadata cannot scale random writes, it is actually a very weak platform for these use cases. Allow me to explain.

Exadata Random Write I/O
While it is true Exadata offers the bandwidth for upwards of 1.5 million read IOPS (with low latency) in a full rack X2 configuration, the data sheet specification for random writes is a paltry 50,000 gross IOPS—or 25,000 with Automatic Storage Management normal redundancy. Applications do not exhibit 60:1 read to write ratios. Exadata bottlenecks on random writes long before an application can realize the Exadata Smart Flash Cache datasheet random read rates.

Exadata for DW/BI/Analytics

Oracle positions Exadata against products like EMC Greenplum for DW/BI/Analytics workloads. I fully understand this positioning because DW/BI is the primary use case for Exadata. In its inception Exadata addressed very important problems related to data flow. The situation as it stands today, however, is that Exadata addresses problems that no longer exist. Once again, allow me to explain.

The Scourge Of The Front-Side Bus Is Ancient History. That’s Important!
It was not long ago that provisioning ample bandwidth to Real Application Clusters for high-bandwidth scans was very difficult. I understand that. I also understand that, back in those days, commodity servers suffered from internal bandwidth problems limiting a server’s data-ingest capability from storage (PCI->CPU core). I speak of servers in the pre-Quick Path Interconnect (Nehalem EP) days.  In those days it made little sense to connect more than, say, two active 4GFC fibre channel paths (~800 MB/s) to a server because the data would not flow unimpeded from storage to the processors. The bottleneck was the front-side bus choking off the flow of data from storage to processor cores. This fact essentially forced Oracle’s customers to create larger, more complex clusters for their RAC deployments just to accommodate the needed flow of data (throughput).  That is, while some customers toiled with the most basic problems (e.g., storage connectivity), others solved that problem but still required larger clusters to get more front-side buses involved.

It wasn’t really about the processor cores. It was about the bus. Enter Exadata and storage offload processing.

Because the servers of yesteryear had bottlenecks between the storage adapters and the CPU cores (the front-side bus) it was necessary for Oracle to devise a means for reducing payload between storage and RAC host CPUs. Oracle chose to offload the I/O handling (calls to the Kernel for physical I/O), filtration and column projection to storage. This functionality is known as a Smart Scan. Let’s just forget for a moment that the majority of CPU-intensive processing, in a DW/BI query,  occurs after filtration and projection (e.g., table joins, sort, aggregation, etc). Shame on me, I digress.

All right, so imagine for a moment that modern servers don’t really need the offload-processing “help” offered by Exadata? What if modern servers can actually handle data at extreme rates of throughput from storage, over PCI and into the processor cores without offloading the lower level I/O and filtration? Well, the answer to that comes down to how many processor cores are involved with the functionality that is offloaded to Exadata. That is a sophisticated topic, but I don’t think we are ready to tackle it yet because the majority of datacenter folks I interact with suffer from a bit of EarthStillFlat(tm) syndrome. That is, most folks don’t know their servers. They still think it takes lots and lots of processor cores to handle data flow like it did when processor cores were held hostage by front-side bus bottlenecks. In short, we can’t investigate how necessary offload processing is if we don’t know anything about the servers we intend to benefit with said offload. After all, Oracle database is the same software whether running on a Xeon 5600-based server in an Exadata rack or a Xeon 5600-based server not in an Exadata rack.

Know Your Servers

It is possible to know your servers. You just have to measure.

You might be surprised at how capable they are. Why presume modern servers need the help of offloading I/O (handling) and filtration. You license Oracle by the processor core so it is worthwhile knowing what those cores are capable of. I know my server and what it is capable of. Allow me to share a few things I know about my server’s capabilities.

My server is a very common platform as the following screenshot will show. It is a simple 2s12c24t Xeon 5600 (a.k.a. Westmere EP) server:

My server is attached to very high-performance storage which is presented to an Oracle database via Oracle Managed Files residing in an XFS file system in a md(4) software RAID volume. The following screenshot shows this association/hierarchy as well as the fact that the files are accessed with direct, asynchronous I/O. The screenshot also shows that the database is able to scan a table with 1 billion rows (206 GB) in 45 seconds (4.7 GB/s table scan throughput):

The io.sql script accounts for the volume of data that must be ingested to count the billion rows:

$ cat io.sql
set timing off
col physical_reads_GB format 999,999,999;      
select VALUE /1024 /1024 /1024 physical_reads_GB from v$sysstat where STATISTIC# =
(select statistic# from v$statname where name like '%physical read bytes%');
set timing on

So this simple test shows that a 2s12c24t server is able to process 392 MB/s per processor core. When Exadata was introduced most data centers used 4GFC fibre channel for storage connectivity. The servers of the day were bandwidth limited. If only I could teleport my 2-socket Xeon 5600 server back in time and put it next to an Exadata V1 box. Once there, I’d be able to demonstrate a 2-socket server capable of handling the flow of data from 12 active 4GFC FC HBA ports! I’d be the talk of the town because similar servers of that era could neither connect as many active FC HBAs nor ingest the data flowing over the wires—the front-side bus was the bottleneck. But, the earth does not remain flat.

The following screenshot shows the results of five SQL statements explained as:

  1. This SQL scans all 206 GB, locates the 4 char columns (projection) in each row and nibbles the first char of each. The rate of throughput is 2,812 MB/s. There is no filtration
  2. This SQL ingests all the date columns from all rows and maintains 2,481 MB/s. There is no filtration.
  3. This SQL combines the efforts of the previous two queries which brings the throughput down to 1,278 MB/s. There is no filtration.
  4. This SQL processes the entire data mass of all columns in each row and maintains 1,528 MB/s. There is no filtration.
  5. The last SQL statement introduces filtration. Here we see that the platform is able to scan and selectively discard all rows (based on a date predicate) at the rate of 4,882 MB/s. This would be akin to a fully offloaded scan in Exadata that returns no rows.


This blog series aims to embark on finding good answers to the question I raised in my recent article entitled How Many Non-Exadata RAC Licenses Do You Need to Match Exadata Performance. I’ve explained that offload to Exadata storage consists of payload reduction. I also offered a technical, historical perspective as why that was so important. I’ve also showed that a small, modern QPI-based server can flow data through processor cores at rates ranging from 407 MBPS/core down to 107 MBPS/core depending on what the SQL is doing (SQL with no predicates mind you).

Since payload reduction is the primary value add of Exadata I finished this installment in the series with an example of a simple 2s12c24t Xeon 5600 server filtering out all rows at a rate of 4,882 MB/s—essentially the same throughput as a simple count(*) of all rows as I showed earlier in this post. That is to say that, thus far, I’ve shown that my little lab system can sustain nearly 5GB/s disk throughput whether performing a simple count of rows or filtering out all rows (based on a date predicate). What’s missing here is the processor cost associated with the filtration and I’ll get to that soon enough.

We can’t accurately estimate the benefit of offload until we can accurately associate CPU cost to filtration.  I’ll take this blog series to that point over the next few installments—so long as this topic isn’t too boring for my blog readers.

This is part I in the series. At this point I hope you are beginning to realize that modern servers are better than you probably thought. Moreover, I hope my words about the history of front-side bus impact on sizing systems for Real Application Clusters is starting to make sense. If not, by all means please comment.

As this blog series progresses I aim to help folks better appreciate the costs of performing certain aspects of Oracle query processing on modern hardware. The more we know about modern servers the closer we can get to answer the poll more accurately. You license Oracle by the processor core so it behooves you to know such things…doesn’t it?

By the way, modern storage networking has advanced far beyond 4GFC (400 MB/s).

Finally, as you can tell by my glee in scanning Oracle data from an XFS file system at nearly 5GB/s (direct I/O), I’m quite pleased at the demise of the front-side bus! Unless I’m mistaken, a cluster of such servers, with really fast storage, would be quite a configuration.

Oracle Exadata Storage Server Version 1. A “FAQ” is Born. Part I.

BLOG UPDATE (22-MAR-10): Readers, please be aware that this blog entry is about the HP Oracle Database Machine (V1).

BLOG UPDATE (01-JUN-09). According to my blog statistics, a good number of new readers find my blog by being referred to this page by google. I’d like to draw new readers’ attention to the sidebar at the right where there are pages dedicated to indexing my Exadata-related posts.  The original blog post follows:

I expected Oracle Exadata Storage Server to make an instant splash, but the blogosphere has really taken off like a rocket with the topic. Unfortunately there is already quite a bit of misinformation out there. I’d like to approach this with routine quasi-frequently asked question posts. When I find misinformation, I’ll make a blog update. So consider this installment number one.

Q. What does the word programmable mean in the product name Exadata Programmable Storage Server?

A. I don’t know, but it certainly has nothing to do with Oracle Exadata Storage Server. I have seen this moniker misapplied to the product. An Exadata Storage Server “Cell”-as we call them-is no more programmable than a Fibre Channel SAN or NAS Filer. Well, it is of course to the Exadata product development organization, but there is nothing programmable for the field. I think, perhaps, someone may have thought that Exadata is a field programmable gate array (FPGA) approach to solving the problem of offloading query intelligence to storage. Exadata is not field-“programmable” and it doesn’t use or need FPGA technology.

Q. How can Exadata be so powerful if there is only a single 1gb path from the storage cells to the switch?

A. I saw this on a blog post today and it is an incorrect assertion. In fact, I saw a blogger state, “1gb/s???? that’s not that good.” I couldn’t agree more. This is just a common notation blunder. There is, in fact, 20 Gb bandwidth between each Cell and each host in the database grid, which is close to 2 gigabytes of bandwidth (maximum theoretical 1850MB/s due to the IB cards though). I should point out that none of the physical plumbing is “secret-sauce.” Exadata leverages commodity components and open standards (e.g., OFED ).

Q. How does Exadata change the SGA caching dynamic?

A. It doesn’t. Everything that is cached today in the SGA will still be cached. Most Exadata reads are buffered in the PGA since the plan is generally a full scan. That is not to say that there is no Exadata value for indexes, because there can be. Exadata scans indexes and tables with the same I/O dynamic.

Q. This Exadata stuff must be based on NAND FLASH Solid State Disk

A. No, it isn’t and I won’t talk about futures. Exadata doesn’t really need Solid State Disk. Let’s think this one through. Large sequential read and write  speed is about the same on FLASH SSD as rotating media, but random I/O is very fast. 12 Hard Disk Drives can saturate the I/O controller so plugging SSD in where the 3.5″ HDDs are would be a waste.

Q. Why mention sequential disk I/O performance since sequential accesses will only occur in rare circumstances (e.g., non-concurrent scans).

A. Yes, and the question is what? No, honestly. I’ll touch on this. Of course concurrent queries attacking the same physical disks will introduce seek times and rotational delays. And the “competition” can somehow magically scan different table extents on the same disks without causing the same drive dynamic? Of course not. If Exadata is servicing concurrent queries that attack different regions of the same drives then, yes, by all means there will be seeks. Those seek, by the way, are followed by 4 sequential 1MB I/O operations so the seek time is essentailly amortized out.

Q. Is Exadata I/O really sequential, ever?

A. I get this one a lot and it generally comes from folks that know Automatic Storage Management (ASM). Exadata leverages ASM normal redundancy mirroring which mirrors and stripes the data. Oh my, doesn’t that entail textbook random I/O? No, not really. ASM will “fill” a disk from the “outside-in. ” This does not create a totally random I/O pattern since this placement doesn’t randomize from the outer edge of the platters to the spindle and back. In general, the “next” read on any given disk involved in a scan will be at a greater offset in the physical device and not that “far” from the previous sectors read. This does not create the pathological seek times that would be associated with a true random I/O profile.

When Exadata is scanning a disk that is part of an ASM normal redundancy disk group and needs to “advance forward” to get the next portion of the table, Exadata directs the drive mechanics to position at the specific offset where it will read an ASM allocation unit of data, and on and on it goes. Head movements of this variety are considered “short seeks.” I know what the competition says about this topic in their positioning papers. Misinformation will be propagated.

Let me see if I can handle this topic in a different manner. If HP Oracle Exadata Storage Server was a totally random I/O train wreck then it wouldn’t likely be able to drive all the disks in the system at ~85MB/s. In the end, I personally think the demonstrated throughput is more interesting than an academic argument one might stumble upon in an anti-Exadata positioning paper.

Well, I think I’ll wrap this up as installment one of an on-going thread of Q&A on HP Oracle Exadata Storage Server and the HP Oracle Database Machine.

Don’t forget to read Ron Weiss’ Oracle Exadata Storage Server Technical Product Whitepaper. Ron is a good guy and it is a very informative piece. Consider it required reading-especially if you are trolling my site in the role of competitive technical marketing. <smiley>

Databases are the Contents of Storage. Future Oracle DBAs Can Administer More. Why Would They Want To?

I’ve taken the following quote from this Oracle whitepaper about low cost storage:

A Database Storage Grid does not depend on flawless execution from its component storage arrays. Instead, it is designed to tolerate the failure of individual storage arrays.

In spite of the fact that the Resilient Low-Cost Storage Initiative program was decommissioned along with the Oracle Storage Compatability Program, the concepts discussed in that paper should be treated as a barometer of the future of storage for Oracle databases-with two exceptions: 1) Fibre Channel is not the future and 2) there’s more to “the database” than just the database. What do I mean by point 2? Well, with features like SecureFiles, we aren’t just talking rows and columns any more and I doubt (but I don’t know) that SecureFiles is the end of that trend.

Future Oracle DBAs
Oracle DBAs of the future become even more critical to the enterprise since the current “stove-pipe” style IT organization will invariably change. In today’s IT shop, the application team talks to the DBA team who talks to the Sys Admin team who tlks to the Storage Admin team. All this to get an application to store data on disk through a Oracle database. I think that will be the model that remains for lightly-featured products like MySQL and SQL Server, but Oracle aims for more. Yes, I’m only whetting your appetite but I will flesh out this topic over time. Here’s food for thought: Oracle DBAs should stop thinking their role in the model stops at the contents of the storage.

So while Chen Shapira may be worried that DBAs will get obviated, I’d predict instead that Oracle technology will become more full-featured at the storage level. Unlike the stock market where past performance is no indicator of future performance, Oracle has consistently brought to market features that were once considered too “low-level” to be in the domain of a Database vendor.

The IT industry is going through consolidation. I think we’ll see Enterprise-level IT roles go through some consolidation over time as well. DBAs who can wear more than “one hat” will be more valuable to the enterprise. Instead of thinking about “encroachment” from the low-end database products, think about your increased value proposition with Oracle features that enable this consolidation of IT roles-that is, if I’m reading the tea leaves correctly.

How to Win Friends and Influence People
Believe me, my positions on Fibre Channel have prompted some fairly vile emails in my inbox-especially the posts in my Manly Man SAN series. Folks, I don’t “have it out”, as they say, for the role of Storage Administrators. I just believe that the Oracle DBAs of today are on the cusp of being in control of more of the stack. Like I said, it seems today’s DBA responsibilities stop at the contents of the storage-a role that fits the Fibre Channel paradigm quite well, but a role that makes little sense to me. I think Oracle DBAs are capable of more and will have more success when they have more control. Having said that, I encourage any of you DBAs who would love to be in more control of the storage to look at my my post about the recent SAN-free Oracle Data Warehouse. Read that post and give considerable thought to the model it discusses. And give even more consideration to the cost savings it yields.

The Voices in My Head
Now my alter ego (who is a DBA, whereas I’m not) is asking, “Why would I want more control at the storage level?” I’ll try to answer him in blog posts, but perhaps some of you DBAs can share experiences where performance or availability problems were further exacerbated by finger pointing between you and the Storage Administration group.

Note to Storage Administrators
Please, please, do not fill my email box with vitriolic messages about the harmony today’s typical stove-pipe IT organization creates. I’m not here to start battles.

Let me share a thought that might help this whole thread make more sense. Let’s recall the days when an Oracle DBA and a System Administrator together (yet alone) were able to provide Oracle Database connectivity and processing for thousands of users without ever talking to a “Storage Group.” Do you folks remember when that was? I do. It was the days of Direct Attach Storage (DAS). The problem with that model was that it only took until about the late 1990s to run out of connectivity-enter the Fibre Channel SAN. And since SANs are spokes attached to hubs of storage systems (SAN arrays), we wound up with a level of indirection between the Oracle server and its blocks on disk. Perhaps there are still some power DBAs that remember how life was with large numbers of DAS drives (hundreds). Perhaps they’ll recall the level of control they had back then. On the other hand, perhaps I’m going insane, but riddle me this (and feel free to quote me elsewhere):

Why is it that the industry needed SANs to get more than a few hundred disks attached to a high-end Oracle system in the late 1990s and yet today’s Oracle databases often reside on LUNs comprised of a handful of drives in a SAN?

The very thought of that twist of fate makes me feel like a fish flopping around on a hot sidewalk. Do you remember my post about capacity versus spindles? Oh, right, SAN cache makes that all better. Uh huh.

Am I saying the future is DAS? No. Can I tell you now exactly what model I’m alluding to? Not yet, but I enjoy putting out a little food for thought.

What Oracle Topic Is On Your Coffee Table?

I’ve received a couple of emails over the last few weeks in which readers asked me what my favorite blog posts are on my own blog. I’d have to sift through old posts to answer that definitively, but one recent post comes to mind. I think my recent post about why Oracle shops generally aren’t realizing the full potential of their hard drives is one of my favorites.

Over-Configuring DBWR Processes – Part III

Sometimes the comment thread on a blog entry are important enough to warrant yet another blog entry. This is one of those times. In my blog entry called Over-Configuring DBWR Processes Part II I offered some test results showing that it really is best to keep a single writer until such time as a single writer becomes CPU-bound. But first a disclaimer.

DISCLAIMER: There are ports of Oracle where specialized multiple DBWR code is implemented for specific hardware architecture (e.g., NUMA). I ought to know, I worked on one of the very first (if not the first) to do so with the ports to Sequent DYNIX/ptx.

There, disclaimer aside, I am talking about the Linux ports of 10g and 11g which have no such hardware specialization.

The first question in that comment thread was:

From what’s written here, it looks like you only tested 1 and 8. Is it possible that the global maximum is actually somewhere between? Maybe 2 processes is slightly faster than 1, for example?

Great question with a simple answer: use 1 writer as long as that single writer has ample bandwidth. That is a truth, but I figured I’d do some runs to milk out a fact or two. I configured 2 DBWR processes and re-ran the workload I describe in Part II of this series. I followed that up with 4 DBWR processes. Hey, it’s all auto-pilot stuff with my setup so doing this was no big deal. Let’s compare the results. Just to summarize, the previous configurations performed as follows:


OLTP Throughput (TPM)

Aggregate DBWR CPU Usage







And now with the 2 and 4 DBWR cases:


OLTP Throughput (TPM)

Aggregate DBWR CPU Usage













The way this goes is that more DBWR processes impact throughput. On the other hand, more throughput will require more DBWR work so the optimal case of 1 DBWR process will take more CPU, but remember that it is also getting more throughput. I only provide these CPU numbers to show what percentage of a single CPU any given configuration utilizes. I don’t think it is necessary to run the 6 DBWR case. But what about a wildcat configuration such as multiple DBWR processes that have hard processor affinity?

Processor Affinity and Oracle Background Processes
Well, I’d never bind an Oracle background process to a singe processor (core) for obvious reasons, but with the Linux taskset(1) command it is possible to bind a running process to a socket. For example, the following command binds pid 13276 to the seconds socket of a mulit-socket system:

# taskset –pc 2-3 13276

Building on that idea, you can grab DBWR processes (or any process for that matter) once they are running and pin them to a CPU core or a set of CPU cores. The following stupid bash shell function can be used to do this sort of thing. Note, since there are tens of thousands of skilled computer programmers out there these days, all shell scripts are considered stupid of course. The function takes an argument that it then uses in its grep regular expression for globbing ps(1) output. An example would be:

aff ora_dbw
local GLOB="$1"
local CORE=0
local TWIN=0

ps -ef | grep $GLOB | grep -v grep | awk '{ print $2 }'  | while read PID
        (( TWIN = $CORE + 1 ))

        set -x
        sudo taskset -pc ${CORE}-${TWIN} $PID
        set -    

        (( CORE = CORE + 2 ))
        [[ $CORE = 8 ]] && CORE=0 && TWIN=0

I used this script to affinity 2 DBWR processes to a single socket and the numbers didn’t change. Neither did throwing 2 DBWR processes one each to two different sockets. The fact simply remains that DBWR doesn’t benefit from having peers unless it is CPU-bottlenecked.

Manly Men Only Deploy Oracle with Fibre Channel – Part VIII. After All, Oracle Doesn’t Support Async I/O on NFS

In the comment section of my recent post about Tim Hall’s excellent NFS step-by-step Linux RAC install Guide, Tim came full circle to ask a question about asynchronous I/O on NFS. He wrote:

What do you set your filesystemio_options init.ora parameter to when using Oracle over NFS?

Based on what you’ve written before I know NFS supports direct I/O, but I’m struggling to find a specific statement about NFS and asynchronous I/O. So should I use:




My reply to that was going to remind you folks about my recent rant about old Linux distributions combined with Oracle over NFS.  That is, the answer is, “it depends.” It depends on whether you are running a reasonable Linux distribution. But, Tim quickly followed up his query with:

I found my answer. Asynchronous I/O is not supported on NFS:

Bummer, I didn’t get to answer it.

Word To The Wise
Don’t use old Linux stuff with NAS if you want to do Oracle over NFS. Metalink 279069.1 provides a clear picture as to why I say that. It points out a couple of important things:

1. RHEL 4 U4 and EL4 both support asynchronous I/O on NFS mounts. That makes me so happy because I’ve been doing asynchronous I/O on NFS mounts with Oracle10gR2 for about 16 months. Unfortunately, ML 279069.1 incorrectly states that the critical fix for Oracle async I/O on NFS is U4, when in fact the specific bug (Bugzilla 161362 ) was fixed in RHEL4 U3 as seen in this Red Hat Advisory from March 2006.

2. Asynchronous I/O on NFS was not supported on any release prior to RHEL4. That’s fine with me because I wouldn’t use any Linux release prior to the 2.6 kernels to support Oracle over NFS!

The Oracle documentation on the matter was correct since it was produced long before there was OS support for asynchronous I/O on Linux for Oracle over NFS. Metalink 279069.1 is partly correct in that it states support for asynchronous I/O on systems that have the fix for Bugzilla 161363 but it incorrectly suggests that U4 is the requisite release for that fix, but it isn’t—the bug was fixed in U3. And yes, I get really good performance with the following initialization parameter set and have for about 16 months:

filesystemio_options = setall

Manly Man Post Script
Always remember, the Manly Man series is tongue-in-cheek.  Oracle over NFS with Async I/O on the other hand isn’t.

SAN Admins: Please Give Me As Much Capacity From As Few Spindles As Possible!

I was catching up on my mojo reading when I caught a little snippet I’d like to blog about. Oh, by the way, have I mentioned recently that StorageMojo is one of my favorite blogs?

In Robin Harris’ latest installment about ZFS action at Apple, he let out a glimpse of one of his other apparent morbid curiosities—flash. Just joking, I don’t think ZFS on Mac or flash technology are morbid, it just sounded catchy. Anyway, he says:

I’ve been delving deep into flash disks. Can you say “weird”? My take now is that flash drives are to disk drives what quantum mechanics is to Newtonian physics. I’m planning to have something out next week.

I look forward to what he has to say. I too have a great interest in flash.

Now, folks, just because we are Oracle-types and Jim Grey was/is a Microsoft researcher, you cannot overlook what sorts of things Jim was/is interesting in. Jim’s work has had a huge impact on technology over the years and it turns out that Jim took/takes an interest in flash technology with servers in mind. Just the abstract of that paper makes it a natural must-read for Oracle performance minded individuals. Why? Because it states (with emphasis added by me):

Executive summary: Future flash-based disks could provide breakthroughs in IOps, power, reliability, and volumetric capacity when compared to conventional disks.


Yes, IOps! Nothing else really matter where Oracle database is concerned. How can I say that? Folks, round-brown spinning things do sequential I/O just fine—naturally. What they don’t do is random I/O. To make it worse, most SAN array controllers (you know, that late 1990’s technology) pile on overhead that further choke off random I/O performance. Combine all that with the standard IT blunder of allocating space for Oracle on a pure capacity basis and you get the classic OakTable Network response:

Attention DBAs, it’s time for some déjà vu. I’ll state with belligerent repetition, redundantly, over and over, monotonously reiterating this one very important recurrent bit of advice: Do everything you can to get spindles from your storage group—not just capacity.


Yes that’s right, it wont be long (in relative terms) until you see flash memory storage fit for Oracle databases. The aspect of this likely future trend that I can’t predict, however, is what impact such technology would have on the entrenched SAN array providers. Will it make it more difficult to keep the margins at the levels they demand, or will flash be the final straw that commoditizes enterprise storage? Then again, and Jim Grey points out in that paper, flash density isn’t even being driven by the PC—and most certainly not enterprise storage—ecosystem. The density is being driven by consumer and mobile applications. Hey, I want my MTV. Um, like all of it, crammed into my credit-card sized mpeg player too.

When it gets cheaper and higher capacity of course. Well, its not exactly that simple. I went spelunking for that Samsung 1.8” 32GB SSD and found two providers with street price of roughly USD $700.00 for 32GB here and here. In fact, upon further investigation, Ritek may soon offer a 32GB device at some $8 per GB. But let’s stick with current product for the moment. At $22 per GB, we’re not exactly talking SATA which runs more on the order of $.35 per GB. But then we are talking enterprise applications here, so a better comparison would be to Fibre drives which go for about $3-$4 per GB.

Now that is interesting since Jim Grey pointed out that in-spite of some industry predictions setting the stage for NAND to double every year, NAND had in fact gained 16 fold in 4 years–off by year. If that pace continues, could we really expect 512GB 1.8″ SSD devices in the next 4 years? And would the price stay relatively constant yielding a cost of something like $1.35 per GB? Remember, even the current state of the art (e.g., the Samsung 1.8″ 32GB SSD) delivers on the order of 130,000 random single-sector IOps–that’s approximately 7usec latency for a random I/O. At least that is what Samsung’s literature claims. Jim’s paper, on the other hand reports grim current art performance when measured with DskSpd.exe:

The story for random IOs is more complex – and disappointing. For the typical 4-deep 8KB random request, read performance is a spectacular 2,800 requests/second but write performance is a disappointing 27 requests/second.

The technology is young and technically superior, but there is work to do in getting the most out of NSSD as the paper reports. Jim suspects that short term quick fixes could be made to bring the random I/O performance for 8KB transfers on today’s NSSD technology up to about 1,700 IOps split evenly between read and write. Consider, however, that real world applications seldom exhibit a read:write ratio of 50:50. Jim generalized on the TPC-C workload as a case in point. It seems with “some re-engineering” (Jim’s words) even today’s SSD would be a great replacement for hard drives for typical Oracle OLTP workloads since you’ll see more 70:30 read:write ratios in the real world. And what about sequential writes? Well, there again, even today’s technology can handle some 35MB/s of sequential writes so direct path writes (e.g., sort spills) and redo log writes would be well taken care of. But alas, the $$/GB is still off. Time will fix that problem and when it does, NSSD will be a great fit for databases.

Don’t think for a moment Oracle Corporation is going to pass up on enabling customers to exploit that sort of performance–with or without the major SAN vendors.

But flash burns out, right? Well, yes and no. The thing that matters is how long the device lasts-the sum of its parts. MTBF numbers are crude, but Samsung sticks a 1,000,000hr MTBF on this little jewel-how cool.
Well, I’ve got the cart well ahead of the horse here for sure because it is still too expensive, but put it on the back burner, because we aren’t using Betamax now and I expect we’ll be using fewer round-brown spinning things in the span of our careers.

Oracle Over NFS. I Need More Monitoring Tools? A Bonded NIC Roller Coaster.

As you can tell by my NFS-related blog entries, I am an advocate of Oracle over NFS. Forget those expensive FC switches and HBAs in every Oracle Database server. That is just a waste. Oracle11g will make that point even more clearly soon enough. I’ll start sharing how and why as soon as I am permitted. In the meantime…

Oracle over NFS requires bonded NICs for redundant data paths and performance. That is an unfortunate requirement that Oracle10g is saddled with. And, no, I’m not going to blog about such terms as IEEE 802.3ad, PAgP, LACP, balance-tlb, balance-alb or even balance-rr. The days are numbered for those terms-at least in the Oracle database world. I’m not going to hint any futher about that though.

Monitoring Oracle over NFS
If you are using Oracle over NFS, there are a few network monitoring tools out there. I don’t like any of them. Let’s see, there’s akk@da and Andrisoft WanGuard. But don’t forget Anue and Aurora, Aware, BasicState, CommandCenter NOC, David, Dummmynet, GFI LANguard, Gomez, GroundWork, Hyperic HQ, IMMonitor, Jiploo, Monolith, moods, Network Weathermap, OidView, Pandetix, Pingdom, Pingwy, skipole-monitor, SMARTHawk, Smarts, WAPT, WFilter, XRate1, arping, Axence NetVision, BBMonitor, Cacti, CSchmidt collection, Cymphonix Network Composer, Darkstat, Etherape, EZ-NOC, Eye-on Bandwidth, Gigamon University, IPTraf, Jnettop, LITHIUM, mrtg-ping-probe, NetMRG, NetworkActiv Scanner, NimTech, NPAD, Nsauditor, Nuttcp, OpenSMART, Pandora FMS, PIAFCTM, Plab, PolyMon, PSentry, Rider, RSP, Pktstat, SecureMyCompany, SftpDrive, SNM, SpeedTest, SpiceWorks, Sysmon, TruePath, Unbrowse, Unsniff, WatchMouse, Webalizer, Web Server Stress Tool, Zenoss, Advanced HostMonitor, Alvias, Airwave, AppMonitor, BitTorrent, bulk, BWCTL, Caligare Flow Inspector, Cittio, ClearSight, Distinct Network Monitor, EM7, EZMgt, GigaMon, Host Grapher II, HPN-SSH, Javvin Packet Analyzer, Just-ping, LinkRank, MoSSHe, mturoute, N-able OnDemand, Netcool, netdisco, Netflow Monitor, NetQoS, Pathneck, OWAMP, PingER, RANCID, Scamper, SCAMPI, Simple Infrastructure Capacity Monitor, Spirent, SiteMonitor, STC, SwitchMonitor, SysUpTime, TansuTCP, thrulay, Torrus, Tstat, VSS Monitoring, WebWatchBot, WildPackets, WWW Resources for Communications & Networking Technologies, ZoneRanger, ABwE, ActivXpets, AdventNet Web NMS, Analyse It, Argus, Big Sister, CyberGauge, eGInnovations, Internet Detective, Intellipool Network Monitor, JFF Network Management System, LANsurveyor, LANWatch, LoriotPro, MonitorIT, Nagios, NetIntercept, NetMon, NetStatus, Network Diagnostic Tool, Network Performance Advisor, NimBUS, NPS, Network Probe, NetworksA-OK, NetStat Live, Open NerveCenter, OPENXTRA, Packeteer, PacketStorm, Packetyzer, PathChirp, Integrien, Sniff’em, Spong, StableNet PME, TBIT, Tcptraceroute, Tping, Trafd, Trafshow, TrapBlaster, Traceroute-nanog, Ultra Network Sniffer, Vivere Networks, ANL Web100 Network Configuration Tester, Anritsu, aslookup, AlertCenter, Alertra, AlertSite, Analyse-it, bbcp, BestFit, Bro, Chariot, CommView, Crypto-PAn, elkMonitor, DotCom, Easy Service Monitor, Etherpeek, Fidelia, Finisar, Fpinger, GDChart, HipLinkXS, ipMonitor, LANExplorer, LinkFerret, LogisoftAR, MGEN, Netarx, NetCrunch, NetDetector, NetGeo, NEPM, NetReality, NIST Net, NLANR AAD, NMIS, OpenNMS PageREnterprise, PastMon, Pathprobe, remstats, RIPmon, RFT, ROMmon, RUDE, Silverback, SmokePing, Snuffle, SysOrb, Telchemy, TCPTune, TCPurify, UDPmon, WebAttack, Zabbix, AdventNet SNMP API, Alchemy Network Monitor, Anasil analyzer, Argent, Autobuf, Bing, Clink, DSLReports, Firehose, GeoBoy, PacketBoy, Internet Control Portal, Internet Periscope, ISDNwatch, Metrica/NPR, Mon, NetPredict, NetTest, Nettimer, Net-One-1, Pathrate, RouteView, sFlow, Shunra, Third Watch, Traceping, Trellian, HighTower, WCAT, What’s Up Gold, WS_FTP, Zinger, Analyzer, bbftp, Big Brother, Bronc, Cricket, EdgeScape, Ethereal (now renamed Wireshark), gen_send/gen_recv, GSIFTP, Gtrace, Holistix, InMon, NcFTP, Natas, NetAlly, NetScout, Network Simulator, Ntop, PingGraph, PingPlotter, Pipechar, RRD, Sniffer, Snoop, StatScope, Synack, View2000, VisualPulse, WinPcap, WU-FTPD, WWW performance monitoring, Xplot, Cheops, Ganymede, hping2, Iperf, JetMon, MeasureNet, MatLab, MTR, NeoTrace, Netflow, NetLogger, Network health, NextPoint, Nmap, Pchar, Qcheck, SAA, SafeTP, Sniffit, SNMP from UCSD, Sting, ResponseNetworks, Tcpshow, Tcptrace WinTDS, INS Net Perf Mgmt survey, tcpspray, Mapnet, Keynote, prtraceroute clflowd flstats, fping, tcpdpriv, NetMedic Pathchar, CAIDA Measurement Tool Taxonomy, bprobe & cprobe, mrtg, NetNow, NetraMet, Network Probe Daemon, InterMapper, Lachesis, Optimal Networks and last but not least, Digex.

Simplicity Please
The networking aspect of Oracle over NFS is the simplest type of networking imaginable. The database server issues I/O to NFS filesystems being exported over simple, age old Class C private networks (192.168.N.N). We have Oracle statspack to monitor what Oracle is asking of the filesystem. However, if the NFS traffic is being sent over a bonded NIC, monitoring the flow of data is important as well. That is also a simple feat on Linux since /proc tracks all that on a per-NIC basis.

I hacked out a very simple little script to monitor eth2 and eth3 on my system. It isn’t anything special, but it shows some interesting behavior with bonded NICS. The following screen shot shows the simple script executing. A few seconds after starting the script, I executed an Oracle full table scan with Parallel Query in another window. Notice how /proc data shows that the throughput has peaks and valleys on a per-second basis. The values being reported are Megabytes so it is apparent that the bursts of I/O are achieving full bandwidth of the GbE network storage paths, but what’s up with the pulsating action? Is that Oracle, or the network? I can’t tell you just yet. Here is the screen shot nonetheless:


For what it is worth, here is a listing of that silly little script. Its accuracy compensates for its lack of elegance. Cut and paste this on a Linux server and tailor eth[23] to whatever you happen to have.

$ cat
function get_data() {
cat /proc/net/dev | egrep “${token1}|${token2}” \
| sed ‘s/^.*://g’ | awk ‘{ print $1 }’ | xargs echo


while true

set – `get_data`

sleep $INTVL

set – `get_data`

echo $a_if1 $b_if1 | awk -v intvl=$INTVL ‘{
printf(“%7.3f\t”, (($1 – $2) / 1048576) / intvl) }’
echo $a_if2 $b_if2 | awk -v intvl=$INTVL ‘{
printf(“%7.3f\n”, (($1 – $2) / 1048576) / intvl) }’


Oracle RDBMS Technology is Old. It Must Be Obsolete! Disk Drives Forever!

I ran across an interesting quote on that challenged my appreciation of relational database technology—and most particularly Oracle. The article states:

Relational databases started to get to be a big deal in the 1970’s, and they’re still a big deal today, which is a little peculiar, because they’re a 1960’s technology.

Forget the “started to get to be” bit for a moment. Uh, yes, RDBMS technology became important in the 70s (ISI, PRTV and Honeywell MRDS). However, since E.F. Codd didn’t write the defining paper until 1970 it is a stretch to call it “1960’s technology.” Oh well, that article was written in 1999 after all.

What a Great Idea: I Want to Force Some New, “Cool” Technology into My IT Shop
The bit in the quote that got me thinking was how astonished the author was that 1960’s—well, 1970’s actually—technology was “still a big deal” way back in 1999. You see, I think one thing that actually hurts IT shops is the nearly absurd rate of new technology injection. It seems to me that the datacenters with the highest level of success are those that take new technology as slowly as possible. Am I talking about newer versions of, say, the Oracle database server? No, absolutely not. Adopting a newer revision of Oracle is not radical. Those of us who revere the code rest soundly at night knowing that deep down in kernel are bits and pieces that haven’t really changed much in the last 20+ years—perhaps even 30 years given the KCB role in multi-block read consistency (if in fact MBRC was there in the first version of Oracle).

Why is there a mindset in IT that all old technology must go? Folks, we still use little round, brown spinning things (hard drives). Now there’s a bit of information technology that has been around longer than relational databases and ripe for a fond farewell. DBAs are asking for “disk space” from their storage administrators and that is exactly what they are getting. Forget for a moment that roughly 30 drives worth of large, sequential read throughput can saturate the bandwidth of most high-end FC SAN array controllers. Yes, there are exceptions, but I’m trying to make a point. The point is, here we are 27 years after the introduction of the ST506 5.25” and we are not getting full utilization of our drives—at least not when they are provisioned the way most space is provisioned these days. That is, you ask for 2TB of “space” for your Oracle database and you get it—allocated from something like 6 disks in the same enclosure as dozens of other disks. You are getting space, not bandwidth.


What’s This Have to Do with Oracle?
Someone should ask the author of that article if old technology deserves the rubbish heap simply because it is old. Ask him if he has tires on his car (powered with an internal-combustion engine no less). Yep, pneumatic tires for cars date back to P. Strauss circa 1911. No, really, what does this have to do with Oracle? You see, it is software from companies like Oracle—with their “old” RDBMS front and center—that will first help us tackle this problem we have with untapped hard-drive bandwidth and eventually move us along to whatever replaces those little, round brown spinning things in the future. Yes I am hinting, but I’m not saying anything more.

That’s right, that old crusty Oracle RDBMS technology—created to efficiently manage data stored on hard drives—will outlive hard drives and, quite likely, whatever replaces hard drives. That isn’t so hard to accept. After all, P. Strauss’ pneumatic tire will certainly be getting us to and fro long after we move beyond the internal combustion engine.

Dry, Techno-Geek Humor
The article also contained some humor. The following definition was given for what an RDBMS is:

A relational database is a bunch of rectangular tables. Each row of a table is a record about one person or thing; the record contains several pieces of information called fields.

Couldn’t a table be square? Somebody please tell that guy that tables have columns (attributes). Files, being hierarchical, have fields.

SAN Array Cache and Filers Hate Sequential Writes

Or at least they should.

Jonathan Lewis has taken on a recent Oracle-l thread about Thinking Big. It’s a good blog entry and I recommend giving it a read. The original Oracle-l post read like this:

We need to store apporx 300 GB of data a month. It will be OLTP system. We want to use 
commodity hardware and open source database. we are willing to sacrifice performance 
for cost. E.g. a single row search from 2 billion rows table should be returned in 2 sec.

I replied to that Oracle-l post with:

Try loading a free Linux distro and typing :
man dbopen
man hash
man btree
man mppo
man recno

Yes, I was being sarcastic, but on the other hand I have been involved with application projects where we actually used these time-tested “database” primitives…and primitive they are! Anyway, Jonathan’s blog entry actually took on the topic and covers some interesting aspects. He ends with some of the physical storage concepts that would likely be involved. He writes:


SANs can move large amounts of data around very quickly – but don’t work well with very large numbers of extremely random I/Os. Any cache benefit you might have got from the SAN has already been used by Oracle in caching the branch blocks of the indexes. What the SAN can give you is a write-cache benefit for the log writer, database writer, and any direct path writes and re-reads from sorting and hashing.

Love Your Cache, Hate Large Sequential Writes
OK, this is the part about which I’d like to make a short comment—specifically about Log Writer. It turns out that most SAN arrays don’t handle sequential writes well either. All told, arrays shouldn’t be in the business of caching sequential writes (yes, there needs to be a cut-off there somewhere). I’ve had experiences with some that don’t cache sequential writes and that is generally good. I’ve had experiences with a lot that do and when you have a workload that generates a lot of redo, LGWR I/O can literally swamp an array cache. Sure, the blocks should be cached long enough for the write back to disk, but allowing those blocks to push into the array cache any further than the least of the LRU end makes little sense. Marketing words for arrays that handle these subtleties usually sound like, “Adaptive Array Cache”, or words to that effect.

One trick that can be used to see such potential damage is to run your test workload with concurrent sequential write “noise.” If you create a couple of files the same size as your redo logs and loop a couple of dd(1) processes performing 128K writes—without truncating the files on open—you can drive up this sort of I/O to see what it does to the array performance. If the array handles the caching of sequential writes, without polluting the cache, you shouldn’t get very much damage. An example of such a dd(1) command is:

$ dd if=/dev/zero of=pseudo_redo_file_db1 bs=128k count=8192 conv=notrunc &

$ dd if=/dev/zero of=pseudo_redo_file_db2 bs=128k count=8192 conv=notrunc &

$ wait

Looping this sort of “noise workload” will simulate a lot of LGWR I/O for two databases. Considering the typical revisit rate of the other array cache contents, this sort of dd(1) I/O shouldn’t completely obliterate your cache. If it does, you have an array that is too fond of sequential writes.

What Does This Have To Do With NAS?
This sort of workload can kill a filer. That doesn’t mean I’m any less excited about Oracle over NFS—I just don’t like filers. I recommend my collection of NFS related posts and my Scalable NAS for Oracle paper for background on what sequential writes can do to certain NAS implementations.

I’ll be talking about this topic and more at Utah Oracle User Group on March 21st.


DBWR with CIO on JFS2. Resource Starvation?

In a recent post to the oracle-l email list, a participant posted the following:

We are running Oracle & on Aix 5.2.  I am seeing the following warnings in the DB Writer trace file:

Warning: lio_listio returned EAGAIN
Performance degradation may be seen.

There have been several very interesting threads on the list lately about Oracle on AIX with JFS2 filesystems. I’d like to blog quickly about this thread.

The error message in the DBWR trace file indicates that lio_listio() calls are being starved for resources, but what resources?

A lio_listio() Backgrounder
The error message in the DBWR trace file indicates that lio_listio() calls are being starved for resources. For those who don’t know, lio_listio() is the traditional POSIX gathered write system call. Most Oracle ports use it. It supports DBWR’s need to flush multiple buffers to multiple files in one call, but completion processing is a little rough. The way the call works is your program (e.g., DBWR) either issues a long list of requests and blocks until they are all done (mode=LIO_WAIT), or the mode argument is set to LIO_NWAIT which means your process can continue to do other work. When all the LIO_NWAIT requests are completed, your process gets delivered a signal also specified at call time. So, if for some reason, DBWR is doing a lot of small burst writes, it will likely be handling a lot of signals. No big deal really, but that is how it works. The way I/O completion handling works with POSIX AIO is one of the main reasons Oracle was motivated to define the Oracle Disk Manager library specification(ODM).  The differences between ODM and POSIX AIO is a topic for a different blog entry.

About lio_listio() on Filesystem Files
Details, details. The way AIO is implemented from one filesystem to the next varies greatly—unfortunately. I prefer simplicity and that is why the cluster filesystem coupled with PolyServe Matrix Server supports the same Linux kernel internals for AIO as are exercised when using raw devices. If you mount a PSFS in DBoptimized mode, calls to lio_listio(3) are met with the same kernel handling as if the call was targeting raw partitions. But, this blog entry is not about PolyServe since the oracle-l post was about Oracle9i and Oracle10g on AIX.

So, I sent an email to the poster to make sure that this error was being hit on JFS2 and he confirmed it was. You see, AIO on JFS2 is handled by kernel processes of which there are only 10 per processor by default. The number of system wide AIO in flight is limited by the number of these AIO “handlers” as specified in IBM’s AIO documentation (see maxervers). To get around this lio_listio() error, more AIO kernel processes are needed, but how many? That IBM documenation suggests:

The value should be about the same as the expected number of concurrent AIO requests.

How Many is Enough. How Many is Too Many?
Hmmm, that is going to be a very difficult number to guess for the original poster who had both 9i and 10g running on the same pSeries server. Remember, there is no way to control DBWR. If you have a really dirty SGA, DBWR will do really large write batches. That is a good thing.  

In cases like this I tend to generally work out what safe ceilings are. I know Bret Olszewski and if he gives a reasonable value for a tunable, I tend to take his word. In this paper about database performance on JFS, Bret and his co-authors state:

[…] the number of aioservers in the system limits the number of asynchronous I/O operations that can be in progress simultaneously. The maximum number of aioservers that can be created is controlled by the maxservers attribute, which has a default value of 10 per processor. For our experiments, we used a maxservers value of 400 per processor.

The paper also points out that there are 4 CPUs in the System p 680 test system. So while the default is 10 per CPU, it looks like if the collective DBWR in-flight I/O are bursting at high rates—such as as 1600 concurrent I/Os—setting aioservers to 400 per processor is not unreasonable.

The paper also covers a technique used to determine if you have too many aioservers. Basically, if you have a bunch of these kernel processes that do not get CPU time then the I/O load is sufficiently being handled by the processes that are burning CPU. You can then reduce the number if it seems important to you. Makes sense, but I can’t imagine having sleeping kernel processes around costs much.

A Word About Filesystem Performance
OK, if Bret specifies a tunable, I take his advice. On the other hand, I have to call out a problem in that JFS2 paper. In the analysis section, the conclusion is that Concurrent I/O performs 200% better than plain Direct I/O, and comes within 8% of raw logical volumes.That makes sense since CIO implicitly invokes DIO and DIO should be as fast as raw disk with the exception of write serialization (which CIO addresses). The test was 100% processor bound and I/O intensive so any processor overhead in the I/O code path will account for that 8%–there is, after all, more software in the filesystem case than the raw logical volume case but is it really code path that costs the 8 percent? The paper states:

The Concurrent I/O run has a higher %system and lower %user than the raw LV run because of the additional context switches due to traversing down the file system path for I/O, and due to the use of AIO server threads for servicing AIO requests […]

OK, the bit about “additional context switches due to traversing down the file system path” is really weird. There is no such thing unless you are opening files for every I/O—which, of course, Oracle does not do. Oracle keeps files open and I/O is nothing more than a position and read/write  (e.g., pread(), pwrite()) or an asynchronous write (e.g., lio_listio()).  It is true that converting a pathname to an inode costs (e.g., the namei()/lookup() kernel routines), but that is paid in the open(2) call. Once a file is open there is no cost associated with the pathname because I/O routines do not even use a pathname—they use the file descriptor. In short,  there is no “traversing down the file system path” associated with I/O on an open file. If I had $50 to bet, I would wager the 1600 AIO handlers jumping on and off those 4 CPUs would account for the 8%, but what do I know?



I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 744 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories


All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

%d bloggers like this: