Density is Increasing, But Certainly Not That Cheap
Netapp’s SEC 10-Q form for their quarter ending in October 2006 has a very interesting prediction. I was reading this post on StorageMojo about Isilon and saw this quote from the SEC form (emphasis added by me):
According to International Data Corporation’s (IDC’s) Worldwide Disk Storage Systems 2006-2010 Forecast and Analysis, May 2006, IDC predicts that the average dollar per petabyte (PB) will drop from $8.53/PB in 2006 to $1.85/PB in 2010.
Yes, Netapp is telling us that IDC thinks we’ll be getting storage at $8.53 per Petabyte within the next three years. Yippie! Here is the SEC filing if you want to see for yourself.
We Need Disks, Not Capacity
Yes, drive density is on the way up so regardless of how off the mark Netapp’s IDC quote is, we are going to continue to get more capacity from fewer little round brown spinning things. That doesn’t bode well for OLTP performance. I blogged recently on the topic of choosing the correct real estate from disks when laying out your storage for Oracle databases. I’m afraid it won’t be long until IT shops are going to force DBAs to make bricks without straw by assigning, say, 3 disks for a fairly large database. Array cache to the rescue! Or not.
Array Cache and NetApp NVRAM Cache Obliterated With Sequential Writes
The easiest way to completely trash an most array caches is to perform sequential writes. Well, for that matter, sequential writes happen to be the bane of NVRAM cache on Filers too. No, Filers don’t handle sequential writes well. A lot of shops get a Filer and dedicate it to transaction logging. But wait, that is a single point of failure. What to do? Get a cluster of Filers just for logging? What about Solid State Disk?
Solid State Disk (SSD) price/capacity is starting to come down to the point where it is becoming attractive to deploy them for the sole purpose of offloading the sequential write overhead generated from Oracle redo logging (and to a lesser degree TEMP writes too). The problem is they are SAN devices so how do you provision them so that several databases are logging on the SSD? For example, say you have 10 databases that, on average, are each thumping a large, SAN array cache with 4MB/s for a total sequential write load of 40MB/s. Sure, that doesn’t sound like much, but to a 4GB array cache, that means a complete recycle every 100 seconds or so. Also, rememeber that buffers in the array cache are pinned while being flushed to back to disk. That pain is certainly not being helped by the fact that the writes are happening to fewer and fewer drives these days as storage is configured for capacity instead of IOPS. Remember, most logging writes are 128KB or less so a 40MB logging payload is derived from some 320, or more, writes per second. Realistically though, redo flushing on real workloads doesn’t tend to benefit from the maximum theoretical piggy-back commit Oracle supports, so you can probably count on the average redo write being 64KB or less—or a write payload of 640 IOPS. Yes a single modern drive can satisfy well over 200 small sequential writes per second, but remember, LUNS are generally carved up such that there are other I/Os happening to the same spindles. I could go on and on, but I’ll keep it short—redo logging is tough on these big “intelligent” arrays. So offload it. Back to the provisioning aspect.
Carving Luns. Lovely.
So if you decide to offload just the logging aspect of 10 databases to SSD, you have to carve out a minimum of 20 LUNS (2 redo logs per database) zone the Fibre Channel switch so that you have discrete paths from servers to their raw chunks of disk. Then you have to fiddle with raw partitions on 10 different servers. Yuck. There is a better way.
SSD Provisioning Via NFS
Don’t laugh—read on. More and more problems ranging from software provisioning to the widely varying unstructured data requirements today’s applications are dealing with keep pointing to NFS as a solution. Provisioning very fast redo logging—and offloading the array cache while you are at it—can easily be done by fronting the SSD with a really small File Serving Cluster. With this model you can provision those same 10 servers with highly available NFS because if a NAS head in the File Serving Utility crashes, 100% of the NFS context is failed over to a surviving node transparently—and within 20 seconds. That means LGWR file descriptors for redo logs remain completely valid after a failover. It is 100% transparent to Oracle. Moreover, since the File Serving Utility is symmetric clustered storage—unlike clustered Filers like OnTap GX—the entire capacity of the SSD can be provisioned to the NAS cluster as a single, simple LUN. From there, the redo logging space for all those databases are just files in a single NFS exported filesystem—fully symmetric, scalable NFS. The whole thing can be done with one vender too since Texas Memory Systems is a PolyServe reseller. But what about NFS overhead and 1GbE bandwidth?
NFS With Direct I/O (filesystemio_options=directIO|setall)
When the Oracle database—running on Solaris, HP-UX or Linux—opens redo logs on an NFS mount, it does so with Direct I/O. The call overhead is very insignificant for sequential small writes when using Direct I/O on an NFS client. The expected surge in kernel mode cycles due to the NFS overhead really doesn’t happen with simple positioning and read/write calls—especially when the files are open O_DIRECT (or directio(3C) for Solaris). What about latency? That one is easy. LGWR will see 1ms service times 100% of the time, no matter how much load is placed on the down-wind SSD. And bandwidth? Even without bonding, 1GbE is sufficient for logging and these SSDs (I’ve got them in my lab) handle requests in 1ms all the way up to full payload which (depending on model) goes up to 8 X 4Gb FC—outrageous!
Now that is a solution to a problem using real, genuine clustered storage. And, no I don’t think NetApp really believes a Petabyte of disk will be under $9 in the next three years. That must be a typo. I know all about typos as you blog readers can attest.