Or at least they should.
Jonathan Lewis has taken on a recent Oracle-l thread about Thinking Big. It’s a good blog entry and I recommend giving it a read. The original Oracle-l post read like this:
We need to store apporx 300 GB of data a month. It will be OLTP system. We want to use
commodity hardware and open source database. we are willing to sacrifice performance
for cost. E.g. a single row search from 2 billion rows table should be returned in 2 sec.
I replied to that Oracle-l post with:
Try loading a free Linux distro and typing :
man dbopen
man hash
man btree
man mppo
man recno
Yes, I was being sarcastic, but on the other hand I have been involved with application projects where we actually used these time-tested “database” primitives…and primitive they are! Anyway, Jonathan’s blog entry actually took on the topic and covers some interesting aspects. He ends with some of the physical storage concepts that would likely be involved. He writes:
SANs can move large amounts of data around very quickly – but don’t work well with very large numbers of extremely random I/Os. Any cache benefit you might have got from the SAN has already been used by Oracle in caching the branch blocks of the indexes. What the SAN can give you is a write-cache benefit for the log writer, database writer, and any direct path writes and re-reads from sorting and hashing.
Love Your Cache, Hate Large Sequential Writes
OK, this is the part about which I’d like to make a short comment—specifically about Log Writer. It turns out that most SAN arrays don’t handle sequential writes well either. All told, arrays shouldn’t be in the business of caching sequential writes (yes, there needs to be a cut-off there somewhere). I’ve had experiences with some that don’t cache sequential writes and that is generally good. I’ve had experiences with a lot that do and when you have a workload that generates a lot of redo, LGWR I/O can literally swamp an array cache. Sure, the blocks should be cached long enough for the write back to disk, but allowing those blocks to push into the array cache any further than the least of the LRU end makes little sense. Marketing words for arrays that handle these subtleties usually sound like, “Adaptive Array Cache”, or words to that effect.
One trick that can be used to see such potential damage is to run your test workload with concurrent sequential write “noise.” If you create a couple of files the same size as your redo logs and loop a couple of dd(1) processes performing 128K writes—without truncating the files on open—you can drive up this sort of I/O to see what it does to the array performance. If the array handles the caching of sequential writes, without polluting the cache, you shouldn’t get very much damage. An example of such a dd(1) command is:
$ dd if=/dev/zero of=pseudo_redo_file_db1 bs=128k count=8192 conv=notrunc &
$ dd if=/dev/zero of=pseudo_redo_file_db2 bs=128k count=8192 conv=notrunc &
$ wait
Looping this sort of “noise workload” will simulate a lot of LGWR I/O for two databases. Considering the typical revisit rate of the other array cache contents, this sort of dd(1) I/O shouldn’t completely obliterate your cache. If it does, you have an array that is too fond of sequential writes.
What Does This Have To Do With NAS?
This sort of workload can kill a filer. That doesn’t mean I’m any less excited about Oracle over NFS—I just don’t like filers. I recommend my collection of NFS related posts and my Scalable NAS for Oracle paper for background on what sequential writes can do to certain NAS implementations.
I’ll be talking about this topic and more at Utah Oracle User Group on March 21st.
Yeah. One way to handle it it to allocate separate spindles, controller, cache to redo logs. I remember we’ve done exactly that – 4+4 spindles in raid10, separate controller and if I’m not mistaken cache (EMC DMX 2000). That did the trick. 1 ms for relatively small sequential writes and just a bit higher what I could achieve with solid state storage (that was 0.5-0.6 ms) but still, write time was slightly jumping up from time to time… I guess that was due to cache saturation on the box.
I guess same can be done with a descent NAS. Right? OK, OK, I see the link to the paper 😉
Good luck tomorrow!
Your “white noise” won’t work on most modern arrays. Many disk arrays have zero-detection engines and while they acknowledge the write, they don’t actually write it to disk so you will not get results that are anywhere near a real workload. You might use /dev/urandmon instead.
Scott,
I don’t understand your comment. The whole point of the post focused on the array controller, not the disks. Can you clarify?