Don’t Forget ARCH When Pondering Redo and Throwing Partners Under The Bus For Fun, Not Profit.

In my recent post about LGWR processing, I used a title that fits my recurring Manly Men motif just for fun. The title of the post was Manly Men Only Use Solid State Disk for Redo Logging. The post then set out to show that LGWR performance issues are not always I/O related by digging in to what a log file sync wait event really encompasses. Well, I’m learning to choose my words better-the hard way. Let me explain.

The purpose for building the test case for that post using solid state disk (SSD) for redo log files was to establish the fact that even when redo logging I/O is essentially free, it is quite possible to see long duration log file sync wait events. I also mentioned that I did one pass of the test with the initialization parameter _disable_logging set to true and achieved the same results as the SSD run. That establishes the fact that SSD for redo logging is as fast as fast can be. The point of the exercise was to show that when the I/O component is the not the root cause of long duration log file sync wait events it would be foolish to “throw hardware at the problem.” What I failed to point-in this post only-was that SSD for redo is most certainly a huge win if your system is doing any reasonable amount of redo. This test case only generates about 500 redo writes per second so it is not a show-case example of when to apply SSD technology for redo. On the other hand, I would not have been able to make such a rich example of LGWR processing without solid state disk.

Biting The Hand That Feeds You
The good folks at Texas Memory Systems were nice enough to loan me the solid state disk that I happened to use in the LGWR processing post. I’m sad to report that, as per email I received on the topic, their reading of the LGWR processing post left the impression that I do not promote the use of solid state disk for redo. Nothing could be further from the truth. The problem is that I am still working on the tests for my latest proof case for solid state using a much higher-end configuration. Regular readers of this blog know that I have consistently promoted solid state disk for high-end redo logging situations. I have been doing so since about 2000 when I got my first solid state disk-an Imperial MegaRam. That technology could not even hold a candle to the Texas Memory Systems device I have in my lab now though.

Forget SSD for Redo Logging, Really?
Regular readers of this blog know my regard for Nuno Souto, but he left a comment on the LGWR processing thread that I have to disagree with. He wrote:

Folks should forget SSDs for logs: they are good to reduce seek and rotational delays, that is NOT the problem in 99% of the cases with redo logs. They might be much better off putting the indexes in the SSDs!

Noons is correct to point out that rotation and seek are generally not a problem with LGWR writes to traditional round-brown spinning thingies, but I’d like to take it a bit further. While it is true that given a great deal of tuning and knowledge of the workload it is possible to do a tremendous amount of redo logging to traditional disks, that is generally only possible in isolated test situations. The perfect case in point are the high end TPC-C results Oracle achieves on SMP hardware. The TPC-C workload generates a lot of redo and yet Oracle manages to achieve results in the millions of TpmC without solid state disk. That is because there are not as many variables with the TPC workload as there are in production ERP environments. And there is one other dirty little secret: log switches.

Real Life Logging. Don’t Forget Log Switches.
Neither the test case I used for the LGWR processing post nor any audited Oracle TPC-C run are conducted with the database in archive log mode. What? That’s right. The TPC-C specification stipulates that the total system price include disk capacity for a specific number of days worth of transaction logging, but it is not required that logs actually be kept so the databases are never set up in archivelog mode.

One of the most costly aspects of redo logging is not LGWR’s log file parallel write, but instead the cost of ARCH spooling the inactive redo log to the archive log destination. When Oracle performs a log switch, LGWR and ARCH battle for bandwidth to the redo log files.

Noons points out in his comment that rotation and seek are not a problem for LGWR writes which is generally true. However, all too often folks set up their redo logs on a single LUN. And although the single LUN may have many platters under it, LGWR and ARCH going head to head performing sequential writes and sequential large reads of blocks on the same spindles can introduce performance bottlenecks-significant performance bottlenecks. To make matter worse, it is very common to have many databases stored under one SAN array. Who is paying attention to religiously carve out LUNs for one database or the other based on logging requirements (both LGWR and ARCH)? Usually nobody.

The thing I like the most about solid state for redo logging is that it neutralizes concern for both sides of the redo equation (LGWR and ARCH) and does so regardless of how many databases you have. Moreover, if you log in solid state to start with, you don’t have to worry about which databases are going logging-critical because all the databases get zero cost LGWR writes and log switches.

Storage Level Caches
Simply put, redo logging wreaks havoc on SAN and NAS caches. Think about it this way, a cache is important for revisiting data. Although DBAs care for redo logs and archived redo logs with religious fervor, very few actually want to revisit that data-at least not for the sake of rolling a database forward (after a database restore). So if redo logs are just sort of a necessary evil, why allow redo I/O to punish your storage cache? I realize that some storage cache technology out there supports tuning to omit the caching of sequential reads or writes and other such tricks, but folks, how much effort can you put into this? After all, we surely don’t want to eliminate all sequential reads and write caching just because the ones LGWR and ARCH perform are trashing the storage cache. And if such tuning could be done on a per-LUN basis, does that really make things that much simpler? I don’t think so. The point I’m trying to make is that if you want to eliminate all facets of redo logging as a bottleneck, while offloading the storage caches, solid state redo is the way to go.

Summary
No, I don’t throw partners under the bus. I just can’t do total brain dumps in every blog entry and yes, I like catchy blog titles. I just hope that people read the blog entry too.

Hey, I’m new to this blogging stuff after all.

6 Responses to “Don’t Forget ARCH When Pondering Redo and Throwing Partners Under The Bus For Fun, Not Profit.”

Feed for this Entry Trackback Address

1 Noons August 7, 2007 at 8:18 am

“Who is paying attention to religiously carve out LUNs for one database or the other based on logging requirements (both LGWR and ARCH)? Usually nobody.”

“nobody” is my middle name, Kevin! 🙂

I agree entirely with your points. Like I said: 99%. The other 1% are the folks for whom this stuff becomes critical. We might have different percentual experiences and I can certainly relate to that.
In the 1% cases, it pays to be very careful with the LUN allocation.

One of the things I’ve noticed for example is that once we separate the redo to its own individual LUN and assuming we have a decent amount of cache in the SAN, we get the logs since the last switch – very important, like you said, the log switches! – nicely lined up in the cache, ready for ARCH to come in and get them. It is therefore highly desirable to time the redo log switch so that the SAN doesn’t have to flush the LUN cache before the next switch comes around. It can be done.

That usually obviates the need to have SSDs lined up until you are well into the <1% territory. Not to say by any means that SSDs don’t have a use in logs.

But in most cases I’d rather get them lined up for the hot index blocks. And this is where partitioning large hot tables and indexes becomes so important when we’re looking at the last ounce of performance off a system: slap the hot data/indexes into its own partitions in a SSD, stick less busy portions into other partitions in the SAN and enjoy trouble free performance.

Or rather: very low levels of trouble. There is no such thing as trouble free in this field…

2 kevinclosson August 7, 2007 at 3:11 pm

NNNS? Nuno Noons Nobody Souto? That’s two middle names 🙂

3 Noons August 8, 2007 at 3:34 am

LOL!
yeah, it’s getting bad…

another thing I completely forgot to mention:

when “playing tricks” with the SAN/lun cache, it’s all very well if the application load does not cause sustained periods of I/O. Caches work by delaying the inevitable: when it happens, we better have disks around that can cope with the volume being read/written.

In this specific case of the redo logs, if the amount of writes on the database is very large and sustained – tipically monster 1Mtps benchmarks or large dw loads or online google stuff – then inevitably at some time in the fufure we may run out of cache and start doing real writes. That is crunch time for SANs, but more than likely not for SSDs.

So I guess that is another gotcha to add to yours above.

Like in all things, I like to think of it as a trade off: use SAN technology – or NAS/NFS! – up to a point. Once past that point – for whatever reasons – then other technologies kick in. SSD is the most obvious one.

4 Krishna Manoharan July 24, 2008 at 3:52 pm

Hi Kevin,

Have you tried using Solid State Devices for Temporary Tablespaces?

Thanks
Krishna

5 SQL Sasquatch January 17, 2014 at 8:09 am

Great blog post! Important considerations for SQL Server, too. Can you get good – even great – SAN spinning disk performance for txlog? Yeah – for writing to one txlog, on dedicated disks (and maybe also requiring controller CPU/cache partition).
Add in SQL Server log backup, Always On replication, transaction rollbacks, or heaven forbid writes to another txlog for another database on the same spinning disks and things can get dicey. Too many sustained sequential threads to an inadequate number of spinning disks (or saturated CPU/cache in front of them), and sequential latency will drop below expected random performance.
So… need sustained high performance for one SQL Server database txlog? It can be done with spinning disk. Want to get sustained high performance for more than one txlog and/or add some log readers like log backups or replication added to the mix? Good SSDs will let you do it with a small number of SSD devices(and adequate controller CPU & write cache if SAN vs onboard), while HDDs would require lots of devices and lots of attention (for balancing, striping & spreading, etc). In the end some systems have enough HDD devices but suffer from lack of attention 🙂
Too many words, sorry. But trying to get my thoughts sorted out for a similar blog post aimed at SQL Server community. 🙂

1 推荐几篇不错的解释Oracle基础概念的文章 | a db thinker's home Trackback on March 16, 2010 at 7:46 am

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage