“Feel” Your Processor Cache. Oracle Does. Part I.

At about the same time I was reading Curt Monash’s mention of yet another in-memory database offering, my friend Greg Rahn started hammering his Nehalem-based Mac using the Silly Little Benchmark (SLB). Oh, before I forget, there is an updated copy of SLB here.

This blog entry is a quick and dirty two-birds-one-stone piece.

Sure About In-Memory Database? Then Be Sure About Memory Hierarchy
Curt’s post had a reference to this blog entry about levels of cache on the Csqlcache blog. I took a gander at it and immediately started gasping for air. According to the post, level-2 processor cache is:

Level 2 cache – also referred to as secondary cache) uses the same control logic as Level 1 cache and is also implemented in SRAM and soldered onto the motherboard, which is normally located close to the processor.

No, it isn’t. The last of the volume microprocessors to use off-die level-2 cache was the Pentium II and that was 11 years ago. So, no, processors don’t jump off-die to access static RAMs glued to the motherboard. Processor L2 caches are in the processor (silicon) and, in fact, visible to other cores within a multi-core package. That’s helpful for cache-to-cache transfers, which occur at blisteringly high frequencies with an Oracle Database workload since spinlocks (latches) sustaining a high acquire/release rate will usually have another process trying on the latch on one of the other cores. Once the latch is released, it is much more efficient to shuttle the protected memory lines via a cache-to-cache transfer than in the olden days where L2 cache required a bus access. These shared caches dramatically accelerate Oracle concurrency. That’s scalability. But that isn’t what I’m blogging about.

Get Your Stopwatch. How Fast is That L2 Cache?
In the Csqlcache blog it was stated matter-of-factly that L2 cache has a latency of 20ns. Well, ok, sure, there are or have been L2 processor caches with 20ns latency, but that is neither cast in stone, nor the common nomenclature for expressing such a measurement. It also happens to be a very poor L2 latency number, but I digress. Modern microprocessor L2 cache accesses are in phase with the processor clock rate. So, by convention, access times to L2 cache are expressed in CPU clock cycles. For example, consider a processor clocked at 2.93 GHz. At that rate, each cycle is 0.34 nanoseconds. Let’s say further that a read of a clean line in our 2.93 GHz processor requires 11 clock cycles. That would be 3.75 ns. However, expressing it in wall clock terms is not the way to go, especially on modern systems that can throttle the clock rate as per load placed on the processor. Let’s say, for example, that our 2.93 GHz processor might temporarily be clocked down to 2 GHz. Loading that same memory line would therefore require 5.5 ns.

We can use SLB to investigate this topic further. In the following session excerpt I ran SLB on the first core of the second socket on a Xeon 5400 based server. I had SLB (memhammer) allocate 4 MB of memory from which memhammer loops picking random 64 byte offsets in which to write. It turns out that SLB is the most pathological of workloads because it requires a processor L2 line load prior to every write–except, that is, in the case where I allocate a sufficiently small chunk of memory to fit in the L2 cache. As the session snapshot shows, memhammer was able to write at random locations within the 4 MB chunk at the rate of 68.86 million times per second or 14.5 ns per L2 cache access.


# cat r
./create_sem
taskset -pc 4 $$
./memhammer $1 $2 &
sleep 1
./trigger

wait

#  ./r 1024 3000000
pid 23384's current affinity list: 0-7
pid 23384's new affinity list: 4
Total ops 3072000000  Avg nsec/op    14.5  gettimeofday usec 44614106 TPUT ops/sec 68857145.8

When I increased the chunk of memory SLB allocated to 64 MB, the rate fell to roughly 9.3 million writes per second (107.8ns) since the test blew out the L2 cache and was writing to memory.


#  ./r 16384 30000
pid 22919's current affinity list: 0-7
pid 22919's new affinity list: 4
Total ops 491520000  Avg nsec/op   107.8  gettimeofday usec 52970954 TPUT ops/sec 9279047.5

I don’t know anything about Csqlcache. I do know that since they are focused on in-memory databases they ought to know what memory really “looks” like. So, put away your soldering iron and that bag full of SRAM chips. You can’t make your modern microprocessor system faster that way.

2 Responses to ““Feel” Your Processor Cache. Oracle Does. Part I.”

Feed for this Entry Trackback Address

1 Brett Schroeder April 15, 2009 at 4:00 am

Good heavens! The author of this article has either never seen and/or thought about all those photographs of die Intel so extensively uses in it’s marketing these days.

The L2 cache is the large regular area that consumes almost half the die (on Core 2 Duo the L2 is about 4mm x 4mm). The SRAM memory cells are laid out in millions of adjacent columns and rows whereas the logic circuitry has no regular repeating patterns (at least on this large scale of about 1 mm). This makes the L2 cache the easiest area to identify on the die.

No need to feel your cache, you can see it 🙂

- 2 kevinclosson April 15, 2009 at 4:34 am
  
  good one, Brett!

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage