Fun With Intel Xeon 5500 Nehalem and Linux cpuspeed(8). Part II.

In my recent blog entry entitled Fun With Intel Xeon 5500 Nehalem and Linux cpuspeed(8) Part I, I shared a peek into how Nehalem processors respond to load by automatically increasing the processor clock frequency. I know both Intel and AMD processors have supported this functionality for ages, but this series is focused on certain edge-cases that might be of interest to regular readers or even perhaps the wayward googler…

Part I was more focused on getting the CPUs to “heat up,” this installment in the series has more to do with how the processors “cool down” based on reduced load. But first…

Busy Only The CPU, Not Memory.
I decided to change the method I use to stress the processors to an approach that is purely cpu-bound. The following box shows the new, simplistic program (dumb.c) that I use to stress the processors. Also in the box is a listing of the busy.sh script that drives the new program.


# cat dumb.c
main(){
unsigned long i=0L;
for (;i<9999999999L;i++);
}
# cat busy.sh
#!/bin/bash

function busy() {
local WHICH_CPU=$1 

taskset -pc $WHICH_CPU ./a.out
}
#--------------
CPU_STRING="$1"

for CPU in `echo $CPU_STRING`
do
( busy "$CPU" ) &
done
wait

&#91;/sourcecode&#93;

That’s really simple stuff, but it will do nicely to see what it takes to get cpuspeed(8) to crank up the clock rates. The following box shows that a single instance of the dumb.c program will complete in just short of 22 seconds. Also, I’ll verify that the Xeon 5500-based system I’m testing is booted with NUMA <span style="color:#ff0000;">disabled </span>in the BIOS.


# time ./a.out

real    0m21.632s
user    0m21.621s
sys     0m0.001s

# numactl --hardware
available: 1 nodes (0-0)
node 0 size: 16132 MB
node 0 free: 9395 MB
node distances:
node   0
 0:  10

In the next box I’ll show how running dumb.c on the primary thread of either core 0 or core 1 of socket 0 produces odd speed-up. Running dumb.c on core 0 speeds up OS CPU 0 and every even-numbered processor thread in the box. Conversely, stressing core 1 causes the clock rate on all odd-numbered processor threads to increase. Yes, that seems weird to me too. I don’t know why it does this, but I’ll try to find out.


# ./busy.sh 0;./howfast.sh
0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000
#
# ./busy.sh 1;./howfast.sh
0 1600.000 1 2934.000 2 1600.000 3 2934.000 4 1600.000 5 2934.000 6 1600.000 7 2934.000 8 1600.000 9 2934.000 10 1600.000 11 2934.000 12 1600.000 13 2934.000 14 1600.000 15 2934.000

After seeing the effect of stressing core 0 and core 1 on socket 0, I thought I’d try all primary threads in both sockets:

# ./busy.sh '0 1 2 3 4 5 6 7' ;./howfast.sh
0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000

Interesting. I get the same speedup results that I get when I stress only OS cpu 0. That made me curious so I tried all secondary threads in both sockets:

# ./busy.sh '8 9 10 11 12 13 14 15' ;./howfast.sh
0 2934.000 1 2934.000 2 2934.000 3 2934.000 4 2934.000 5 2934.000 6 2934.000 7 2934.000 8 2934.000 9 2934.000 10 2934.000 11 2934.000 12 2934.000 13 2934.000 14 2934.000 15 2934.000

Yes, that too is an oddity. Hammering all the primary threads heats up only OS cpus 0,2,4,6,8,10,12 and 14 but hammering on the all the secondary threads causes all processor threads to clock up. Interesting. That got me to thinking about just how consistent that behavior was. As the following text box shows, it isn’t that consistent. I looped 10 iterations of busy.sh hammering all the secondary processor threads and found that 50% of the time it caused all the processor threads to speed up:

# for t in 1 2 3 4 5 6 7 8 9 10; do ./busy.sh '8 9 10 11 12 13 14 15' ;./howfast.sh;sleep 30; done
0 2934.000 1 2934.000 2 2934.000 3 2934.000 4 2934.000 5 2934.000 6 2934.000 7 2934.000 8 2934.000 9 2934.000 10 2934.000 11 2934.000 12 2934.000 13 2934.000 14 2934.000 15 2934.000
0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000
0 2934.000 1 2934.000 2 2934.000 3 2934.000 4 2934.000 5 2934.000 6 2934.000 7 2934.000 8 2934.000 9 2934.000 10 2934.000 11 2934.000 12 2934.000 13 2934.000 14 2934.000 15 2934.000
0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000
0 2934.000 1 2934.000 2 2934.000 3 2934.000 4 2934.000 5 2934.000 6 2934.000 7 2934.000 8 2934.000 9 2934.000 10 2934.000 11 2934.000 12 2934.000 13 2934.000 14 2934.000 15 2934.000
0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000
0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000
0 2934.000 1 2934.000 2 2934.000 3 2934.000 4 2934.000 5 2934.000 6 2934.000 7 2934.000 8 2934.000 9 2934.000 10 2934.000 11 2934.000 12 2934.000 13 2934.000 14 2934.000 15 2934.000
0 2934.000 1 2934.000 2 2934.000 3 2934.000 4 2934.000 5 2934.000 6 2934.000 7 2934.000 8 2934.000 9 2934.000 10 2934.000 11 2934.000 12 2934.000 13 2934.000 14 2934.000 15 2934.000
0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000</pre>

How Long Do They Stay Hot?
Not long. In the following text box I’ll show that after heating up OS cpus 0,2,4,6,8…14, only a matter of 10 seconds passed before all the processor threads had throttled back down to 1600 MHz:

# ./howfast.sh;./busy.sh '0' ;./howfast.sh;sleep 5;./howfast.sh;sleep 5;./howfast.sh
0 1600.000 1 1600.000 2 1600.000 3 1600.000 4 1600.000 5 1600.000 6 1600.000 7 1600.000 8 1600.000 9 1600.000 10 1600.000 11 1600.000 12 1600.000 13 1600.000 14 1600.000 15 1600.000
0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000
0 1600.000 1 1600.000 2 1600.000 3 1600.000 4 1600.000 5 1600.000 6 1600.000 7 1600.000 8 1600.000 9 1600.000 10 1600.000 11 1600.000 12 1600.000 13 1600.000 14 1600.000 15 1600.000
0 1600.000 1 1600.000 2 1600.000 3 1600.000 4 1600.000 5 1600.000 6 1600.000 7 1600.000 8 1600.000 9 1600.000 10 1600.000 11 1600.000 12 1600.000 13 1600.000 14 1600.000 15 1600.000

So, this was a really long blog entry that will likely raise more questions than it answers. But, it is Part II in series and once I know more, I’ll post it. The “more” I’ll know will have to do with the “ondemand” governor for CPU scaling:


# /etc/rc.d/init.d/cpuspeed status
Frequency scaling enabled using ondemand governor

8 Responses to “Fun With Intel Xeon 5500 Nehalem and Linux cpuspeed(8). Part II.”

Feed for this Entry Trackback Address

1 Brett Schroeder May 12, 2009 at 8:35 pm

Are you (implicitly) assuming that each core has its own independent clock (that can be manipulated by software)? Clock circuitry utilizes a phase locked loop (PLL) which has several analog components. This analog circuitry does not scale (in size) at the same rate as digital circuitry (~50% area reduction per technology generation). Thus clocks are expensive in terms of silicon real estate (and becoming more so with each technology generation). A reasonable engineering trade-off would be to have some cores share clock circuitry in order to reduce die size (this in turn reduces the cost of each die).

A quick lunch time scan through the Intel Systems Programming Guide did not yield anything conclusive, only vague statements like “With multiple processor cores residing in the same physical package, hardware dependencies may exist for a subset of logical processors on a platform. These dependencies may impose requirements that impact coordination of P-state transitions.” (from volume 3A, section 13.2).

Maybe Steve Shaw’s arm can be twisted into talking to the Nehalem architects to verify/refute the independent-clock-per-core assumption 😉

- 2 kevinclosson May 12, 2009 at 10:30 pm
  
  “Are you (implicitly) assuming that each core has its own independent clock”
  
  Brett, I’m not making any assumptions, except perhaps, that applying pressure to the various cores in that manner I’ve demonstrated should have sped up the cores/threads differently. This is a learning curve.
  
- 3 Brett Schroeder May 12, 2009 at 10:33 pm
  
  D’oh! Just realized that this is happening across sockets and not only within a single socket i.e. run it on core 0 and *all* even-numbered threads speed up, as in “all” threads in the entire box. This somewhat negates the argument for a shared on-die clock being responsible for the observations and/or at least points to another mechanism at higher level that is forcing common clock speeds…..although I have no idea what that could be.
  
  - 4 kevinclosson May 12, 2009 at 10:56 pm
    
    No problem, Brett. I didn’t read your comment as an argument. It looked like a hypothesis more than anything…which is just fine with me…but, yes, the point I was making with these test results is what happens to the other socket.
    
5 Chris Slattery May 13, 2009 at 10:12 am

Is the NUMA disable important because of synchronized access to the memory and all need to be operating at the same speed for that ?

- 6 kevinclosson May 13, 2009 at 2:05 pm
  
  actually, speaking in terms of generic NUMA principles, there is nothing that prevents different “nodes” from having different speed memory.
  
7 Kermit the sensible June 17, 2009 at 9:41 pm

just a note to say I tested this on a new dell running Ubuntu 9 and it does not show this behaviour…. probably different kernel/bios settings?

- 8 kevinclosson June 17, 2009 at 11:44 pm
  
  Hi Kermit,
  
  Thanks for testing it. There could be so many things different…but that is sort of the point of this microbenchmark. It let’s people explore this stuff. I have no idea what Dell you have, but since I’m using an HP Proliant server the diffs could be significant and those diffs could easily change the cpuspeed behaviour…and, indeed, Ubunutu too can change all that.
  Is your Dell a NUMA or SUMA box?

	David Zheng on Announcing pgio (The SLOB Meth…
	Oracle redo log perf… on File Systems For A Database? C…
	Oracle redo log perf… on Yes, File Systems Still Need T…
	kevinclosson on Announcing SLOB 2.5.4
	pgio nutzen? - I/O W… on So pgio Does Not Accurately Re…

Kevin Closson's Blog: Platforms, Databases and Storage