I recently received email from a reader who wondered why Part I and II of my series on Intel 5500 “Nehalem” cpuspeed(8) was based on NUMA-disabled mode (SUMA/SUMO system) testing. The series the reader referred to can be found at the following links:
Fun With Intel Xeon 5500 Nehalem and Linux cpuspeed(8) Part I.
Fun With Intel Xeon 5500 Nehalem and Linux cpuspeed(8). Part II.
The reader is correct. Thus far in the series I’ve been sharing some findings (trivia?) from a test system with NUMA disabled at the BIOS level. For reference, you can see more about the concept of disabling NUMA with commodity NUMA systems in this post. As an aside, running a Commodity NUMA Implementation (CNI) system (e.g., Xeon 5500 Nehalem) with NUMA disabled in the BIOS is also refered to as a SUMA or SUMO configuration.
A Look at cpuspeed(8) and NUMA
In this blog entry I’ll show some findings based on the busy.sh script (to stress varying processor threads) and analysis of how cpuspeed(8) reacts using the howfast.sh script. But first, recall from Part II of this series where I said:
Hammering all the primary threads heats up only OS cpus 0,2,4,6,8,10,12 and 14 but hammering on the all the secondary threads causes all processor threads to clock up.
That was an indeed an odd thing to observe and I have not yet started to investigate why it is that way since I’m still in somewhat of a discovery phase. Let’s see how the processors respond under the same conditions with NUMA enabled in the BIOS. But first, I’ll do a quick check to make sure it is a NUMA system, not a SUMA/SUMO system system. I’ll use numactl(8) to make sure I have two NUMA nodes in this HP Proliant server with Intel Xeon 5500 “Nehalem” processors:
# numactl --hardware available: 2 nodes (0-1) node 0 size: 8052 MB node 0 free: 3683 MB node 1 size: 8080 MB node 1 free: 3664 MB node distances: node 0 1 0: 10 20 1: 20 10
Good, it is a NUMA system. In the following box I’ll show how the processors respond to two different experiments. Before I show any test results, though, I need to point out that I’ve changed the howfast.sh script so that it it takes an argument and compares the current processor speeds against the value supplied in the argument. If no argument is provided the script just lists a single line of output with all the processors’ current clock rates. This change was necessary to avoid having to peruse the output of the script to validate the speeds prior to an experiment.
The following box shows the new script behavior. I first use the script with an argument of 1600 and so long as all the cpus are currently clocked at 1600 MHz, the script returns success and the shell moves on to execute busy.sh. As expected, after busy.sh executed, howfast.sh stumbles on a cpu that is not clocked at 1600 and fails.
# ./howfast.sh 1600 && ./busy.sh 1;./howfast.sh 1600 Check Failed: CPU 0 is 2934.00
NUMA Experiments
First, I’ll stress the primary thread of core 0. Next, I’ll stress the primary thread of core 1. Both cores are in socket 0:
# ./howfast.sh 1600 && ./busy.sh 0; ./howfast.sh 0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000 # # ./howfast.sh 1600 && ./busy.sh 1; ./howfast.sh 0 1600.000 1 2934.000 2 1600.000 3 2934.000 4 1600.000 5 2934.000 6 1600.000 7 2934.000 8 1600.000 9 2934.000 10 1600.000 11 2934.000 12 1600.000 13 2934.000 14 1600.000 15 2934.000
That output should look familiar to the six or so folks following this series because it is exaclty how the processors behave when the system is booted as a SUMA/SUMO system. In Part II of this series I made the following observation:
Running dumb.c on core 0 speeds up OS CPU 0 and every even-numbered processor thread in the box. Conversely, stressing core 1 causes the clock rate on all odd-numbered processor threads to increase.
Let’s see what happens when I hammer multiple processor threads as I did in Part II.
# ./howfast.sh 1600 && ./busy.sh '0 1 2 3 4 5 6 7';./howfast.sh 0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000 # ./howfast.sh 1600 && ./busy.sh '8 9 10 11 12 13 14 15';./howfast.sh 0 2934.000 1 2934.000 2 2934.000 3 2934.000 4 2934.000 5 2934.000 6 2934.000 7 2934.000 8 2934.000 9 2934.000 10 2934.000 11 2934.000 12 2934.000 13 2934.000 14 2934.000 15 2934.000
Déjà Vu
Here, as in the SUMA case, stressing the primary procesor threads in both sockets causes only certain processor threads to clock up. On the other hand, as was also the case with SUMA, stressing the secondary processor threads of both sockets speeds up all processor threads. So, at least this much is consistent between the NUMA and SUMA tests. But what about a series of these tests with a cool down period in the loop?
In the following box I’ll show the effect of looping the busy.sh script in the same fashion as I did in Part II (SUMA). In each iteration, I’ll stress the secondary processor threads of both sockets. As you’ll see, the results are similar to the SUMA behavior except for the frequency of tests that resulted in all processors speeding up. In the SUMA case it was 50% but in the NUMA case it is only 40%:
# # for t in 1 2 3 4 5 6 7 8 9 10; do ./howfast.sh 1600 && ./busy.sh '8 9 10 11 12 13 14 15' ;./howfast.sh;sleep 30; done 0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000 0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000 0 2934.000 1 2934.000 2 2934.000 3 2934.000 4 2934.000 5 2934.000 6 2934.000 7 2934.000 8 2934.000 9 2934.000 10 2934.000 11 2934.000 12 2934.000 13 2934.000 14 2934.000 15 2934.000 0 2934.000 1 2934.000 2 2934.000 3 2934.000 4 2934.000 5 2934.000 6 2934.000 7 2934.000 8 2934.000 9 2934.000 10 2934.000 11 2934.000 12 2934.000 13 2934.000 14 2934.000 15 2934.000 0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000 0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000 0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000 0 2934.000 1 2934.000 2 2934.000 3 2934.000 4 2934.000 5 2934.000 6 2934.000 7 2934.000 8 2934.000 9 2934.000 10 2934.000 11 2934.000 12 2934.000 13 2934.000 14 2934.000 15 2934.000 0 2934.000 1 2934.000 2 2934.000 3 2934.000 4 2934.000 5 2934.000 6 2934.000 7 2934.000 8 2934.000 9 2934.000 10 2934.000 11 2934.000 12 2934.000 13 2934.000 14 2934.000 15 2934.000 0 2934.000 1 1600.000 2 2934.000 3 1600.000 4 2934.000 5 1600.000 6 2934.000 7 1600.000 8 2934.000 9 1600.000 10 2934.000 11 1600.000 12 2934.000 13 1600.000 14 2934.000 15 1600.000
So here we are at Part III and thus far the sum value of all this information is:
- cpuspeed(8) acts unpredictably on Xeon 5500 “Nehalem” processors
- cpuspeed(8) acts differently on Xeon 5500 “Nehalem” processors in NUMA mode compared to SUMA mode.
- processors cool down quickly after being clocked up
Someone, someday, will likely be scratching their head and googling to see if anyone else is seeing odd processor frequency issues with the Xeon 5500 “Nehalem processors. If nothing else, this series of blog posts will at least let said googler know that they are not alone in what they are seeing.
Great Blogging!!
Keep Your Good Work Going!!
Processor
Some of your results can be explained by the fact that Nehalem has Hyper-Threading (and your BIOS has it enabled). Hyper-threading effectively turns one core into 2 logical CPU’s by having two sets of registers to store the architectural state of two threads. However, not all components of the core are duplicated for each thread. The threads are thus forced to share the “central” part of each core i.e. execution engine, system bus interface and clock circuitry.
Since the clock is common to both threads on any given core, when you load up the primary and its frequency increases, so does the frequency of the secondary. Vice versa when you load up the primary. These reported speeds are in reality the speed of the same clock circuit in each core.
In general, for your two socket quad-core system the speeds of OS CPU #N will *always* be equal to OS CPU #(N+8) irrespective of how you load it e.g CPU #3 (s0_c3_t0) = CPU #11 (s0_c3_t1).
See section 2.2.7 and 2.2.8 of Intel Software Developers Manual Vol 1 (there is a nice schematic of Nehalem/Core i7 at the end of 2.2.8). Here’s the link http://download.intel.com/design/processor/manuals/253665.pdf