I received an interesting email recently from a reader that takes offense at how I dare to discuss the differences between Intel Xeon 5500 (Nehalem) systems operating in NUMA versus SUMA/SUMO mode. One excerpt of the email read:
…and I think you are just creating confusion and chaos to gain popularity with your NUMA versus non-NUMA stuff. We tested everything we can think of and see no difference when booted with NUMA or non-NUMA…
I don’t doubt for one moment that the testing performed by this reader showed no performance differences between NUMA and SUMA because I have no idea whatsoever what his testing consisted of. And, besides, Xeon 5500 Nehalem EP is one extremely nice NUMA package. That is, when running non-NUMA aware software on this particular NUMA offering you can rest assured that you won’t likely fall over dead from NUMA pathologies. That’s good, but does that mean there really is no difference when booted in the NUMA versus SUMA? Hardly!
Please allow me to explain something. Intel Xeon 5500 (Nehalem) is a very tightly coupled NUMA system. Remote memory references are only about 20% more costly than local. If you measure a workload that does not saturate the processors you are very unlikely to detect any difference in throughput. If you have a program that only drives a processor core to, say, 80% utilization you will most likely not see any throughput difference if the process performs all its I/O into remote memory or local memory. When using only remote memory the process would consume moderately more processor cycles, however unless the code is overly-synthetic so as to force a high rate of L2 misses the result would likely be equivalent throughput in both the local and remote cases.
NUMA/SUMA: The Ever-Hypothetical Topic
Let’s stop talking in the hypothetical. How about something that, gasp, real Oracle Database Administrators have to do more than just occasionally. Consider for a moment transferring a sizable zipped ASCII file in preparation for loading into an Oracle Data Warehouse. When booting in the default NUMA mode and running Linux, memory is presented to processes in multiple hierarchies. For example, the following box shows a freshly booted Intel Xeon 5500 (Nehalem EP) box with 16 GB total RAM segmented into two memories. Notice how just 7 minutes after booting up memory has been consumed in a non-symmetrical fashion. The numactl command shows that roughly 40% more memory has been allocated from node 0 memory compared to node 1. That’s because not every memory usage in the Linux kernel (including drivers) is NUMA aware. But that is not what I’m blogging about.
# uptime;numactl --hardware 13:28:30 up 7 min, 1 user, load average: 0.00, 0.09, 0.07 available: 2 nodes (0-1) node 0 size: 8052 MB node 0 free: 5773 MB node 1 size: 8080 MB node 1 free: 7955 MB node distances: node 0 1 0: 10 20 1: 20 10 # cat /proc/meminfo MemTotal: 16427752 kB MemFree: 14059424 kB Buffers: 19588 kB Cached: 239480 kB SwapCached: 0 kB Active: 66308 kB Inactive: 217152 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 16427752 kB LowFree: 14059424 kB SwapTotal: 2097016 kB SwapFree: 2097016 kB Dirty: 1848 kB Writeback: 0 kB AnonPages: 24408 kB Mapped: 15024 kB Slab: 170920 kB PageTables: 3512 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 10310892 kB Committed_AS: 382752 kB VmallocTotal: 34359738367 kB VmallocUsed: 381716 kB VmallocChunk: 34359356623 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 2048 kB # free total used free shared buffers cached Mem: 16427752 2370064 14057688 0 19872 239852 -/+ buffers/cache: 2110340 14317412 Swap: 2097016 0 2097016
In this section of this blog entry I’d like to show a practical example of honest-to-goodness, real world work that doesn’t exhibit totally benign NUMA characteristics. Within a VNC I opened two xterm sessions. I’ll call them “left” and “right.” In the left xterm I’ll list a zipped ASCII file to capture the inode so as to prove my testing is happening against the same file. The file is inode 1701506. You’ll also see a stupid little script called henny_penny.sh named appropriately as I apparently come off as Henny Penny to folks like the reader who emailed me. The henny_penny.sh script executed in the left xterm showed that a shell with a parent process id of 23283 was able to sling the contents of all_card_trans.ul.gz into /dev/null at the rate of 4.9 GB/s. That is very fast indeed. It is that fast, in fact, because the file has been moved into the current directory with FTP so the contents of the approximately 1.5 GB file is cached in memory. Ah, but the question is, what memory?
# ls -li all* henny_penny* 1701506 -rw-r--r-- 1 root root 1472114768 Aug 14 11:31 all_card_trans.ul.gz 1701513 -rwxr-xr-x 1 root root 90 Aug 14 12:17 henny_penny.sh # cat henny_penny.sh ps -f ls -li all_card_trans.ul.gz date dd if=all_card_trans.ul.gz of=/dev/null bs=1M date # sh ./henny_penny.sh UID PID PPID C STIME TTY TIME CMD root 23283 23280 0 12:13 pts/0 00:00:00 -bash root 23849 23283 0 12:18 pts/0 00:00:00 sh ./henny_penny.sh root 23850 23849 0 12:18 pts/0 00:00:00 ps -f 1701506 -rw-r--r-- 1 root root 1472114768 Aug 14 11:31 all_card_trans.ul.gz Fri Aug 14 12:18:12 PDT 2009 1403+1 records in 1403+1 records out 1472114768 bytes (1.5 GB) copied, 0.30021 seconds, 4.9 GB/s Fri Aug 14 12:18:12 PDT 2009
In the following box you’ll see how things behaved in the right xterm. I invoked henny_penny.sh (parent PID 23422) and voila dd(1) was able to shovel the contents of all_card_trans.ul.gz into /dev/null at a rate of 6 GB/s. Now, that’s only 22% faster for a totally memory-bound, CPU-saturated task so why would anyone other than Henny Penny care? Notice how the henny_penny.sh script included the output of the date(1) command. Just three seconds after “left” was muddling through at 4.9 GB/s, “right” proceeded to rip through at 6.0 GB/s. Yes, memory hierarchy matters.
# sh ./henny_penny.sh UID PID PPID C STIME TTY TIME CMD root 23422 23420 0 12:14 pts/3 00:00:00 -bash root 23856 23422 0 12:18 pts/3 00:00:00 sh ./henny_penny.sh root 23857 23856 0 12:18 pts/3 00:00:00 ps -f 1701506 -rw-r--r-- 1 root root 1472114768 Aug 14 11:31 all_card_trans.ul.gz Fri Aug 14 12:18:15 PDT 2009 1403+1 records in 1403+1 records out 1472114768 bytes (1.5 GB) copied, 0.244703 seconds, 6.0 GB/s Fri Aug 14 12:18:15 PDT 2009
How, What, Why?
The left xterm and its children happen to be executing on cores 0-3 (SMT disabled at the moment but no matter) and the right xterm on cores 4-7. The FTP process executed on one (or more) of cores 4-7 and since Linux prefers to allocate buffers to a process such as this from local memory, you can see why henny_penny.sh in the right xterm achieved the throughput it did.
Likely nobody until the Xeon 5500 Linux production uptake actually starts! In the meantime there is me (Henny Penny) and a few curiously morbid (er, uh, morbidly curious) Googlers who might stumble upon this trivia.
What’s This Have To Do With Nehalem EX?
Well, even the 4-socket Nehalem EX packaging implements single-hop remote memory. That’s a significant difference from the way 4-sockets were done with HyperTransport. So, I actually don’t expect NUMAisms such as this to be any more painful than with EP (2 socket).
I Still Think He’s Henny Penny
So, let’s take another look at this topic. I’ve already mentioned that Linux likes to allocate memory close to processes when running on Nehalem systems. That’s good, isn’t it? Well, the answer is yes, of course, it depends.
In the following text box you’ll see how I depleted free memory (down to 40MB free) from node 0 by writing zeros to a file. Consider yet another hypothetical with me for one moment. What happens when I execute, say, 100 processes that each allocates a moderate 16 MB of memory with malloc(3)? Do you think Linux will yank these processes from me, their parent, and place them on node 1 or will they be homed on node 0 with their heaps allocated from node 1? Will it matter? What if they are producers and I am their consumer? Where should they execute? What if they each work on 1/100th of the dumb_test.out file reading into their respective heap? Well, at this point there is no way for 100 processes on node 0 (socket 0) to attack 1/100th segments (buffering in their heap) of that file without 100% remote memory overhead. Could such a “bizarre” hypothetical happen in production? Sure. Is there any way to properly deal with such an issue? Well, yes and no.
If the hypothetical “1/100th program” was coded to libnuma then it can assure process placement and therefore local heap. However, what about the fact that my work file is buffered entirely on node 0 memory? Wouldn’t that guarantee 100% local access to node 0 users of that file but 100% remote for node 1 users? Yes. That’s great for the node 0 users you might say. However, those node 0 users had better not malloc(3) any memory because you know where that memory is going to come from. ‘Round and ’round we go…
# numactl --hardware available: 2 nodes (0-1) node 0 size: 8052 MB node 0 free: 5946 MB node 1 size: 8080 MB node 1 free: 7987 MB node distances: node 0 1 0: 10 20 1: 20 10 # time dd if=/dev/zero of=dumb_test.out bs=1M count=5946;numactl --hardware 5946+0 records in 5946+0 records out 6234832896 bytes (6.2 GB) copied, 6.07315 seconds, 1.0 GB/s real 0m6.091s user 0m0.003s sys 0m6.069s available: 2 nodes (0-1) node 0 size: 8052 MB node 0 free: 40 MB node 1 size: 8080 MB node 1 free: 7652 MB node distances: node 0 1 0: 10 20 1: 20 10
So, what if I cloak my test with libnuma attributes (inherited by dd from numactl(8))? In the following text box you’ll see that instead of a Cyclops, memory was allocated nice and evenly from the page cache when I filled out the dumb.test.out file. So in this model, processes homed on either node 0 or node 1 are guaranteed a 50% local access rate when accessing dumb_test.out and I am protected from memory imbalances. In fact, if it was my system and had to stay with NUMA, I’d consider invoking shells under numactl –interleave. As such any non-NUMA aware programs (like FTP) will be granted memory in a round-robin fashion but any NUMA aware program (coded to libnuma calls) will execute as it would without being wrapped with numactl. It’s just a thought. It isn’t any official recommendation and, as my email in-box suggests, it doesn’t matter anyway…nonetheless, I think the following looks better than a cyclops:
# numactl --interleave=0,1 /bin/bash # numactl -s policy: interleave preferred node: 0 (interleave next) interleavemask: 0 1 interleavenode: 0 physcpubind: 0 1 2 3 4 5 6 7 cpubind: 0 1 nodebind: 0 1 membind: 0 1 # numactl --hardware available: 2 nodes (0-1) node 0 size: 8052 MB node 0 free: 5957 MB node 1 size: 8080 MB node 1 free: 7988 MB node distances: node 0 1 0: 10 20 1: 20 10 # dd if=/dev/zero of=dumb_test.out bs=1M count=5957 5957+0 records in 5957+0 records out 6246367232 bytes (6.2 GB) copied, 6.24962 seconds, 999 MB/s # numactl --hardware available: 2 nodes (0-1) node 0 size: 8052 MB node 0 free: 2825 MB node 1 size: 8080 MB node 1 free: 4854 MB node distances: node 0 1 0: 10 20 1: 20 10