I’ve read through the full disclosure report from Oracle’s January 2012 TPC-C. I’ve found that the result was obtained without using any NUMA init.ora parameters (e.g., enable_NUMA_support). The storage was a collection of Sun x64 servers running COMSTAR to serve up F5100 flash storage. The storage connectivity was 8GFC fibre channel. This was a non-RAC result with 8s80c160t Xeon E7. The only things that stand out to me are:
- The settings of disk_async_io=TRUE. This was ASM on raw disk so I should think ASYNC would be the default. Interesting.
- Overriding the default number of DBWR processes by setting db_writer_processes. The default number of DBWR processes would be 20 so the benchmark team increased that 60%. Since sockets are NUMA “nodes” on this architecture the default of 20 would render 2.5 DBWR per “node.” In my experience it is beneficial to have DBWR processes an equal multiple of the number of sockets (NUMA nodes) so if the benchmark team was thinking the way I think they went with 4x socket count.
The FDR is here: http://c970058.r58.cf2.rackcdn.com/fdr/tpcc/Oracle_X4800-M2_TPCC_OL-UEK-FDR_011712.pdf
For more information about the missing enable_NUMA_support parameter see: Meet _enable_NUMA_support: The if-then-else Oracle Database 11g Release 2 Initialization Parameter.
For a lot more about NUMA as it pertains to Oracle, please visit: QPI-Based Systems Related Topics (e.g., Nehalem EP/EX, Westmere EP, etc)
On the topic of increasing DBWR processes I’d like to point out that doing so isn’t one of those “some is good so more must be better” situations. For more reading on that matter I recommend:
Over-Configuring DBWR Processes Part I
Over-Configuring DBWR Processes Part II
Over-Configuring DBWR Processes Part III
Over-Configuring DBWR Processes Part IV
The parameters:
Got A Big NUMA Box For Running Oracle? Take Care To Get Interrupt Handling Spread Across The Sockets Evenly
Page 310 of the FDR shows the following script used to arrange good affinity between the FC HBA device drivers and the sockets. I had to do the same sort of thing with the x4800 (aka Exadata X2-8) back before I left Oracle’s Exadata development organization. This sort of thing is standard but I wanted to bring the concept to your attention:
#!/bin/bash service irqbalance stop last_node=-1 declare -i count=0 declare -i cpu cpu1 cpu2 cpu3 cpu4 for dir in /sys/bus/pci/drivers/qla2xxx/0000* do node=`cat $dir/numa_node` irqs=`cat $dir/msi_irqs` if [ "`echo $irqs | wc -w`" != "2" ] ; then echo >&2 "script expects 2 interrupts per device" exit 1 fi first_cpu=`sed 's/-.*//' < $dir/local_cpulist` echo $node $irqs $first_cpu $dir done | sort | while read node irq1 irq2 cpu1 dir do cpu2=$cpu1+10 cpu3=$cpu1+80 cpu4=$cpu1+90 if [ "$node" != "$last_node" ] then count=1 cpu=$cpu1 else count=$count+1 case $count in 2) cpu=$cpu2;; 3) cpu=$cpu3;; 4) cpu=$cpu4;; *) echo "more devices than expected on node $node" count=1 cpu=$cpu1;; esac fi last_node=$node echo "#$dir" echo "echo $cpu > /proc/irq/$irq1/smp_affinity_list" echo "echo $cpu > /proc/irq/$irq2/smp_affinity_list" echo echo $cpu > /proc/irq/$irq1/smp_affinity_list echo $cpu > /proc/irq/$irq2/smp_affinity_list done
Thats strange, I see pga_aggregate_target = 0 and parallel_max_servers=0 why is that ?
It’s TPC-C so none of the Parallel Query Option related settings are needed. It’s common benchmark technique to shut of all the lights that aren’t needed so to speak 🙂
Ok, on the risk of being stupid: Why is it interesting that the disk_asynch_io parameter is set to true?
Is this parameter not always default true and here just being listed as one of the parameters (just like undo_management)?
Hi Freek,
I’m not trying to raise red flags. I just pointed out the only settings that looked odd. There are hundreds of settings in the instance that were left to default (not specifically set). It just seemed odd to me that this one would be necessary to set. Full disclosure: I haven’t taken the time to see what the default for that setting is. I should think that is the default.
Could you have the script at the bottom of your post either in a downloadable form, or maybe embedded in a syntax highlighting plugin (like wp-syntax) so it can be read more easily? The long line is really goofing up the rendering.
–Jason
no NUMA support because of a clueless architecture vs. say HP’s prema architecture.
The cost of a miss might not be that expensive on a x4800 with evenly distributed IRQs and the use of an OS that is much more stringent with IO/Processor affinity.
Stills seems like Oracle would want to run a TPC-C with Exadata though.
Maybe they don’t want to because it will be replaced with the SUN SuperCluster platform…..
It is certainly NUMA, Matt. The lack of “glue (ala PREMA or X5) doesn’t disqualify the x4800 from being able to exploit NUMA (via enabling built-in Oracle NUMA support via _enable_NUMA_support). It’s quite the opposite in fact. The tighter the NUMA architecture, the less need for application software NUMA awareness. I’m quite surprised the x4800 was able to do a decent TPC-C without getting the age-old _enable_NUMA_support underpinnings functioning well in the database code and enabled.
Now, having said all that, it may just be the case that the Oracle database bits they used just enable the NUMA awareness code by default based on socket-count discovery. I don’t know because I not longer have access to the source for Oracle Database.
Finally, while the 4,803,718 TpmC is a huge number, on a per-core basis it still shows that big machines are usually not fast machines as evidenced by the fact that 4.8 million on this platform is 60,000 TpmC/core whereas the recent Cisco UCS+Oracle result of 1,053,100 on 12 WSM-EP cores comes out to roughly 88,000 TpmC/core. The 4.8 million result requires a database that is about 47% larger than the Cisco UCS result as per rules of scale in TPC-C. That’s a shame because it would be really interesting to see what the x4800 could do running the scale used on the 12 core Cisco UCS. When you buy software on a per-core basis it’s nice to compare apples to apples on a per-core basis.
Thanks for pointing this report out. One thing that jumps out at me is this, from page 47:
“Product Availability – Date:
Oracle Database 11g Release 2 with Partitioning for OEL
6.1 – June 26, 2012”
I’m new to the Oracle on Linux world, as I’m prepping to move off AIX. I may have missed something, but I’ve heard no dates for 11gR2 certification on OEL 6. Is this the first word we’ve heard from Oracle that they’re finally adding RHEL/OEL 6 to the certification matrix?
Interesting find, Ryan. The old trick is to pull the result before the 6 months if things don’t meet dates. That’s fair play. We’ll see.