Archive for the 'oracle' Category



Something to Ponder? What Sort of Powerful Offering Could a Filesystem in Userspace Be?

I Jest Not, But I Speak In Fragments Today
Imagine, just for a moment, the possibilities. Ponder, with me, the richness of features a Filesystem in User Space could offer when said user-space code has the powerful backing of a real database.

Just imagine.

Yes, this is a  Teaser Post™

Little Things Doth Crabby Make – Part VII. Oracle Database 11g Index Fragmentation?

I just checked my list of “miscellaneous” posts and see that it has been quite a while since I blogged a rant. I think it is time for another installment in the Little Things Doth Crabby Make series.  Unlike usual, however, this time it isn’t I which the little thing hath crabby made. No, it’s got to be Microsoft on the crabby end—although I contend it is no “little thing” in this case.

According to Ed Bott’s blog post on the matter, Microsoft has fingered SQL Server as the culprit behind a recent service outage on MSDN and TechNet. What’s that adage? Eating one’s own dog-food? Anyway, the supposed SQL Server problem was database fragmentation. Huh? The tonic? According to Ed:

I’m told that Microsoft engineers are now monitoring the status of this database every 30 minutes and plan to rebuild the indexes every evening to avoid a recurrence of the problem.

How fun…playing with indexes—nightly!

And, yes, the title was a come-on. Oracle Database 11g fragmentation? Puh-leeeeze.

Temporary Link to Edited Webcast Video: Oracle Exadata Storage Server Technical Deep Dive – Part I.

As some of you found out, the original Part I archived webcast in this series suffered technical failures on play-back at about 8 minutes into the video.

I have sent an edited version that cleans that up (special thanks to the Exadata PM team for that effort). The improved version supports play-bar dragging so you can fast forward into the webcast. That will come in handy since this version still had some dead air up through the first 4 minutes and 30 seconds or so. With the newly edited video you can simply start playback at 4m30s.

The IOUG Exadata SIG folks are all at Collaborate 2009 so they won’t be mending their website to vend this edited version of Part I until next week. In the interim, there are a limited number of available downloads at the link earmarked as TEMPORARY at the URL supplied below.

Note, Part II included a section that recapped some of the material from Part I since starting at about slide 34 I was rushed for time and mixed some MB/s for total MB citations. That is, my rushed words at some points didn’t match the values on the slides. The slides were right and I was wrong…it’s usually the other way around, but as I say, “Sometimes man bites dog.” 🙂

Archived Webcasts: Oracle Exadata Storage Server Technical Deep Dive Part I and Part II.

Fun With Intel Xeon 5500 Nehalem and Linux cpuspeed(8) Part I.

Intel Xeon 5500 (Nehalem) CPUs–Fast, Slow, Fast, Slow. CPU Throttling Is A Good Thing. Really, It Is!
I’m still blogging about Xeon 5500 (Nehalem) SMT (Symmetric Multi-Threading), but this installment is a slight variation from Part I and Part II. This installment has to do with cpuspeed(8). Please be aware that this post is part one in a series.

One of the systems I’m working on enables cpu throttling via cpuspeed:


# chkconfig --list cpuspeed
cpuspeed        0:off   1:on    2:on    3:on    4:on    5:on    6:off

So I hacked out this quick and dirty script to give me single-line output showing each CPU (see this post about Linux OS CPU to CPU thread mapping with Xeon 5500) and its current clock rate. The script is called howfast.sh and its listing follows:

#!/bin/bash
egrep 'processor|MHz' /proc/cpuinfo | sed 's/^.*: //g' | xargs echo

The following is an example of the output. It shows that currently all 16 processor threads are clocked at 1600 MHz. That’s ok with me because nothing is executing that requires “the heat.”

# ./howfast.sh
0 1600.000 1 1600.000 2 1600.000 3 1600.000 4 1600.000 5 1600.000 6 1600.000 7 1600.000 8 1600.000 9 1600.000 10 1600.000 11 1600.000 12 1600.000 13 1600.000 14 1600.000 15 1600.000

So the question becomes just what does it take to heat these processors up? Let’s take a peek…

Earth To CPU: Hello, Time To Get a Move On
The following script, called busy.sh, is simple. It runs a sub-shell on named processors looping the shell “:” built-in. But don’t confuse “:” with “#.” I’ve seen people use “:” as a comment marker…bad, bad (or at least it use to be when people cared about systems). Anyway, back to the train of thought. Here is the busy.sh script:

#!/bin/bash

function busy() {
local SECS=$1
local WHICH_CPU=$2
local brickwall=0
local x=0

taskset -pc $WHICH_CPU $$ > /dev/null 2>&1
x=$SECONDS
(( brickwall = $x + $SECS ))

until [ $SECONDS -ge $brickwall ]
do
    :
done
}
#--------------
SECS=$1
CPU_STRING="$2"
#(mpstat -P ALL $SECS 1 > mpstat.out 2>&1 )&

for CPU in `echo $CPU_STRING`
do
    ( busy $SECS "$CPU" ) &
done
wait

Let’s see what happens when I execute busy.sh to hammer all 16 processor threads. I’ll first use howfast.sh to get a current reading. I’ll then set busy.sh in motion to whack on all 16 processors after which I immediately check what howfast.sh has to say about them.

#  howfast.sh;sh ./busy.sh 30 '0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15';howfast.sh
0 1600.000 1 1600.000 2 1600.000 3 1600.000 4 1600.000 5 1600.000 6 1600.000 7 1600.000 8 1600.000 9 1600.000 10 1600.000 11 1600.000 12 1600.000 13 1600.000 14 1600.000 15 1600.000
0 2934.000 1 2934.000 2 2934.000 3 2934.000 4 2934.000 5 2934.000 6 2934.000 7 2934.000 8 2934.000 9 2934.000 10 2934.000 11 2934.000 12 2934.000 13 2934.000 14 2934.000 15 2934.000

Boring Your Readers To Death For Fun, Not Profit
Wow, this is such an interesting blog post isn’t it? You’re wondering why I’ve wasted your time, right?

Let’s allow the processors to cool down again and take a slightly different look. In fact, perhaps I should run a multiple command sequence where I start with 120 seconds of sleep followed by howfast.sh and then busy.sh. But, instead of busy.sh targeting all processors, I’ll run it only on OS processor 0. I’ll follow that up immediately with a check of the clock rates using howfast.sh:

# sleep 120; howfast.sh;sh ./busy.sh 30 0;howfast.sh
0 1600.000 1 1600.000 2 1600.000 3 1600.000 4 1600.000 5 1600.000 6 1600.000 7 1600.000 8 1600.000 9 1600.000 10 1600.000 11 1600.000 12 1600.000 13 1600.000 14 1600.000 15 1600.000
0 1600.000 1 2934.000 2 1600.000 3 2934.000 4 1600.000 5 2934.000 6 1600.000 7 2934.000 8 1600.000 9 2934.000 10 1600.000 11 2934.000 12 1600.000 13 2934.000 14 1600.000 15 2934.000

Huh? I stress processor 0 but processors 1,3,5,7,9,11,13 and 15 heat up? That’s weird, and I don’t understand it, but I’ll be investigating.  I wonder what other interesting findings lurk? What happens if I stress processor 1? I think I should start putting date commands in there too. Let’s see what happens:

# ./howfast.sh;date;./busy.sh 30 1;date;./howfast.sh
0 1600.000 1 1600.000 2 1600.000 3 1600.000 4 1600.000 5 1600.000 6 1600.000 7 1600.000 8 1600.000 9 1600.000 10 1600.000 11 1600.000 12 1600.000 13 1600.000 14 1600.000 15 1600.000
Fri May  1 14:30:57 PDT 2009
Fri May  1 14:31:27 PDT 2009
0 1600.000 1 2934.000 2 1600.000 3 2934.000 4 1600.000 5 2934.000 6 1600.000 7 2934.000 8 1600.000 9 2934.000 10 1600.000 11 2934.000 12 1600.000 13 2934.000 14 1600.000 15 2934.000
#

Ok, that too is odd. I stress thread 0 in either core 0 or 1 of socket 0 and I get OS processors 1,3,5,7,9,11,13 and 15 heated up. I wonder what would happen if I hammer on the primary thread of all the cores of socket 0? Let’s see:

# ./howfast.sh;date;./busy.sh 30 '0 1 2 3';date;./howfast.sh
0 1600.000 1 1600.000 2 1600.000 3 1600.000 4 1600.000 5 1600.000 6 1600.000 7 1600.000 8 1600.000 9 1600.000 10 1600.000 11 1600.000 12 1600.000 13 1600.000 14 1600.000 15 1600.000
Fri May  1 14:40:51 PDT 2009
Fri May  1 14:41:21 PDT 2009
0 2934.000 1 2934.000 2 2934.000 3 2934.000 4 2934.000 5 2934.000 6 2934.000 7 2934.000 8 2934.000 9 2934.000 10 2934.000 11 2934.000 12 2934.000 13 2934.000 14 2934.000 15 2934.000

Hmmm. Hurting cores 0 and 1 wasn’t enough to unleash the dogs but hammering all the cores in that socket proved sufficient. Of course it seems odd to me that it would heat up all threads in all cores on both sockets. But this is a blog entry of observations only at this point. I’ll post more about this soon.

Would it surprise anyone if I got the same result from beating on the primary thread of all 4 cores in socket 1? It shouldn’t:

# sleep 120;./howfast.sh;date;./busy.sh 30 '4 5 6 7';date;./howfast.sh
0 1600.000 1 1600.000 2 1600.000 3 1600.000 4 1600.000 5 1600.000 6 1600.000 7 1600.000 8 1600.000 9 1600.000 10 1600.000 11 1600.000 12 1600.000 13 1600.000 14 1600.000 15 1600.000
Fri May  1 14:44:33 PDT 2009
Fri May  1 14:45:03 PDT 2009
0 2934.000 1 2934.000 2 2934.000 3 2934.000 4 2934.000 5 2934.000 6 2934.000 7 2934.000 8 2934.000 9 2934.000 10 2934.000 11 2934.000 12 2934.000 13 2934.000 14 2934.000 15 2934.000

Consider this installment number one in this series…

And, before I forget, I nearly made it through this post without mentioning NUMA. These tests were run with NUMA disabled. Can anyone guess why that matters?

Quick Update: New Page Added to the Blog. Intel Xeon 5500 (Nehalem) Related Posts.

Just a quick blog post to point out that I have added a page specifically to index Intel 5500 Xeon (Nehalem) related posts. The page is under the Index of My Posts page.  Here is a quick link: Intel Xeon 5500 (Nehalem) Related Topics.

How To Produce Raw, Spreadsheet-Ready Physical I/O Data With PL/SQL. Good For Exadata, Good For Traditional Storage.

Several folks who read the Winter Corporation Exadata Performance Assessment have asked me what method I used to produce the throughput timeline graphs. I apologize to them for taking so long to follow this up.

The method I used to produce that data is a simple PL/SQL loop that evaluates differences in gv$sysstat contents over time and produces its output into a file in the filesystem with appending writes. Of course there are a lot of other ways to get this data not the least of which include such tools as ASH and so forth. However, in my opinion, this is a nice technique to get raw data that is in an up-loadable format ready for Excel. Er, uh, I suppose I’m supposed to say OpenOfficeThisOrThatExcelLookAlikeThingy aren’t I? Oh well.

The following is a snippet of the output from the tool. This data was collected during a very lazy moment of SQL processing using a few Exadata Storage Server cells as the database storage. I simply tail(1) the output file to see the aggregate physical read and write rate in 5-second intervals. The columns are (from left to right) time of day, total physical I/O, physical read, physical write. Throughput values are in megabytes.


$ tail -f /tmp/mon.log
11:51:43|293|185|124|
11:51:49|312|190|102|
11:51:55|371|234|137|
11:52:00|360|104|257|
11:52:06|371|245|145|
11:52:11|378|174|217|
11:52:16|377|251|122|
11:52:21|431|382|83|
11:52:26|385|190|180|
11:52:32|244|127|140|
11:52:37|445|329|106|
11:52:42|425|301|101|
11:52:47|391|214|177|
11:52:53|260|60|200|

The following is the PL/SQL script. This script should be cut-and-paste ready to go.


set serveroutput on format wrapped size 1000000

create or replace directory mytmp as '/tmp';

DECLARE
n number;
m number;

gb number := 1024 * 1024 * 1024;
mb number := 1024 * 1024 ;

bpio number; -- 43 physical IO disk bytes
apio number;
disp_pio number(8,0);

bptrb number; -- 39 physical read total bytes
aptrb number;
disp_trb number(8,0);

bptwb number; -- 42 physical write total bytes
aptwb number;
disp_twb number(8,0);

x number := 1;
y number := 0;
fd1 UTL_FILE.FILE_TYPE;
BEGIN
        fd1 := UTL_FILE.FOPEN('MYTMP', 'mon.log', 'w');

        LOOP
                bpio := 0;
                apio := 0;

                select  sum(value) into bpio from gv$sysstat where statistic# = '43';
                select  sum(value) into bptwb from gv$sysstat where statistic# = '42';
                select  sum(value) into bptrb from gv$sysstat where statistic# = '39';

                n := DBMS_UTILITY.GET_TIME;
                DBMS_LOCK.SLEEP(5);

                select  sum(value) into apio from gv$sysstat where statistic# = '43';
                select  sum(value) into aptwb from gv$sysstat where statistic# = '42';
                select  sum(value) into aptrb from gv$sysstat where statistic# = '39';

                m := DBMS_UTILITY.GET_TIME - n ;

                disp_pio := ( (apio - bpio)   / ( m / 100 )) / mb ;
                disp_trb := ( (aptrb - bptrb) / ( m / 100 )) / mb ;
                disp_twb := ( (aptwb - bptwb) / ( m / 100 )) / mb ;

                UTL_FILE.PUT_LINE(fd1, TO_CHAR(SYSDATE,'HH24:MI:SS') || '|' || disp_pio || '|' || disp_trb || '|' || disp_twb || '|');
                UTL_FILE.FFLUSH(fd1);
                x := x + 1;
        END LOOP;

        UTL_FILE.FCLOSE(fd1);
END;
/

So, while it isn’t rocket-science, I hope it will be a helpful tool for at least a few readers and the occasional wayward googler who stops by…

Linux Thinks It’s a CPU, But What Is It Really – Part II. Trouble With the Intel CPU Topology Tool?

Yesterday I made a blog entry about the Intel CPU topology tool to help understand Xeon 5500 SMT and how it maps to Linux OS CPUs. I received a few emails about the tool. Some folks where having trouble figuring it out on their system (the tool works on other Xeons too).

This is just a quick blog entry to explain the tool for those readers and the possible future wayward googler.

In the following session snap you’ll see that the  CPU topology tool tar file is called  topo03062009.tar. In the session I do the following:

  1. Extract the tarball
  2. Change directories into the directory created by the tar extraction
  3. Run the make for 64 bit Linux
  4. Ignore the warnings
  5. Run ls(1) to see what I picked up. Hmmm, there are no file names that appear to be executable.
  6. I look into the script that builds the tool. I see the binary is produced into cpu_topology64.out. ( Uh, I think even a.out would have been more intuitive).
  7. I use file(1) to make sure it is an executable
  8. I run it but throw away all but the last 40 lines of output.

# ls -l topo*
-rw-r--r-- 1 root root 163840 Apr 13 21:16 topo03062009.tar
# tar xvf topo03062009.tar
topo/cpu_topo.c
topo/cputopology.h
topo/get_cpuid.asm
topo/get_cpuid_lix32.s
topo/get_cpuid_lix64.s
topo/Intel Source Code License Agreement.doc
topo/mk_32.bat
topo/mk_32.sh
topo/mk_64.bat
topo/mk_64.sh
topo/util_os.c
# cd topo
# sh ./mk_64.sh
cpu_topo.c: In function ?DumpCPUIDArray?:
cpu_topo.c:1857: warning: comparison is always false due to limited range of data type
# ls
cpu_topo.c          get_cpuid_lix32.s                        mk_32.bat  util_os.c
cpu_topology64.out  get_cpuid_lix64.o                        mk_32.sh   util_os.o
cputopology.h       get_cpuid_lix64.s                        mk_64.bat
get_cpuid.asm       Intel Source Code License Agreement.doc  mk_64.sh
#
# more mk*64*sh
#!/bin/sh

gcc -g -c get_cpuid_lix64.s -o get_cpuid_lix64.o
gcc -g -c util_os.c
gcc -g -DBUILD_MAIN cpu_topo.c -o cpu_topology64.out get_cpuid_lix64.o util_os.o
#
#
# file cpu_topology64.out
cpu_topology64.out: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.6.9,
 dynamically linked (uses shared libs), for GNU/Linux 2.6.9, not stripped

# ./cpu_topology64.out  | tail -40
      +-----------------------------------------------+

Combined socket AffinityMask= 0xf0f

Package 1 Cache and Thread details

Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
CmbMsk is Combined AffinityMask(extended hex) for hw threads sharing cache
       CmbMsk will differ from AffMsk if > 1 hw_thread/cache
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------+-----------+-----------+-----------+
Cache |  L1D      |  L1D      |  L1D      |  L1D      |
Size  |  32K      |  32K      |  32K      |  32K      |
OScpu#|    4    12|    5    13|    6    14|    7    15|
Core  |c0_t0 c0_t1|c1_t0 c1_t1|c2_t0 c2_t1|c3_t0 c3_t1|
AffMsk|   10   1z3|   20   2z3|   40   4z3|   80   8z3|
CmbMsk| 1010      | 2020      | 4040      | 8080      |
      +-----------+-----------+-----------+-----------+

Cache |  L1I      |  L1I      |  L1I      |  L1I      |
Size  |  32K      |  32K      |  32K      |  32K      |
      +-----------+-----------+-----------+-----------+

Cache |   L2      |   L2      |   L2      |   L2      |
Size  | 256K      | 256K      | 256K      | 256K      |
      +-----------+-----------+-----------+-----------+

Cache |   L3                                          |
Size  |   8M                                          |
CmbMsk| f0f0                                          |
      +-----------------------------------------------+

Linux Thinks It’s a CPU, But What Is It Really – Part I. Mapping Xeon 5500 (Nehalem) Processor Threads to Linux OS CPUs.

Thanks to Steve Shaw, Database Technology Manager, Intel for pointing me to the magic decoder ring for associating Xeon 5500 (Nehalem) processor threads with Linux OS CPUs. Steve is an old acquaintance who I would gladly refer to as a friend but I’m not sure how Steve views the relationship. See, I was the technical reviewer of his book (Pro Oracle Database 10g RAC on Linux), which is a role that can make friends or frenemies I suppose. I don’t have any bad memories of the project, and Steve is still talking to me, so I think things are hunky dory.  OK, joking aside…but first, a bit more about Steve.

Steve writes the following on his website intro page (emphasis added by me):

I’m Steve Shaw and for over 10 years have specialised in working with the Oracle database. I have a background with Oracle on various flavours of UNIX including HP-UX, Sun Solaris and my own personal favourite Dynix/ptx on Sequent.

Sequent? I’ve emerged from my ex-Sequent 12-step program! Indeed, that is a really good personal favorite to have. But, I’m sentimental, and I digress as well.

The Magic Decoder Ring
The web resource Steve provided is this Intel webpage containing information about processor topology. There is an Intel processor topology tool that really helps make sense of the mappings between processor cores and threads on Nehalem processors  and Linux OS CPUs.

What’s in the “Package?”
As we can see from that Intel webpage, and the processor topology tool itself, Intel often use the term “package” when referring to what goes in a socket these days. Considering there are both cores and threads, I suppose there is justification for a more descriptive term. I still use socket/core/thread nomenclature though. It works for me.  Nonetheless, let’s see what my Nehalem 2s8c16t system shows when I run the topology tool. First, let’s see the output from “package” number 0 (socket 0). There is a lot of output from the command. I recommend focusing on line 20 and 21 in the following text box:


Package 0 Cache and Thread details

Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
CmbMsk is Combined AffinityMask(extended hex) for hw threads sharing cache
       CmbMsk will differ from AffMsk if > 1 hw_thread/cache
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
L1D is Level 1 Data cache, size(KBytes)= 32,  Cores/cache= 2, Caches/package= 4
L1I is Level 1 Instruction cache, size(KBytes)= 32,  Cores/cache= 2, Caches/package= 4
L2 is Level 2 Unified cache, size(KBytes)= 256,  Cores/cache= 2, Caches/package= 4
L3 is Level 3 Unified cache, size(KBytes)= 8192,  Cores/cache= 8, Caches/package= 1
      +-----------+-----------+-----------+-----------+
Cache |  L1D      |  L1D      |  L1D      |  L1D      |
Size  |  32K      |  32K      |  32K      |  32K      |
OScpu#|    0     8|    1     9|    2    10|    3    11|
Core  |c0_t0 c0_t1|c1_t0 c1_t1|c2_t0 c2_t1|c3_t0 c3_t1|
AffMsk|    1   100|    2   200|    4   400|    8   800|
CmbMsk|  101      |  202      |  404      |  808      |
      +-----------+-----------+-----------+-----------+

Cache |  L1I      |  L1I      |  L1I      |  L1I      |
Size  |  32K      |  32K      |  32K      |  32K      |
      +-----------+-----------+-----------+-----------+

Cache |   L2      |   L2      |   L2      |   L2      |
Size  | 256K      | 256K      | 256K      | 256K      |
      +-----------+-----------+-----------+-----------+

Cache |   L3                                          |
Size  |   8M                                          |
CmbMsk|  f0f                                          |
      +-----------------------------------------------+

From the output we can decipher that Linux OS CPU 0 resides in socket 0, core 0, thread 0. That much is straightforward. On the other hand, the tool adds value by showing us that Linux OS CPU 8 is actually the second processor thread in socket 0, core 0. And, of course, “package” 1 follows in suit:


Package 1 Cache and Thread details

Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
CmbMsk is Combined AffinityMask(extended hex) for hw threads sharing cache
       CmbMsk will differ from AffMsk if > 1 hw_thread/cache
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------+-----------+-----------+-----------+
Cache |  L1D      |  L1D      |  L1D      |  L1D      |
Size  |  32K      |  32K      |  32K      |  32K      |
OScpu#|    4    12|    5    13|    6    14|    7    15|
Core  |c0_t0 c0_t1|c1_t0 c1_t1|c2_t0 c2_t1|c3_t0 c3_t1|
AffMsk|   10   1z3|   20   2z3|   40   4z3|   80   8z3|
CmbMsk| 1010      | 2020      | 4040      | 8080      |
      +-----------+-----------+-----------+-----------+

Cache |  L1I      |  L1I      |  L1I      |  L1I      |
Size  |  32K      |  32K      |  32K      |  32K      |
      +-----------+-----------+-----------+-----------+

Cache |   L2      |   L2      |   L2      |   L2      |
Size  | 256K      | 256K      | 256K      | 256K      |
      +-----------+-----------+-----------+-----------+

Cache |   L3                                          |
Size  |   8M                                          |
CmbMsk| f0f0                                          |
      +-----------------------------------------------+

So, it goes like this:

Linux OS CPU Package Locale
0 S0_c0_t0
1 S0_c1_t0
2 S0_c2_t0
3 S0_c3_t0
4 S1_c0_t0
5 S1_c1_t0
6 S1_c2_t0
7 S1_c3_t0
8 S0_c0_t1
9 S0_c1_t1
10 S0_c2_t1
11 S0_c3_t1
12 S1_c0_t1
13 S1_c1_t1
14 S1_c2_t1
15 S1_c3_t1

By the way, the CPU topology tool works on other processors in the Xeon family.

Archived Webcasts: Oracle Exadata Storage Server Technical Deep Dive Part I and Part II.

BLOG UPDATE 08-MAR-2011: Please visit The Papers, Webcasts, Files, Etc section of my blog for the content referenced below. Thank you. The Original post follows:

This is just a quick blog entry to offer pointers to the IOUG recorded webcasts I did on March 25 and April 16 2009. These are parts 1 and 2 in a series I’m calling “Technical Deep Dive.” I hope at least some of the material seems technical and/or deep for folks who choose to view them. I don’t aim to waste people’s time.

As for audio/video quality, these things can sometimes be a bit hit and miss since there are several moving parts, what with the merge of my dial-in voice stream and the GotoMeeting.com video side of the webcast. I haven’t played them back in their entirety so I can’t vouch. The format is Windows Media.

The idea behind offering this material is to aid the IOUG Exadata Special Interest Group in gaining Exadata-related knowledge that goes a bit further than the more commonly available collateral. As I see it, the less “spooky” this technology appears to Oracle’s install base the more likely they are to steer their next DW/BI deployment over to Exadata. Or, perhaps, even a migration of any currently long-of-tooth Oracle DW/BI deployment in need of a hardware refresh!

Note: There was some AV trouble in the first 4 minutes, 30 seconds of Part I. I recommend you right-click, save it , then when you view it drag the progress bar to roughly 4m30s and let it play from there.

(TEMPORARY LINK) 25-MAR-09 – Oracle Exadata Storage Server Technical Deep Dive – Part I

(PERMANENT LINK. Do Not Use Until Further Notice) 25-MAR-09  – Oracle Exadata Storage Server Technical Deep Dive – Part I

16-APR-09   – Oracle Exadata Storage Server Technical Deep Dive – Part II

Off Topic: A Couple of Photographs Added to the Blog

Recently Added Photographs

Enjoy!

Last-Minute Webcast Reminder. Oracle Exadata Storage Server Technical Deep Dive – Part II.

Just a quick, last-minute reminder:

Webcast Announcement: Oracle Exadata Storage Server Technical Deep Dive – Part II.

“Feel” Your Processor Cache. Oracle Does. Part II.

That’s funny, but I had a sneaking suspicion it was going to happen, so….

In yesterday’s post entitled “Feel” Your Processor Cache. Oracle Does. Part I., I pointed out that the newest comer to the ever-growing crowd of in-memory open source database products and accelerators got it a bit wrong when they described what a level-two processor cache is. However, before I made that blog post I took a screen shot of the Csqlcache blog. Notice the description of L2:

csqlcache_before

"Before" Description: L2 Cache Soldered to the Motherboard

I just checked and it looks like they took the hint, but in what some would consider poor blogging style by simply changing it as opposed to making an edit to draw attention to the change.  But that’s not why I’m blogging. The fact that someone made a blog correction is not interesting to me. Please see the “after rendition” in the next screen shot:

csclecache_after

"After" Description: L2 On-die.

So, yes, they took the hint that expressing cache latency in wall clock time is messy, but they now cite fixed latencies for L1, L2 and memory. First off, 5-cycle L1 would be disastrous! Then citing 10-cycle L2 is truly a number out of the hat. But that is not what I’m blogging about.

The new page is citing memory latency at 5-50ns. Oh how I’d love to have a system that chewed up memory at 50ns! But what about that low bound 5ns? Wow, memory latencies at modern L2 cache speed. That would be so cool! I wonder where these Csqlcache folks get their hardware? It is definitely out-of-this-worldly.

It’s All About Cache Lines
I don’t get this bit about “granularity” in that page either. Folks, modern microprocessors map memory to cache with in nits known as cache lines. All processors that matter (no names but their initials are x64) use a cache line size of 64 bytes (8 words). In order to access any bits within a 64-byte line of memory, the entire line must be installed in the processor cache. So I think it would be a bit more concise to specify granularity at the base operational chunk the processor deals with, which is a cache line. That’s the point of the Silly Little Benchmark by the way.

The workhorse of SLB (memhammer) randomly picks a line and writes a word in the line. The control loop is tight and the work loop is otherwise light so this test creates maximum processor stalls with minimum extraneous cycles. That is, it exhibits a miserably high CPI (cycles per instruction) cost. That’s why it is called memhmmer.

I’ve got the “before” screen shot. Let’s see if it silently changes. I hate to sound critical, but these Csqlcache folks are hanging their hat on producing a database accelerator. You have to know a lot about memory and how it works to do that sort of thing well. And, my oh my how the in-memory and in-line database accelerators field is so saturated. That reminds me of the company that my SQL Server-focused counterparts at PolyServe were all excited about back in about 2005 called Xprime. It looked like a holy grail back then. I recall it even took a best new product sort of award at a large SQL Server convention.

It didn’t work very well.

“Feel” Your Processor Cache. Oracle Does. Part I.

At about the same time I was reading Curt Monash’s mention of yet another in-memory database offering, my friend Greg Rahn started hammering his Nehalem-based Mac using the Silly Little Benchmark (SLB). Oh, before I forget, there is an updated copy of SLB here.

This blog entry is a quick and dirty two-birds-one-stone piece.

Sure About In-Memory Database? Then Be Sure About Memory Hierarchy
Curt’s post had a reference to this blog entry about levels of cache on the Csqlcache blog. I took a gander at it and immediately started gasping for air. According to the post, level-2 processor cache is:

Level 2 cache – also referred to as secondary cache) uses the same control logic as Level 1 cache and is also implemented in SRAM and soldered onto the motherboard, which is normally located close to the processor.

No, it isn’t. The last of the volume microprocessors to use off-die level-2 cache was the Pentium II and that was 11 years ago. So, no, processors don’t jump off-die to access static RAMs glued to the motherboard. Processor L2 caches are in the processor (silicon) and, in fact, visible to other cores within a multi-core package. That’s helpful for cache-to-cache transfers, which occur at blisteringly high frequencies with an Oracle Database workload since spinlocks (latches) sustaining a high acquire/release rate will usually have another process trying on the latch on one of the other cores. Once the latch is released, it is much more efficient to shuttle the protected memory lines via a cache-to-cache transfer than in the olden days where L2 cache required a bus access. These shared caches dramatically accelerate Oracle concurrency. That’s scalability. But that isn’t what I’m blogging about.

Get Your Stopwatch. How Fast is That L2 Cache?
In the Csqlcache blog it was stated matter-of-factly that L2 cache has a latency of 20ns. Well, ok, sure, there are or have been L2 processor caches with 20ns latency, but that is neither cast in stone, nor the common nomenclature for expressing such a measurement. It also happens to be a very poor L2 latency number, but I digress. Modern microprocessor L2 cache accesses are in phase with the processor clock rate. So, by convention, access times to L2 cache are expressed in CPU clock cycles. For example, consider a processor clocked at 2.93 GHz. At that rate, each cycle is 0.34 nanoseconds. Let’s say further that a read of a clean line in our 2.93 GHz processor requires 11 clock cycles. That would be 3.75 ns. However, expressing it in wall clock terms is not the way to go, especially on modern systems that can throttle the clock rate as per load placed on the processor. Let’s say, for example, that our 2.93 GHz processor might temporarily be clocked down to 2 GHz. Loading that same memory line would therefore require 5.5 ns.

We can use SLB to investigate this topic further. In the following session excerpt I ran SLB on the first core of the second socket on a Xeon 5400 based server. I had SLB (memhammer) allocate 4 MB of memory from which memhammer loops picking random 64 byte offsets in which to write. It turns out that SLB is the most pathological of workloads because it requires a processor L2 line load prior to every write–except, that is, in the case where I allocate a sufficiently small chunk of memory to fit in the L2 cache. As the session snapshot shows, memhammer was able to write at random locations within the 4 MB chunk at the rate of 68.86 million times per second or 14.5 ns per L2 cache access.


# cat r
./create_sem
taskset -pc 4 $$
./memhammer $1 $2 &
sleep 1
./trigger

wait

#  ./r 1024 3000000
pid 23384's current affinity list: 0-7
pid 23384's new affinity list: 4
Total ops 3072000000  Avg nsec/op    14.5  gettimeofday usec 44614106 TPUT ops/sec 68857145.8

When I increased the chunk of memory SLB allocated to 64 MB, the rate fell to roughly 9.3 million writes per second (107.8ns) since the test blew out the L2 cache and was writing to memory.


#  ./r 16384 30000
pid 22919's current affinity list: 0-7
pid 22919's new affinity list: 4
Total ops 491520000  Avg nsec/op   107.8  gettimeofday usec 52970954 TPUT ops/sec 9279047.5

I don’t know anything about Csqlcache. I do know that since they are focused on in-memory databases they ought to know what memory really “looks” like. So, put away your soldering iron and that bag full of SRAM chips. You can’t make your modern microprocessor system faster that way.

“I Still Want My Fibre Channel.” Thus Sayeth Manly Man!

Just a quick alert that one of my installments in the “Manly Man” series has just come alive once again through its comment thread. I find it interesting that 20 months after I wrote that it still gets read quite frequently-a proximately 150 times per month following the first month after the original posting-as per the WordPress analytics on the post. But the visit rate on the post isn’t what I’m blogging about.

Let’s face it, that post-and the majority of the Manly Man series-are basically an indictment against the storage presentation weaknesses specific to Oracle Real Application Clusters in an FC SAN environment. During that timeframe a lot had transpired that makes me feel a bit prescient on the matter. Consider:

  1. Oracle released Direct NFS in Oracle Database 11g:
    1. Manly Men Only Deploy Oracle with Fibre Channel – Part VI. Introducing Oracle11g Direct NFS!
  2. SCSI RDMA (SRP) became a viable option:
    1. Oracle on Opteron with Linux-The NUMA Angle (Part III). Introducing The Silly Little Benchmark.
    2. HP’s Optimized Warehouse (Blades) Reference Platform ( http://h18004.www1.hp.com/products/blades/oow/index.html )
  3. Switched Serial Attached SCSI became increasingly more popular

…and, of course Oracle Exadata Storage Server and the HP Oracle Database Machine!

Just How Well Do You Know Your Oracle Home Directory Tree?

How Deep is Your…Oracle Home?
This is likely the most trivial of pursuits sort of posts I’ve made in a long while. Please don’t ask me why, but I had to take a few minutes to inventory directory depth in an Oracle Database 11g Enterprise Edition (with Real Application Clusters) Oracle Home directory tree on Linux. The data in the following box shows each directory depth that exits under my Oracle Home and a tall of how many directories are nested that deeply.

Maybe we should all breathe a sigh of relief that there are only 3 directories laying 13 levels deep? That is, after all, only .07% of all directories (4208) under a typical 11g Oracle Home! I’m not losing sleep.

Hey, like I said, trivial pursuit. Ho hum.

SQL> select d,count(*) from oh_dirs
  2  group by d
  3  order by 2 desc ;

         D   COUNT(*)
---------- ----------
         8        796
         5        729
         6        663
         4        651
         9        437
         7        369
         3        306
        10        145
         2         70
        11         31
        12          6

         D   COUNT(*)
---------- ----------
        13          3
        14          1
         1          1

14 rows selected.

SQL> select count(*) from oh_dirs;

  COUNT(*)
----------
      4208

DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.