Archive Page 11

Seven Fundamentals Everyone Should Know About Exadata

I speak to a lot of customers, prospects and co-workers about Exadata.  Even though Exadata has been in production for two years I still do not presume everyone has a grasp of some of the more important fundamentals of Exadata. I’ll routinely get asked about how very large SGA buffering can enhance Exadata Smart Scan or how Storage Indexes might improve OLTP workloads and other such non sequiturs.

There are a lot of sessions about Exadata being offered at Oracle OpenWorld 2010 and for good reason.  Exadata is exciting technology! It dawns on me, however, that a few words explaining some of the more fundamental aspects of Exadata might help folks absorb more of what they are hearing in the sessions they attend next week.

I consider the following seven terms and definitions utterly important for folks to know before sitting through an Exadata presentation. In fact, there may even be some sessions offered by presenters who could also benefit from the following 242 words?

  • Cell Offload Processing.
    • Work performed by the Storage Servers that would otherwise have to be executed in the database grid. Includes functionality like Smart Scan, datafile initialization, RMAN offload, Hybrid Columnar Compression (HCC) decompression.
  • Smart Scan.
    • Most relevant Cell Offload Processing for improving Data Warehouse / Business Intelligence query performance. Smart Scan is the agent for offloading filtration, projection, Storage Index exploitation and HCC decompression.
  • Full Scan or Index Fast Full Scan.
    • The required access method chosen by the query optimizer in order to trigger a Smart Scan.
  • Direct Path Reads.
    • Required buffering model for a Smart Scan. The flow of data from a Smart Scan cannot be buffered in the SGA buffer pool. Direct path reads can be performed for both serial and parallel queries. Direct path reads are buffered in process PGA (heap).
  • Result Set.
    • Data returned by the SQL processing layer. The SQL processing layer is in the Oracle Database. The data flowing from a Smart Scan is not a result set.
  • Exadata Smart Flash Cache.
    • Flash Cache in each of the Storage Servers. Not to be confused with Database Flash Cache which is Flash in the database grid and not compatible with Exadata. Smart Scan aggressively scans both HDD and Flash media concurrently. When data is present in the flash cache scan rates of 50 GB/s on Exadata Version 2 hardware are the norm for full rack configurations. Maximum theoretical scan rates (a.k.a., datasheet scan rates) for Exadata are *only* possible for fully offloaded scans. A fully offloaded scan is generated by a SQL query that finds no rows. Blog Update: Please consider viewing the following 2 minute Youtube video with a demonstration of how complex SQL processing throttles Exadata Smart Scan to roughly 10% of maximum theoretical scans rates:http://www.youtube.com/watch?v=JuWVjSp42yM
  • Storage Index.
    • Dynamic, in-memory indexes. The role of Storage Index technology is not to aid in locating data faster but instead to eliminate I/O. With Storage Indexes the Exadata Storage Server software can determine whether or not a given storage region contains rows relevant to the query and decide to not read the storage region. Storage Indexes are only examined during a Smart Scan.

I hope you’ll find this helpful.

OpenWorld 2010 Session Update. Room Change Again.

The OOW folks informed me that they needed to move our session to a different room–again. So, if you are interested here are the new details:

ID#: S315110
Title: Optimizing Servers for Oracle Database Performance
Track: Database
Date: 20-SEP-10
Time: 17:00 – 18:00
Venue: Moscone South
Room: Rm 102

Oracle Database 11g Release 2 Patchset 1 (11.2.0.2) Is Now Available, But This Is Not Just An Announcement Blog Entry.

BLOG UPDATE: I should have peeked at my blog aggregator before posting this. I just found that my friend Greg Rahn posted about the same content on his blog earlier today. Hmmm.plagerism!

Oracle Database 11g Release 2 Patchset 1 (11.2.0.2 Part Number E15732-03) is available as of today for x86 and x86_64 Linux as per My Oracle Support. This is not a blog post with a simple announcement of the software availability. I’d like to point out sometime related to this Patchset that I did not know until quite recently. I don’t apply Patchsets very often since having joined Oracle so I learned a few new things about patch application particularly as it pertains to 11.2.0.2.

Read This Before Touching 11.2.0.2!

I recommend reading MOS Note 1189783.1 – Important Changes to Oracle Database Patch Sets Starting With 11.2.0.2. There are two key topics this MOS note explains quite well:

  • The reason behind why the 11.2.0.2 Patchset download for x86_64 is 4.9 Gigabytes in size
  • More clarity on the concept of an “out-of-place upgrade”

I’ll wish I had read MOS note 1189783.1 before I trudged headlong into my first 11.2.0.1->11.2.0.2 upgrade effort!

OpenWorld 2010 Unconference Venue Is Now Open For OpenWorld Attendees Too!

In my post entitled OpenWorld 2010 Unconference Open For JavaOne And/Or Oracle Develop Registrants Only I quoted the Unconference policy which, at the time, stated Unconference attendance was only open to JavaOne and Oracle Develop folks.

I just received email stating that the policy has changed and that the new wording is as follows:

Now, Open to Oracle OpenWorld Attendees as well!
The unconference is a venue for any JavaOne, Oracle Develop or Oracle OpenWorld 2010 attendees to present their own session or workshop on a topic they’re passionate about, in an informal, interactive setting. It is a great opportunity for attendees to learn what’s on the minds of their peers in the community.

My Unconference sessions are:

Tuesday 11 AM: Lombard: Do-It-Yourself Exadata-like Performance? Kevin Closson, Performance Architect, Systems Technology Group, Oracle.

Tuesday 4PM:    Lombard: What Every Oracle Professional Wants To Ask About Exadata (Also Known as Q&A with Kevin Closson.) Kevin Closson, Performance Architect, Systems Technology Group, Oracle.

OpenWorld 2010 Unconference – Open for JavaOne And/Or Oracle Develop Registrants Only. A Poll.

It has come to my attention that the Unconference offered during this year’s OpenWorld can only be attended by registered JavaOne or OracleDevelop attendees as per the following quote:

Participation and attendance is reserved to JavaOne and Oracle Develop attendees. You have to be registered to JavaOne or Oracle Develop 2010 to attend any of those sessions.

I think it’s time for a poll. How many folks interested in Exadata plan to be paid JavaOne/OracleDevelop attendees and might, therefore, attend Unconference sessions?

Some Blog Errors Are Just Too Serious To Ignore. A Comparison of Intel Xeon 5400 (Harpertown) to Intel Xeon 5500 (Nehalem EP).

I’d like to direct readers to an important blog update/correction.

In my post entitled An Intel Xeon 5400 System That Outperforms An Intel 5500 (Nehalem EP) System? Believe It…Or Know It I blogged about an erroneous conclusion I had drawn about a test performed on these two processor models. I think the update does the blog post justice and it all serves as a good object lesson in how important Xeon topology is.  I must remember to practice what I preach (e.g., remain ever-aware of topology).

While on the topic, the following post remains as an example of the the type of workload that exhibits near-parity between Xeon 5400 and Xeon 5500:

Intel Xeon 5500 Nehalem: Is It 17 Percent Or 2.75-Fold Faster Than Xeon 5400 Harpertown? Well, Yes Of Course It Is!

Oracle Exadata Database Machine I/O Bottleneck Revealed At… 157 MB/s! But At Least It Scales Linearly Within Datasheet-Specified Bounds!

It has been quite a while since my last Exadata-related post. Since I spend all my time, every working day, on Exadata performance work this blogging dry-spell should seem quite strange to readers of this blog. However, for a while it seemed to me as though I was saturating the websphere on the topic and Exadata is certainly more than a sort of  Kevin’s Dog and Pony Show. It was time to let other content filter up on the Google search results. Now, having said that, there have been times I’ve wished I had continued to saturate the namespace on the topic because of some of the totally erroneous content I’ve seen on the Web.

Most of the erroneous content is low-balling Exadata with FUD, but a surprisingly sad amount of content that over-hypes Exadata exists as well. Both types of erroneous content are disheartening to me given my profession. In actuality, the hype content is more disheartening to me than the FUD. I understand the motivation behind FUD, however, I cannot understand the need to make a good thing out to be better than it is with hype. Exadata is, after all, a machine with limits folks. All machines have limits. That’s why Exadata comes in different size configurations  for heaven’s sake! OK, enough of that.

FUD or Hype? Neither, Thank You Very Much!
Both the FUD-slinging folks and the folks spewing the ueber-light-speed, anti-matter-powered warp-drive throughput claims have something in common—they don’t understand the technology.  That is quickly changing though. Web content is popping up from sources I know and trust. Sources outside the walls of Oracle as well. In fact, two newly accepted co-members of the OakTable Network have started blogging about their Exadata systems. Kerry Osborne and Frits Hoogland have been posting about Exadata lately (e.g., Kerry Osborne on Exadata Storage Indexes).

I’d like to draw attention to Frits Hoogland’s investigation into Exadata. Frits is embarking on a series that starts with baseline table scan performance on a half-rack Exadata configuration that employs none of the performance features of Exadata (e.g., storage offload processing disabled). His approach is to then enable Exadata features and show the benefit while giving credit to which specific aspect of Exadata is responsible for the improved throughput. The baseline test in Frits’ series is achieved by disabling both Exadata cell offload processing and Parallel Query Option! To that end, the scan is being driven by a single foreground process executing on one of the 32 Intel Xeon 5500 (Nehalem EP) cores in his half-rack Database Machine.

Frits cited throughput numbers but left out what I believe is a critical detail about the baseline result—where was the bottleneck?

In Frits’ test, a single foreground process drives the non-offloaded scan at roughly 157MB/s. Why not 1,570MB/s (I’ve heard everything Exadata is supposed to be 10x)? A quick read of any Exadata datasheet will suggest that a half-rack Version 2 Exadata configuration offers up to 25GB/s scan throughput (when scanning both HDD and FLASH storage assets concurrently). So, why not 25 GB/s? The answer is that the flow of data has to go somewhere.

In Frits’ particular baseline case the data is flowing from cells via iDB (RDS IB) into heap-buffered PGA in a single foreground executing on a single core on a single Nehalem EP processor. Along with that data flow is the CPU cost paid by the foreground process in its marshalling all the I/O (communicating with Exadata via the intelligent storage layer) which means interacting with cells to request the ASM extents as per its mapping of the table segments to ASM extents (in the ASM extent map). Also, the particular query being tested by Frits performs a count(*) and predicates on a column. To that end, a single core in that single Nehalem EP socket is touching every row in every block for predicate evaluation. With all that going on, one should not expect more than 157MB/s to flow through a single Xeon 5500 core. That is a lot of code execution.

What Is My Point?
The point is that all systems have bottlenecks somewhere. In this case, Frits is creating a synthetic CPU bottleneck as a baseline in a series of tests. The only reason I’m blogging the point is that Frits didn’t identify the bottleneck in that particular test. I’d hate to see the FUD-slingers suggest that a half-rack Version 2 Exadata configuration bottlenecks at 157 MB/s for disk throughput related reasons about as badly as I’d hate to see the hype-spewing-light-speed-anti-matter-warp rah-rah folks suggest that this test could scale up without bounds. I mean to say that I would hate to see someone blindly project how Frits’ baseline test would scale with concurrent invocations. After all, there are 8 cores, 16 threads on each host in the Version 2 Database Machine and therefore 32/64 in a half rack (there are 4 hosts). Surely Frits could invoke 32 or 64 sessions each performing this query without exhibiting any bottlenecks, right? Indeed, 157 MB/s by 64 sessions is about 10 GB/s which fits within the datasheet claims. And, indeed, since the memory bandwidth in this configuration is about 19 GB/s into each Nehalem EP socket there must surely be no reason this query wouldn’t scale linearly, right? The answer is I don’t have the answer. I haven’t tested it. What I would not advise, however, is dividing maximum theoretical arbitrary bandwidth figures (e.g., the 25GB/s scan bandwidth offered by a half-rack) by a measured application throughput requirement  (e.g., Frits’ 157 MB/s) and claim victory just because the math happens to work out in your favor. That would be junk science.

Frits is not blogging junk science. I recommend following this fellow OakTable member to see where it goes.

Linux Thinks It’s a CPU, But What Is It Really – Part III. How Do Intel Xeon 7500 (Nehalem EX) Processors Map To Linux OS Processors?

Last year I posted a blog entry entitled Linux Thinks It’s a CPU, But What Is It Really – Part I. Mapping Xeon 5500 (Nehalem) Processor Threads to Linux OS CPUs where I discussed the Intel CPU Topology Tool. The topology tool is most helpful when trying to quickly map Linux OS processors to physical processor cores or threads. That post gets read, on average, close to 20 times per day since it was posted (10,000+views) so I thought it deserves a follow-up pertaining to more recent Intel processors and, more importantly, more recent Linux releases.

I’m happy to point out that the tool still functions just fine for Intel Xeon 7500 series processors (a.k.a. Nehalem EX see also Sun Oracle’s Sun Fire X4800), however, with recent Linux releases the tool is not quite as necessary. With both Enterprise Linux Enterprise Linux Server release 5.5 (Oracle Enterprise Linux 5.5 ) and Red Hat Enterprise Linux Server release 5.5 the numactl(8) command now renders output that makes it quite clear which sockets associate with which OS processors.

The following output was captured from an 8-socket Nehalem EX machine:

$ numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 64 65 66 67 68 69 70 71
node 0 size: 131062 MB
node 0 free: 122879 MB
node 1 cpus: 8 9 10 11 12 13 14 15 72 73 74 75 76 77 78 79
node 1 size: 131072 MB
node 1 free: 125546 MB
node 2 cpus: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87
node 2 size: 131072 MB
node 2 free: 125312 MB
node 3 cpus: 24 25 26 27 28 29 30 31 88 89 90 91 92 93 94 95
node 3 size: 131072 MB
node 3 free: 126543 MB
node 4 cpus: 32 33 34 35 36 37 38 39 96 97 98 99 100 101 102 103
node 4 size: 131072 MB
node 4 free: 125454 MB
node 5 cpus: 40 41 42 43 44 45 46 47 104 105 106 107 108 109 110 111
node 5 size: 131072 MB
node 5 free: 124881 MB
node 6 cpus: 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119
node 6 size: 131072 MB
node 6 free: 123862 MB
node 7 cpus: 56 57 58 59 60 61 62 63 120 121 122 123 124 125 126 127
node 7 size: 131072 MB
node 7 free: 126054 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  15  20  15  15  20  20  20 
  1:  15  10  15  20  20  15  20  20 
  2:  20  15  10  15  20  20  15  20 
  3:  15  20  15  10  20  20  20  15 
  4:  15  20  20  20  10  15  15  20 
  5:  20  15  20  20  15  10  20  15 
  6:  20  20  15  20  15  20  10  15 
  7:  20  20  20  15  20  15  15  10 

A node is synonymous with a socket in this case. So, as the output shows, socket 0 maps to OS processors 0-7 and 64-71, the latter range being processor threads. Let’s see how similar this output is to the Intel CPU Topology Tool (NOTE – hover over the box and click view source for best presentation):

Package 0 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
L1D is Level 1 Data cache, size(KBytes)= 32,  Cores/cache= 2, Caches/package= 8
L1I is Level 1 Instruction cache, size(KBytes)= 32,  Cores/cache= 2, Caches/package= 8
L2 is Level 2 Unified cache, size(KBytes)= 256,  Cores/cache= 2, Caches/package= 8
L3 is Level 3 Unified cache, size(KBytes)= 24576,  Cores/cache= 16, Caches/package= 1
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|       0       64|       1       65|       2       66|       3       67|       4       68|       5       69|       6       70|       7       71|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|       1     1z16|       2     2z16|       4     4z16|       8     8z16|      10     1z17|      20     2z17|      40     4z17|      80     8z17|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ff


Package 1 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|       8       72|       9       73|      10       74|      11       75|      12       76|      13       77|      14       78|      15       79|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|     100     1z18|     200     2z18|     400     4z18|     800     8z18|     1z3     1z19|     2z3     2z19|     4z3     4z19|     8z3     8z19|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ff00


Package 2 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      16       80|      17       81|      18       82|      19       83|      20       84|      21       85|      22       86|      23       87|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|     1z4     1z20|     2z4     2z20|     4z4     4z20|     8z4     8z20|     1z5     1z21|     2z5     2z21|     4z5     4z21|     8z5     8z21|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ffz4


Package 3 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      24       88|      25       89|      26       90|      27       91|      28       92|      29       93|      30       94|      31       95|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|     1z6     1z22|     2z6     2z22|     4z6     4z22|     8z6     8z22|     1z7     1z23|     2z7     2z23|     4z7     4z23|     8z7     8z23|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ffz6


Package 4 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      32       96|      33       97|      34       98|      35       99|      36      100|      37      101|      38      102|      39      103|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|     1z8     1z24|     2z8     2z24|     4z8     4z24|     8z8     8z24|     1z9     1z25|     2z9     2z25|     4z9     4z25|     8z9     8z25|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ffz8


Package 5 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      40      104|      41      105|      42      106|      43      107|      44      108|      45      109|      46      110|      47      111|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|    1z10     1z26|    2z10     2z26|    4z10     4z26|    8z10     8z26|    1z11     1z27|    2z11     2z27|    4z11     4z27|    8z11     8z27|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ffz10


Package 6 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      48      112|      49      113|      50      114|      51      115|      52      116|      53      117|      54      118|      55      119|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|    1z12     1z28|    2z12     2z28|    4z12     4z28|    8z12     8z28|    1z13     1z29|    2z13     2z29|    4z13     4z29|    8z13     8z29|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

Combined socket AffinityMask= 0xff00000000000000ffz12


Package 7 Cache and Thread details


Box Description:
Cache  is cache level designator
Size   is cache size
OScpu# is cpu # as seen by OS
Core   is core#[_thread# if > 1 thread/core] inside socket
AffMsk is AffinityMask(extended hex) for core and thread
Extended Hex replaces trailing zeroes with 'z#'
       where # is number of zeroes (so '8z5' is '0x800000')
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+
Cache |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |     L1D         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
OScpu#|      56      120|      57      121|      58      122|      59      123|      60      124|      61      125|      62      126|      63      127|
Core  |   c0_t0    c0_t1|   c1_t0    c1_t1|   c2_t0    c2_t1|   c3_t0    c3_t1|   c4_t0    c4_t1|   c5_t0    c5_t1|   c6_t0    c6_t1|   c7_t0    c7_t1|
AffMsk|    1z14     1z30|    2z14     2z30|    4z14     4z30|    8z14     8z30|    1z15     1z31|    2z15     2z31|    4z15     4z31|    8z15     8z31|
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |     L1I         |
Size  |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |     32K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |      L2         |
Size  |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |    256K         |
      +-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+-----------------+

Cache |      L3                                                                                                                                       |
Size  |     24M                                                                                                                                       |
      +-----------------------------------------------------------------------------------------------------------------------------------------------+

I’m quite happy to see this enhancement to numactl(8). I’ll try to blog soon on why you should care about this topic.

What’s Really Happening at OpenWorld 2010? Part II.

BLOG UPDATE: Yet another room change for Optimizing Servers for Oracle Database Performance

The OOW folks informed me that they needed to move our session to a larger room. So, if you are interested here are the new details:

ID#: S315110
Title: Optimizing Servers for Oracle Database Performance
Track: Database
Date: 20-SEP-10
Time: 17:00 – 18:00
Venue: Moscone South
Room: Rm 302

I’ll also be giving this following Unconference sessions:

Tuesday – Sept 21st

11AM
Lombard:  Do-It-Yourself Exadata-like Performance?

4PM
Lombard:  What Every DBA Wants To Ask About Exadata (also known as Q&A with Kevin Closson).


What’s Really Happening at OpenWorld 2010?

This is a quick blog entry to share a few of my plans for OOW. I’ll be co-presenter with a Wallis Pereira, Sr. Technical Program Manager in the Mission Critical Segment of Intel’s Data Center Group. Wally is a very old friend of mine and we’ll be delivering the following session.

ID#: S315110
Title: Optimizing Servers for Oracle Database Performance
Track: Database
Date: 20-SEP-10
Time: 17:00 – 18:00
Venue: Moscone South
Room: Rm 270

“Unconference”
I’ll also be offering a couple of short presentations in the “Unconference” venue on Tuesday, September 21 at 11 AM and 2 PM:

Tuesday – Sept 21st

11AM
Lombard: Do-It-Yourself Exadata-like Performance? Kevin Closson, Performance Architect, Systems Technology Group, Oracle.

2PM
Mason: What Every DBA Wants To Ask About Exadata Also Known as Q&A with Kevin Closson. Kevin Closson, Performance Architect, Systems Technology Group, Oracle.

For more information on the Unconference venue please visit OOW 2010 Unconferences

I also recommend joining me as I attend the following presentation:

Realworld Performance Group Round Table Discussion

By the way, since my Monday session is at 5:00 PM I should be done for the day afterward so any of the folks that owe me a drink can catch me after the presentation (visa versa on the drink debt of course) 🙂

Do-It-Yourself Exadata-Level Performance! Really? Part IV.

In my post entitled Do-It-Yourself Exadata-Level Performance? Really? I invited readers to visit the Oracle Mix page and vote for my suggest-a-session where I aimed to present on DIY Exadata-level performance. As the following screenshot shows I got a lot of good folks to vote on that. It must have been an interesting sounding topic!

Yes, 105 votes. I’m not positive, but that may be the largest number of votes for any suggest-a-session. Thanks for the support. The screenshot also states that back in the week of July 5 the results and notifications would be posted. I waited a few weeks after July 5, without notice, and emailed some of the Mix folks. Here’s what I got:

Oracle employees were not eligible to be selected. The Mix process is meant to give external folks another opportunity to submit their sessions for review and possible inclusion.

I wish they would have stipulated the fact that Oracle Employees need not participate in suggest-a-session. I would have saved those 105 folks the headache of voting.

So, I’m sorry to say that if the topic I suggested in my abstract was something you wanted to hear in a general session, your want is in vain. However, the syllabus for the show suggests to me that there will be plenty of content that you need to hear as per the powers that be. I think the old Stones’ lyric should change to:

You can’t always get what you want. We’ll give you what you need.

I’ll blog more about this seemingly seedy concept of DIY Exadata-level performance soon. I’ll also post about the sessions I am involved with at OOW 2010. I’m hoping my dry-spell on blogging is going to ease. I have a large amount of content to get out.

Little Things Doth Crabby Make – Part XIV. Verbose Linux Command Output Should Be Very Trite. Shouldn’t It?

Not all topics I blog about in my Little Things Doth Crabby Make series make me crabby. Often times I’ll blog something that I presume would make at least one individual somewhere, sometime crabby. This one actually did make me crabby.

Huh? Was That Verbose?
I’m blogging about the –verbose option to the Linux mdadm(8) command. Consider the command I issued in the following text box.


$ mdadm --create --verbose /dev/md11 --level=stripe -c 4096 --raid-devices=16 /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw /dev/sdx /dev/sdy /dev/sdz /dev/sdaa /dev/sdab /dev/sdac
mdadm: failed to create /dev/md11

OK, that wasn’t very verbose. Indeed, it only reported to me that the command failed. I could have figured that out by the obvious missing RAID device after my command prompt returned to me. In my mind, verbose shouldn’t mean what but why. That is, if I ask for verbose output I want something to help me figure out why something just happened. The what is obvious—command failure results in no RAID device.

As you’ll see in the following text box I checked to make sure I was superuser and indeed I was not. So I picked up superuser credentials and the command succeeded nicely. However, even when the command succeeds the verbose option isn’t exactly chatting my ear off! That said, getting brief output from a successful execution of a command, when I stipulate verbosity, would certainly not make it as an installment in the Little Things Doth Crabby Make series.


$ id
uid=1002(oracle) gid=700(dba) groups=700(dba)
$ su
Password:
#  mdadm --create --verbose /dev/md11 --level=stripe -c 4096 --raid-devices=16 /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw /dev/sdx /dev/sdy /dev/sdz /dev/sdaa /dev/sdab /dev/sdac
mdadm: array /dev/md11 started.

The moral of the story is if you want to do things that require superuser become superuser.

I still want why-based output when I opt for verbosity. In this case there was a clear permissions problem. The command could have at least let the errno.h goodies trickle up!

Will Oracle Ever Release Sun Servers Based On Westmere EP and Nehalem EX Processors? Yes.

Oracle has announced the release of several new x86 servers based on the Westmere EP and Nehalem EX processors. This is a really short blog entry, because the website is so loaded with information I haven’t much to add:

Oracle Real Application Clusters Does Not Scale?
I’d like to single out an interesting benchmark. This Oracle Sun x4470 SAP SD benchmark result highlights Real Application Clusters scalability and a result in 6U worth of rack space that beats the recent HP Proliant DL980 G7 result with Microsoft by 15%. The DL980 G7 is an 8U server.

Yes, It Runs IBM DB2, Too.
Lest we forget, people do deploy IBM DB2 on Solaris. To read more follow the link:

Updated Options for Deploying IBM DB2 on Solaris x86-based Servers

Little Things Doth Crabby Make – Part XIII. When Startup Means Shutdown And Stay That Way.

This seemed worthy of Little Things Doth Crabby Make status mostly because I was surprised to see that Oracle Database 11g STARTUP command worked this way…

Consider the following text box. I purposefully tried to do a STARTUP FORCE specifying a PFILE that doesn’t exist. Well, it did exactly what I told it to do, but I was surprised to find that the abort happens before sqlplus checks for the existence of the specified PFILE. I ended up with a down database instance.


SQL> startup force pfile=./p4.ora
ORACLE instance started.

Total System Global Area 3525079040 bytes
Fixed Size                  2217912 bytes
Variable Size            1107298376 bytes
Database Buffers         2231369728 bytes
Redo Buffers              184193024 bytes
Database mounted.

Database opened.
SQL> SQL>
SQL> HOST ls foo.ora
ls: foo.ora: No such file or directory

SQL> startup force pfile=./foo.ora
LRM-00109: could not open parameter file './foo.ora'
ORA-01078: failure in processing system parameters
SQL> show sga
ORA-01034: ORACLE not available
Process ID: 0
Session ID: 737 Serial number: 5

This one goes in the don’t-do-stupid-stuff category I guess. Please don’t ask how I discovered this…

Running Oracle Database On A System With 40% Kernel Mode Overhead? Are You “Normal?”

Fellow Oak Table Network member Charles Hooper has undertaken a critical reading of a recently published book on the topic of Oracle performance. Some folks have misconstrued his coverage as just being hyper-critical, but as Charles points out his motive is just to bring the content alive. It has been an interesting series of blog entries. I’ve commented on a couple of these blog posts, but as I began to comment on his latest installment I realized I should just do my own blog entry on the matter and refer back. The topic at hand is about how “system time” relates to Oracle performance.

The quote from the book that Charles is blogging about reads:

System time: This is when a core is spending time processing operating system kernel code. Virtual memory management, process scheduling, power management, or essentially any activity not directly related to a user task is classified as system time. From an Oracle-centric perspective, system time is pure overhead.

To say “[…] any activity not directly related to a user task is classified as system time” is too simplistic to be correct. System time is the time processors spend executing code in kernel mode. Period. But therein lies my point. The fact is the kernel doesn’t do much of anything that is not directly related to a user task. It isn’t as if the kernel is running interference for Oracle. It is only doing what Oracle (or any user mode code for that matter)  is driving it to do.

For instance, the quote lists virtual memory, process scheduling and so on. That list is really too short to make the point come alive. It is missing the key kernel internals that have to do with Oracle such as process birth, process death, IPC (e.g., Sys V semaphores), timing (e.g., gettimeofday()), file and network I/O, heap allocations and stack growth and page table internals (yes, Virtual Memory).

In my opinion, anyone interested in the relationship between Oracle and an operating system kernel must read Section 8.1 of my friend James Morle’s book Scaling Oracle8i in spite of the fact that it sounds really out of date (by title) it goes a long way to make the topic at hand a lot easier to understand.

If this topic is of interest to you feel free to open the following link and navigate down to section 8.1 (page 417). Scaling Oracle8i ( in PDF form).

How Normal Are You?
The quote on Charles’ blog entry continues:

From an Oracle-centric perspective, system time is pure overhead. It’s like paying taxes. It must be done, and there are good reasons (usually) for doing it, […]

True, processor cycles spent in kernel mode are a lot like tax. However, as James pointed out in his book, the VOS layer, and the associated OSD underpinnings, have historically allowed for platform-specific optimizations.  That is, the exact same functionality on one platform may impose a larger tax than on others. That is the nature of porting. The section of section of James’ book starting at page 421 shows some of the types of things that ports have done historically to lower “system time” tax.

Finally, Charles posts the following quote from the book he is reviewing:

Normally, Oracle database CPU subsystems spend about 5% to 40% of their active time in what is called system mode.

No, I don’t know what “CPU subsystems” is supposed to mean. That is clearly a nickname for something. But that is not what I’m blogging about.

If you are running Oracle Database (any version since about 8i) on a server dedicated to Oracle and running on the hardware natively (not a Virtual Machine), I simply cannot agree with that upper-bound figure of 40%. That is an outrageous amount of kernel-mode overhead. I should think the best way to get to that cost level would be to use file system files without direct I/O. Can anyone with a system losing 40% to kernel mode please post a comment with any specifics about what is driving that much overhead and whether you are happy with the performance of your server?


DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 741 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.