Oracle performance | Kevin Closson's Blog: Platforms, Databases and Storage

Archive for the 'Oracle performance' Category

SLOB Chewed Up All My File System Space and Spit It Out. But, Why?

This is a quick blog post in response to a recent interaction with a SLOB user. The user reached out to me to lament that all her file system space was consumed as the result of a SLOB execution (runit.sh). I reminded her that runit.sh will alert to possible derelict mpstat/iostat/vmstat processes from an aborted SLOB test. If these processes exist they will be spooling their output to unlinked files.

The following screen shot shows what to expect if a SLOB test detects potential “deadwood” processes. If you see this sort of output from runit.sh, it’s best to investigate whether in fact they remain from an aborted test or whether there are other users on the system that left these processes behind.

The next screenshot shows how to take action on the runit.sh output:

Summary

If SLOB is consuming all your file system space, it’s probably already telling you why–all you need to do is take action. I hope this is helpful for some wayward Googler someday.

EMC XtremIO – The Full-Featured All-Flash Array. Interested In Oracle Performance? See The Whitepaper.

Published July 11, 2014 All Flash Array , Flash Storage for Databases , oracle , Oracle I/O Performance , Oracle performance , Oracle Performnce Monitoring , Oracle SAN Topics , Oracle Storage Related Problems 9 Comments
Tags: whitepaper, XtremIO

NOTE: There’s a link to the full article at the end of this post.

I recently submitted a manuscript to the EMC XtremIO Business Unit covering some compelling lab results from testing I concluded earlier this year. I hope you’ll find the paper interesting.

There is a link to the full paper at the bottom of this block post. I’ve pasted the executive summary here:

Executive Summary

Physical I/O patterns generated by Oracle Database workloads are well understood. The predictable nature of these I/O characteristics have historically enabled platform vendors to implement widely varying I/O acceleration technologies including prefetching, coalescing transfers, tiering, caching and even I/O elimination. However, the key presumption central to all of these acceleration technologies is that there is an identifiable active data set. While it is true that Oracle Database workloads generally settle on an active data set, the active data set for a workload is seldom static—it tends to move based on easily understood factors such as data aging or business workflow (e.g., “month-end processing”) and even the data source itself. Identifying the current active data set and keeping up with movement of the active data set is complex and time consuming due to variability in workloads, workload types, and number of workloads. Storage administrators constantly chase the performance hotspots caused by the active dataset.

All-Flash Arrays (AFAs) can completely eliminate the need to identify the active dataset because of the ability of flash to service any part of a larger data set equally. But not all AFAs are created equal.

Even though numerous AFAs have come to market, obtaining the best performance required by databases is challenging. The challenge isn’t just limited to performance. Modern storage arrays offer a wide variety of features such as deduplication, snapshots, clones, thin provisioning, and replication. These features are built on top of the underlying disk management engine, and are based on the same rules and limitations favoring sequential I/O. Simply substituting flash for hard drives won’t break these features, but neither will it enhance them.

EMC has developed a new class of enterprise data storage system, XtremIO flash array, which is based entirely on flash media. XtremIO’s approach was not simply to substitute flash in an existing storage controller design or software stack, but rather to engineer an entirely new array from the ground-up to unlock flash’s full performance potential and deliver array-based capabilities that are unprecedented in the context of current storage systems.

This paper will help the reader understand Oracle Database performance bottlenecks and how XtremIO AFAs can help address such bottlenecks with its unique capability to deal with constant variance in the I/O profile and load levels. We demonstrate that it takes a highly flash-optimized architecture to ensure the best Oracle Database user experience. Please read more: Link to full paper from emc.com.

Oracle Exadata Storage Server: A Black Box with No Statistics.

Published October 7, 2008 I/O Topics , oracle , Oracle performance , Real Application Clusters 16 Comments
Tags: Oracle Exadata Storage Server Software

A question came in about whether it is possible to measure how much data is filtered out when running a query serviced by a Smart Scan in the Exadata Storage Server grid. The following is the long answer.

An Example of Offload Processing Effectiveness Accounting

I’d like to answer this question by taking real information from a test configuration consisting of 4 Real Application Clusters nodes attached to 6 Exadata Storage Server cells with the SAS disk option (12 x 300GB 3.5″ 15K RPM).

Test Workload – The Affinity Card Program Test Database

The Affinity Card Program Test Database (ACPTD) is a test schema and set of queries that mimics the type of queries performed by a marketing group responsible for promoting the use of a retail club card and affinity credit card for a fictitious company. In this example deployment of the ACPTD, the group responsible for promoting club card and affinity credit card activity for Acme Inc has built a data mart with data from several sources, including the main corporate ERP and CRM systems; partner merchant ERP systems; and, outside syndicated data.

For this blog post I’ll focus on the main syndicated card transaction table called all_card_trans. However, there are other tables in the schema and for reference sake the following text box shows the table sizes in my current deployment of the test kit. As the query shows, the all_card_trans table is 613 GB. Yes, I know that’s small, but this is a test system and I don’t like watching 6 Exadata cells hammering 1 GB/s each for tens of minutes when a couple of minutes will do. I hope you’ll understand.

SQL> set echo on
SQL> @tab_sizes
SQL> col segment_name for a32
SQL> select segment_name, sum(bytes)/1024/1024 mb from user_segments
  2  where segment_name not like 'BIN%' and segment_name not like 'SYS%'
  3  group by segment_name;

SEGMENT_NAME                             MB
-------------------------------- ----------
PARTNER_MERCHANTS                         8
PARTNER_MERCHANT_SALES                67780
ZIPCODES                                  8
OUR_SALES                             21956
ALL_CARD_TRANS                       629032
MCC                                       8
OUR_STORES                                8
CUST_SERVICE                             76
TM                                   .03125
CUST_FACT                              4708

10 rows selected.

The table definition for the all_card_trans table is shown in the following text box:

SQL> desc all_card_trans
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CARD_NO                                   NOT NULL VARCHAR2(20)
 CARD_TYPE                                          CHAR(20)
 MCC                                       NOT NULL NUMBER(6)
 PURCHASE_AMT                              NOT NULL NUMBER(6)
 PURCHASE_DT                               NOT NULL DATE
 MERCHANT_CODE                             NOT NULL NUMBER(7)
 MERCHANT_CITY                             NOT NULL VARCHAR2(40)
 MERCHANT_STATE                            NOT NULL CHAR(3)
 MERCHANT_ZIP                              NOT NULL NUMBER(6)

The following queries show the total row count in the all_card_trans table as well as the number of rows representing transactions with credit cards with leading 4 digits of 3333 and 4447, which represent the card block granted to our fictitious company, Acme Inc, for their affinity card program. This is a needle in a haystack query since only .3% of the data matches the where predicate. Now, before you look at the text box please note that this is a single instance of Real Application Clusters driving this query. And, yes, the RAC node is a small 8-core DL360. Both scans are driving storage at 6 GB/s (613GB/102s). Folks, remember that to do what I just did in the following text box with Fibre Channel I’d have to have a system with 15 FC HBAs attached to several huge SAN arrays. Oh, and I should point out that it would also require about 85 Netezza Snippet Processing Units to match this throughput.

SQL> select count(*) from all_card_trans where  card_no like '4777%' or card_no like '3333%’;

  COUNT(*)
----------
  22465830

Elapsed: 00:01:42.71
SQL> select count(*) from all_card_trans;

  COUNT(*)
----------
7897611641

Elapsed: 00:01:42.47

I Can’t Believe I Ate the Whole Thing

The following query uses a method for measuring cell activity by forcing the database to ingest all rows and all columns. Don’t try to make sense of the query because it isn’t supposed to make sense. It is only supposed to drive Exadata to return all rows and all columns, which it does.

$ more demo_offload_bytes.sql

drop view ingest;
create view ingest as
select avg(length(CARD_NO)) c1,
avg(length(CARD_TYPE)) c2,
max(MCC + PURCHASE_AMT) c3,
max(length(PURCHASE_DT)) c4,
max(MERCHANT_CODE) c5,
avg(length(MERCHANT_CITY)) c6,
avg(length(MERCHANT_STATE)) c7,
max(MERCHANT_ZIP) c8 from all_card_trans;

col MB format 99999999.9
select NAME,VALUE/(1024*1024*1024) GB from gv$sysstat where STATISTIC# in (196, 44, 43 );

select c1+c2+c3+c4+c5+c6+c7 from ingest;

select NAME,VALUE/(1024*1024*1024) GB from gv$sysstat where STATISTIC# in (196, 44, 43);

The following text box shows that when I ran the query Exadata (in aggregate) scanned 613 GB of disk and returned all but 6% to the RDBMS instance (delta cell physical IO interconnect bytes). Also recorded by v$sysstat (bytes eligible) is the fact that there was nothing peculiar about the data being scanned-peculiar in any such fashion that would have interfered with offload processing (e.g., chained rows, etc). Since I asked for all the data, that is what I got. It is nice to know, however, that the entirety of the data was a candidate for offload processing.

Notice I didn’t time the query. I’ll offer more on that later in this blog post.

SQL> @demo_offload_bytes

View dropped.

View created.

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           1227.80814
cell physical IO interconnect bytes                              1151.52453
cell physical IO bytes eligible for predicate offload            1227.31451

C1+C2+C3+C4+C5+C6+C7
--------------------
          9010106.59

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                            1841.5774
cell physical IO interconnect bytes                              1727.15227
cell physical IO bytes eligible for predicate offload            1840.97177

So let’s take a look at some offload processing. The following change to the “ingest” query reduces the amount of data ingested by the database tier by only selecting rows where the credit card number started with 4777 or 3333. We know from the scan of the table that this whittles down the dataset by 99.7%.

$ more demo_offload_bytes1.sql

create view ingest_less as
select avg(length(CARD_NO)) c1,
avg(length(CARD_TYPE)) c2,
max(MCC + PURCHASE_AMT) c3,
max(length(PURCHASE_DT)) c4,
max(MERCHANT_CODE) c5,
avg(length(MERCHANT_CITY)) c6,
avg(length(MERCHANT_STATE)) c7,
max(MERCHANT_ZIP) c8 from all_card_trans act
where card_no like '4777%' or card_no like '3333%';

col MB format 99999999.9
select NAME,VALUE/(1024*1024*1024) GB from gv$sysstat where STATISTIC# in (196, 44, 43 );

set timing on
select c1+c2+c3+c4+c5+c6+c7 from ingest_less;

select NAME,VALUE/(1024*1024*1024) GB from gv$sysstat where STATISTIC# in (196, 44, 43);

The following text box shows that that Exadata scanned and filtered out the uninteresting rows. The projected columns accounted for only 1.7 GB across the iDB intelligent I/O fabric. That is, the database tier only had to ingest 1.7 GB, or 17 MB/s since the query completed in 103 seconds.

SQL> @demo_offload_bytes_1

View dropped.

Elapsed: 00:00:00.21

View created.

Elapsed: 00:00:00.00

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           1841.13743
cell physical IO interconnect bytes                              43.1782733
cell physical IO bytes eligible for predicate offload            1840.97177

Elapsed: 00:00:00.01

C1+C2+C3+C4+C5+C6+C7
--------------------
           9010106.9

Elapsed: 00:01:43.31

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           2454.80203
cell physical IO interconnect bytes                              44.8776852
cell physical IO bytes eligible for predicate offload            2454.62903

Elapsed: 00:00:00.00
SQL>

So, yes, Exadata does let you see how effectively Smart Scans are working. In this example we saw a single-table scan on a small Oracle Exadata Storage Server grid weeding out 99.7% of the data.

Sneaky, Sneaky.

You’ll notice when I executed demo_offload_bytes.sql (the non-filtering example) I was not tracking query completion time. That’s because up to this point I’ve only been showing examples driven by a single Real Application Clusters node. Now, let’s think about this. I have 8 Intel processor cores and a single inbound Infiniband path on the database server. From an inbound I/O bandwidth perspective the host can “only” ingest 1.6GB/s, but the CPUs may also further throttle that since the database is doing its own projection in this case.

I’ve shown, in this post, that this test Exadata Storage grid configuration can scan disk at 6 GB/s. The question you might ask-and I’m about to answer-is how much does this single database host throttle storage and why does that matter? Well, it matters because having ample storage bandwidth with limited database server bandwidth is the classic imbalance addressed by the HP Oracle Database Machine. So, let me run it again-with timing turned on. As you’ll see, bottlenecks are bottlenecks. This single database host cannot keep up with storage, so storage is throttled and the result is 18X increase in query completion time when compared to the heavily filtered case. Both queries had to read the same amount of data but in this case there was an imbalance in the upstream ability to ingest the data both from an I/O bandwidth and CPU perspective.

SQL> @demo_offload_bytes

View dropped.

Elapsed: 00:00:00.02

View created.

Elapsed: 00:00:00.01

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           2454.81084
cell physical IO interconnect bytes                               44.886501
cell physical IO bytes eligible for predicate offload            2454.62903

Elapsed: 00:00:00.00

C1+C2+C3+C4+C5+C6+C7
--------------------
          9010106.59

Elapsed: 00:29:26.85

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           3068.55792
cell physical IO interconnect bytes                              620.492259
cell physical IO bytes eligible for predicate offload            3068.28629

Elapsed: 00:00:00.00

Need More RDBMS Bandwidth? OK, Use It.

So, let’s see what the full-ingestion case looks like with 4 Real Application Clusters nodes. And please forgive that I forgot to aggregate the v$sysstat output. I’ve added 4-fold database grid resources so I should complete the full-ingestion query in 75% less time. The text box starts out showing that the number of RAC instances was increased from 1 to 4.

SQL> host date
Fri Oct  3 16:27:52 PDT 2008

SQL> select instance_name from gv$instance;

INSTANCE_NAME
----------------
test1

Elapsed: 00:00:00.02
SQL> select instance_name from gv$instance;

INSTANCE_NAME
----------------
test1
test4
test3
test2

Elapsed: 00:00:00.08
SQL> host date
Fri Oct  3 16:32:06 PDT 2008
SQL>
SQL> @demo_offload_bytes

View dropped.

Elapsed: 00:00:00.03

View created.

Elapsed: 00:00:00.01

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           3068.69843
cell physical IO interconnect bytes                              620.632769
cell physical IO bytes eligible for predicate offload            3068.28629
physical IO disk bytes                                           .013385296
cell physical IO interconnect bytes                              .013385296
cell physical IO bytes eligible for predicate offload                     0
physical IO disk bytes                                           .010355473
cell physical IO interconnect bytes                              .010355473
cell physical IO bytes eligible for predicate offload                     0
physical IO disk bytes                                           .031275272
cell physical IO interconnect bytes                              .031275272

NAME                                                                     GB
---------------------------------------------------------------- ----------
cell physical IO bytes eligible for predicate offload                     0

12 rows selected.

Elapsed: 00:00:00.02

C1+C2+C3+C4+C5+C6+C7
--------------------
          9010106.59
Elapsed: 00:07:25.01

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           3222.13893
cell physical IO interconnect bytes                              764.535856
cell physical IO bytes eligible for predicate offload            3221.70425
physical IO disk bytes                                            154.31714
cell physical IO interconnect bytes                              144.732708
cell physical IO bytes eligible for predicate offload            154.280945
physical IO disk bytes                                           152.860648
cell physical IO interconnect bytes                              143.367385
cell physical IO bytes eligible for predicate offload            152.828033
physical IO disk bytes                                           153.195674
cell physical IO interconnect bytes                                143.6831

NAME                                                                     GB
---------------------------------------------------------------- ----------
cell physical IO bytes eligible for predicate offload             153.13031

12 rows selected.

Elapsed: 00:00:00.02
SQL>

Well, that was predictable, but cool nonetheless. What about the filtered ingest-query? Should there be much of a speedup given the fact that storage bandwidth remains constant and the ingest rate of the filtered query was only roughly 17 MB/s with a single Real Application Clusters node? I could save the cut and paste effort and just tell you that adding Real Application Clusters nodes in this case will, of course, not reduce the query time, but since I’ve bored you this long, here it is:

SQL> @demo_offload_bytes_1

View dropped.

Elapsed: 00:00:00.01

View created.

Elapsed: 00:00:00.01

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           3222.15758
cell physical IO interconnect bytes                              764.554509
cell physical IO bytes eligible for predicate offload            3221.70425
physical IO disk bytes                                           154.334342
cell physical IO interconnect bytes                               144.74991
cell physical IO bytes eligible for predicate offload            154.280945
physical IO disk bytes                                           153.213285
cell physical IO interconnect bytes                               143.70071
cell physical IO bytes eligible for predicate offload             153.13031
physical IO disk bytes                                           152.878347
cell physical IO interconnect bytes                              143.385084

NAME                                                                     GB
---------------------------------------------------------------- ----------
cell physical IO bytes eligible for predicate offload            152.828033

12 rows selected.

Elapsed: 00:00:00.01

C1+C2+C3+C4+C5+C6+C7
--------------------
           9010106.9

Elapsed: 00:01:42.79

NAME                                                                     GB
---------------------------------------------------------------- ----------
physical IO disk bytes                                           3375.27835
cell physical IO interconnect bytes                              764.974159
cell physical IO bytes eligible for predicate offload            3374.81995
physical IO disk bytes                                             308.0746
cell physical IO interconnect bytes                              144.133529
cell physical IO bytes eligible for predicate offload            307.986572
physical IO disk bytes                                           310.634206
cell physical IO interconnect bytes                              145.189768
cell physical IO bytes eligible for predicate offload              310.5755
physical IO disk bytes                                           302.274069
cell physical IO interconnect bytes                              143.805239

NAME                                                                     GB
---------------------------------------------------------------- ----------
cell physical IO bytes eligible for predicate offload            302.218781

12 rows selected.

Elapsed: 00:00:00.01
SQL> host date
Fri Oct  3 16:49:32 PDT 2008

Summary

Exadata works and has statistics so that the solution doesn’t feel so much like a black box. In the end, however, none of the statistcs stuff really matters. What matters is whether your queries complete in the desired service times.

Databases are the Contents of Storage. Future Oracle DBAs Can Administer More. Why Would They Want To?

Published September 6, 2007 NFS CFS ASM , oracle , Oracle 11g , Oracle 11g Real Application Clusters , Oracle ASM , Oracle Automatic Memory Management , Oracle I/O Performance , Oracle performance , Oracle SMB , Oracle TPC-H , Oracle Unstructured Data , Oracle11g Leave a Comment

I’ve taken the following quote from this Oracle whitepaper about low cost storage:

A Database Storage Grid does not depend on flawless execution from its component storage arrays. Instead, it is designed to tolerate the failure of individual storage arrays.

In spite of the fact that the Resilient Low-Cost Storage Initiative program was decommissioned along with the Oracle Storage Compatability Program, the concepts discussed in that paper should be treated as a barometer of the future of storage for Oracle databases-with two exceptions: 1) Fibre Channel is not the future and 2) there’s more to “the database” than just the database. What do I mean by point 2? Well, with features like SecureFiles, we aren’t just talking rows and columns any more and I doubt (but I don’t know) that SecureFiles is the end of that trend.

Future Oracle DBAs
Oracle DBAs of the future become even more critical to the enterprise since the current “stove-pipe” style IT organization will invariably change. In today’s IT shop, the application team talks to the DBA team who talks to the Sys Admin team who tlks to the Storage Admin team. All this to get an application to store data on disk through a Oracle database. I think that will be the model that remains for lightly-featured products like MySQL and SQL Server, but Oracle aims for more. Yes, I’m only whetting your appetite but I will flesh out this topic over time. Here’s food for thought: Oracle DBAs should stop thinking their role in the model stops at the contents of the storage.

So while Chen Shapira may be worried that DBAs will get obviated, I’d predict instead that Oracle technology will become more full-featured at the storage level. Unlike the stock market where past performance is no indicator of future performance, Oracle has consistently brought to market features that were once considered too “low-level” to be in the domain of a Database vendor.

The IT industry is going through consolidation. I think we’ll see Enterprise-level IT roles go through some consolidation over time as well. DBAs who can wear more than “one hat” will be more valuable to the enterprise. Instead of thinking about “encroachment” from the low-end database products, think about your increased value proposition with Oracle features that enable this consolidation of IT roles-that is, if I’m reading the tea leaves correctly.

How to Win Friends and Influence People
Believe me, my positions on Fibre Channel have prompted some fairly vile emails in my inbox-especially the posts in my Manly Man SAN series. Folks, I don’t “have it out”, as they say, for the role of Storage Administrators. I just believe that the Oracle DBAs of today are on the cusp of being in control of more of the stack. Like I said, it seems today’s DBA responsibilities stop at the contents of the storage-a role that fits the Fibre Channel paradigm quite well, but a role that makes little sense to me. I think Oracle DBAs are capable of more and will have more success when they have more control. Having said that, I encourage any of you DBAs who would love to be in more control of the storage to look at my my post about the recent SAN-free Oracle Data Warehouse. Read that post and give considerable thought to the model it discusses. And give even more consideration to the cost savings it yields.

The Voices in My Head
Now my alter ego (who is a DBA, whereas I’m not) is asking, “Why would I want more control at the storage level?” I’ll try to answer him in blog posts, but perhaps some of you DBAs can share experiences where performance or availability problems were further exacerbated by finger pointing between you and the Storage Administration group.

Note to Storage Administrators
Please, please, do not fill my email box with vitriolic messages about the harmony today’s typical stove-pipe IT organization creates. I’m not here to start battles.

Let me share a thought that might help this whole thread make more sense. Let’s recall the days when an Oracle DBA and a System Administrator together (yet alone) were able to provide Oracle Database connectivity and processing for thousands of users without ever talking to a “Storage Group.” Do you folks remember when that was? I do. It was the days of Direct Attach Storage (DAS). The problem with that model was that it only took until about the late 1990s to run out of connectivity-enter the Fibre Channel SAN. And since SANs are spokes attached to hubs of storage systems (SAN arrays), we wound up with a level of indirection between the Oracle server and its blocks on disk. Perhaps there are still some power DBAs that remember how life was with large numbers of DAS drives (hundreds). Perhaps they’ll recall the level of control they had back then. On the other hand, perhaps I’m going insane, but riddle me this (and feel free to quote me elsewhere):

Why is it that the industry needed SANs to get more than a few hundred disks attached to a high-end Oracle system in the late 1990s and yet today’s Oracle databases often reside on LUNs comprised of a handful of drives in a SAN?

The very thought of that twist of fate makes me feel like a fish flopping around on a hot sidewalk. Do you remember my post about capacity versus spindles? Oh, right, SAN cache makes that all better. Uh huh.

Am I saying the future is DAS? No. Can I tell you now exactly what model I’m alluding to? Not yet, but I enjoy putting out a little food for thought.

What Oracle Topic Is On Your Coffee Table?

Published September 6, 2007 Oracle I/O Performance , Oracle OTLP , Oracle performance Leave a Comment

I’ve received a couple of emails over the last few weeks in which readers asked me what my favorite blog posts are on my own blog. I’d have to sift through old posts to answer that definitively, but one recent post comes to mind. I think my recent post about why Oracle shops generally aren’t realizing the full potential of their hard drives is one of my favorites.

Oracle11g Automatic Memory Management – Part III. A NUMA Issue.

Published August 24, 2007 AMD Quad-core , AMD Quad-Core Performance , Automatic Memory Management , General Oracle Topics , Linux NUMA , NUMA Oracle , Opteron NUMA , Opteron Oracle , Oracle Barcelona , Oracle NUMA , Oracle performance , Oracle11g 3 Comments

Now I’m glad I did that series about Oracle on Linux, The NUMA Angle. In my post about the the difference between NUMA and SUMA and “Cyclops”, I shared a lot of information about the dynamics of Oracle running with all the SGA allocated from one memory bank on a NUMA system. Déjà vu.

Well, we’re at it again. As I point out in Part I and Part II of this series, Oracle implements Automatic Memory Management in Oracle Database 11g with memory mapped files in /dev/shm. That got me curious.

Since I exclusively install my Oracle bits on NFS mounts, I thought I’d sling my 11g ORACLE_HOME over to a DL385 I have available in my lab setup. Oh boy am I going to miss that lab when I take on my new job September 4^th. Sob, sob. See, when you install Oracle on NFS mounts, the installation is portable. I install 32bit Linux ports via 32bit server into an NFS mount and I can take it anywhere. In fact, since the database is on an NFS mount (HP EFS Clustered Gateway NAS) I can take ORACLE_HOME and the database mounts to any system with a RHEL4 OS running-and that includes RHEL4 x86_64 servers even though the ORACLE_HOME is 32bit. That works fine, except 32bit Oracle cannot use libaio on 64bit RHEL4 (unless you invokde everything under the linux32 command environment that is). I don’t care about that since I use either Oracle Disk Manager or, better yet, Oracle11g Direct NFS. Note, running 32bit Oracle on a 64bit Linux OS is not supported for production, but for my case it helps me check certain things out. That brings us back to /dev/shm on AMD Opteron (NUMA) systems. It turns out the only Opteron system I could test 11g AMM on happens to have x86_64 RHEL4 installed-but, again, no matter.

Quick Test

[root@tmr6s5 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3585 MB
node 1 size: 4095 MB
node 1 free: 3955 MB
[root@tmr6s5 ~]# dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024
1024+0 records in
1024+0 records out
[root@tmr6s5 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3585 MB
node 1 size: 4095 MB
node 1 free: 2927 MB

Uh, that’s not good. I dumped some zeros into a file on /dev/shm and all the memory was allocated from socket 1. Lest anyone forget from my NUMA series (you did read that didn’t you?), writing memory not connected to your processor is, uh, slower:

[root@tmr6s5 ~]# taskset -pc 0-1 $$
pid 9453's current affinity list: 0,1
pid 9453's new affinity list: 0,1
[root@tmr6s5 ~]# time dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024 conv=notrunc
1024+0 records in
1024+0 records out

real    0m1.116s
user    0m0.005s
sys     0m1.111s
[root@tmr6s5 ~]# taskset -pc 1-2 $$
pid 9453's current affinity list: 0,1
pid 9453's new affinity list: 1
[root@tmr6s5 ~]# time dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024 conv=notrunc
1024+0 records in
1024+0 records out

real    0m0.931s
user    0m0.006s
sys     0m0.923s

Yes, 20% slower.

What About Oracle?
So, like I said, I mounted that ORACLE_HOME on this Opteron server. What does an AMM instance look like? Here goes:

SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 3587 MB
node 1 size: 4095 MB
node 1 free: 3956 MB

SQL> startup pfile=./amm.ora
ORACLE instance started.

Total System Global Area 2276634624 bytes
Fixed Size                  1300068 bytes
Variable Size             570427804 bytes
Database Buffers         1694498816 bytes
Redo Buffers               10407936 bytes
Database mounted.
Database opened.
SQL> !numactl --hardware
available: 2 nodes (0-1)
node 0 size: 5119 MB
node 0 free: 1331 MB
node 1 size: 4095 MB
node 1 free: 3951 MB

Ick. This means that Oracle11g AMM on Opteron servers is a Cyclops. Odd how this allocation came from memory attached to socket 0 when the file creation with dd(1) landed in socket 1’s memory. Hmm…

What to do? SUMA? Well, it seems as though I should be able to interleave tmpfs memory and use that for /dev/shm-at least according to the tmpfs documentation. And should is the operative word. I have been tweaking for a half hour to get the mpol=interleave mount option (with and without the -o remount technique) to no avail. Bummer!

Impact
If AMD can’t get the Barcelona and/or Budapest Quad-core off the ground (and into high-quality servers from HP/IBM/DELL/Verari), none of this will matter. Actually, come to think of it, unless Barcelona is really, really fast, you won’t be sticking it into your existing Socket F motherboards because that doubles your Oracle license fee (unless you are on standard edition which is priced on socket count). That leaves AMD Quad-core adopters waiting for HyperTransport 3.0 as a remedy. I blogged all this AMD Barcelona stuff already.

Given the NUMA characteristics of /dev/shm, I think I’ll test AMM versus MMM on NUMA, and them test again on SUMA-if I can find the time.

If anyone can get /dev/shm mounted with the mpol option, please let me know because, at times, I can be quite a dolt and I’d love this to be one of them.

Oracle11g Automatic Memory Management – Part I. Linux Hugepages Support.

Published August 23, 2007 oracle , Oracle 11g , oracle hugepages , Oracle performance 19 Comments

I spent the majority of my time in the Oracle Database 11g Beta program testing storage-related aspects of the new release. To be honest, I didn’t even take a short peek at the new Automatic Memory Management feature. As I pointed out the other day, Tanel Poder has started blogging about the feature.

If you read Tanel’s post you’ll see that he points out AMM-style shared memory does not use hugepages. This is because AMM memory segments are memory mapped files in /dev/shm. At this time, the major Linux distributions do not implement backing memory mapped files with hugepages as they do with System V-style IPC shared memory. The latter supports the SHM_HUGETLB flag passed to the shmget(P) call. It appears as though there was an effort to get hugepages support for memory mapped pages by adding MAP_HUGETLB flag support for the mmap(P) call as suggested in this kernel developer email thread from 2004. I haven’t been able to find just how far that proposed patch went however. Nonetheless, I’m sure Wim’s group is more than aware of that proposed mmap(P) support and if it is really important for Oracle Database 11g Automatic Memory Management, it seems likely there would be a 2.6 Kernel patch for it someday. But that begs the question: just how important are hugepages? Is it blasphemy to even ask the question?

Memory Mapped Files and Oracle Ports
The concept of large page tables is a bit of a porting nightmare. It will be interesting to see how the other ports deal with OS-level support for the dynamic nature of Automatic Memory Management. Will the other ports also use memory mapped files instead of IPC Shared Memory? If so, they too will have spotty large page table support for memory mapped files. For instance, Solaris 9 supported large page tables for mmap(2) pages, but only if it was an anonymous mmap (e.g., a map without a file) or a map of /dev/zero-neither of which would work for AMM. I understand that Solaris 10 supports large page tables for mmap(2) regions that are MAP_SHARED mmap(2)s of files-which is most likely how AMM will look on Solaris, but I’m only guessing. Other OSes, like Tru64-and I’m quite sure most others-don’t support large page tables for mmap(2)ed files. This will be interesting to watch.

Performance, Large Page Table, Etc
I remember back in the mid-90s when Sequent implemented shared large page tables for IPC Shared memory on our Unix variant-DYNIX/ptx. It was a very significant performance enhancement. For instance, 1024 shadow processes attached to a 1GB SGA required 1GB of physical memory-for the page tables alone! That was significant on systems that had very small L2 caches and only supported 4GB physical memory. Fast forwarding to today. I know people with Oracle 10g workloads that absolutely seize up their Linux (2.6. Kernel) system unless they use hugepages. Now I should point out that these sites I know of have a significant mix of structured and unstructured data. That is, they call out to LOBs in the filesystem (give me SecureFiles please). So the pathology they generally suffered without hugepages was memory thrashing between Oracle and the OS page cache (filesystem buffer cache). The salve for those wounds was hugepages since that essentially carves out and locks down the memory at boot time. Hugepages memory can never be nibbled up for page cache. To that end, benefiting from hugepages in this way is actually a by-product. The true point behind hugepages not the fact that it is reserved at boot time, but the fact that CPUs don’t have to thrash to maintain the physical to virtual translations (tlb). In general, hugepages are a lot more polite on processor caches and they reduce RAM overhead for page tables. Compared to the mid 1990s, however, RAM is about the least of our worries these days. Manageability is the most important and AMM aims to help on that front.

Confusion
Of all things Oracle and Linux, I think one of the topics that gets mangled the most is hugepages. The terms and nobs to twist run the gamut. There’s hugepages, hugetlb, hugetlbfs, hugetlbpool and so on. Then there are the differences from one Linux distribution and Linux kernel to the other. For instance, you can’t use hugepages on SuSE unless you turn off vm.disable_cap_mlock (need a few double negatives?). Then there is the question of boot-time versus /proc or sysctl(8) to reserve the pages. Finally, there is the fact that if you don’t have enough hugepages when you boot Oracle, Oracle will not complain-you just don’t get hugepages. I think Metalink 361323.1 does a decent job explaining hugepages with old and recent Linux in mind, but I never see it explained as succinctly as follows:

Use OEL 4 or RHEL 4 with Oracle Database 10g or 11g
Set oracle hard memlock N in /etc/security/limits.conf where N is a value large enough to cover your SGA needs
Set vm.nr_hugepages in /etc/sysctl.conf to a value large enough to cover your SGA.

Further Confusion
Audited TPC results don’t help. For instance, on page 125 of this Full disclosure report from a recent Oracle10g TPC-C, there are listings of sysctl.conf and lilo showing the setting of the hugetlbpool parameter. That would be just fine if this was a RHEL3 benchmark since vm.hugetlbpool doesn’t exist in RHEL4.

Performance
I admit I haven’t done a great deal of testing with AMM, but generally a quick I/O-intensive OLTP test on a system with 4 processor cores utilized at 100% speak volumes to me. So I did just such a test.

Using an order-entry workload accessing the schema detailed in this Oracle Whitepaper about Direct NFS, I tested two configurations:

Automatic Memory Management (AMM). Just like it says, I configured the simplest set of initialization parameters I could:

UNDO_TABLESPACE=rb1
UNDO_MANAGEMENT = AUTO
compatible = 10.1.0.0
control_files                  = ( /u01/app/oracle/product/11/db_1/rw/DATA/cntlbench_1 )
db_block_size                   = 4096
MEMORY_TARGET=1500M
db_files                        = 100
db_writer_processes = 1
db_name                         = bench
processes                       = 200
sessions                        = 400
cursor_space_for_time           = TRUE  # pin the sql in cache
filesystemio_options=setall

Manual Memory Management(MMM). I did my best to tailor the important SGA regions to match what AMM produced. In my mind, for an OLTP workload the most important SGA regions are the block buffers and the shared pool.

UNDO_TABLESPACE=rb1
UNDO_MANAGEMENT = AUTO
compatible = 10.1.0.0
control_files                  = ( /u01/app/oracle/product/11/db_1/rw/DATA/cntlbench_1 )
db_block_size                   = 4096
#MEMORY_TARGET=1500M
db_cache_size = 624M
shared_pool_size=224M
db_files                        = 100
db_writer_processes = 1
db_name                         = bench
processes                       = 200
sessions                        = 400
cursor_space_for_time           = TRUE  # pin the sql in cache
filesystemio_options=setall

The following v$sgainfo output justifies just how closely configured the AMM and MMM cases were.

AMM:

SQL> select * from v$sgainfo ;

NAME                                  BYTES RES
-------------------------------- ---------- ---
Fixed SGA Size                      1298916 No
Redo Buffers                       11943936 No
Buffer Cache Size                 654311424 Yes
Shared Pool Size                  234881024 Yes
Large Pool Size                    16777216 Yes
Java Pool Size                     16777216 Yes
Streams Pool Size                         0 Yes
Shared IO Pool Size                33554432 Yes
Granule Size                       16777216 No
Maximum SGA Size                 1573527552 No
Startup overhead in Shared Pool    83886080 No

NAME                                  BYTES RES
-------------------------------- ---------- ---
Free SGA Memory Available                 0

MMM:

SQL> select * from v$sgainfo ;

NAME                                  BYTES RES
-------------------------------- ---------- ---
Fixed SGA Size                      1302592 No
Redo Buffers                        4964352 No
Buffer Cache Size                 654311424 Yes
Shared Pool Size                  234881024 Yes
Large Pool Size                           0 Yes
Java Pool Size                     25165824 Yes
Streams Pool Size                         0 Yes
Shared IO Pool Size                29360128 Yes
Granule Size                        4194304 No
Maximum SGA Size                  949989376 No
Startup overhead in Shared Pool    75497472 No

NAME                                  BYTES RES
-------------------------------- ---------- ---
Free SGA Memory Available                 0

The server was a HP DL380 with 4 processor cores and the storage was an HP EFS Clustered Gateway NAS. Before each test I did the following:

Restore Database
Reboot Server
Mount NFS filesystems
Boot Oracle

Before the MMM case I set vm.nr_hugepages=600 and after the database was booted, hugepages utilization looked like this:

$ grep Huge /proc/meminfo
HugePages_Total:   600
HugePages_Free:    145
Hugepagesize:     2048 kB

So, given all these conditions, I believe I am making an apples-apples comparison of AMM to MMM where AMM does not get hugepages support but MMM does. I think this is a pretty stressful workload since I am maxing out the processors and performing a significant amount of I/O-given the size of the server.

Test Results
OK, so this is a very contained case and Oracle Database 11g is still only available on x86 Linux. I hope I can have the time to do a similar test with more substantial gear. For the time being, what I know is that losing hugepages support for the sake of gaining AMM should not make you lose sleep. The results measured in throughput (transactions per second) and server statistics are in:

Configuration	OLTP Transactions/sec	Logical IO/sec	Block Changes/sec	Physical Read/sec	Physical Write/sec
AMM	905	36,742	10,195	4,287	2,817
MMM	872	36,411	10,101	4,864	2,928

Looks like 4% in the favor of AMM to me and that is likely attributed to the 13% more physical I/O per transaction the MMM case had to perform. That part of the results has me baffled for the moment since they both have the same buffering as the v$sgainfo output above shows. Well, yes, there is a significant difference in the amount of Large Pool in the MMM case, but this workload really shouldn’t have any demand on Large Pool. I’m going to investigate that further. Perhaps an interesting test would be to reduce the amount buffering the AMM case gets to force more physical I/O. That could bring it more in line. We’ll see.

Summary
I’m not saying hugepages is no help across the board. What I am saying is that I would weigh heavily the benefits AMM offers because losing hugepages might not make any difference for you at all. If it is, in fact, a huge problem across the board then it looks like there has been work done in this area for the 2.6 Kernel and it seems reasonable that such a feature (hugepages support for mmap(P)) could be implemented. We’ll see.

Over-Configuring DBWR Processes – Part III

Published August 17, 2007 DBWR Performance , oracle , Oracle I/O Performance , Oracle performance 4 Comments

Sometimes the comment thread on a blog entry are important enough to warrant yet another blog entry. This is one of those times. In my blog entry called Over-Configuring DBWR Processes Part II I offered some test results showing that it really is best to keep a single writer until such time as a single writer becomes CPU-bound. But first a disclaimer.

DISCLAIMER: There are ports of Oracle where specialized multiple DBWR code is implemented for specific hardware architecture (e.g., NUMA). I ought to know, I worked on one of the very first (if not the first) to do so with the ports to Sequent DYNIX/ptx.

There, disclaimer aside, I am talking about the Linux ports of 10g and 11g which have no such hardware specialization.

The first question in that comment thread was:

From what’s written here, it looks like you only tested 1 and 8. Is it possible that the global maximum is actually somewhere between? Maybe 2 processes is slightly faster than 1, for example?

Great question with a simple answer: use 1 writer as long as that single writer has ample bandwidth. That is a truth, but I figured I’d do some runs to milk out a fact or two. I configured 2 DBWR processes and re-ran the workload I describe in Part II of this series. I followed that up with 4 DBWR processes. Hey, it’s all auto-pilot stuff with my setup so doing this was no big deal. Let’s compare the results. Just to summarize, the previous configurations performed as follows:

Configuration	OLTP Throughput (TPM)	Aggregate DBWR CPU Usage
1 DBWR	62.6	291
8 DBWR	57.9	256

And now with the 2 and 4 DBWR cases:

Configuration	OLTP Throughput (TPM)	Aggregate DBWR CPU Usage
1 DBWR	62.6	291
2 DBWR	58.7	273
4 DBWR	58.4	269
8 DBWR	57.9	256

The way this goes is that more DBWR processes impact throughput. On the other hand, more throughput will require more DBWR work so the optimal case of 1 DBWR process will take more CPU, but remember that it is also getting more throughput. I only provide these CPU numbers to show what percentage of a single CPU any given configuration utilizes. I don’t think it is necessary to run the 6 DBWR case. But what about a wildcat configuration such as multiple DBWR processes that have hard processor affinity?

Processor Affinity and Oracle Background Processes
Well, I’d never bind an Oracle background process to a singe processor (core) for obvious reasons, but with the Linux taskset(1) command it is possible to bind a running process to a socket. For example, the following command binds pid 13276 to the seconds socket of a mulit-socket system:

# taskset –pc 2-3 13276

Building on that idea, you can grab DBWR processes (or any process for that matter) once they are running and pin them to a CPU core or a set of CPU cores. The following stupid bash shell function can be used to do this sort of thing. Note, since there are tens of thousands of skilled computer programmers out there these days, all shell scripts are considered stupid of course. The function takes an argument that it then uses in its grep regular expression for globbing ps(1) output. An example would be:

aff ora_dbw

aff()

local GLOB="$1"
local CORE=0
local TWIN=0

ps -ef | grep $GLOB | grep -v grep | awk '{ print $2 }'  | while read PID
do
        (( TWIN = $CORE + 1 ))

        set -x
        sudo taskset -pc ${CORE}-${TWIN} $PID
        set -    

        (( CORE = CORE + 2 ))
        [[ $CORE = 8 ]] && CORE=0 && TWIN=0
done
}

I used this script to affinity 2 DBWR processes to a single socket and the numbers didn’t change. Neither did throwing 2 DBWR processes one each to two different sockets. The fact simply remains that DBWR doesn’t benefit from having peers unless it is CPU-bottlenecked.

Over-Configuring DBWR Processes – Part II.

Published August 17, 2007 DBWR Performance , Oracle performance 13 Comments

In my recent blog entry about over-configuring DBWR processes, I mentioned the processor cache effect of having too many DBWR processes doing work that a single DBWR process could handle. I’d like to provide more detail on the matter.

Backgrounder
Few people really understand what DBWR does for a living. Yes, everyone knows DBWR is responsible for flushing modified buffers from the SGA, but that is really high level. DBWR:

1. Builds “write batches” by examining the state of buffer headers on the LRU lists. It scans the LRU end considering the age of the buffer. At instance boot time, each DBWR process is assigned a set of LRUs to tend to.
2. Manipulates cache buffers chains elements for such reasons as marking the buffer busy when it is in flight for an I/O and marking it as written after the flush operation.
3. Performs I/O using the OSDs that are available to it. OSDs are routines in the Oracle kernel that are Operating System (port) specific. These are the routines through which Oracle gets support for asynchronous and/or direct I/O via calls to various libraries (e.g., libc, libaio, libodm, etc).
4. Posts waiters, such as foreground processes waiting in a free buffer wait event.

The problem with over-configuring DBWR processes is two-fold:

1. Processor Cache Trashing. DBWR processes have huge working sets since they deal with so many LRU and cache buffers chains structures. Having more of these processes than necessary migrating from CPU to CPU places a huge dent in the L2 caches with both page table and data footprint.
2. Scheduling anomalies. If you have more than 1 DBWR process, you run the increased risk of having DBWR processes that can’t get on CPU. Consider a checkpoint. All LRUs will have modified buffers that need flushed during a checkpoint. Any given LRU can only be cleaned by the boot-time assigned DBWR process. That means that a checkpoint requires action from all DBWR processes. It’s easier for 1 process to get scheduled on a CPU on, say, an 8 CPU system than 8 processes.

Point 2 above is the part that I suspect most folks aren’t aware of. I think folks envision the various multiple DBWR processes will only get busy if their peers aren’t keeping up. That is not the case. LRUs get evenly loaded by foreground processes because when a process needs a buffer (for file I/O, sort buffer, block cloning, etc) they first try the LRU latch for the last LRU they used (improves cache warmth) and if they can’t get that latch with a no-wait get, they jump to a different latch and do another no-wait get. This continues until they either obtain an LRU on a no-wait get or they loop around to the first attempted latch and do a wait-get. I recall in older releases of Oracle the next LRU to try was the very next LRU. That could have changed to something like a random selection but I don’t know ( I haven’t seen the code in years but that’s about to change). No matter, the LRUs are evenly loaded. Since LRUs are evenly loaded, that means that DBWR processes are equally tasked. I’m not very smart, but this I know:

On SMP systems it is not wise to break up a task that a single CPU can do and distribute the work evenly amongst multiple CPUs. You introduce processor scheduling overhead and you hammer the processor caches

How true, and even more so when the processes share resources as DBWR processes do-they “share” cache buffers chains and other goodies. Anyway, it might be smart to do such task decomposition on realtime or other specialized systems, but not Oracle systems. Let’s take a closer look.

Moderate OLTP Load, db_writer_processes=8
I just did an OLTP run on my DL585 8-core system with db_writer_processes set to 8. After the instance is booted, but before the Pro*C OLTP clients are triggered, I did a ps(1) command grepping for all the DBWR processes. As soon as the 900 second run was complete, the driver script does the same ps(1) command. All output goes to nohup.out so I can grep for DBWR processes and see how much CPU time they consumed during the run:

$ grep ora_dbw nohup.out
oracle   14507     1  2 13:46 ?        00:00:01 ora_dbw0_bench1
oracle   14509     1  2 13:46 ?        00:00:01 ora_dbw1_bench1
oracle   14511     1  2 13:46 ?        00:00:02 ora_dbw2_bench1
oracle   14513     1  1 13:46 ?        00:00:01 ora_dbw3_bench1
oracle   14515     1  1 13:46 ?        00:00:01 ora_dbw4_bench1
oracle   14517     1  2 13:46 ?        00:00:01 ora_dbw5_bench1
oracle   14519     1  1 13:46 ?        00:00:01 ora_dbw6_bench1
oracle   14521     1  2 13:46 ?        00:00:01 ora_dbw7_bench1
oracle   14507     1  2 13:46 ?        00:00:23 ora_dbw0_bench1
oracle   14509     1  2 13:46 ?        00:00:23 ora_dbw1_bench1
oracle   14511     1  2 13:46 ?        00:00:23 ora_dbw2_bench1
oracle   14513     1  2 13:46 ?        00:00:23 ora_dbw3_bench1
oracle   14515     1  2 13:46 ?        00:00:23 ora_dbw4_bench1
oracle   14517     1  2 13:46 ?        00:00:23 ora_dbw5_bench1
oracle   14519     1  2 13:46 ?        00:00:23 ora_dbw6_bench1
oracle   14521     1  2 13:46 ?        00:00:23 ora_dbw7_bench1

The ps(1) output shows us that each of the 8 DBWR processes accumulated 23 seconds of CPU. This clearly reflects the fact that all LRUs are loaded evenly and that each DBWR has an equal part in keeping them clear. So what’s the point? Well, 8 DBWR processes accumulating 184 CPU seconds (23*8) during a 7200 (8 * 900) CPU-second OLTP run means this DBWR workload could have been serviced by a single CPU. In fact 184 CPU seconds is only 20% of a single CPU core during a 900 second run, yet by configuring 8 DBWR processes we spread that workload over multiple CPUs. I know, you’re thinking this must be a pretty lightly loaded system, right? No, the processors hover around 50% utilization with peaks of 70%. The datafile I/O rate was some 6,980 IOPS of which 4,548/second were writes:

Load Profile                            Per Second       Per Transaction
~~~~~~~~~~~~                       ---------------       ---------------
                  Redo size:          3,222,269.99              4,570.66
              Logical reads:             66,770.61                 94.71
              Block changes:             18,097.81                 25.67
             Physical reads:              2,342.71                  3.32
            Physical writes:              4,548.36                  6.45
                 User calls:             11,795.67                 16.73
                     Parses:              3,597.63                  5.10
                Hard parses:                  0.27                  0.00
                      Sorts:                  1.00                  0.00
                     Logons:                  0.00                  0.00
                   Executes:             10,121.28                 14.36
               Transactions:                704.99

Moderate OLTP Load, db_writer_processes=1
Let’s take a look at the same workload after a shutdown, database restore and reboot of the instance:

$ grep ora_db noh*
oracle   15652     1  0 14:23 ?        00:00:00 ora_dbw0_bench1
oracle   15652     1 17 14:23 ?        00:02:55 ora_dbw0_bench1

What’s that? A single DBWR was able to service the same workload with only 175 CPU seconds? That’s the same work with 5% fewer processor cycles. But what did it do to throughput? Well, this is the medium model of the benchmark where the highest processor utilization peaks were roughly 70%. Reducing CPU utilization on a system that has ample CPU cycles to spare doesn’t get much. In fact, throughput only improved 2.4% from 46.3 Transactions Per Minute (TPM) to 47.4 TPM. Some would say that is within the margin of statistical error, but I bet those same people don’t have a setup were a database restore followed by an instance reboot will produce benchmark results within .5% like this setup does. Let’s just say it’s calibrated. Have you really read this far to hear about a paltry 2.4% difference in OLTP throughput though? No, that’s not all. The impact of too many DBWR processes is scheduling overhead and poor processor cache efficiency, so we need to look at a run with higher CPU utilization.

Heavy OLTP Load, db_writer_processes=8
The following ps(1) data shows us that with the increased OLTP load the 8 DBWR processes each consumed 32 CPU seconds for an aggregate of 256 CPU seconds. That’s still much less than the amount of cycles a single CPU had to offer-28% of a single CPU in fact. S

$ grep ora_db noh*
oracle   20763     1  0 15:59 ?        00:00:00 ora_dbw0_bench1
oracle   20765     1  0 15:59 ?        00:00:00 ora_dbw1_bench1
oracle   20767     1  0 15:59 ?        00:00:00 ora_dbw2_bench1
oracle   20769     1  0 15:59 ?        00:00:00 ora_dbw3_bench1
oracle   20771     1  0 15:59 ?        00:00:00 ora_dbw4_bench1
oracle   20773     1  0 15:59 ?        00:00:00 ora_dbw5_bench1
oracle   20775     1  0 15:59 ?        00:00:00 ora_dbw6_bench1
oracle   20777     1  0 15:59 ?        00:00:00 ora_dbw7_bench1
oracle   20763     1  3 15:59 ?        00:00:32 ora_dbw0_bench1
oracle   20765     1  3 15:59 ?        00:00:32 ora_dbw1_bench1
oracle   20767     1  3 15:59 ?        00:00:32 ora_dbw2_bench1
oracle   20769     1  3 15:59 ?        00:00:32 ora_dbw3_bench1
oracle   20771     1  3 15:59 ?        00:00:32 ora_dbw4_bench1
oracle   20773     1  3 15:59 ?        00:00:32 ora_dbw5_bench1
oracle   20775     1  3 15:59 ?        00:00:32 ora_dbw6_bench1
oracle   20777     1  3 15:59 ?        00:00:32 ora_dbw7_bench1

So here again the DBWR workload would “fit” within a single CPU. With this configuration, the OLTP throughput was 57.9 TPM. How does that compare to the same workload running against an instance with a single DBWR process?

Heavy OLTP Load, db_writer_processes=1
The following ps(1) output shows that a single DBWR process was able to service the same OLTP load with 291 CPU seconds-still much less (32%) than the bandwidth of a single CPU (there are 900 CPU seconds per core during the test).

$ grep ora_db noh*
oracle   21426     1  0 16:26 ?        00:00:00 ora_dbw0_bench1
oracle   21426     1 29 16:26 ?        00:04:51 ora_dbw0_bench1

But what about performance? When reducing DBWR count to 1, the OLTP throughput increased 8% from 57.9 to 62.6 TPM. What was the I/O load like though? Here is a snippet from statspack:

Load Profile                            Per Second       Per Transaction
~~~~~~~~~~~~                       ---------------       ---------------
                  Redo size:          5,918,344.96              4,355.71
              Logical reads:            125,684.76                 92.50
              Block changes:             33,622.21                 24.74
             Physical reads:              2,066.64                  1.52
            Physical writes:              8,483.06                  6.24
                 User calls:             21,996.51                 16.19
                     Parses:              6,737.41                  4.96
                Hard parses:                  0.27                  0.00
                      Sorts:                  1.03                  0.00
                     Logons:                  0.00                  0.00
                   Executes:             18,852.22                 13.87
               Transactions:              1,358.76

We see that with a single DBWR, the OLTP workload pushed through some 10,549 IOPS of which 8,483 per second were writes. That means a single DBWR on fairly outdated AMD Opteron processors (DL585) can service 8,483 writes per second with only 32% of a single CPU. So why configure more than that?

Where Did The Time Go?
Were there any significant waits on the 8 DBWR case? Yes, there were. Let’s see if anyone can guess which one is which:

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
CPU time                                                     3,148          35.8
log file sync                                1,207,863       2,905      2   33.1
log file switch (checkpoint incomplete)          4,627         828    179    9.4
db file scattered read                         172,724         670      4    7.6
log file parallel write                        532,373         464      1    5.3
          -------------------------------------------------------------

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
CPU time                                                     2,858          31.0
log file sync                                1,076,670       2,509      2   27.2
log file switch (checkpoint incomplete)          4,808       1,939    403   21.0
db file scattered read                         173,566         664      4    7.2
log file parallel write                        491,346         419      1    4.5
          -------------------------------------------------------------

Summary
Did he really make this long of a blog entry to discuss some 8% OLTP throughput increase? Well, yes and no. The affect of processor-cache thrashing will be more severe on larger SMP systems. This test system only has 4 sockets (8 cores). I’ve seen crazy DBWR configurations on 32 and 64 CPU systems account for more than 8%. Even if 8% was a constant, who wants to lose as much as 3 or 5 CPUs worth of bandwidth in today’s IT environment?

Manly Men Only Use Solid State Disk For Redo Logging. LGWR I/O is Simple, But Not LGWR Processing

Published July 21, 2007 LGWR , NFS Performance , oracle , Oracle OTLP , Oracle performance 64 Comments

Let’s look at some statspack data from an OLTP Order Entry benchmark run that achieved 133 transaction per second (TPS):

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
log file sync                                   68,965       1,117     16   64.6
db file sequential read                        159,712         241      2   14.0
CPU time                                                       163           9.4
log file switch (checkpoint incomplete)            164         146    890    8.4
db file parallel write                          19,651          31      2    1.8

And now let’s look at the stats from run number 2 of the same benchmark, same system, same disk setup, etc:

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
db file sequential read                        838,810       1,635      2   69.1
CPU time                                                       253          10.7
db file parallel write                         153,334         145      1    6.1
log file sync                                   93,100         143      2    6.0
log file parallel write                         93,417          73      1    3.1

Benchmark run number 2 achieved 200 TPS. That’s right, statspack number 1 and number 2 above are from the same system. The database files are stored in an an HP EFS Clustered Gateway NAS device and accesed via NFS and the redo logs are, get this, stored on a Texas Memory Systems Solid State Disk. Statspack 2 above was a very busy system (a lot of cached SGA work) and doing over 3,000 IOPS:

Load Profile                            Per Second       Per Transaction
~~~~~~~~~~~~                       ---------------       ---------------
                  Redo size:          1,138,491.64              7,166.65
              Logical reads:             14,640.16                 92.16
              Block changes:              5,538.83                 34.87
             Physical reads:              1,428.73                  8.99
            Physical writes:              1,742.27                 10.97
                 User calls:              2,352.99                 14.81

So if a) system is the same, and b) LGWR is writing redo to the fastest disk possible, what is the story? What changed from statspack 1 to statspack 2 to make log file sync wait events jump 8-fold-with the associated 34% drop in throughput? More users? No. Nothing that simple. I’ll explain.

LGWR I/O? Who Cares-Simple Stuff
I see myriads of information on blogs, forums and all those expert repositories about the I/O aspects of redo logging, but very little about LGWR processing. Yes, a portion of LGWR’s job is to flush the redo buffer to the online redo log, but there’s more to LGWR’s job than writing redo. In fact, I never could understand the fascination with LGWR’s I/O since it is about as simple as it gets: sequential writes to a single file of sizes ranging from 512 bytes to a port-defined maximum (usually 128KB but often 1MB). Yes, it gets a little more complex with multiplexed redo, but that is just concurrent operations of the exact same profile. The part that seems to be hard to grasp for folks is where asynchronous I/O comes in to play.

LGWR Async I/O
If LGWR is clearing more than the port-defined maximum from the buffer, asynchronous I/O is important. Let’s consider an example. Let’s say LGWR has been posted to clear 1.75MB of redo from the buffer on a platform that establishes the LGWR I/O maximum to be 1MB. In this case, LGWR will issue a 1MB write asynchronously immediately followed by a 750KB asynchronous write. These writes are sequential. After the second I/O is in flight, LGWR starts to poll for the completions of the I/Os. So the question often becomes what if LGWR is only flushing, say, 600KB instead of 1.75MB, isn’t asynchronous I/O crucial in that case as well? The answer is no-unless you are using multiplexed redo. If LGWR is flushing an amount less than the port-defined maximum, asynchronous I/O offers nothing-for that specific flushing operation. See, LGWR is a lot different than DBWR. When DBWR has I/O in flight, it also has other stuff it needs to do like building the next write batch and so forth. LGWR on the other hand has nothing to do when its I/Os are in flight. It polls for the completions and once the flushing operation is complete, it then takes action to post the foregrounds that are waiting in a log file sync wait event. So asyncnronous I/O has no play in a LGWR buffer flush operation of an amount smaller than the port-defined maximum I/O size.

Oh, I forgot to mention, LGWR will keep up to 4 I/Os in flight (I need to check if that is more in late releases) when clearing more than the maximum allowed in a single physical write. All in all, LGWR I/O is very simple. So why do session waits, such as log file sync, generate so much folklore? I don’t know, but I know the folklore isn’t helpful.

Throw A Solid State Disk At It
I love that one. Let’s see, writing to solid state disk is faster than round-brown spinning thingies, so if any aspect of LGWR’s duties seem out of sorts, let’s get some really, really fast disk. Don’t get me wrong, I’m a big fan of solid state disk, but throwing it at the wrong problem is not the right thing to do.

The most complex topic regarding LGWR is the log file sync wait event. Why? Because it has very little to do with I/O, that’s why. Sure I/O is a part of it, but there is more to it.

Backgrounder
LGWR spends its time in a work loop consisting of very little, really. Well, in contrast to DBWR that is. Remember, DBWR has to manipulate both cache buffers chains and lrus and flush modified SGA buffers and post processes. While DBWR works on a large amount of metadata and buffers, LGWR does not. A glance at ps(1) output makes that clear. Notice in the following ps(1) output, LGWR manages to perform its tasks with a resident set size one 30^th the size of DBWR-and of course that ratio would lean much more to the DBWR side on a large SGA system since DBWR has to map all the buffers in the SGA over the life of the instance (the exception to this is the indirect data buffers feature). LGWR on the other hand has very little working set because it only tends to the redo buffer and will only exercise a few latches (redo allocation, copy, etc).

$ ps -eF | egrep 'UID|lgwr|dbw' | grep -v grep
UID        PID  PPID  C    SZ  RSS PSR STIME TTY          TIME CMD
oracle   19062     1  1 377082 851808 0 16:11 ?       00:00:08 ora_dbw0_bench1
oracle   19064     1  2 379943 29816 0 16:11 ?        00:00:09 ora_lgwr_bench1

Log File Sync
When a foreground process commits changes to the database, it allocates space in the redo buffer and copies the redo information into the buffer-protected either by the redo allocation latch itself or a redo copy latch. The process then enqueues itself for service from LGWR and posts the LGWR (most ports use IPC semaphores for posting). The process then goes to sleep waiting for the return post from LGWR indicating its redo is flushed to disk. So, we now have a process waiting for log file sync event (LFS). Let’s call this process waiter1. How long waiter1 waits depends on a lot of factors.

The first factor that will affect waiter1’s LFS wait is what LGWR was doing when it posted him. Remember, Oracle is a multi-user database on a multi-processor system. There is 1 LGWR and a lot of consumers of his service. On a busy system, odds are quite good that any time LGWR get posted it is in the process of servicing a batch of prior waiters. That is, when waiter1 posted LGWR the odds are good it was already busy flushing a bunch of redo for some other group of processes who are themselves waiting in LFS. This is where it gets dicey.

Before LGWR can flush waiter1’s redo, he has to finish flushing the prior batch and post all of those waiters. Let’s say there was 256KB of redo to flush for 2 waiters on a platform with the port-defined maximum LGWR I/O size set at 128KB. On this platform, LGWR will have to issue 2 asynchronous 128KB writes and poll for those to complete. Once the I/O is complete, LGWR will post each of the two waiters using the post-wait mechanism for that port of Oracle-which is an IPC Semaphore semop() call for most platforms. So, LGWR has made system calls to write, and system calles to poll for I/O completion and finally 2 calls to semop(). LGWR is just a normal user-mode process so all this dipping into the kernel introduces processing delays. Yes, these system calls I listed are non-blocking, but when a processor transitions from kernel to user mode there is opportunity to switch to higher-priority runable processes. In fact, the very act of LGWR posting these 2 processes can cause it to lose CPU. Huh? Sure. What if instead of 2 there were 16 in the batch and there were only 4 CPUs on the system. We call that the thundering herd and it wreaks havoc on LGWR. These process scheduling storms can be quite problematic. Think about it. You have one process (LGWR) that just made 4,8 or 16 or more processes runable by semop()ing them from their log file sync wait which makes them runable in kernel mode which trumps a runable user-mode process. If that happens on a system with, say 8 cores, those processes will need to run somewhere. It is a scheduling storm. For this reason you should always set LGWR’s scheduling priority to the highest possible (with renice for instance).

Without elevated scheduling priority for LGWR, it can often be the case that the processes LGWR wakes up will take the CPU from LGWR. Some platforms (Legacy Unix) support the notion of making processes non-preemptable as well. On those platforms you can make LGWR non-preemptable. Without preemption protection, LGWR can lose CPU before it has exhausted its fair time slice.
Once LGWR loses his CPU it may be quite some time until he gets it back. For instance, if LGWR is preempted in the middle of trying to perform a redo buffer flush, there may be several time slices of execution for other processes before LGWR gets back on CPU. Why am I putting you to sleep about this? Because those time slices that LGWR was off CPU will be charged to foreground processes waiting in a log file sync wait event. Yes, if a foreground is in log file sync wait and LGWR is piled under other processes and can’t get CPU for, say, a few time slices (10ms each), that foreground process’ log file sync will have 10, 20 or more milliseconds of extra wait tacked onto it before LGWR even gets a chance to do I/O. LGWR’s I/O is merely a component of a foreground process’ LFS wait.

So what happened to the benchmark run behind statspack 2 above? CPU starvation specifically for LGWR! How did I do that? Easy. I have a simple program that uses the Linux sched_setaffinity() API call to assign hard processor affinity to processes. These tests were done on a 4-core x86 system and in both cases all the background processes were started up with hard CPU affinity to CPU 0. Once the database was started, I reassign LGWR’s hard affinity to CPU 3. The Pro*C benchmark processes and their associated shadow processes are all running split equally with hard affinity between CPUs 1 and 2. This is the model for both benchmark runs. Yes, LGWR is all alone on CPU 3.

In the case where the benchmark only got 133TPS, I have a “noise” process running on the same CPU as LGWR. This noise process also is pinned to the CPU with hard affinity. What does this “noise” process do that is so heinous as to affect Oracle’s OLTP throughput by 34%? After all, it is only a single process. Yes, single, but also running on the same CPU as LGWR. The noise program is:

double sqrt();
main () { int i; double z; for (;;) { for (i = 0; i < 1000000; i++) { z = sqrt (i); } poll (0, 0, 1); } }

Simple enough. That loop of sqrt() calls from libm takes about 3ms on the CPUs in this benchmark system. So the way this works is that about every 3ms, this process will go to sleep for 1ms using the poll() system call. If this process is running when LGWR needs to run, LGWR will have to wait for 3ms. If that process isn’t executing, LGWR has no impediment to peforming his tasks and LFS are a measely 2ms and OLTP throughput jumps 34%.

This little parasite noise process gets in LGWR’s way a lot too. Basically, any time LGWR gets posted to perform a flush, this noise process is in the way (chalk up 3ms), any time LGWR blocks on an I/O (e.g., a flush that is smaller than the port maximum and therefore a single I/O) this noise process is in the way once the I/O is complete (chalk up 3ms), any time LGWR unluckily loses his CPU to a time slice switch this process will run and take 3ms from LGWR. It all adds up. And in this test case, processes posted by LGWR can never take his CPU (since I’ve set up hard affinity) so this is a nicer test case actually.

Don’t Write At All
Just to be evil, I’ll take it a step further. Did you know that if you set the initialization parameter _disable_logging=TRUE LGWR does everything it normally does except it omits performing the I/O for a redo flush. Don’t do this if you like your database because anything other than a shutdown normal is going to leave your database unrecoverable. However, this is a great test because if performance doesn’t increase when you set this parameter in this fashion, then you have a LGWR I/O processing problem and not a LGWR I/O problem. So I did a run of the same benchmark like this:

SQL> show parameter _disable

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
_disable_logging                     boolean     TRUE

Everything else was the same but now instead of really fast Solid State Disk I/Os, LGWR doesn’t even bother writing to disk at all. What did that do to the benchmark results? Nothing. I still get 133TPS.

Summary
LGWR I/O is very simple. LGWR processing on the other hand is a bit more complex.

Oracle Doesn’t Scale Without Multiple Log Writer Processes

Published July 17, 2007 oracle , Oracle performance 15 Comments

There has been a bit of a spat going on between Jonathan Lewis and Don Burleson. Don’s summary of the topic can be found here. I don’t care about the spat. I want to point out something clearly. Very clearly. But first, a quote from Don:

[…] Lewis failed to mention the hidden parameter _lgwr_io_slaves nor the Metalink note that clearly states that initially, only one LGWR process is started and that multiple LGWR processes will only appear under high activity.

The Oracle docs are very clear that multiple log writer processes exist:

“Prior to Oracle8i you could configure multiple log writers using the LGWR_IO_SLAVES parameter.”

Note: There is a big difference between LGWR I/O Slaves and LGWR. Don continues:

(Note: Starting in 8.1.7, lgwr_io_slaves became a hidden parameter _lgwr_io_slaves, and it remains a hidden parm in Oracle10g).

Further, Metalink note 109582.1 says that multiple log writer I/O slaves were first introduced almost a decade ago, way back in Oracle8:

Folks, there is only ever one LGWR whose job it is to clear the redo buffer. DBWR on the other hand is multi-stated (support for multiple true DBWR processes) and that is due entirely to the NUMA requirements of certain ports from the late 1990s. Although it is possible for a single DBWR to saturate a CPU, and therefore need multiple DBWR processes, that is rare (generally related to pathological latch contention on cache buffers chains).

As long as LGWR and DBWR have asynchronous I/O support—and you are not running on a NUMA-optimized port—there really should never be a reason to configure multiple DBWR processes, nor should LGWR saturate a CPU.

Let me put it this way, there are examples of large high-end systems doing OLTP that don’t need multiple LGWR processes. Here is a system with roughly 7,000 spindles, 40 FC HBAs, 128 cores and 2TB RAM that needed neither multiple LGWR nor DBWR processes. And if you think you have a workload that generates more redo on a per-CPU basis than TPC-C you’re probably wrong.

If I ever hear the term I/O slaves again, I’ll:

Summary
Oracle demonstrates more LGWR scalability than 99.99942% of all production sites would ever need when they run such high end TPC-C benchmarks as the one I cited above. And please, let’s stop the I/O slave madness!

Manly Men Only Deploy Oracle with Fibre Channel – Part VIII. After All, Oracle Doesn’t Support Async I/O on NFS

Published July 17, 2007 NFS CFS ASM , oracle , Oracle Async I/O NFS , Oracle DBWR , Oracle Direct I/O , Oracle I/O Performance , Oracle NAS , Oracle NFS , Oracle on NFS , Oracle over NFS , Oracle performance 9 Comments

In the comment section of my recent post about Tim Hall’s excellent NFS step-by-step Linux RAC install Guide, Tim came full circle to ask a question about asynchronous I/O on NFS. He wrote:

What do you set your filesystemio_options init.ora parameter to when using Oracle over NFS?

Based on what you’ve written before I know NFS supports direct I/O, but I’m struggling to find a specific statement about NFS and asynchronous I/O. So should I use:

filesystemio_options=directIO

or

filesystemio_options=setall

My reply to that was going to remind you folks about my recent rant about old Linux distributions combined with Oracle over NFS. That is, the answer is, “it depends.” It depends on whether you are running a reasonable Linux distribution. But, Tim quickly followed up his query with:

I found my answer. Asynchronous I/O is not supported on NFS:

http://download.oracle.com/docs/cd/B19306_01/server.102/b15658/appc_linux.htm#sthref892

Bummer, I didn’t get to answer it.

Word To The Wise
Don’t use old Linux stuff with NAS if you want to do Oracle over NFS. Metalink 279069.1 provides a clear picture as to why I say that. It points out a couple of important things:

1. RHEL 4 U4 and EL4 both support asynchronous I/O on NFS mounts. That makes me so happy because I’ve been doing asynchronous I/O on NFS mounts with Oracle10gR2 for about 16 months. Unfortunately, ML 279069.1 incorrectly states that the critical fix for Oracle async I/O on NFS is U4, when in fact the specific bug (Bugzilla 161362 ) was fixed in RHEL4 U3 as seen in this Red Hat Advisory from March 2006.

2. Asynchronous I/O on NFS was not supported on any release prior to RHEL4. That’s fine with me because I wouldn’t use any Linux release prior to the 2.6 kernels to support Oracle over NFS!

Summary
The Oracle documentation on the matter was correct since it was produced long before there was OS support for asynchronous I/O on Linux for Oracle over NFS. Metalink 279069.1 is partly correct in that it states support for asynchronous I/O on systems that have the fix for Bugzilla 161363 but it incorrectly suggests that U4 is the requisite release for that fix, but it isn’t—the bug was fixed in U3. And yes, I get really good performance with the following initialization parameter set and have for about 16 months:

filesystemio_options = setall

Manly Man Post Script
Always remember, the Manly Man series is tongue-in-cheek. Oracle over NFS with Async I/O on the other hand isn’t.

Manly Men Only Deploy Oracle with 64 Bit Linux – Part I. What About a x86 Port on EM64T/AMD64 Hardware?

Published July 13, 2007 oracle , ORacle 64 bit , Oracle latches , Oracle performance 32 Comments

In the comment thread of my of my latest installment in the “Manly Man” series, a reader posted a humorous comment that included a serious question:

[…] what are your thoughts about x86 vs. x86-64 Linux, in relation to Oracle and RAC? I’d appreciate a blog entry if you could.

Tim Hall followed up in suit:

I would have thought it was obvious. The number 64 is twice the size of 32, so it must be twice as good, making you twice as manly!

Yes, when it comes to bitness, some is good so more must be better, right? Tim then continued to point out the political incorrectness of the term Manly Man by suggesting the following:

PS. I think politically correct names for this series of blog entries are:

Personally Persons only deploy…
Humanly Humans only deploy…

Before I actually touch the 32 versus 64-bit question from the thread, I’ll submit the following Manly Man entertainment:

If you have enjoyed the Manly Man series at all, or simply need a little background, you must must see this YouTube Video about Manly Men and Irish Spring.

32 or 64 Bit Linux
When people talk about 32 versus 64 bit Oracle with Linux they are actually talking about 3 topics:

Running 32 bit Oracle on 32 bit Linux with native 32 bit hardware (e.g., Pentium IV (Willamette), Xeon MP (Foster MP) ).
Running 32 bit Oracle on 32 bit Linux with x86_64 hardware.
Running 64 bit Oracle on 64 bit Linux with x86_64 hardware.

The oddball combination folks don’t talk about is 32 bit Oracle running on 64 bit Linux because Oracle doesn’t support it. I do test that however by installing Oracle on a 32 bit server and then NFS mounting that ORACLE_HOME over on a 64 bit Linux system. However, discussing this combination would therefore be moot due to the support aspect.

The most interesting comparison would be between 1 and 2 above provided both systems have precisely the same core clock speed, L2 cache and I/O subsystem. As such, the comparison would come down to how well the 32-bit optimized code is treated by the larger cache line size. There would, of course, be other factors since there are several million more transistors in an EM64T processor than a Pentium IV and other fundamental improvements. I have wished I could make that comparison though. The workload of choice would be one that “fits” in a 32 bit environment (e.g., 1GB SGA, 1GB total PGA) and therefore doesn’t necessarily benefit from 64 bitness.

If anyone were to ask me, I’d say go with 3 above. Oracle on x86_64 Linux is not new.

Bitness
In my recent blog entry about old software configurations in an Oracle over NFS situation, I took my well-deserved swipes at pre-2.6 Linux kernels. Perhaps the most frightening one in my experience was 32-bit RHEL 3. That whole 4/4 split kernel thing was a nightmare—unless you like systems that routinely lock up. But all told I was taking swipes at pre-2.6 kernels without regard for bitness. So long as the 2.6 kernel is on the table, the question of bitness is not necessarily so cut and dried.

In my recent blog entry about Tim Hall’s excellent step-by-step guide for RAC on NFS, a reader shared a very interesting situation he has gone through:

I have a Manly Man question for you. This Manly Man Wanna Be (MMWB) runs a 2-node 10g RAC on Dell 2850s with 2 dual-core Xeon CPUs (total of 4 CPUs). Each server has 16 GB of memory. While MMWB was installing this last year, he struggled mightily with 64-bit RAC on 64-bit Red Hat Linux 4.0. MMWB finally got it working after learning a lot of things about RPMs and such.

However, Boss Of Manly Man Wanna Be (BOMMWB) was nervous about 64-bit being “new,” and all of the difficulties that MMWB had with it, so we reinstalled with 32-bit RAC running on 32-bit Red Hat Linux 4.0.

My naturally petulant reaction would have been to focus on the comment about 64-bit being “new.” I’m glad I didn’t fire off. This topic deserves better treatment.

While I disagree with Boss of Manly Man’s assertion that 64-bit is “new”, I can’t take arms against the fact that this site measured different levels of pain when installing the same release of Oracle on the same release of RHEL4—only varying the bitness. It is unfortunate that this site has committed themselves to a 32 bit database based solely upon the their experiences during the installation. Yes, the x86_64 install of 10gR2 requires a bit more massaging of the platform vis a vis RPMs. In fact, I made a blog entry about 32 bit libraries required on 64 bit RHEL4. While there may occasionally be more headaches during an x86_64 install than x86, I would not deploy Oracle on a 32 bit operating system today unless there was a gun held to my head. All is not lost for this site, however. The database they created with 32 bit Oracle is perfectly usable in-place with 64 bit Oracle after a simple dictionary upgrade procedure documented in the note Metalink note entitled Changing between 32-bit and 64-bit Word Sizes (ML62290.1).

Has Anyone Ever Tested This Stuff?
I have…a lot! But I really doubt we are talking about running 32 bit Oracle on 32 bit hardware. Nobody even makes a native 32 bit x86 server these days (that I know of). I think the question at hand is more about 32 bit Oracle on 32 bit Linux with x86_64 hardware.

There has always been the big question about what 64 bit software performance is like when the workload possesses no characteristics that would naturally benefit from the larger address space. For instance, what about a small number of users attached to an SGA of 1GB and the total PGA footprint is no more than 1GB. That’s a workload that doesn’t need 64 bit. Moreover, what if the comparison is between 32 bit and 64 bit software running on the same server (e.g., AMD Opteron). In this case, the question gets more interesting. After all, the processor caches are the same, the memory->processor bandwidth is constant, the drivers can all DMA just fine. The answer is an emphatic yes! But yes, what? Yes there are occasions where 64 bit code will dramatically outperform 32 bit code on dual-personality 64 bit processors (e.g., AMD Opteron). It is all about porting. Let me explain.

The problem with running sophisticated 32 bit code on 64 bit processors is that the 32 bit code was most likely ported with a different processor cache line size in mind. This is important:

Native 32 bit x86 processors use a 32 byte cache line size whereas 64 bit processors (e.g., AMD64, EM64T) use a 64 byte cache line.
That means, in the case of a native 32 bit processor, load/store and coherency operations are performed on a swath of 32 bytes. Yes, there were exceptions like the Sequent NUMA-Q 2000 which had two different cache line sizes-but that was a prior life for me. Understanding cache line size and how it affects coherency operations is key to database throughput. And unlike Microsoft who never had to do the hard work of porting (IA64 not withstanding), Oracle pays very close attention to this topic. In the case of x86 Linux Oracle, the porting teams presumed the code was going to run on native 32 bit processors-a reasonable presumption.

What Aspects of Bitness Really Matter?
The area of the server that this topic impacts the most (by far) is latching. Sure, you use the database server to manage your data and the accesses to your data seem quite frequent to you (thousands of accesses per second), but that pales in comparison to the rate at which system memory is accessed for latches. These operations occur on the order of millions of times per second. Moreover, accesses to latches are write-intensive and most-painfully contended across multiple CPUs which results in a tremendous amount of bandwidth used for cache coherency. Spinlocks (latches) require attention to detail-period. Just sum up the misses and gets on all the latches during a processor-saturated workload sometime and you’ll see what I mean. What’s this have to do with 32 versus 64 bit?

It’s All About The Port
At porting time, Oracle pays close attention to ensure that latch structures fit within cache lines in a manner that eliminates false sharing. Remember, processors don’t really read or write single words. They read/write or invalidate the entire line that a given word resides in-at least when processor-to-memory operations occur. Imagine, therefore, a latch structure that is, say, 120 bytes long and that the actual latch word is the first element of the structure. Next, imagine that there are only 2 latches in our imaginary server and we are running on a 32 bit OS on a native 32 bit system such as Pentium IV or Xeon MP (Foster 32 bit) and therefore a 32 byte cache line size. We allocate and initialize our 2 structures at instance startup. These structures will lay out in 240 bytes within a single memory page. Since were dutiful enough to align our two structures on a page boundry, what we have is the first structure resides in the first 120 bytes of the memory page-the first 4 32 byte cache lines. But wait, there are 12 extra bytes in the 4th cache line. Doesn’t that mean the first 12 bytes of the second latch structure are going to share space in the 4th cache line? Not if you are good at porting. And in our example, we are.

That’s right, we were aware of our cache line size (32 bytes) so we padded the structure by allocating an array of unsigned integers (4 bytes) three deep as the last element of our structure. Now our latch structure is precisely 132 bytes or 4 lines. Finally, we have our imaginary 32 bit code optimized for a real 32 bit system (and therefore a 32 byte cache line size). That is, we have optimized our 32 bit software for our presumed platform which is 32 bit hardware. Now, if half of the CPUs in the box are hammering the first latch, there is no false sharing with the second. What’s this got to do with hardware bitness?The answer is in the fact that Oracle ports the x86 Linux release with a 32 bit system in mind.

Running 32 bit Code on a 64 bit CPU
The devil is in the details. Thus far our imaginary 2-latch system is optimized for hardware that operates on a 32 byte line. Since our structures fit within 4 32 byte lines or 2 64 byte lines should we execute on a x86_64 system there would be no false sharing so we must also be safe for a system with a 64 byte line, no? Well, true, there will be no false sharing between the two structures since they are now 2 64 byte lines as opposed to 4 32byte lines, but there is more to it.
Do you think it’s possible that the actual latch word in the structure might be adjacent (same 32 bytes) to anything interesting? Remember, heavily contended latches are constantly being tried for by processes on other CPUs. So if the holder of the latch writes on any other word in the cache line that holds the latch word, the processor coherency subsystem invalidates that line. To the other CPUs with processes spinning on the latch, this invalidation “looks” like the latch itself has been freed (changed from on to off) when in fact the latch is still held but an adjacent word in the same line was modified. This sort of madness absolutely thrashes a system. So, the dutiful port engineer rearranges the elements of the latch structure so that there is nothing else ever to be written in the same cache line that has the actual latch word. But remember, we ported to a 32 bit system with a 32 byte line. On the other hand, if you run this code on a 64 bit system–and therefore 64 byte lines–all of your effort to isolate the latch word from other write-mostly words was for naught. That is, if the cache line is now 64 bytes, any write by the latch holder in the first 64 bytes of the structure will cause invalidations (cache thrashing) for other processes trying to acquire the latch (spinning) on other CPUs. This isn’t a false sharing issue between 2 structures, but it has about the same effect.

Difficult Choices – 32 bit Software Optimized for 64 bit Hardware.
What if the porting engineer of our imaginary 2-latch system were to somehow know that the majority of 32 bit Linux servers would some day end up being 64 bit servers compatible with 32 bit software? Well, then he’d surely pad out the structure so that there are no frequently written words in the same 64 bytes in which the latch word resides. If the latch structure we have to begin with is 120 bytes, odds are quite slim that the percentage of read-mostly words will facilitate our need to pack the first 64 bytes with read-mostly objects along side the latch word. It’s a latch folks, it is not a read-mostly object! So what to do? Vapor!

Let’s say our 120 byte latch structure is a simple set of 30 words each being 4 bytes (remember we are dealing with a 32 bit port here). Let’s say further that there are only 4 read-mostly words in the bunch. In our imaginary 2 latch example, we’d have to set up the structure so that the first word is the latch, and the next 16 bytes are the 4 read-mostly elements. Now we have 20 bytes that need protection. To optimize this 32 bit code for a 64 bit processor, we’ll have to pad out to 64 bytes-with junk. So we’ll put an array of unsigned integers 11 deep (44 bytes) immediately after the latch word and our 4 read-mostly words. That fits nicely in 64 bytes-at the cost of wasting 44 bytes of processor cache for every single latch that comes through our processor caches. Think cache buffers chains folks! We aren’t done though.

We started with 120 bytes (30 4 byte words) and have placed only 5 of those words into their own cache line. We have 25 words, or 100 bytes left to deal with. Remember, we are the poor porting engineer that is doing an imaginary 32 bit software port optimized for 64 bit servers since nobody makes 32 bit servers any more. So, we’ll let the first 64 bytes of the remaining 100 fall into their own line. That leaves 36 bytes that we’ll also have to pad out to 64 bytes-there goes another 28 bytes of vapor. All told, we started with 120 bytes and wound up allocating 192 bytes so that our code will perform optimally on a processor that uses a 64 byte cache line. That’s a 60% increase in the footprint we leave on our processor caches which aren’t very large to start with. That’s how Oracle would have to optimize 32 bit Oracle for a 64 bit processor (x86 code on x86_64 kit). But they don’t because that would have been crazy. After all, 32 bit Oracle was intended to run on 32 bit hardware.

Porting, sound easy? Let me throw this one in there. It just so happens that the Oracle latch structure was in fact 120 bytes in Oracle8i on certain ports. Oh, and lest I forget, remember that Oracle keeps track of latch misses. What’s that got to do with this? Uh, that means processes that do not hold the latch increment counters in the latch structure when they miss. Imagine having one of those miss count words in the same line as the latch word itself!

This is tricky stuff.
Who Uses 32 bit Linux for Oracle These Days?
Finally, bugler, sound Taps.

A thread on oracle-l the other day got me thinking. The thread was about the difficulties being endured at a particular Linux RAC site that prompted the DBA there to audit what RPMs he has on the system. It appears as though everything installed was a revision “as high or higher” based on Oracle’s documented requirements. In his request for information from the list, I noticed uname(1) output that suggested he is using a 32 bit RHEL 4 system.

One place I always check for configuration information is Oracle’s Validated Configurations web page. This page covers Linux recipes for installation success. I just looked there to see if there was any help I could give offer that DBA and found that there are no 32 bit validated configurations!

I know there is a lot of 32 bit x86 hardware out there, but I doubt it is even possible to buy one today. Except for training or testing purposes I just can’t muster a reason to even use 32 bit Linux servers for the database tier at this point and to be honest, running a 32 bit port of Oracle on an x86_64 processor makes very little sense to me as well.

Audited TPC-C Proves SQL Server is Better Than Oracle

Published June 29, 2007 Oracle performance 5 Comments

Some time back I had a blog thread going about Oracle on AMD’s upcoming quad-core “Barcelona” processor. The thread (found here ) took AMD’s published, promotional material that set expectations for TPC-C throughput. At first there were projections from AMD that Barcelona would deliver 70% better throughput than Opteron 2200 systems. Later, as I blogged in this blog entry, AMD suggested we should expect as much as a 40% improvement over the Intel Xeon 5355 “Cloverdale” processor. The whole series of Barcelona-related posts was based on my analysis of Oracle licensing cost vis a vis Barcelona since Oracle licenses per core. In that series I used TPC-C results to make my points and at one juncture I used a TPC-C result from an Intel Xeon 5355 with SQL Server in one of my calculations. Folks got up in arms about that.

It seems people aren’t aware that all three major database players get about the same TPC-C per core on commodity hardware because the TPC-C workload has been optimized to the maximum. It’s a hardware benchmark folks.

This blog entry is to draw attention to the error of my ways.

My analysis of the potential for Oracle performance on Barcelona was covered in detail in that series of posts, but the TPC-C result I used in some of my calculations–thus raising the ire of some readers–was this 2-socket Xeon 5355 result with SQL Server. Certain readers thought it was absurd to collude differing database vendors’ TPC-C numbers into my calculations in spite of how many times I’ve blogged about the fact that TPC-C is a hardware benchmark, not a software benchmark. Oh well, you can’t win them all.

Confession is Good for The Soul
But what does this have to do with the Barcelona thread. Well, the Barcelona thread is just how we got here. So now I admit the error of my ways. During my series on Barcelona, I used this TPC-C result showing that Xeon 5355 can do 30,092 TpmC per core—with SQL Server. I factored that TPC-C number into Oracle cost per core comparison to AMD’s prediction on what Barcelona might do. After all, AMD is predicting they’ll beat out Xeon 5355 by 40% so I needed a Xeon 5355 number. Egad! Using a TPC-C number, without regard for database product, to compare hardware platforms! Shame on me. It turns out I was wrong. How wrong?

A Lower TPC-C Result Proves Oracle is Better
Yes, it’s true. A lower result can be a better result. It turns out that Oracle did eventually publish a TPC-C result on Xeon 5355-based gear. The result came in at 100,924 TpmC or 25,231 TpmC per core—19% lower than the SQL Server result of 30,092 TpmC/core. SQL Server simply must be a better database!

Not even close. Yes, the SQL Server number is 19% better on a per-core basis, but the cost to produce that result was $1.85 per TpmC whereas Oracle’s result was only 42% of that cost or $.78 per TpmC and it’s easy to see why. The SQL Server result was obtained with 64GB main memory whereas Oracle’s result used only 24GB. But that isn’t all. The SQL Server result came from a configuration that included, get this, 552 disk drives compared to the Oracle result that only required 104 drives.

The Tale of the Tape
You heard it here first! SQL Server is a better database than Oracle. All you have to do is throw in 5.3 fold more disk drives and an additional 2.6 fold main memory, shake vigorously and out plops a 19% performance improvement. Not bad for 2.4 fold additional cost!

Summary
SQL Server is not better than Oracle. By the way, AMD Barcelona most likely won’t deliver 40% more throughput measured in TpmC than Intel Xeon 5355.

Manly Men Only Deploy Oracle with Fibre Channel – Part 1. Oracle Over NFS is Weird.

Published June 14, 2007 NFS CFS ASM , oracle , Oracle NAS , Oracle NFS , Oracle on RHEL5 , Oracle over NFS , Oracle performance , Oracle RDBMS , Oracle SAN Topics , Oracle Storage Related Problems 20 Comments

Beware, lot’s of tongue in cheek in this one. If you’re not the least bit interested in storage protocols, saving money or a helpful formula for safely configuring I/O bandwidth for Oracle, don’t read this.

I was reading Pawel Barut’s Log Buffer #48 when the following phrase caught my attention:

For many of Oracle DBAs it might be weird idea: Kevin Closson is proposing to install Oracle over NFS. He states that it’s cheaper, simpler and will be even better with upcoming Oracle 11g.

Yes, I have links to several of my blog entries about Oracle over NFS on my CFS, NFS, ASM page, but that is not what I want to blog about. I’m blogging specifically about Powet’s assertion that “it might be a weird idea”—referring to using NAS via NFS for Oracle database deployments.

Weird
I think the most common misconception people have is regarding the performance of such a configuration. True, NFS has a lot of overhead that would surely tax the Oracle server way too much—that is if Oracle didn’t take steps to alleviate the overhead. The primary overhead is in NFS client-side caching. Forget about it. Direct I/O and asynchronous I/O are available to the Oracle server for NFS files with just about every NFS client out there.

Manly Men™ Choose Fibre Channel
I hear it all the time when I’m out in the field or on the phone with prospects. First I see the wheels turning while math is being done in the head. Then, one of those cartoon thought bubbles pops up with the following:

Hold it, that Closson guy must not be a Manly Man™. Did he just say NFS over Gigabit Ethernet? Ugh, I am Manly Man and I must have 4Gb Fibre Channel or my Oracle database will surely starve for I/O!

Yep, I’ve been caught! Gasp, 4Gb has more bandwidth than 1Gb. I have never recommended running a single path to storage though.

Bonding Network Interfaces
Yes, it can be tricky to work out 802.3ad Link Aggregation, but it is more than possible to have double or triple bonded paths to the storage. And yes, scalability of bonded NICs varies, but there is a simplicity and cost savings (e.g., no FCP HBAs or expensive FC switches) with NFS that cannot be overlooked. And, come in closely and don’t tell a soul, you won’t have to think about bonding NICs for Oracle over NFS forever, wink, wink, nudge, nudge.

But, alas, Manly Man doesn’t need simplicity! Ok, ok, I’m just funning around.

No More Wild Guesses
A very safe rule of thumb to keep your Oracle database servers from starving for I/O is:
100Mb I/O per GHz CPU

So, for example, if you wanted to make sure an HP c-Class server blade with 2-socket 2.66 GHz “Cloverdale” Xeon processors had sufficient I/O for Oracle, the math would look like this:

12 * 2.66 * 4 * 2 == 255 MB/s

Since the Xeon 5355 is a quad-core processor and the 480c c-Class blade supports two of them there are 21.28 GHz for the formula. And, 100 Mb is about 12 MB. So if Manly Man configures, say, two 4Gb FC paths (for redundancy) to the same c-Class blade he is allocating about 1000 MB/s bandwidth. Simply put, that is expensive overkill. Why? Well, for starters, the blade would be 100% saturated at the bus level if it did anything with 1000 MB/s so it certainly couldn’t satisfy Oracle performing physical I/O and actually touching the blocks (e.g., filtering, sorting, grouping, etc). But what if Manly Man configured the two 4Gb FCP paths for failover with only 1 path active path (approximately 500 MB/s bandwidth)? That is still overkill.

Now don’t get me wrong. I am well aware that 2 “Cloverdale” Xeons running Parallel Query can scoop up 500MB/s from disk without saturating the server. It turns out that simple light weight scans (e.g., select count(*) ) are about the only Oracle functionality that breaks the rule of 100Mb I/O per GHz CPU. I’ve even proven that countless times such as in this dual processor, single core Opteron 2.8 Ghz proof point. In that test I had IBM LS20 blades configured with dual processor, single-core Opterons clocked at 2.8 GHz. So if I plug that into the formula I’d use 5.6 for the GHz figure which supposedly yields 67 MB/s as the throughput at which those processors should have been saturated. However, on page 16 of this paper I show those two little single-core Opterons scanning disk at the rate of approximately 380MB/s. How is that? The formula must be wrong!

No, it’s not wrong. When Oracle is doing a light weight scan it is doing very, very little with the blocks of data being returned from disk. On the other hand, if you read further in that paper, you’ll see on page 17 that a measly 21MB/s of data loading saturated both processors on a single node-due to the amount of data manipulation required by SQL*Loader. OLTP goes further. Generally, when Oracle is doing OLTP, as few as 3,000 IOps from each processor core will result in total saturation. There is a lot of CPU intensive stuff wrapped around those 3,000 IOps. Yes, it varies, but look at your OLTP workload and take note of the processor utilization when/if the cores are performing on the order of 3,000 IOps each. Yes, I know, most real-world Oracle databases don’t even do 3,000 IOps for an entire server which takes us right back to the point: 100Mb I/O per GHz CPU is a good, safe reference point.

What Does the 800 Pound Gorilla Have To Say?
When it comes to NFS, Network Appliance is the 800lb gorilla. They have worked very hard to get to where they are. See, Network Appliance likely doesn’t care if Manly Man would rather deploy FCP for Oracle instead of NFS since their products do both protocols-and iSCSI too. All told, they may stand to make more money if Manly Man does in fact go with FCP since they may have the opportunity to sell expensive switches too. But, no, Network Appliance dispels the notion that 4Gb (or even 2Gb) FCP for Oracle is a must.

In this NetApp paper about FCP vs iSCSI and NFS, measurements are offered that show equal performance with DSS-style workloads (Figure 4) and only about 21% deficit when comparing OLTP on FCP to NFS. How’s that? The paper points out that the FCP test was fitted with 2Gb Fibre Channel HBAs and the NFS case had two GbE paths to storage yet Manly Man only achieved 21% more OLTP throughput. If NFS was so inherently unfit for Oracle, this test case with bandwidth parity would have surely made the point clear. But that wasn’t the case.

If you look at Figure 2 in that paper, you’ll see that the NFS case (with jumbo frames) spent 31.5% of cycles in kernel mode compared to 22.4% in the FCP case. How interesting. The NFS case lost 28% more CPU to kernel mode overhead and delivered 21% less OLTP throughput. Manly Man must surely see that addressing that 28% extra kernel mode overhead associated with NFS will bring OLTP throughput right in line with FCP and:

– NFS is simpler to configure

– NFS can be used for RAC and non-RAC

– NFS is cheaper since GbE is cheaper (per throughout) than FCP

Now isn’t that weird?

The 28%.

I can’t tell you how and when the 28% additional kernel-mode overhead gets addressed, but, um, it does. So, Manly Man, time to invent the wheel.

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage

Archive for the 'Oracle performance' Category

SLOB Chewed Up All My File System Space and Spit It Out. But, Why?

Summary

Oracle Exadata Storage Server: A Black Box with No Statistics.

Databases are the Contents of Storage. Future Oracle DBAs Can Administer More. Why Would They Want To?

What Oracle Topic Is On Your Coffee Table?

Oracle11g Automatic Memory Management – Part III. A NUMA Issue.

Oracle11g Automatic Memory Management – Part I. Linux Hugepages Support.

Over-Configuring DBWR Processes – Part III

Over-Configuring DBWR Processes – Part II.

Manly Men Only Use Solid State Disk For Redo Logging. LGWR I/O is Simple, But Not LGWR Processing

Oracle Doesn’t Scale Without Multiple Log Writer Processes

Manly Men Only Deploy Oracle with Fibre Channel – Part VIII. After All, Oracle Doesn’t Support Async I/O on NFS

Manly Men Only Deploy Oracle with 64 Bit Linux – Part I. What About a x86 Port on EM64T/AMD64 Hardware?

Audited TPC-C Proves SQL Server is Better Than Oracle

DISCLAIMER

Pages

Blogroll

Follow Blog via Email

Recent Posts

Recent Comments

Fond Memories

Copyright

Archive for the 'Oracle performance' Category

Summary

Share this:

Executive Summary

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

DISCLAIMER

Pages

Blogroll

Follow Blog via Email

Recent Posts

Recent Comments

Fond Memories

Copyright