Modern Servers Are Better Than You Think For Oracle Database – Part I. What Problems Actually Need To Be Fixed? | Kevin Closson's Blog: Platforms, Databases and Storage

Modern Servers Are Better Than You Think For Oracle Database – Part I. What Problems Actually Need To Be Fixed?

Blog update 2012.02.28: I’ve received countless inquiries about the storage used in the proof points I’m making in this post. I’d like to state clearly that the storage is not a production product, not a glimpse of something that may eventually become product or any such thing. This is a post about CPU, not about storage. That point will be clear as you read the words in the post.

In my recent article entitled How Many Non-Exadata RAC Licenses Do You Need to Match Exadata Performance I brought up the topic of processor requirements for Oracle with and without Exadata. I find the topic intriguing. It is my opinion that anyone influencing how their company’s Oracle-related IT budget is used needs to find this topic intriguing.

Before I can address the poll in the above-mentioned post I have to lay some groundwork. The groundwork I need to lay will come in this and an unknown number of installments in a series.

Exadata for OLTP

There is no value add for Oracle Database on Exadata in the OLTP/ERP use case. Full stop. OLTP/ERP does not offload processing to storage. Your full-rack Exadata configuration has 168 Xeon 5600 cores in the storage grid doing practically nothing in this use case. Or, I should say, the processing that does occur in the Exadata storage cells (in the OLTP/ERP use case) would be better handled in the database host. There simply is no value in introducing off-host I/O handling (and all the associated communication overhead) for random single-block accesses. Additionally, since Exadata cannot scale random writes, it is actually a very weak platform for these use cases. Allow me to explain.

Exadata Random Write I/O
While it is true Exadata offers the bandwidth for upwards of 1.5 million read IOPS (with low latency) in a full rack X2 configuration, the data sheet specification for random writes is a paltry 50,000 gross IOPS—or 25,000 with Automatic Storage Management normal redundancy. Applications do not exhibit 60:1 read to write ratios. Exadata bottlenecks on random writes long before an application can realize the Exadata Smart Flash Cache datasheet random read rates.

Exadata for DW/BI/Analytics

Oracle positions Exadata against products like EMC Greenplum for DW/BI/Analytics workloads. I fully understand this positioning because DW/BI is the primary use case for Exadata. In its inception Exadata addressed very important problems related to data flow. The situation as it stands today, however, is that Exadata addresses problems that no longer exist. Once again, allow me to explain.

The Scourge Of The Front-Side Bus Is Ancient History. That’s Important!
It was not long ago that provisioning ample bandwidth to Real Application Clusters for high-bandwidth scans was very difficult. I understand that. I also understand that, back in those days, commodity servers suffered from internal bandwidth problems limiting a server’s data-ingest capability from storage (PCI->CPU core). I speak of servers in the pre-Quick Path Interconnect (Nehalem EP) days. In those days it made little sense to connect more than, say, two active 4GFC fibre channel paths (~800 MB/s) to a server because the data would not flow unimpeded from storage to the processors. The bottleneck was the front-side bus choking off the flow of data from storage to processor cores. This fact essentially forced Oracle’s customers to create larger, more complex clusters for their RAC deployments just to accommodate the needed flow of data (throughput). That is, while some customers toiled with the most basic problems (e.g., storage connectivity), others solved that problem but still required larger clusters to get more front-side buses involved.

It wasn’t really about the processor cores. It was about the bus. Enter Exadata and storage offload processing.

Because the servers of yesteryear had bottlenecks between the storage adapters and the CPU cores (the front-side bus) it was necessary for Oracle to devise a means for reducing payload between storage and RAC host CPUs. Oracle chose to offload the I/O handling (calls to the Kernel for physical I/O), filtration and column projection to storage. This functionality is known as a Smart Scan. Let’s just forget for a moment that the majority of CPU-intensive processing, in a DW/BI query, occurs after filtration and projection (e.g., table joins, sort, aggregation, etc). Shame on me, I digress.

All right, so imagine for a moment that modern servers don’t really need the offload-processing “help” offered by Exadata? What if modern servers can actually handle data at extreme rates of throughput from storage, over PCI and into the processor cores without offloading the lower level I/O and filtration? Well, the answer to that comes down to how many processor cores are involved with the functionality that is offloaded to Exadata. That is a sophisticated topic, but I don’t think we are ready to tackle it yet because the majority of datacenter folks I interact with suffer from a bit of EarthStillFlat(tm) syndrome. That is, most folks don’t know their servers. They still think it takes lots and lots of processor cores to handle data flow like it did when processor cores were held hostage by front-side bus bottlenecks. In short, we can’t investigate how necessary offload processing is if we don’t know anything about the servers we intend to benefit with said offload. After all, Oracle database is the same software whether running on a Xeon 5600-based server in an Exadata rack or a Xeon 5600-based server not in an Exadata rack.

Know Your Servers

It is possible to know your servers. You just have to measure.

You might be surprised at how capable they are. Why presume modern servers need the help of offloading I/O (handling) and filtration. You license Oracle by the processor core so it is worthwhile knowing what those cores are capable of. I know my server and what it is capable of. Allow me to share a few things I know about my server’s capabilities.

My server is a very common platform as the following screenshot will show. It is a simple 2s12c24t Xeon 5600 (a.k.a. Westmere EP) server:

My server is attached to very high-performance storage which is presented to an Oracle database via Oracle Managed Files residing in an XFS file system in a md(4) software RAID volume. The following screenshot shows this association/hierarchy as well as the fact that the files are accessed with direct, asynchronous I/O. The screenshot also shows that the database is able to scan a table with 1 billion rows (206 GB) in 45 seconds (4.7 GB/s table scan throughput):

The io.sql script accounts for the volume of data that must be ingested to count the billion rows:

$ cat io.sql
set timing off
col physical_reads_GB format 999,999,999;      
select VALUE /1024 /1024 /1024 physical_reads_GB from v$sysstat where STATISTIC# =
(select statistic# from v$statname where name like '%physical read bytes%');
set timing on

So this simple test shows that a 2s12c24t server is able to process 392 MB/s per processor core. When Exadata was introduced most data centers used 4GFC fibre channel for storage connectivity. The servers of the day were bandwidth limited. If only I could teleport my 2-socket Xeon 5600 server back in time and put it next to an Exadata V1 box. Once there, I’d be able to demonstrate a 2-socket server capable of handling the flow of data from 12 active 4GFC FC HBA ports! I’d be the talk of the town because similar servers of that era could neither connect as many active FC HBAs nor ingest the data flowing over the wires—the front-side bus was the bottleneck. But, the earth does not remain flat.

The following screenshot shows the results of five SQL statements explained as:

This SQL scans all 206 GB, locates the 4 char columns (projection) in each row and nibbles the first char of each. The rate of throughput is 2,812 MB/s. There is no filtration
This SQL ingests all the date columns from all rows and maintains 2,481 MB/s. There is no filtration.
This SQL combines the efforts of the previous two queries which brings the throughput down to 1,278 MB/s. There is no filtration.
This SQL processes the entire data mass of all columns in each row and maintains 1,528 MB/s. There is no filtration.
The last SQL statement introduces filtration. Here we see that the platform is able to scan and selectively discard all rows (based on a date predicate) at the rate of 4,882 MB/s. This would be akin to a fully offloaded scan in Exadata that returns no rows.

Summary

This blog series aims to embark on finding good answers to the question I raised in my recent article entitled How Many Non-Exadata RAC Licenses Do You Need to Match Exadata Performance. I’ve explained that offload to Exadata storage consists of payload reduction. I also offered a technical, historical perspective as why that was so important. I’ve also showed that a small, modern QPI-based server can flow data through processor cores at rates ranging from 407 MBPS/core down to 107 MBPS/core depending on what the SQL is doing (SQL with no predicates mind you).

Since payload reduction is the primary value add of Exadata I finished this installment in the series with an example of a simple 2s12c24t Xeon 5600 server filtering out all rows at a rate of 4,882 MB/s—essentially the same throughput as a simple count(*) of all rows as I showed earlier in this post. That is to say that, thus far, I’ve shown that my little lab system can sustain nearly 5GB/s disk throughput whether performing a simple count of rows or filtering out all rows (based on a date predicate). What’s missing here is the processor cost associated with the filtration and I’ll get to that soon enough.

We can’t accurately estimate the benefit of offload until we can accurately associate CPU cost to filtration. I’ll take this blog series to that point over the next few installments—so long as this topic isn’t too boring for my blog readers.

This is part I in the series. At this point I hope you are beginning to realize that modern servers are better than you probably thought. Moreover, I hope my words about the history of front-side bus impact on sizing systems for Real Application Clusters is starting to make sense. If not, by all means please comment.

As this blog series progresses I aim to help folks better appreciate the costs of performing certain aspects of Oracle query processing on modern hardware. The more we know about modern servers the closer we can get to answer the poll more accurately. You license Oracle by the processor core so it behooves you to know such things…doesn’t it?

By the way, modern storage networking has advanced far beyond 4GFC (400 MB/s).

Finally, as you can tell by my glee in scanning Oracle data from an XFS file system at nearly 5GB/s (direct I/O), I’m quite pleased at the demise of the front-side bus! Unless I’m mistaken, a cluster of such servers, with really fast storage, would be quite a configuration.

27 Responses to “Modern Servers Are Better Than You Think For Oracle Database – Part I. What Problems Actually Need To Be Fixed?”

Feed for this Entry Trackback Address

1 Freek February 28, 2012 at 1:14 am

Kevin,

Does your usage of a software raid mean that the luns where residing on more then one storage box, or is there another reason why you used it?

Reply
- 2 kevinclosson February 28, 2012 at 5:37 am
  
  Freek,
  
  Yes there are multiple LUNs. The series is focused on CPU, not storage (yet). CPU is all that matters. Storage is a necessity, not *the* necessity.
  
  Reply
3 oracledoug February 28, 2012 at 1:14 am

Terrific post, Kevin. Says in a much more elegant and informed way the things I’ve been rambling about in pubs for too many of the past 18 months or so 😉

Reply
4 Scott February 28, 2012 at 11:27 am

Thanks for the post. Could you provide the details on the plumbing from the server to the storage array and auto trace of the scripts?

Reply
- 5 kevinclosson February 28, 2012 at 11:35 am
  
  Hi Scott,
  
  I’m not using a production storage product. I can’t disclose what it is.
  
  I need to reiterate that storage is not the focus of this blog series. This is a series about CPU. Would you have believed that 12 cores of Xeon 5600 could sustain SQL query throughput of nearly 5GB/s if I hadn’t posted this? Most folks wouldn’t. Most folks would presume a CPU bottleneck.
  
  Reply
6 Martin February 28, 2012 at 12:20 pm

Hi Kevin,

I would love to see the same test repeated on the Romley platform when it comes out. As processors get better and better this difference between Exadata and modern hardware will only become bigger. I recently asked the question in the pub whether starting an Exadata implementation now is still worth it. I got interesting replies.

Reply
- 7 oracledoug.com February 28, 2012 at 12:55 pm
  
  That’s because you go to interesting pubs with interesting company, Martin 😉 (Present company excepted)
  
  Reply
- 8 kevinclosson February 28, 2012 at 12:58 pm
  
  Hi Martin,
  
  Um, would you be surprised if I told you I’ve got LGA-2011 (Patsburg) lit up with this (and other) workloads? 🙂
  
  We can be sure that Oracle will release a new model..something like “X3-2” or something in about late summer or maybe it will slip to OOW 2012. The problem is that the Exadata family gets messed with as far as positioning goes. Until I see proof otherwise the X2-8 model will still be based on the glue-less x4800 (aka G5) platform. That creates a problem for advancing the X2-8 to Sandy Bridge because (unless I come to understand otherwise) the Sandy Bridge EN family will not support 8S glue-less implementations. So that leaves IBM (X5) and HP (Prema) to deliver 8S Sandy Bridge.
  
  On a per-core bases we should find that Sandy Bridge EN is faster than Xeon E7. I don’t know by how much though. It will make things, um, uncomfortable for Exadata sales folks to push a full rack X2-8 with E7 stuff when a much smaller config (fewer cores, less RDBMS $Licensing) of Sandy Bridge X2-2 ( or whatever it will be called) will outperform an E7-based X2-8.
  
  It is my opinion that it is smart for customers’ $$ to remain open for hardware as it comes from Intel. These “engineered systems” bundlings slow things down too much.
  
  I’ll be more than happy to hear that the Sun guys have figured out how to cobble an 8S Sandy Bridge EN into a glue-less QPI-midplane system like the x4800. I’m only human. I could be wrong.
  
  Your point about how Sandy Bridge even improves the equation I’m blogging about in this thread. Sandy Bridge goes even further to improve data flow into processor cores from a PCI standpoint. So I ask, what problems actually need to be fixed?
  
  Reply
9 Dominic Delmolino March 1, 2012 at 11:08 am

Kevin — I thought Joe’s comments regarding the possibility of 8-way Sandy Bridge lined up with what you were saying about the x4800 as well.

http://sqlblog.com/blogs/joe_chang/archive/2011/11/29/intel-server-strategy-shift-with-sandy-bridge-en-ep.aspx

Reply
- 10 kevinclosson March 1, 2012 at 12:16 pm
  
  Hi Dominic,
  
  I actually know what I know 🙂 The part I had to shy away from is whether there are futures in the x4800 line that look more like the very good glue systems (HP PREMA, IBM X5). We’ll have to wait and see. If not, then I think the roadmap is lumpy for the X2-8.
  
  Thanks for stopping by.
  
  Reply
11 Ted March 2, 2012 at 5:03 am

Great post; can’t wait to read the next installment.

Reply
12 Waseem March 8, 2012 at 10:24 am

Well at the risk of sounding stupid, i have an audicious suggestion.
How about you compare HP proliant, Cisco UCS and an Exadata 1/4 rack. Yes exadata is a black box with no real world benchmarks. But I guess something can be worked out here. Best to configure the UCS& proliant for 2 node RAC systems with comparable compute densities. Cost is altogether a differnt factor.

Reply
- 13 kevinclosson March 8, 2012 at 11:40 am
  
  Waseem,
  
  I don’t understand what you are getting at. If cost is not a factor what is the point?
  
  Reply
14 Amir March 12, 2012 at 11:34 am

Hi Kevin,
Just curious to know how did you calculate 392MB/s:
“So this simple test shows that a 2s12c24t server is able to process 392 MB/s per processor core.”

206GB in 44.84 seconds = 4.6GB/s
With 24 cores = 4.6/24*1024 = 196MB/s per core

Thanks

Reply
- 15 kevinclosson March 12, 2012 at 12:09 pm
  
  Hi Amir,
  
  The platform is 2s12c24t. It’s 12 cores. Your divisor is 24.
  
  Reply
  - 16 Amir March 12, 2012 at 12:31 pm
    
    Thanks Kevin for the clarification. I guess it exposed my lack of understanding of the x86 servers! A followup question, what kind of IO interface was used to pump 4.6GB/s data from the disk sub-system and to the FSB? I am assuming that it was IB based but just wanted to get confirmation.
    
    Thanks
    
    Reply
    - 17 kevinclosson March 12, 2012 at 12:51 pm
      
      No problem, Amir.
      
      Actually, Amir, as odd as it may sound, this post is not about the I/O. It’s about CPU ability to handle the flow of data. This will become clear when I post Part II of this series.
      
      Let’s just think of the storage as a really fast black box. The libraries interfacing with the kernel are 100% standard (LibC and libaio) and the kernel I/O scheduler and block I/O layer are all standard so the relevant code is spot on.
      
      In short, this is not a storage thread. This is a host CPU thread.
      
      Reply
18 jeffshukis April 25, 2012 at 12:30 pm

You are definitely on to something Kevin – current generation servers seem to have the IO problem largely solved. Here is another example:

I have a data warehouse development server assembled largely from used parts bought on eBay. It has four AMD 6172 CPUs, five LSI HBAs, and 25 cheap consumer-grade 120GB SATA3 SSD drives. All total it’s a $3K machine plus $4K in SSDs.

My little server didn’t cost much, but it runs databases like a dream. Oracle IO Calibration shows 7,200MB/Second and 357,000 IOPS. I replicated your billion-row table exactly and then ran the queries you demonstrated. Count(*) completes in 56 seconds, not quite as fast as your machine. Interestingly, however, all of the other queries complete even more quickly that your demo. The query that took 2:44 on your machine completed in just 58 seconds on my machine, for example.

Reply
- 19 kevinclosson April 25, 2012 at 7:51 pm
  
  @jeffshukis : Cool stuff. I do like hearing that. Please get a copy of SLOB (google “oracle SLOB”) and throw it on there… tell is what it can do with real Oracle foreground IOPS. Thanks!
  
  Reply
  - 20 jeffshukis April 26, 2012 at 6:27 am
    
    I’d love to try your SLOB benchmark but unfortunately it’s a Windows 2008R2 box right now. I intended to migrate it to Linux, but performance has been so far above my expectations that I don’t have a pressing reason to do so. It will happen, but probably not for four or five months while I focus on software features instead of raw performance.
    
    Reply
    - 21 kevinclosson April 26, 2012 at 7:57 am
      
      @jeffshukis : Piece of cake. You only need the worlds smallest linux host with client software installed so you can execute sqlplus. Use SQL*Net to hammer the server. There is no wire traffic while SLOB is running… just crack open runit.sh and modify the connect string to use a TNS service.
      
      Reply
22 KD Mann May 30, 2012 at 4:48 pm

Hi Kevin,

Very enlightening post!

It occurs to me that the same server technology (QPI/Hypertransport replacing FSB) that kills the value prop for Exadata does the same thing to RAC itself.

This begs the question; in a world where all but the very largest databases can be serviced by a single fast-and-ultra-reliable Westmere EX machine, what is the value proposition for building hugely expensive RAC clusters vs a plain old failover model?

Especially where the latter can be spread over geographic distances and kill the DR and HA birds with a single stone?

Interested in your thoughts on this…

Reply
- 23 kevinclosson June 1, 2012 at 1:47 pm
  
  @KD Mann : RAC started as a way to aggregate processing power from multiple servers. Then it morphed in peoples’ minds as a availability solution. I disagree with the latter and agree with the former. The problem is the former is no longer needed to the highest percentage of Oracle databases. In fact, most large servers are being chopped up into little servers through LPAR/VPAR/ZONES(etc). Just pick a right-sized server and an infrastructure (e.g., VMware) that enables you to shuttle a database system to larger hardware should it become needed.
  
  We need to all join the 21st century I feel.
  
  One thing I’ll point out is that with x86 larger systems mean slower systems. The processor efficiency (CPU cycles to do any given code) is better (lower CPI) with 2S servers than 4. The cost of “being big” is paid in CPU efficiency. With 4 sockets you are only 25% local memory so 75% of your memory references have tax and the tax varies by architecture in terms of how many hops memory transfers suffer on any given remote reference.
  
  With 2S it’s 50% and the tax is flat (~20%). This is why, for instance, Oracle TPC-C benchmarks get more TpmC/CPU on 2S servers than servers with more than 2 Sockets and the delta in terms of TpmC/CPU is *huge*.
  
  Reply
24 josh September 30, 2012 at 10:57 am

Hi Kevin, I was reading through this post but could not find the 2nd part of the post in your blog just wanted to check if you manage to post the 2nd part of this series

Reply
- 25 kevinclosson October 1, 2012 at 2:36 pm
  
  Hi Josh,
  
  Actually…I forgot to do it… I’ll have to loop back to is as soon as I can. Thanks for the reminder 🙂
  
  Reply

1 State of Data #88 « Dr Data's Blog Trackback on March 1, 2012 at 8:45 pm
2 Chasing the Oracle Exadata Pot of Gold | Smarter Questions for a Smarter Planet Trackback on May 24, 2012 at 12:48 pm

	kevinclosson on Announcing SLOB 2.5.4
	Hell Dip on Announcing SLOB 2.5.4
	kevinclosson on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…
	Amey Bobade on Introducing SLOB – The S…

Kevin Closson's Blog: Platforms, Databases and Storage