Last month I had the privilege of delivering the key note session to the quarterly gathering of Northern California Oracle User Group. My session was a set of vignettes in a theme regarding modern storage advancements. I was mistaken on how much time I had for the session so I skipped over a section about how we sometimes still expect systems performance to add up to a sum of its parts. This blog post aims to dive in to this topic.
To the best of my knowledge there is no marketing literature about XtremIO Storage Array that suggests the array performance is due to the number of solid state disk (SSD) drives found in the device. Generally speaking, enterprise all-flash storage arrays are built to offer features and performance–otherwise they’d be more aptly named Just a Bunch of Flash (JBOF). The scope of this blog post is strictly targeting enterprise storage.
Wild, And Crazy, Claims
Lately I’ve seen a particular slide–bearing Oracle’s logo and copyright notice–popping up to suggest that Exadata is vastly superior to EMC and Pure Storage arrays because of Exadata’s supposed unique ability to leverage aggregate flash bandwidth of all flash components in the Exadata X6 family. You might be able to guess by now that I aim to expose how invalid this claim is. To start things off I’ll show a screenshot of the slide as I’ve seen it. Throughout the post there will be references to materials I’m citing.
DISCLAIMER: The slide I am about to show was not a fair use sample of content from oracle.com and it therefore may not, in fact, represent the official position of Oracle on the matter. That said, these slides do bear logo and copyright! So, then, the slide:
I’ll start by listing a few objections. My objections are always based on science and fact so objecting to content–in particular–is certainly appropriate.
- The slide (Figure 1) suggests an EMC XtremIO 4 X-Brick array is limited to 60 megabytes per second per “flash drive.”
- Objection: An XtremIO 4 X-Brick array has 100 Solid State Disks (SSD)–25 per X-Brick. I don’t know where the author got the data but it is grossly mistaken. No, a 4 X-Brick array is not limited to 60 * 100 megabytes per second (6,000MB/s). An XtremIO 4 X-Brick array is a 12GB/s array: click here. In fact, even way back in 2014 I used Oracle Database 11g Real Application Clusters to scan at 10.5GB/s with Parallel Query (click here). Remember, Parallel Query spends a non-trivial amount of IPC and work-brokering setup time at the beginning of a scan involving multiple Real Application cluster nodes. That query startup time impacts total scan elapsed time thus 10.5 GB/s reflects the average scan rate that includes this “dead air” query startup time. Everyone who uses Parallel Query Option is familiar with this overhead.
- The slide (Figure 1) suggests that 60 MB/s is “spinning disk level throughput.”
- Objection: Any 15K RPM SAS (12Gb) or FC hard disk drive easily delivers sequential scan throughput of more than 200 MB/s.
- The slide (Figure 1) suggests XtremIO cannot scale out.
- Objection: XtremIO architecture is 100% scale out so this indictment is absurd. One can start with a single X-Brick and add up to 7 more. In the current generation scaling out in this fashion with XtremIO adds 25 more SSDs, storage controllers (CPU) and 4 more Fibre Channel ports per X-Brick.
- The slide (Figure 1) suggests “bottlenecks at server inputs” further retard throughput when using Fibre Channel.
- Objection: This is just silly. There are 4 x 8GFC host-side FC ports per XtremIO X-Brick. I routinely test Haswell-EP 2-socket hosts with 6 active 8GFC ports (3 cards) per host. Can a measly 2-socket host really drive 12 GB/s Oracle scan bandwidth? Yes! No question. In fact, challenge me on that and I’ll show AWR proof of a single 2-socket host sustaining Oracle table scan bandwidth at 18 GB/s. No, actually, I won’t make anyone go to that much trouble. Instead, click the following link for AWR proof that a single host with 2 6-core Haswell-EP (2s12c24t) processors can sustain Oracle Database 12c scan bandwidth of 18 GB/s: click here. I don’t say it frequently enough, but it’s true; you most likely do not know how powerful modern servers are!
- The slide (Figure 1) says Exadata achieve “full flash throughput.”
- Objection: I’m laughing, but that claim is, in fact, the perfect segue to the next section.
Full Flash Throughput
Scan Bandwidth
The slide in Figure 1 accurately states that the NVMe flash cards in the Exadata X6 model are rated at 5.5GB/s. This can be seen in the F320 datasheet. Click the following link for a screenshot of the F320 datasheet: click here. So the question becomes, can Exadata really achieve full utilization of all of the NVMe flash cards configured in the Exadata X6? The answer no, but sort of. Please allow me to explain.
The following graph (Figure 2) shows data cited in the Exadata datasheet and depicts the reality of how close a full-rack Exadata X6 comes to realizing full flash potential.
As we know, a full-rack Exadata has 14 storage servers. The High Capacity (HC) model has 4 NVMe cards per storage server purposed as a flash cache. The HC model also comes with 12 7,200 RPM hard drives per storage server as per the datasheet.
The following graph shows that yes, indeed Exadata X6 does realize full flash potential when performing a fully-offloaded scan (Smart Scan). After all, 4 * 14 * 5.5 is 308 and the datasheet cites 301 GB/s scan performance for the HC model. This is fine and dandy but it means you have to put up with 168 (12 * 14) howling 7,200 RPM hard disks if you are really intent on harnessing the magic power of full-flash potential!
Why the sarcasm? It’s simple really–just take a look at the graph and notice that the all-flash EF model realizes just a slight bit more than 50% of the full flash (aggregate) performance potential. Indeed, the EF model has 14 * 8 * 5.5 == 616 GB/s of full potential available–but not realizable.
No, Exadata X6 does not–as the above slide (Figure 1) suggests–harness the full potential of flash. Well, not unless you’re willing to put up with 168 round, brown, spinning thingies in the configuration. Ironically, it’s the HDD-Flash hybrid HC model that enjoys the “full flash potential.” I doubt the presenter points this bit out when slinging the slide shown in Figure 1.
IOPS
The slide in Figure 1 doesn’t actually suggest that Exadata X6 achieves full flash potential for IOPS, but since these people made me crack open the datasheets and use my brain for a moment or two I took it upon myself to do the calculations. The following graph (Figure 3) shows the delta between full flash IOPS potential for the full-rack HC and EF Exadata X6 models using data taken from the Exadata datasheet.
No…Exadata X6 doesn’t realize full flash potential in terms of IOPS either.
References
Here is a link to the full slide deck containing the slide (Figure 1) I focused on in this post: http://konferenciak.advalorem.hu/uploads/files/INFR_Sarecz_Lajos.pdf.
Just in case that copy of the deck disappears, I pushed a copy up the the WayBack Machine: click here.
Summary
XtremIO Storage Array literature does not suggest that the performance characteristics of the array are a simple product of how many component SSDs the array is configured with. To the best of my knowledge neither does Pure Storage suggest such a thing.
Oracle shouldn’t either. I have now made that point crystal clear.
Hi
Yes, that Oracle slide is kind of silly but it has the point: Exadata may deliver much faster scanning throughput than the other general storage systems on the market. And yes, it may show much better flash utilization. Does the latter has any practical application? I guess not.
“single 2-socket host sustaining Oracle table scan bandwidth at 18 GB/s”
Wow. This would require 22 8Gbps ports. How did you put them into 2s system (which are often 2U)?
I didn’t say it was Fibre Channel? 🙂 It’s NVMe plumbing. The point of the post is about the fact that *hosts* can handle the dataflow and Oracle processing.
in the scan bandwidth graphic you mention partity – do you mean parity ? I assume so because the second graphic has it as parity. Is the difference in the EF versions of exadata because it has to write two copies of the data ? whereas on Hc the secondary copy is written to disk by default ?
You found a typo. I don’t understand your question about writes. The post is about scans.
I assumed that the reason EFdoes badly is that half the flash is holfding mirror copies and so is effectively useless froma bandwidth perspective whereas for HC most mirror copies are being written to disk and not occupying space in flash. i.e the data in flash will be more unique in case of HC vs EF. Or is that assumption incorrect ?
@robinsc : Badly?
As in does not live up to its full potential by at least half 🙂
@robinsc: You are missing the point. It has nothing to do with where multiplexed write blocks (ASM redundancy) reside. The comparison in Figure 2 shows the actual scan bandwidth for fully offloaded scans in the HC versus EF model. The EF model has double the P320 flash cards (8 versus the 4 in the HC model). Scan throughput is the same whether scanning 4 NVMe cards versus 8 NVMe cards. Unless Oracle proves otherwise this means there is a bottleneck not directly related to I/O. My first guess would be cell CPU because scanning 300+ GB/s with 14 storage servers is roughly 10GB/s per socket! And that’s a lot even if only peeking into the block header for count of row slots (these are COUNT(*) scans).
Hi Kevin,
Thanks for pointing out this mistake in my presentation. Looks like I used a too early version of the presentation just after Exadata X6-2 was announced. Soon after that this was corrected by product managers, but my presentation was uploaded there with the old slide. I have already asked the agency to replace it with the correct version.
Sorry for the mistake and thanks again for pointing it out.
Regards,
Lajos
@lsarecz , Hi. I think the same incorrect material is seen in other slide decks though. I think you got the wrong information from someone else, no? Either way, it’s no big deal. I just needed to make the general true fact known that modern platforms’ performance is really never the simple sum of individual components.