BLOG CORRECTION: The next to the last paragragh has been edited to offer more clarity on which components impose limits on I/O transfer sizes.
I’m going to tell you something nobody else knows. You’ve heard it here first. Ready? Here’s the deal, no more than 800 MB/s can pass through two 4 Gb Fibre Channel HBAs into any host system memory. It’s that simple. If you want more than 800 MB/s available for your CPUs, you have to either add more 4 Gb HBAs or go with 8 Gb Fibre, or drop FCP all together and go with something that can deliver at that level, but this isn’t a plug for the Manly Man Series on Fibre Channel Technology, I’m blogging about Data Warehouse Appliance technology, specifically DATAllegro.
Exit Conventional Wisdom, and Electronics!
Here is a graphic of the V3 DATAllegro building block. It’s two Dell 2950s (a.k.a., Compute Nodes) each plumbed with two 4 Gb Fibre Channel HBAs to a small EMC CX3 array. According to this piece on DATAllegro’s website, they are the only people on the planet to push more than is electronically possible through two 4 Gb HBAs, I quote:
Data for each compute node is partitioned into six files on dedicated disks with a shared storage node. Multi-core allows each of these six partitions to be read in parallel. Data is streamed off these partitions using DATAllegro Direct Data StreamingTM (DDS) technology that maximizes sequential reads from each disk in the array. DDS ensures the appliance architecture is not I/O bound and therefore pegged by the rate of improvement of storage technology. As a result, read rates of over 1.2 GBps per compute node are possible.
That’s right. I wasn’t going to point out that each compute node is fed by six disks, because if I did I’d also have to tell you they are 7200 RPM SATA drives, mirrored. Supposedly we are to believe that the pixy dust known as Direct Data StreamingTM can, uh, pull data at what rate per spindle? Yes, that’s right, they say 200 MB/s per drive! Folks, I’ve got 7200 LFF SATA drives all over the place and you can’t get more than 80 MB/s per drive from these things (and that is actually fairly tough to do). Even EMC’s own specification sheet for the CX3 spells out the limit as 31-64 MB/s. I’ll attest that if your code stays out on the outer, say, 10% of the drive you can stream as much as 75-80 MB/s from these things. So with the DATAllegro system, and using my best numbers (not EMC’s published numbers), you’d only expect to get some 480 MB/s from 6 7200 RPM SATA drives (6×80). Wow, that Direct Data StreamingTM technology must be really cool, albeit totally cloak and dagger. Let’s not stop there.
What about this 1.2 GB/s per compute node claim? How do you pump that through 2 x 4 Gb FC HBAs? You don’t. Not even DATAllegro with all those Cool SoundingTM technologies. What’s really being said in that DATAllegro overview piece is that their effective ingestion rate is some 1.2 GB/s, I quote:
Compression expands throughput: Within each node, two of the multi-core processors are reserved for software compression. This increases I/O throughput from 800MBps from the shared storage node to over 1.2 GBps for each compute node.
They could just come out and say it, but they expect you to believe in magic. I’ll quote Stuart Frost (CEO, DATAllegro) on more of this magic, secret sauce:
Another very important aspect of performance is ensuring sequential reads under a complex workload. Traditional databases do not do a good job in this area – even though some of the management tools might tell you that they are! What we typically see is that the combination of RAID arrays and intervening storage infrastructure conspires to break even large reads by the database into very small reads against each disk.
Traditional databases are only victims of what storage arrays do with the I/O requests by way of slicing and dicing. Further, the OS and FC HBA impose limits for the size of large I/O requests. It is not a characteristic of a traditional database system. Even a Totally Rad Non-Traditional RDBMSTM like the one DATAllegro embeds in their compute nodes (spoiler: it’s Ingres, nothing new) will fall prey to what the array controller does with large I/O requests. But more to the point, FC HBAs and the Linux (CentOS for DATAllegro) block I/O layer impose limits on the size of transfers and that is generally 1MB.
If I’m wrong, I expect DATAllegro to educate us, with proof, not more implied Awesomely Fabulicious CoolFlips Technology TM. In the end, however, no matter whether they managed to code custom FC HBA drivers and somehow obtained custom firmware for the CX3 to achieve larger transfer sizes than anyone else or not, I’ll bet dollars to donuts they can’t push more than 800 MB/s through dual 4 Gb FCP HBAs, and certainly not from 6 7200 RPM SATA drives.
Hmmm, so this means that if Oracle can achieve a compression rate of 5:1 on data warehouse data, its “effective ingestion rate” is … wow!
Cool.
“Awesomely Fabulicious CoolFlips Technology TM”
LOL! Man, haven’t laughed like this for a looooong time!
Please, Kevin: let me use this one on the next EMC meeting.
The Clarion guy is trying to convince us sharing a CX between my DW nodes and all other file servers, print servers and sql servers around the place makes a lot of sense…
Noons,
You’ve always got carte blanche around here…go for it. Which CX to share by the way?
That is a somewhat low upper limit to stripe chunk size on that array, if it’s really 256 sectors (128KB, I assume?), but I wonder how much it ‘s really hurting physical drive performance. I think it’s not really that much, after about 32-64 sectors the sequential transfer rate of most drives starts leveling off a lot. Correct me if I’m wrong. If all segments of a transfer are initiated in parallel this should generally be a win, until the number of outstanding host I/O’s gets high enough to cause all the seeking on all those drives to be a problem.
Actually, I just re-read my post and realize I made a mistake regarding the CX3. Only 4-way mirrors and above invoke the stripe size. Nonetheless, DATAllegro uses CentOS and the odds they have manipulated the HBA driver (e.g., Qlogic, Emulex) to push through larger than 1MB I/Os is ever so slim. Further, I have found no evidence that a CX3 supports a transfer larger than 1MB (singleton nor a striped transfer). As for max stripe size…
They (EMC CX3) actually use the uncommon 520 byte sectors in the CX3. There is nothing strange about bounding stripe width (or RAID op units generally speaking) to 256 sectors..there are many, many that are that way…some smaller. I know that 256 is a very common Engenio limit as well as the HP StorageWorks Arrays I’ve had experience with. Chapparrell are that way. LSI RAC (Raid on a chip) e.g., HP SmartArray P400 is that way. It is only an issue when you read 1MB (common, if not all, Linux 2.6.18 kernel max I/O size on FC, SAS, etc disk).
The point is that with what ingredients I see (CentOS, FC HBA, CX3), DATAllegro is hitting physical disks with maximum 1MB transfers…until they point out otherwise (with proof).
CX340. Apparently they are Fabulicious and can roast and brew 105 different varieties of coffee while servicing a multi-TB dw…
Digging deeper into this I believe that DATAllegro has never actually observed the numbers they claim (at least not all of them). Not only that, they claim different numbers for what appears to be the exact same metric.
Lets first look at the claim on this page: “[3:1 compression] increases I/O throughput from 800MBps from the shared storage node to over 1.2GBps for each compute node.” The wording here seems a bit misleading to me: Why are they comparing “from the [single CX3-10] storage node” to “each [of two] compute node[s]”? Since there are two compute nodes per storage node, with 3:1 compression the math at least adds up: 800MBps x 3:1 compression = 2.4GBps = (1.2GBps * 2 nodes). But this claim (800MBps from the storage) is a farce (more on that later).
Now lets look at DATAllegro and Teradata: A Node-to-Node Comparison. Here the claim for “max I/O rate per node” is 900MBps.
Note that the two I/O throughput claims do not even match up!!! The first claims 1.2GBps per node and the second claims 900MBps per node. So which is it?!?!? I would believe the latter (900MBps logical throughput) to actually be physically possible (the first is not!), because a EMC CX3-10 storage processor can only output about 600MBps total (physical) regardless of number of drives or workload. This is obviously much less than the throughput capacity of the 4 x 4Gbps FCP. So if one assumes the 3:1 compression ratio, then the storage has a capable throughput of 1800MBps of logical I/O (3 times the physical of 600MBps). Since there are two nodes sharing this 1800MBps, they each would be capable of 900MBps. This equates to about 50MBps per HDD (600MBps / 12 HDDs) and thus does not to exceed the laws of physics or the spec sheet.
Obviously if the data compression ratio is less than 3:1, this rate will drop and approach the 600MBps physical max I/O throughput for the CX3-10.
Greg,
OK, I didn’t want to touch the fact that the CX3-10 SP array head is a simple Xeon box, but it is. Nothing magical. It is quite unlikely that there would be enough bandwidth in that thing to shuffle back-end to front-end sufficiently to saturate the 4x4Gb FCP out-bound plumping anyway. Nonetheless, I do hold fast the maximum theoretical ingest rate to a DATAllegro v3 node is 800MB/s. That is a fact (2x4Gb FCP HBA).
So, thanks for playing devil’s advocate, Greg. I hadn’t seen that 900MB/s figure before but it is as absurd as the 1.2GB figure because DATAllegro cites it as a bandwidth number. They refer to the 900MB/s as, and I quote, “Max I/O Rate per Node.” That is totally dishonest since their compute nodes are plumbed with 2x4Gb FCP HBAs.
They need to get honest and call it what it is, “Effective I/O Rate per Node.” And, they need to do that soon because voodoo doesn’t stand up to much scrutiny around here.
I know I’m sending decent traffic to DATAllegro’s site with this thread. So whoever is over there monitoring this ought to take note that we don’t cotton to such tomfoolery round these parts. Fix your verbiage!
Oh this is funny… Following the trackback from here covering M$ acquisition of DATAllegro, we learn that it’s bad news for Oracle folks:
…it’s bad news for Ingres, bad news for Oracle, bad news for IBM, bad news for Teradata and bad news for HP, all for obvious reasons.
Alex,
Funny is not the word. I’d say pathetic is more fitting.
Folks, please, let’s not cloud the marketing with facts!
We all know that no one ( at least those who write the checks{$} ) checks the numbers. We technical people are supposed to just use the technology and if/when it doesn’t perform as advertised, then it’s obviously our incompetence that has configured it incorrectly. 😉
This is how over 50% of the poorly performing systems I’ve inherited came to the shops I’ve worked with.
Just my $0.02.
Thanks for the due diligence and a dedication to real math.