Navigation ↓
  |  Enrico Signoretti

Large Capacities and Failure Domains

A few weeks ago, Samsung launched a new 128TB SSD (!), and last week Sandisk did something similar, announcing a 400GB SD card. These products are totally different, but it is clear that flash vendors are now able to achieve a remarkable die density, and it is highly likely that this trend will continue in the future.
 
Event though $/GB is still in favor of the HDD – roughly in the 1:10 range – when it comes to capacity per drive the SSD beats the HDD hands down. It will soon also become an option for some capacity-driven applications… but there is a catch.

The Relationship Between $/GB and Capacity

When you look at large capacities, no matter the application, the first characteristic you look for is cost… or, more appropriately, $/GB.
 
$/GB is a fundamental parameter to plan for the sustainability of an infrastructure in the long term, while other characteristics of a large scale-out system, such as throughput, are taken for granted because of the massive parallelization and the number of devices and nodes involved. In fact, for some use cases, especially where large files are involved, the difference in performance between SSD- and HDD-based clusters is mitigated by many factors. I’m not saying that an HDD can be compared to an SSD as far as performance is concerned, but there are some compromises that you have to tolerate to keep costs down… and compromises are at the base of bottlenecks.
 
Unfortunately, to get the best density and $/GB you need to squeeze as many disks as possible in a single node. This is why many server vendors provide 48, 60, 72, and even 90-slot servers. The quantity of CPUs and network connectivity is sufficient if you use these servers just for storage, but there are other issues to consider.
 
A 90-disk server using 10TB drives provides 900TB of raw capacity, and if this server goes down, then you have a huge problem. No matter how large is the cluster, the rebuilding time and the performance impact while this operation takes place won’t go unnoticed.

Flash Memory Will Make Things Worse

A 128TB SSD is ten times bigger than a disk with the same form factor. It is also 10 or 20 times faster. If CPU power and network bandwidth weren’t a problem before, they are now. And don’t forget that you can squeeze many of these disks in a single server.
 
By adopting the same servers that I mentioned above, you could obtain very high densities, but the failure domain would be massive. Even losing a small 12-disk server would be a tragedy (we are talking about up to 1PB per 1 or 2 RU here). And don’t forget the bandwidth, RAM, and CPU power needed to support basic features like erasure coding, deduplication, compression, encryption, and so on.

Nano-nodes to the Rescue!

We think we already solved this problem once with the nano-node, which is a small interposer with enough CPU, RAM, SSD, and Ethernet to manage a single hard drive. This offers the same density that you can get with fat nodes, but the failure domain, which remains the node, is now equal to one disk (one node = one disk). The benefits are the smallest failure domains and the highest parallelism in the cluster, as well as much better power consumption.
 
The same could be the case with larger disks or SSDs, just by adding more resources to the nano-node; in this case, the result could be even better. In fact, while the original nano-node was designed for high capacity but low performance workloads, such as active archiving, SSDs allow you to think bigger.
 
For example, a 128TB SSD nano-node could carry 8/10 CPU cores, 16/32GB of RAM, and two 10/25Gbit/s ports. The form factor would be slightly different than the original one, but node density could remain remarkable. And contrary to what happened with the original nano-nodes, the additional SSD throughput, CPU power, RAM, and network bandwidth could be used for many more scale-out applications and workloads (non-SQL databases, data analytics, and so on). In a hypothetical design, 10 or 12 of these devices could be packed in a 2RU chassis with a 100Gb/s backend, delivering enough throughput to support the SSDs.

Takeaways

Building balanced architectures is not easy, but is fundamental to creating sustainable infrastructures.
 
I’m just speculating here; we don’t yet have any actual products available like this. However, I really love the concept and all the possible applications for this type of “small” node, which could expand the range of use cases for flash memory in large capacity scenarios. Nano-nodes, or micro-blades, were not very powerful in the past but this was mainly because of the lack of storage density. Now that local storage is no longer an issue, I’m sure that more and more vendors will start thinking this way.
 
Here at OpenIO, we are ready to support this type of architecture. We already did in the past, with the original nano-node, and OpenIO SDS already runs in production on all-flash clusters. And this is the scenario where Grid for Apps, our serverless computing framework, shows its strengths.

Want to know more?

OpenIO SDS is available for testing in four different flavors: Linux packages, the Docker image, a simple ready-to-go virtualized 3-node cluster and Raspberry Pi.

Stay in touch with us and our community through Twitter, our Slack community channel, GitHub, blog RSS feed and our web forum, to receive the latest info, support, and to chat with other users.

Reserve your seat for one of our webinars! Check on our Events page for next sessions or on our youtube channel for recorded videos.

 

Comments (2)

Leave a comment 
  1. Chris Evans

    Enrico, nodes seem like a good idea, as you say, performance is good with SSD and the failure domain from a drive perspective is lower (but not from the chassis – although that could be made with passive components). My only concern is how I/O is managed across so many components, with no central “controller”. I known that OpenIO can manage a distributed model, but what about a block and file applications, or aren’t you including those? With potentially thousands or tens of thousands of nodes in an architecture, the support model would be very different.

    1. Enrico Signoretti

      Hi Chris,
      thanks for chiming in.
      I agree with you that performance for block and files, served through this kind of architecture, could be an issue, especially from the latency consistency POV. At the moment, my goal is to find the best solution for object storage here. We are also working on performance improvements for our file connector, and I hope to give you an answer later this year about it. 😉

      Supporting this kind of architecture is totally different. With EC for example, you can lose 30+% of the drives before going into troubles. Meaning that you can be lazy and let them die, go in the DC to replace failed drives just once per month and do all the work in a couple of hours. (the only contract we provided for SLS was “NBD part replacement”) 😉

Leave a comment

All fields are required. Your email address will not be published.