Last night I observed a small Twitter fight between Scality and Cloudian around an aggressive marketing campaign in which Cloudian wants to show its superiority against their competitor. I didn't get into the fight, but what struck me most was a tweet from Giorgi Regni (CTO of Scality) talking about "data protection granularity.” As far as I understand, from their point of view, data protection granularity means that you can manually choose the level of data protection per domain, user, bucket, and maybe per object. This is cool, but it is just too simplistic to cope with reality.
Let's talk about dynamic data protection
As usual, at OpenIO, we think differently.
Almost all object stores support at least two types of data protection. The first is multiple data copies, where an object is duplicated N times on different nodes. The latter is erasure coding, a technique that involves a lot of math and looks like a RAID 6 on steroids. In practice, instead of organizing data in chunks and adding one or two parity chunks, erasure coding splits the data into several segments and adds additional information to each one of them. With this approach, retrieving the data means that you need only a fraction of the total number of chunks to retrieve the information.
Erasure coding is very efficient from a data footprint point of view, and is able to sustain several failures at the same time. But it doesn't make sense with small files, and is very compute intensive. When files are too small, you end up with a lot of CPU used and tons of I/O operations.
How do we solve this? We select the appropriate data protection mechanism on the fly and we call it dynamic data protection.
How does it work?
Dynamic data protection, or DDP, is a very simple mechanism. When you create a new user or a bucket you can choose a data protection policy (this is what our competitor calls "granularity"). But, with OpenIO, you can select multiple data protection policies and assign a rule to select the right one depending, for example, on the size of the object you are going to store.
This feature is not very useful if you already know the type of files you are going to store (i.e. If you want to store large videos, erasure coding is perfect). But what happens if you don’t know what you are going to store? Think about an ISP for example; if they want to start a new cloud storage service, they need to be ready for any sort of workload and type of data, reacting quickly when necessary.
With OpenIO, you can have multiple storage policies for each domain/user/bucket. This means that if you have a mixed workload, with mixed file sizes, you don't need to think about storage efficiency. We do it for you.
Here’s an example. You could select 3-replica + EC 14+4 for a new bucket. And then you could have a rule where files smaller than 128KB are stored with three copies, while larger files get erasure coded.
I'm going to oversimplify a bit, but I want to explain with an example what will happen when you save a 8K or a 8MB file in the same bucket.
- 3 copies: a total capacity footprint of 8 x 3 = 24K, and just 3 IOPS.
- EC 14+4 (considering a chunk of 8K): 8 x 18 = 72K, and 18 IOPS.
In this case a multiple data protection policy is not only better from the performance point of view but also in terms of capacity consumption.
- 3 copies: a total capacity footprint of 8 x 3 = 24MB.
- EC 14+4: the total capacity is around 10.2MB.
In this case, no matter what the IOPS it is clear that you can save a lot of space with erasure coding.
You don’t want to save multiple copies of large files but you want to stay efficient (especially from a performance point of view) if you get small files in the mix, right?
Thanks to our integrated serverless framework, we can do more. We can easily perform operations like compression or changing the data protection policy over time to improve capacity utilization without impacting performance too much... but that's a story for another day.
Dynamic data protection is just one of the many features that makes OpenIO a different object storage. Our customers are not forced to choose one or another data protection mechanism; the system does it on the fly.
Contrary to the way traditional object stores work, OpenIO's lightweight design, flexibility, and sheer efficiency enable its dynamic load-balancing feature (ConsciousGrid), its ability to run on heterogeneous hardware with consistent performance figures, eliminate the need for rebalancing when the cluster is expanded, and so on. All these features are designed to support evolving business needs, not the other way around.