Big Data, Object Storage, and OpenIO
A few years ago, when I was an analyst, I wrote many articles about architectures that can take advantage of object storage in large Big Data infrastructures (here an example) and, you know, one of my mantras in the last few years has always been to suggest the adoption of a two tier storage architecture strategy. Now, here at OpenIO, this isn’t changing.
Every day I meet up with customers and end users, most of them with large capacity storage systems, and with Big Data needs. One of our goals for the next few years is to become more and more involved in this kind of projects, and not just as a secondary repository for archiving but more as a flexible and active system which can also provide compute capabilities through our Grid for Apps technology.
Radicalization of the tiers
Large Big Data repositories are becoming larger and larger year after year. Capacities are measured in tens or hundreds of Petabytes, and the Exabyte scale is around the corner for more infrastructures that you may think.
Scale-out file system and object stores are the only way to cope with that kind of size and growth. Usually, the first are better for performance while the latter are cheaper but less flexible and with lower performance.
The problem of $/GB has becoming more relevant over time and researchers are trying anything to get more performance from object stores (or cheaper file systems).
At the same time flexibility is the key to effective scalability. In fact is not only the ability to scale that is important, what is more important now is scale at ease, without limit and constraints. Large multi-petabyte systems are continuously evolving infrastructures which include several generations of storage nodes, with different media, sometimes serving different applications and workloads. OpenIO, from this point of view, can be a game changer. The flexibility of our internal architecture, which allow us to support complex configurations made of heterogeneous resources is coupled with the advanced dynamic data placement mechanisms which grant a continuous load balancing for the best performance.
But there is more. Capacity is just one element of the equation, the other one is latency. In fact, even though the size of active data sets are not growing with the same pace of the data lake we ask better results, faster. Most of the times the bottleneck is not in the CPU itself, but the real issue is the time that the CPU waits for data to compute. All-flash storage systems have partially mitigated the problem but they can’t be deployed at a massive scale because the cost.
In-memory storage (and databases) is what makes a difference now. And this is why solutions like Apache Spark are growing so quickly in popularity. The object store has a very bad latency (especially because the kind of media and data protection techniques used to maximize data protection and minimize data footprint), but can be show a very interesting throughout… Good for streaming data continuously to the compute cluster!
We think that the trick to create the ultimate big data infrastructure is to transfer data as quickly as possible from the object store to memory , where it can processed quickly, and viceversa.
OpenIO loves IOStack!
We came across IOStack EU research project a while ago, and we started to collaborate with the team behind it because we believe thy are going towards the right direction.
Our goal is to adopt and contribute to Crystal (an open and extensible software-defined storage layer for OpenStack) and Stocator (analytics data access layer) technologies leveraged by the IOStack consortium. By doing so, we think we’ll enable high-performance analytics on OpenIO SDS.
Even more so, we are already thinking about how to take advantage of our event-driven compute framework (Grid for Apps) to build an even tighter integration between storage, access layer and compute to minimize latency and improve overall infrastructure efficiency.
You’ll hear more and more about OpenIO associated with Big Data and Analytics. We think OpenIO SDS already has the right characteristics of flexibility and scalability for building big data lakes of all sizes. But it’s by simplifying the entire infrastructure and making it more efficient and integrated that Big Data will be available to a larger audience and easier to manage, not matter the growth rate.