$/GB Is No Longer Enough: Let’s Talk About $/Data
Before joining OpenIO, I was an independent blogger and analyst, and I worked with most of the object storage vendors on the market. I was fascinated by the potential of this technology, but – after many years – only a few of them have been able to deliver on their promises: a lower $/GB, without trade-offs imposed by design choices. These choices result in rigid systems that are particularly hard to manage and don’t scale well over time. OpenIO has gone beyond just thinking about $/GB. We not only provide one of the best $/GB of the market, but we also help users create value from the data in their object stores.
$/GB and Sustainability
The first time I met Laurent Denel, CEO and co-founder of OpenIO, I was skeptical. OpenIO’s technology was founded on totally different concepts than those of traditional object stores. The advantages of their approach were good in theory, but I wasn’t sure they were applicable to real-world scenarios. They proved me wrong by showing me real numbers from users in the field, working with the technology developed by their team, on systems ranging from 3 to 600+ nodes and several petabytes. (Yes, six hundred nodes.)
There are many features in OpenIO’s solution that drive down $/GB
- The grid of nodes (as opposed to the classic ring topology)
- Conscience technology (as opposed to static load balancing based on a rigid, distributed hashing table)
- Neat, lightweight design (as opposed to systems based on layers of software that need many resources to run)
- Open source code (as opposed to closed source)
- The never-need-to-rebalance capability (as opposed to a performance impact for data and metadata reshuffling after any cluster alteration).
When calculating costs, one needs to consider not only TCA (total cost of acquisition), which is already very low for the smallest of installations, but also long-term TCO, that is much lower than for other solutions. OpenIO is one of the few platforms that allows users to build an evolutionary system capable of accepting new resources continuously, when needed, without considering what resources were already available and how to take advantage of them. OpenIO supports heterogeneous clusters, and new resources are immediately available when added.
But low $/GB is no longer enough to build sustainable storage infrastructures.
It’s time to talk about $/Data
One of the key advantages of object storage is the rich metadata attached to files and objects. In most cases, you already have the necessary information to make this feature useful. Some object stores have implemented a basic indexing/searching mechanism, but it is limited, and not very useful. We had a different idea.
OpenIO began researching serverless computing many years ago. OpenIO can run with a single CPU core and less than 512MB of RAM, and customers started to ask what they could do with all those unused resources in big x86 boxes. We found the answer. The messaging system implemented for the OpenIO cluster can intercept storage events (like PUT, GET, DELETE, etc.), and it was just a matter of passing these events to the upper layers and doing something with them.
By intercepting events and triggering applications, it is possible to access data and metadata, and to perform operations that can be used for several tasks, such as increasing the value of raw data stored in the system.
Most of these operations could be done by external services or the application itself, but they wouldn’t be offloaded to storage, and they would put much more load on the application layer, limiting the overall efficiency of the entire stack.
There are several use cases for Grid for Apps. Here are some examples:
- Data validation: think about storing data coming from sensors or remote devices. The storage system can check if data is valid, in a specific range, or it can be filtered, leaving only the parts that are necessary for subsequent tasks.
- Metadata enrichment: data stored in the system can be combined with metadata automatically by looking at the content and other information included in the original files.
- Pattern matching: every file ingested by the system can be searched for specific patterns, and alarms can be raised or other actions taken if the pattern matches certain rules (an antivirus is a classic example for this use case).
- Video transcoding: storing a video can automatically trigger a transcoding process that creates additional files for different formats and/or bitrates. And the opposite is also true: when a video is read, the system can resize the video in real time for the device accessing it.
- Real time analytics: trigger a specific analytics task if a file contains logs, for example.
- Advanced indexing and searching: indexing files when imported.
- Image recognition: recognizing content, or just faces, and performing actions accordingly. You could save pictures and protect privacy by blurring faces, for example.
- You name it: Any other operation that simplify complex workflows and can be triggered by an event is feasible. Sometimes it takes just a few lines of code and some open source tools, while, in other cases, you may need a full-fledged application (i.e. Apache Spark, Elastic), but anything can run on top of an OpenIO cluster.
All the above examples have something in common: they create more value from the data stored in the OpenIO object store.
How does it work?
Below are two videos showing how Grid for Apps works. The first, after a short introduction, gives an overview of the concepts, the components involved, and the overall architecture of the product. The second is a demo with an example using TensorFlow (the AI library from Google).
A demo with an example using TensorFlow (the AI library from Google):
Associating object storage with serverless computing is a game changer. It is finally possible to store huge amounts of data, understand the content, and give it value by enriching it or removing unnecessary parts. We can easily move from the “dumb” data lake to a smart data lake. But the advantages don’t stop here; thanks to the augmented capabilities of the storage system, workloads can be improved and simplified by automating tasks at a lower layer that is common to all applications.