Why We Designed an Object Store with a Conscience
We usually define OpenIo as an a flexible and lightweight object storage solution. This comes from decisions taken a long time ago. OpenIO has been founded in 2015, but the design oSDS’s design was conceived in 2006. At that time, the engineering team wanted to overcome the limits of traditional architectures in terms of scalability, but they also wanted to make the product usable at any scale and, above all, easy to use on a day-to-day basis.
Why traditional object stores are rigid
The traditional layout of an object store is usually represented as a ring of nodes. A ring is based on a consistent hashing algorithm (like CHORD from MIT). It stores an object in a static way across a set of servers using its key (usually the hash of the name) to determine a location. The key space is finite; each server is responsible for a range of keys. Two objects sharing the same key prefix will be stored to the same server, and load distribution is achieved statically thanks to the distribution of the keys. When the cluster of nodes is expanded or reconfigured, the key space must be redistributed across all the nodes, and data has to move accordingly for the cluster to stay balanced.
This kind of operation has a huge impact on the infrastructure, and adding new nodes requires a rebalancing that has a serious impact on performance. At the same time, heterogeneous hardware is hardly used because performance is limited by slower nodes.
In addition, data lookups using consistent hashing-based algorithms get slower as the number of nodes increases. The complexity of the lookup query to find an object – O(log n) – increases with the number of nodes. With a few dozen nodes, you don’t notice this, but with large numbers, this latency increases substantially. As we were designing our platform to host as many as 2^16 nodes, O(log n) was not an option (16 hops to locate an object).
Why did we design Conscience?
First, we wanted to achieve flexibility. Flexibility in terms of deployment means:
- Being able to re-use old hardware,
- Being able to mix it with new servers,
- and expect it to work with many generations of hardware to come.
While storage infrastructures are built to last at least five years, you can expect a new hardware generation every 18 months. Sooner or later, you will have to deal with heterogeneity of hardware. We wanted to have a mechanism that would work well in this context, and we knew an even distribution of data across all nodes wasn’t the solution.
Secondly, we wanted to distribute loads across all nodes, not using classic load-balancing techniques that are more suitable for stateless workloads (random, round-robin, least connections, etc.), but rather taking into account some specific data characteristics:
- The statefulness of data, which is not a compute task that disappears once it is completed.
- Data will be stored where it is going to be directed to, and retrieved from.
So we decided to opt for a dynamic data placement. Each server collects its own metrics, sends them to a distributed service, called Conscience, which computes a quality score and publishes these scores on a bulletin available to all the servers of the infrastructure: the grid of nodes. This is done every few seconds, asynchronously, so each node always knows the state of all other nodes. Each micro decision on where to place data is done using the best available nodes with the highest scores.
Scores are calculated based on capacity, CPU power, and IOPS available on the node, and are computed as a geometric mean; this means a node’s score can be 0 if any of the resources goes as low as 0 (no more capacity left, no more CPU left, etc.). New servers will come to the top of the list as soon as they are added because they are empty, but they will not be hammered too much right away, as the measured IOPS will decrease and will impact their overall score.
A directory of objects
As we opted for dynamic data placement based on the actual state of the infrastructure, the challenge was to keep track of all these micro-decisions. In OpenIO SDS, data is not placed in a pre-determined way, and the same calculus cannot be used to find the data once it has been stored. This is why we needed a directory of objects to store their locations, and, as we wanted to store trillions of objects, we had to build a data structure that would allow these records to be distributed over many nodes, even across all nodes.
A 3-level tree with indirections was the best option for optimal latency and based on the number of nodes we wanted to achieve: (2^16), only 2 hops, even on very large platforms. Indirection tables are often used in computing to look up RAM pages, or files across filesystems. This was the closest use case and most proven mechanism to keep track of trillions of objects. By adopting this mechanism, metadata access performance would be consistent and quick at any scale because of the low latency of only 2 hops, and it could be spread across all the nodes of the cluster limiting the resources needed on single nodes.
As a positive side effect, having a directory of objects also allows something that cannot be done with rings. The ring is a pure BLOB store, which means you need to remember the key to access your object. At large scale – trillions of objects – it can be tricky to store the list of all the keys needed to find all the objects.
A modern object storage system allows a complete listing of objects (among other capabilities beyond GET, PUT, and DELETE calls). To build such functionality on top of a BLOB store, you need an extra layer to take care of the metadata and provide the listing capability to applications. The overall complexity of the solution increases with an additional layer, as does the induced latency to get through multiple layers before actually interacting with objects and their metadata.
With our directory of objects, we provide object listing at the core, which is also why it was so simple for us to implement S3 and Swift APIs (which also provide this type of functionality).
The difference between OpenIO SDS and other object stores is at the core. Its flexibility and lightweight design allowed us to build a unique platform which is much more efficient and smarter than ring-based object stores. And this is also why we have been able to develop our integrated serverless computing framework on top of it.
Conscience technology is not just a load balancing mechanism, but is at the base of the philosophy that enables us to deliver a superior object storage product, and makes SDS, and Grid for Apps, suitable for a very broad set of use cases at any scale… from 1TB up to 1000s of PB.