We have achieved significant improvements in the speed of rebuilding data when volumes are damaged in the latest version of OpenIO. The key to this is the distribution of rebuilding jobs across the whole platform.
In this article, I’d like to explain:
- why it is necessary to do this as fast as possible
- how we managed to make this process three times as fast, what kind of tools we used
- what the future holds for data rebuilding
Rebuilding Missing Chunks
OpenIO is a software defined object storage system that doesn’t rely on static placement algorithms such as chords, rings, crushmaps, etc. Instead, the solution relies on its ConsciousGrid technology to optimize data placement in the most flexible manner possible, and on a heavily distributed directory to persist the locations polled by ConsciousGrid.
The data itself is chunked and stored on services called rawx. When losing a drive hosting a rawx, all the chunks of that rawx are lost. Thanks to the data protection mechanisms, it is possible to rebuild all the data of the failed drive, rebuilding the missing chunks to guarantee the redundancy expected by the owner of the data. The less time we wait to rebuild the data, the lower is the risk of losing another disk, which could compound the problem. Therefore, we need a tool to rebuild missing chunks as quickly as possible.
OpenIO 18.04 provided a tool to rebuild all the chunks of a damaged rawx. But as drive capacities increase, we store more data chunks on rawx services, and more data is impacted if a disk or node fails. After a profiling session, we discovered the bottleneck: the tool could only be started on one machine, as it was written in Python in a mono-process/multi-coroutine fashion. It did not distribute its load, not even on local CPU cores.
When rebuilding millions of chunks, we were limited due to the CPU and/or network capabilities of the machine. We spent a lot of effort in the 18.10 to speed up the rebuilding mechanism.
Our goal for OpenIO SDS 18.10 was to no longer be limited by the performance of a single CPU core, nor even by a single machine. The only limit we would accept was the total resources of all the machines hosting rawx services.
With a test platform, we managed to increase the rebuild speed by a factor N, where N is the number of workers. We achieved a 3x improvement in performance on the same platform:
How does it work?
Distributed rebuilding is based on the controller/worker communication pattern.
Each worker is connected to a queue where it can retrieve the chunks to rebuild (3).
The single controller receives all the chunks to rebuild (1) and sends them to workers via their queues (2).
When a worker rebuilds a chunk, it notifies the controller via a queue (4).
By distributing the chunks to rebuild to the different workers, we distribute the load of this maintenance operation across all the machines having a worker. So, with a high enough number of workers, the rebuild speed is limited only by the rawx's ability to read the chunks present and write the new chunks.
We never had such a quick rebuilding tool. To do this, we distributed the rebuilding jobs across the whole platform, and this is essential to make sure environments are resilient and run smoothly with no downtime or data loss.
In the future, we plan to integrate the rebuilding process into GridForApps to simplify the distribution of chunks to rebuild by automatically distributing them over the entire platform.
The documentation is available here.