Navigation ↓
  |  OpenIO

Behind the Scenes of the #TbpsChallenge

Lire la version française

Everything you always wanted to know about the #TbpsChallenge.
(But were afraid to ask).

What is the #TbpsChallenge?

To demonstrate the performance and scalability of OpenIO Object Storage, we recently deployed our solution on a cluster of more than 350 physical servers provided by Criteo Labs. The benchmark made it possible to cross the symbolic threshold of writing up to 1 terabit of data per second, and even to exceed it since the useful throughput observed is 1.372 Tbps. This is the equivalent of digitally transferring the 22 million printed books from the world’s largest library, the Library of the U.S. Congress, in under one minute. We achieved this performance level under real production conditions. This confirms the robustness of our object storage software, which we designed for new data uses, in particular the massive exploitation of data by AI algorithms on Big Data / HPC clusters.

 

How was the cluster configured to achieve this performance?

OpenIO has a mesh architecture; it is a fully distributed system, with weak coupling between components. There’s no hierarchy between machines, nor is there a centralized transaction management system. No single machine intercepts all traffic: each machine runs the three components of OpenIO and acts as an S3 gateway, storing a part of the metadata and the data. This benchmark therefore consisted in aggregating the performance of each of the cluster’s machines with the least possible loss.

 

What load balancing system was put in place to achieve this level of bandwidth without a congestion point?

The copy tasks, from the Criteo production data lake to the OpenIO cluster, were started via DistCp (distributed copy). This tool, which is part of the Hadoop ecosystem, can read and write to and from HDFS or S3, but it doesn’t manage load balancing. Also, a two-level load balancing was implemented. First, a round-robin algorithm at the DNS resolution level, on the Criteo side, to route requests sent to the OpenIO cluster to one of the 350 S3 gateways.

An then, since the round-robin doesn’t distribute the load perfectly uniformly, OpenIO implemented our own secondary load balancing mechanism, based on intelligent redirections between nodes to send the load to the most available servers at the time T. Each of the OpenIO S3 gateways was able to share the load with the others, which produced a very regular writestream, with each gateway taking 1/350 of the write requests. 

 

How was the injection cluster connected to the target cluster?

From the network point of view, the two platforms are nested within each other, and the servers belong to the same full mesh network.
In this sense, the network is “unlimited”: each node can reach its theoretical bandwidth of 10 Gbps by talking to the other nodes.

 

What types of discs were used for this benchmark?

The servers involved in this benchmark were equipped with 3.5″ magnetic hard disks

of 8 TB each (15 disks per server), which stored the data. The metadata (the directory containing the location of the chunks of each object on the cluster) was stored on each server’s system disk (240 GB SSD disk). Storage of this metadata usually requires less than 1% of the platform’s total capacity.

 

What were the characteristics of the servers used?

As part of its annual capacity update, Criteo Labs had a batch of new physical machines installed in the racks, but not yet integrated into its production cluster. These machines were powered up and kindly lent to OpenIO for a few days to achieve the benchmark.

These servers are designed to be Hadoop nodes, capable of storing data under HDFS and performing calculations. As a result, they are generously equipped with CPUs and RAM. These are standard 2U commodity servers, equipped with a 10 Gbps dual port network interface (but only one port was enabled and wired).

Here is the detailed configuration, HPE 2U standard DL380 Gen10, with :

  • CPU 2 x Intel Skylake Gold 6140
  • 384 GB RAM (12 x 32 GB – SK Hynix, DDR4 RDIMM @2 666 MHz)
  • System drives 2 x 240 GB (only one used for metadata) Micron 5100 ECO
  • SATA – LFF – Hotswap hard disk capacity 15 x 8 TB – Hotswap – Seagate
  • Network 2 x SFP+ 10 Gbps HPE 562FLR-SFP + Intel X710 Adapter (only one connected and functional)

From OpenIO’s point of view, these machines are sub-optimal in terms of density (CPU/RAM ratio vs. storage capacity). Moreover, the useful capacity of the cluster after formatting is 38 PB, which is low enough for 352 servers. A typical OpenIO server would normally be equipped with 14 TB disks, reduced CPU capacity and 128 GB of RAM would have been sufficient.

 

With a write rate of 1,372 Tbps, has OpenIO reached its performance limits?

No. This benchmark even demonstrated the opposite: with more nodes, more bandwidth and more disks on each server, the bandwidth reached would have been even higher, and this, in a linear way. By reaching 1.372 Tbps of useful write rate, we approached the theoretical limits of the network capacity we had.

The difference between the theoretical network capacity and the observed useful bandwidth is explained by two factors. Any distributed system generates a network cost, related to the fact that machines discuss among themselves to share their respective states and make decisions. Above all, the data protection mechanism (erasure coding 14 + 4 as part of our benchmark) implies a significant additional network cost, since each written file is divided into 14 chunks, and 4 additional chunks are created so that the data can be recreated in the event of the loss of a disk or server (here it’s possible to lose up to 4 servers without putting the data at risk). Of the 18 chunks, 17 are re-dispatched on the cluster, resulting in a bandwidth consumption higher than the efficient throughput seen from the client application. It can also be deduced that a reading benchmark would have allowed a higher bandwidth to be observed. For this benchmark, we opted for the most “difficult” use case to achieve, since the reading does not generally involve any calculation related to Erasure Coding.

For updates, follow #TbpsChallenge.
And for even more details, download the full benchmark report.