The exponential data growth that we have been seeing in recent years has posed many challenges to those who require information to make decisions. Raw data has to be crunched, compared, aggregated, and transformed to become valuable information, and traditional tools have proved to be inadequate for dealing with large amounts of data.
Big data analytics, in the form of Hadoop and Map Reduce, have been widely considered to be a viable solution for some time. But then users wanted more, more quickly, and for a larger set of use cases. This brought several new tools and technologies competing in this space. Some of them are specifically designed for a single use case, while others are trying to be more general purpose. NoSQL DBs, Elastic, and Spark are some that come to mind.
In-Memory and Big Data
Apache Spark, for example, is becoming very relevant with a growing base of users and applications supporting it. It is easier to adopt than other tools and its design allows it to run on several types of infrastructures, offering compatibility with many programming and query languages. Spark's architecture also supports in-memory computing, allowing applications to run much faster. This technique is proving to be beneficial for machine learning and real-time analytics, along with other tasks.
Generally, in-memory has been growing in popularity for years now. From a niche, high-end and expensive data processing technique it has become common in many fields. It has proven its worth by reducing response times by an order of magnitude (or more in some cases).
The entire industry is embracing in-memory, at all levels. You can now find support for in-memory in many applications (databases, file systems, and so on), as well as hardware (e.g. Intel Optane, Diablo Technologies Memory1, and NVMe). The cost of these memory devices is still high, though, and $/GB counts a lot when it comes to big data, and even more so in general data lakes.
In-Memory and Object Storage
Even though tools like Spark use memory extensively, they need a persistent storage layer to retrieve and store data. Storage-class memory still has a very high $/GB, and it is good for extending RAM, caching, or for small data sets, but when data is measured in hundreds of Terabytes or Petabytes, its price cannot be justified, and the same goes for All-flash storage (usually shared with Block or file protocols).
Object storage is not usually associated with primary data or data processing but, in this particular case, it has some characteristics that make the difference. It is becoming the most viable storage solution for building data lakes, or even for large repositories for single applications.
I came across this article, why Choose S3 Over HDFS, and it is pretty much in line with what I had in mind several years ago when I started to write about object storage on Juku. Object storage is easier to manage, it's cheaper, it has better availability, and it offers superior scalability.
Object Storage, Serverless and Big Data
But there’s more.
Serverless computing associated with the object store can offload many tasks to the storage infrastructure, helping to simplify the entire process and provide results more quickly.
The list of tasks that can be processed directly at the storage level is long. For example, it is possible to check data while it is stored, and it can be pre-processed to remove unnecessary information. But the contrary is true as well: it is possible to process data with the scope of getting additional information and enriching its metadata, making it easier to search, as well as scramble or mask sensitive information (i.e. personal data or credit card numbers).
These automated tasks, which are usually triggered by events, save a lot of time later in the data management process. Instead of having raw materials, you just feed your big data application with a sort of clean, "semi-finished" data set.
And if this is not enough, applications like Apache Spark, for example, could be run on demand on the same infrastructure with the object store. A modern object store, associated with a serverless computing framework can do exactly what I described above… and more.
Object storage is quickly becoming more and more compelling for big data projects. Not only because of its price and TCO (which are at the base of infrastructure sustainability over time), but also because, thanks to the way in-memory works, the performance impact of the object store is limited when compared to other solutions.
At the same time, serverless computing, when integrated with object storage, simplifies workflows and speeds up data processing to get more results, more quickly.