Enterprises need to begin thinking if they really should continue building out more physical Hadoop compute and storage infrastructure. Up until recently, the only way to perform production compute on big data was over clusters of physical nodes running some distribution of HDFS and mostly as YARN batch jobs. Data IT organizations spent a great deal of effort to convince corporate business units to bring their data into the “Data Lake.” However, ingestion is one thing, consumption is altogether a different consideration. For large enterprises who were successful with data consolidation, this meant your infrastructure continued to grow. It’s time to start thinking about making it shrink instead because now Big Data is just Data.
While business units typically have a few data experts who could write SQL queries and could do useful joins within their domain expertise, most requests for reshaping, curating or creating special views of data are handled as IT projects. The problem is you can never hire enough data engineers to keep up with the consumption demand from the business, it’s not even the type of work most of them like to do. There are many data tools, some licensed and some open sourced, however most are made for power users or data engineers who can bring about real business value. The key is to stratify the architecture of your environment and match the best solution at each layer. By abstracting away the physical attributes of data or where its stored, you can decouple data access from data storage and decouple the consumption platform from access methods. This way you can make consumption more self-service.
Much progress has already been made in data science use cases. For example, Spark can run without Hadoop and you can run commercial or in-house applications to read and write data to Azure Blob or AWS S3 storage to perform data wrangling or analysis, HDFS is not needed. To replace the scaling and resource management capabilities provided by Hadoop, you can utilize Mesos or Kubernetes along with a wider choice of orchestration tools to accomplish the same thing. Spin up an Azure DataBricks (Spark as a Service) compute cluster in the cloud that reads and writes to Azure Data blobs, provision all the connections, and you’re good to go. Programmable auto-scaling and auto-termination of Spark clusters based on workload patterns can drastically reduce costs so business people can choose how fast they want results based on their budget.
There will be little innovation in Hadoop going forward. Spark and other Apache projects will make Hadoop obsolete in three years. Hadoop is 10 years old, that’s old even in elephant years. I see Hadoop being relegated to the island of misfit legacy systems. It’s strength as an inexpensive storage repository and ETL batch workload engine has lost its muscle. Most everyone agrees that legacy big data distributions are not suitable for any interactive, client-facing applications. Business people want to join, reshape and curate their structured or unstructured data – regardless of where it is. The time is right to free the data, and let the elephant go while you’re at it.