OLAP in the Big Data world

In the previous months, I have been thinking is OLAP dead technology or we can still find the use of it?

OLAP Over Hadoop

In the last few years, Hadoop has really come forward as a massively scalable distributed computing platform. Most of us are aware that it uses Map Reduce Jobs to perform computation over Big Data which is mostly unstructured. Of course, such a platform cannot be compared with a relational database storing structured data with a defined schema. While Hadoop allows you to perform Deep analytics with complex computations, when it comes to performing multidimensional analytics over data Hadoop seems lagging. You might argue that Hadoop was not even built for such uses. But when the users start putting their historical data in Hadoop they also start expecting multidimensional analytics over it in real time. Here “real time” is really important.

Some of you might think that you can define OLAP friendly Warehousing Star Schema using Hive for your data in Hadoop and use a ROLAP tool. But there comes the catch. Even on the partially aggregated data, the ROLAP queries will be too slow to make it real-time OLAP. As Hive structures the data at reading time, the fixed initial time is taken for each Hive query makes Hadoop really unusable for real-time multidimensional analytics.

The only options left to you are either you aggregate the data in Hadoop and bring the partially aggregated data in an RDBMS. Thus you can use any standard OLAP tool to connect to your RDBMS and perform Multidimensional analytics using ROLAP or MOLAP. While ROLAP will directly fire the queries against the Database, MOLAP will further summarize and aggregate the multidimensional data in the form of cuboids for a cube.

The other option is you use a MOLAP tool that can compute the aggregates for the data in Hadoop and get the computed cube locally. This will allow you to do a really real-time OLAP. Moreover, if the aggregates can be performed in Hadoop itself that will really make cube computations scalable and fast.

There can be a big fight over the point that Hadoop is not a DBMS but when Hadoop reaches to users and organizations who look to use it just because it is a buzzword, they expect almost anything out of it that a DBMS can do. You should see such solutions growing in the near future.

 Just like with data warehouses, analytics software has been around for some time and has been providing value to business users for many years around problem domains such as market basket analytics, sales analytics, predictive analytics, etc.
Now we can see a lot of current advertising and buzz around “Big Data Analytics”. So what makes your analytics “Big Data Analytics”?
Is it adding OLAP/MDX layers on top of Hadoop and NoSQL databases? Or can we call our analytics Big Data Analytics if we ETL data from HDFS with tools like Sqoop, SSIS or Kettle into a traditional RDBMS into a star schema? Based on feedback from my post called “Did Big Data Kill OLAP Cubes“, my guess would be that most of you do not think that is sufficient.
But what about scale & performance as part of the Big Data equation? You know: the volume, velocity, variety, etc … Does traditional OLAP on top of those sources provide the analytics that a data scientist requires?
A very important aspect to Big Data Analytics that differentiates from traditional BI analytics (this is my PM opinion!) is the target persona. Big Data Analytics is primarily for data scientists vs. knowledge workers and business decision makers. Data scientists can subsequently work with IT on a process to “operationalize” their data discovery and outputs from their models such that traditional BI solutions can consume their processed data.
So if you buy into this definition of Big Data Analytics, what this means is that you will need:
  1. Big Data scale with distributed analytics processed with data locality on cluster data nodes
  2. In-memory data caching for quick response times from interactive tools
  3. Columnar compression in order to fit large data sets in memory
  4. Data mining algorithms
  5. Data visualization tools that encourage data discovery, anomaly detection, and data blending

8 thoughts on “OLAP in the Big Data world

  1. You have certainly explained that Big data analytics is the process of examining big data to uncover hidden patterns, unknown correlations and other useful information that can be used to make better decisions..The big data analytics is the major part to be understood regarding Hadoop Course in Chennai program. Via your quality content i get to know about that in deep.Thanks for sharing this here.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.