Predictive analytics, the process of building it

What is Predictive Analytics?

I have talked with many people with different technical knowledge, and many times I have been asked questions like: So, can predictive analytics tell my future? The sad answer is NO. Predictive analytics will not tell you for certain if you are going to be rich or not. Or will not guarantee 100% that your favorite team will win so you can put all your saving on a bet. Also, it won’t tell you where you will end up for sure next year.

However, predictive analytics can definitely forecast and give hints about what might happen in the future with an acceptable level of reliability and can include risk assessment and what – if scenarios.

The process of extracting information from a dataset order to get patterns or predict future outcomes upon that data.

What a process of implementing Predictive Analytics includes?

From having an idea until implementing a predictive model and being able to read it there are a couple of operations that need to be taken care of.

Know the business problem

It is really important to know the scope of the business that you are building a model for. Many people think that it is necessary to apply statistical methods and some Machine learning algorithm on a big chunk of data and the model will give you an answer by itself. Unfortunately, this is not the case. Most of the times you will have to complement your data set with producing new metrics out of the already existing data and for that, you will have to know at least the essentials of the business.

First and foremost, it is essential to identify your business problem. After that, you can successfully determine what metrics are necessary to address your problem and then you can decide which analysis technique you will use.

Transforming and extracting raw data

While trying to build a predictive model you will spend a lot of time trying to prepare the data in the best possible way. That will include handling different data sources like Web API, unstructured data (usually collected from Weblogs, Facebook, Twitter, etc.), different database engines (MSSQL, MySQL, Sybase, Casandra, etc.), flat files (Comma separated value (CSV), tab delimited files, etc.). Therefore, knowledge of Database structures, ETLs, and general computer science knowledge is really useful. In some of the cases, you might be lucky enough to have a separate team that will provide these services for you and delivers you a nicely formatted file that you can work on, but in most of the cases, you still have to do data transformations by yourself.

Tools that are usually used for this kind of transformations are SSIS from Microsoft, SAP Data services, IBM Infosphere Information Manage, SAS Data Management, Python. Many times I have seen ETL processes made purely in C# (This is when application developers are given BI tasks), or purely Stored Procedures (This is when database developers are given BI tasks), R (when statisticians try to perform ETL)
Exploratory data analysis

Exploratory data analysis is an approach to analyze data sets so you can summarize their main characteristics. Often the exploratory data analysis is done using visual methods. Here a statistical model can be used or not, but primary, exploratory data analysis is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

Exploratory data analysis VS Summary data analysis

A summary analysis is simply a numeric reduction of a historical data set. It is quite passive. Its focus was in the past. Its intent is to simply arrive at a few key statistics like mean and standard deviation, which may then either replace the data set or be appended to the data set in the frame of a summary table.
The purpose of exploratory data analysis is to gain insight into the engineering/scientific process behind the data. Whereas summary statistics are passive and historical,  exploratory data analysis is active and futuristic. In an effort to “understand” the procedure and improve it in the future the data as a “window” to peer into the spirit of the process that generated the information.

Building a predictive model

After successfully identifying the problems that this model needs to solve it is time to write some code and do some testings in order to make a theoretical account that will predict outcomes of the data we anticipate to receive in the near future according to the data we already deliver.
This is the time to implement some machine learning. You will basically try to implement algorithms like ANOVA, ARIMA, decision trees, KNN, etc., depending on the problem you are trying to solve and the performance that the algorithm is giving for the specific data we have.
In this pace, we should basically evaluate algorithms with developing a test harness and baseline accuracy from which to ameliorate. The second thing is to leverage results to develop more accurate models.
There are many ways to choose the right algorithm while building a model depending on the scope of the problem. Most of the times the prediction model is improved by combining more than one algorithm, blending. For example, the Predictive model that won the $1 million prize from Netflix for giving recommendations of movies contains more than 100 different models that are blended into a one.
Popular tools that are used now days are R, Python, Weka, Rapid Miner, Mathlab, IBM SPSS, Apache Mahout.
I will write more about choosing the right algorithm for a specific problem in another article.

Presenting the outcome

At this stage, we need to come up with a way of presenting the results that the predictive model has generated. This is where good data visualization practices come handy. Most of the times the results are presented as a report or just an excel spreadsheet, but lately, I see the increased demand, interactive dashboards where the user can see the data from many perspectives instead of one.
At this stage, we should be careful how to present the data since executives and people that need to bring strategic decisions are not necessarily really technical. We must make sure that they have a good understanding of the data. Asking the help of a graphic designer or reading more about how to play with colors and shapes will be really useful and awards at the end.
Some popular visualization platforms that can provide interactive dashboards are Microstrategy, Performance Point on Sharepoint, Tableau, QlikView, Logi Analytics, SAP, SAS, Big Blue from IBM

As you can see the process of building a Predictive model has a much bigger scope than just applying some fancy logos and mathematical formulas. Therefore a successful data scientist should have an understanding of business problems and business analysis so he/she will have a greater understanding of what the data is really saying; some computer science skills so he can perform the extraction and the transformation of the different data sources and some statistical knowledge so he can apply data sampling, better understanding of the predictive models, hypothesis testing, etc.
Maybe, you might think that this kind of person with that much broad knowledge does not exist, but in fact they do. That is why the data scientist is really appreciated lately.

10 thoughts on “Predictive analytics, the process of building it

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.