Practical Predictive Analytics in everyday enterprises

Predictive analytics is now part of the analytics fabric of companies. Even as companies continue to adopt predictive analytics, many are struggling to make it stick. Lots of organizations have not thought about how to virtually put predictive analytics to work, provided the organizational, technology, procedure, and deployment concerns they face.

These can be some of the biggest challenges organizations face today:

Skills development. Organizations are concerned about abilities for predictive modeling. These abilities consist of comprehending how to train a model, interpret output, and determine what algorithm to utilize in what circumstance. Skills are the most significant barrier to adoption of predictive analytics; many of the times, this is the top difficulty.

Model deployment. Companies are utilizing predictive analytics and machine learning throughout a series of use cases. Those checking out the technology are likewise preparing for a diverse set of use cases. Many participants are ruling out what it requires to build a valid predictive model and put it into production. Just a small number of Data Science Teams have a DevOps group, or another group that puts machine learning designs into production maintains versioning or monitors the designs. From experience, operating in this team structure, it can take months to put models into production.

Facilities. On the facilities side, the vast bulk of companies use the data storage facility, along with a variety of other innovations such as Hadoop, data lakes, or the cloud, for developing predictive designs. The bright side is that business appears to be looking to broaden their data platforms to support predictive analytics and machine learning. The relocation to contemporary data architecture to support the diverse type of data makes good sense and is required to prosper in predictive analytics.

New Practices for Predictive Analytics and Machine Learning

Since predictive analytics and machine learning abilities are in such high need, vendors are offering tooling to assist make predictive modeling easier, particularly for brand-new users. Essential to ease of usage are these functions:

  • Collaboration features. Anyone from a business analyst to a data scientist building a model often wants to collaborate with others. A business analyst may want to get input from a data scientist to validate a model or help build a more sophisticated one. Vendors provide collaboration features in their software that enable users to share or comment on models. Collaboration among analysts is an important best practice to help democratize predictive analytics.
  • Workflows and versioning. Lots of products supply workflows that can be saved and reused, including data pipeline workflows for preparing the data in addition to analytics workflows. If a data researcher or another model home builder develops a model, others can recycle the model. This frequently consists of a point-and-click interface for model versioning– crucial for monitoring the newest designs and model history– and for analytics governance.
  • GUIs. Lots of users do not like to program or even write scripts; this stimulated the movement toward GUIs (graphical user interfaces) decades earlier in analytics items. Today’s GUIs typically offer a drag-and-drop and point-and-click interface that makes it easy to construct analytics workflows. Nodes can be picked, defined, dragged onto a canvas, and linked to form a predictive analytics workflow. Some supplier GUIs enable users to plug in open source code as a node to the workflow. This supports models integrated into R or Python, for example.
  • Persona-driven features. Various users desire different user interfaces. A data scientist may want a notebook-based interface, such as Juypter note pads (e.g., “live” Web coding and collaboration user interfaces) or just a programming user interface. A business analyst may prefer a GUI user interface. A business analyst may desire a natural language-based interface to ask questions quickly and discover insights (even predictive ones) in the data. New analytics platforms have tailored environments to satisfy the requirements of various personas while maintaining reliable data stability beneath the platform. This makes structure models more efficient.

Next to read is:

What should you consider when choosing the right machine learning and AI platforms?
Time series components

Build Machine learning models using Time series data

Time series forecasting is an important area of machine learning that is often neglected. It is important because there are so many prediction problems that involve a time component. These problems are neglected because it is this time component that makes time series problems more difficult to handle.

Time series vs. normal machine learning dataset

A normal machine learning dataset is a collection of observations. For example:

observation #1
observation #2
observation #3

Predictions are made for new data when the actual outcome may not be known until some future date. The future is being
predicted, but all prior observations are treated equally. Perhaps with some very minor temporal dynamics to overcome the idea of concept drift such as only using the last year of observations rather than all data available.

A time series dataset is different. Time series adds an explicit order dependence between
observations: a time dimension. This additional dimension is both a constraint and a structure that provides a source of additional information.
A time series is a sequence of observations taken sequentially in time.

Time #1, observation
Time #2, observation
Time #3, observation

Time Series Nomenclature

it is essential to quickly establish the standard terms used when describing
time series data. The current time is defined as t, observation at the present time is defined as obs(t).

We are often interested in the observations made at prior times, called lag times or lags.

Times in the past are negative relative to the current time. For example, the previous time is t-1 and the time before that is t-2. The observations at these times are obs(t-1) and obs(t-2) respectively.

To summarize:

  • t-n: A prior or lag time (e.g. t-1 for the previous time).
  • t: A current time and point of reference.
  • t+n: A future or forecast time (e.g. t+1 for the next time).

Time Series Analysis vs. Time Series Forecasting

We have different goals depending on whether we are interested in understanding a dataset or making predictions. Understanding a dataset, called time series analysis, can help to make better predictions, but is not required and can result in a large technical investment in time and expertise not directly aligned with the desired outcome, which is forecasting the future.

Time Series Analysis

When using classical statistics, the primary concern is the analysis of time series. Time series analysis involves developing models that best capture or describe an observed time series to understand the underlying causes. This eld of study seeks the why behind a time series dataset. This often involves making assumptions about the form of the data and decomposing the time series into constitution components. The quality of a descriptive model is determined by how well it describes all available data and the interpretation it provides to better inform the problem domain.

The primary objective of time series analysis is to develop mathematical models that provide plausible descriptions from sample data.

Time Series Forecasting

Making predictions about the future is called extrapolation in the classical statistical handling of time series data. More modern fields focus on the topic and refer to it as time series forecasting.

Forecasting involves taking models t on historical data and using them to predict future observations. Descriptive models can borrow from the future (i.e. to smooth or remove noise), they only seek to best describe the data. An important distinction in forecasting is that the future is completely unavailable and must only be estimated from what has already happened.

The skill of a time series forecasting model is determined by its performance at predicting the future. This is often at the expense of being able to explain why a specific prediction was made, confidence intervals and even better understanding the underlying causes behind the problem.

Components of Time Series

Time series analysis provides a body of techniques to better understand a dataset. Perhaps the most useful of these is the decomposition of a time series into 4 constituent parts:

  • Level. The baseline value for the series if it were a straight line.
  • Trend. The optional and often linear increasing or decreasing behavior of the series over time.
  • Seasonality. The optional repeating patterns or cycles of behavior over time.
  • Noise. The optional variability in the observations that cannot be explained by the model.

All time series have a level, most have noise, and the trend and seasonality are optional.

Time series components
Time series components

Concerns of forecasting time series

When forecasting, it is important to understand your goal. Use the Socratic method and ask lots of questions to help zoom in on the specifics of your predictive modeling problem. For example:

How much data do you have available and are you able to gather it all together?

  1. Like in all Machine learning models, more data is often more helpful, offering greater opportunity for exploratory data analysis, model testing, and tuning, and model fidelity.
  2. What is the time horizon of predictions that is required? Short, medium or long term? Shorter time horizons are often easier to predict with higher confidence.
  3. Can forecasts be updated frequently over time or must they be made once and remain static? Updating forecasts as new information becomes available often results in more accurate predictions.
  4. At what temporal frequency are forecasts required? Often forecasts can be made at a lower or higher frequency, allowing you to harness down-sampling, and up-sampling of data, which in turn can offer benefits while modeling.

Time series data often requires cleaning, scaling, and even transformation. For example:

  • Frequency. Perhaps data is provided at a frequency that is too high to model or is unevenly spaced through time requiring resampling for use in some models.
  • Outliers. Perhaps there are corrupt or extreme outlier values that need to be identified and handled.
  • Missing. Perhaps there are gaps or missing data that need to be interpolated or imputed.

Often time series problems are real-time, continually providing new opportunities for prediction. This adds an honesty to time series forecasting that quickly eshes out bad assumptions, errors in modeling and all the other ways that we may be able to fool ourselves.

Examples of Time Series Forecasting

  • Forecasting the commodity, like corn, wheat etc. yield in tons by the state each year.
  • Forecasting whether an EEG trace in seconds indicates a patient is having a seizure or not.
  • Forecasting the closing price of a stock each day.
  • Forecasting the birth rate at all hospitals in a city each year.
  • Forecasting product sales in units sold each day for a store.
  • Forecasting the number of passengers through a train station each day.
  • Forecasting unemployment for a state each quarter.
  • Forecasting utilization demand on a server each hour.
  • Forecasting the size of the rabbit population in a state each breeding season.
  • Forecasting the average price of gasoline in a city each day.

Predictive Analytics from research and development to a business maker

The start of predictive analytics and machine learning

Predictive analytics started in the early 90s with pattern recognition algorithms—for example, finding similar objects. Over the years, things have evolved into machine learning. In the workflow of data analysis, you collect data, prepare data, and then perform the analysis. If you employ algorithms or functions to automate the data analysis, that’s machine learning.

Read more about the process of building data analysis.

Read More »

Data scientists: How can I use my data?


I have been asked numerous times, I got access to the database can you please tell me how to use the data?

Since I have been asked this more and more,  I got some time to answer it here and help people with this question.
 

What do you want to do with your data?

First and foremost, what are you trying to do with the data?
Ask yourself, your manage, friend or whoever is making you do something with the data, what do you want the data to show you?
Most of the times the data is powerful as much as you can understand it. Here is how you can do that:

Understand how your data is linked

Databases no matter relational or non relational have schemas. This shows where specific attributes are stored, in which tables or objects, also shows how tables or objects are connected between themselves, the linking.

What is an attribute? Attribute is basically everything that is descriptive, name, surname, dress, profit etc..
What is table or object? Table or object is the structure that is holding or grouping the attributes.
What are links or keys? Links, keys and foreign keys are basically information that allows you linking one table to another. For example you want to link the profit to a sales person, or address to a person, you will use linking or joining to the tables. Mostly this is done using foreign keys or by joining multiple attributes – creating composite key. 

How to learn the links between the data

Some people try to learn the schema all at once, seeing the tables their attributes and how they link to each other.
My suggestion is to learn by doing. Lately in the world of big data, the database systems are getting too complex to be learnt all at once and most of the times you won’t need to know it all. Learning it by practical use cases can help you understand not just table structure but also the underlaying dat.
 

How do I get the data?

Usually we use SQL to query the databases. SQL is the fastest and best performing way to do it.

Other ways can be using code: Java, .Net, R, Python and what not else.
Excel, you can query data easily using Excel while creating Pivot table.
Lately Data Scientist are using tools like KNIME, Alteryx to fetch the data. Using this approach does not requires knowing any query language, but you might face the risk of downloading gigabytes of data in your memory or disk if the table you are trying to query is that big.
Query with Excel:
Query with Javascript: 

Use your imagination.

Once you succeed in getting your data you should start using your imagination and think of making useful use case scenarios that will help your business.
 

Visualize your data

Plain data is boring so that is why we visualize it.

Easy ways to do that is using Excel. Excel is really powerful by itself and can create pretty charts.
Some other popular tools are Tableau, QlickView and Jaspersoft, HighCharts
 Off course,There are endless other solutions that can create pretty charts.
One word of caution about visualization, don’t try to over visualize things because they can become really confusing. Also try to use few colors instead of using all color palette, so other people can follow you.

 

Create a story with your data

Now that you have your use case and cool visualization, try to create a story.
People will understand what you want to say and even get new ideas if you tell your analysis in a nice story.
Happy data mining, now when you know what to do with your data!

Predictive analytics, the process of building it

What is Predictive Analytics?

I have talked with many people with different technical knowledge, and many times I have been asked questions like: So, can predictive analytics tell my future? The sad answer is NO. Predictive analytics will not tell you for certain if you are going to be rich or not. Or will not guarantee 100% that your favorite team will win so you can put all your saving on a bet. Also, it won’t tell you where you will end up for sure next year.

However, predictive analytics can definitely forecast and give hints about what might happen in the future with an acceptable level of reliability and can include risk assessment and what – if scenarios.

The process of extracting information from a dataset order to get patterns or predict future outcomes upon that data.

What a process of implementing Predictive Analytics includes?

From having an idea until implementing a predictive model and being able to read it there are a couple of operations that need to be taken care of.

Know the business problem

It is really important to know the scope of the business that you are building a model for. Many people think that it is necessary to apply statistical methods and some Machine learning algorithm on a big chunk of data and the model will give you an answer by itself. Unfortunately, this is not the case. Most of the times you will have to complement your data set with producing new metrics out of the already existing data and for that, you will have to know at least the essentials of the business.

First and foremost, it is essential to identify your business problem. After that, you can successfully determine what metrics are necessary to address your problem and then you can decide which analysis technique you will use.

Transforming and extracting raw data

While trying to build a predictive model you will spend a lot of time trying to prepare the data in the best possible way. That will include handling different data sources like Web API, unstructured data (usually collected from Weblogs, Facebook, Twitter, etc.), different database engines (MSSQL, MySQL, Sybase, Casandra, etc.), flat files (Comma separated value (CSV), tab delimited files, etc.). Therefore, knowledge of Database structures, ETLs, and general computer science knowledge is really useful. In some of the cases, you might be lucky enough to have a separate team that will provide these services for you and delivers you a nicely formatted file that you can work on, but in most of the cases, you still have to do data transformations by yourself.

Tools that are usually used for this kind of transformations are SSIS from Microsoft, SAP Data services, IBM Infosphere Information Manage, SAS Data Management, Python. Many times I have seen ETL processes made purely in C# (This is when application developers are given BI tasks), or purely Stored Procedures (This is when database developers are given BI tasks), R (when statisticians try to perform ETL)
Exploratory data analysis

Exploratory data analysis is an approach to analyze data sets so you can summarize their main characteristics. Often the exploratory data analysis is done using visual methods. Here a statistical model can be used or not, but primary, exploratory data analysis is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

Exploratory data analysis VS Summary data analysis

A summary analysis is simply a numeric reduction of a historical data set. It is quite passive. Its focus was in the past. Its intent is to simply arrive at a few key statistics like mean and standard deviation, which may then either replace the data set or be appended to the data set in the frame of a summary table.
The purpose of exploratory data analysis is to gain insight into the engineering/scientific process behind the data. Whereas summary statistics are passive and historical,  exploratory data analysis is active and futuristic. In an effort to “understand” the procedure and improve it in the future the data as a “window” to peer into the spirit of the process that generated the information.

Building a predictive model

After successfully identifying the problems that this model needs to solve it is time to write some code and do some testings in order to make a theoretical account that will predict outcomes of the data we anticipate to receive in the near future according to the data we already deliver.
This is the time to implement some machine learning. You will basically try to implement algorithms like ANOVA, ARIMA, decision trees, KNN, etc., depending on the problem you are trying to solve and the performance that the algorithm is giving for the specific data we have.
In this pace, we should basically evaluate algorithms with developing a test harness and baseline accuracy from which to ameliorate. The second thing is to leverage results to develop more accurate models.
There are many ways to choose the right algorithm while building a model depending on the scope of the problem. Most of the times the prediction model is improved by combining more than one algorithm, blending. For example, the Predictive model that won the $1 million prize from Netflix for giving recommendations of movies contains more than 100 different models that are blended into a one.
Popular tools that are used now days are R, Python, Weka, Rapid Miner, Mathlab, IBM SPSS, Apache Mahout.
I will write more about choosing the right algorithm for a specific problem in another article.

Presenting the outcome

At this stage, we need to come up with a way of presenting the results that the predictive model has generated. This is where good data visualization practices come handy. Most of the times the results are presented as a report or just an excel spreadsheet, but lately, I see the increased demand, interactive dashboards where the user can see the data from many perspectives instead of one.
At this stage, we should be careful how to present the data since executives and people that need to bring strategic decisions are not necessarily really technical. We must make sure that they have a good understanding of the data. Asking the help of a graphic designer or reading more about how to play with colors and shapes will be really useful and awards at the end.
Some popular visualization platforms that can provide interactive dashboards are Microstrategy, Performance Point on Sharepoint, Tableau, QlikView, Logi Analytics, SAP, SAS, Big Blue from IBM

As you can see the process of building a Predictive model has a much bigger scope than just applying some fancy logos and mathematical formulas. Therefore a successful data scientist should have an understanding of business problems and business analysis so he/she will have a greater understanding of what the data is really saying; some computer science skills so he can perform the extraction and the transformation of the different data sources and some statistical knowledge so he can apply data sampling, better understanding of the predictive models, hypothesis testing, etc.
Maybe, you might think that this kind of person with that much broad knowledge does not exist, but in fact they do. That is why the data scientist is really appreciated lately.

Where does a Data Scientist sit among all that Big Data

In the past years we have been all witnesses of the growing demand on machine learning, predictive analytics, data analysis. Why is that so?
Well, it is quite simple. Like E.O Wilson said,
“We are drowning in information, while starving for wisdom. The world henceforth will be run by synthesizers, people able to put together the right information at the right time, think critically about it, and make important choices wisely.”
 Every possible device I can think of is generating some data feed. But what to do with this data is always the big question.
Business people for sure will have some sparkles of ideas, marketing people will try to sell the perfect fit all models for optimizing your marketing campaigns and bringing all your needed customers, business controllers will try to predict what the sales going to look alike in the next quarter and so on.. All that by using the raw data generated by everything around us.
The big questions is how doing all this smart analysis?
That is where all the fancy names and companies with shiny solutions come in the game, of course that comes with a price, usually, big price.
But on the other side of the game, this is where all smart people come into the game. Lately, these people are called Data Scientist, Data Anlayst, BI professionals and all possible varieties of names.

What does Data Scientist / Data Analyst really do?

Working as a part of already established Data Science team

Most of them do tasks like pulling data out of MySQL, MSSQL, Sybase and all other databases on the market, becoming a master at Excel pivot tables, and producing basic data visualizations (e.g., line and bar charts). You may on occasion analyze the results of an A/B test or take the lead on your company’s Google Analytics account.
Also here you will have to build predictive models that will forecast your client’s mood, possible new market openings, product forecasting or time series analysis.

Basic Statistics

This is where the basic statistics come really handy.
You should be familiar with statistical tests, distributions, maximum likelihood estimators, etc.
This will also be the case for machine learning, but one of the more important aspects of your statistics knowledge will be understanding when different techniques are (or aren’t) a valid approach. Statistics are important in all company types, but especially data-driven companies where the product is not data-focused and product stakeholders will depend on your help to make decisions and design / evaluate experiments.

Machine Learning

 If you’re dealing huge amounts of data, or working at a company where the product itself is especially data-driven, it may be the case that you’ll want to be familiar with machine learning methods. At this time classical statistical methods might not always work and you might be facing a time when you need to work with all data, instead of a sample of the whole data set, like you would do if you follow a conventional statistical approach of analyzing your data.
This can mean things like ARIMA models, Neural Networks, SVM, VARs, k-nearest neighbors, decision trees, random forests, ensemble methods – all of the machine learning fancy words. It’s true that a lot of these techniques can be implemented using R or Python libraries – because of this, it’s not necessarily a deal breaker if you’re not the world’s leading expert on how the algorithms work. More important is to understand the broad strokes and really understand when it is appropriate to use different techniques. There is a lot of literature that can help you in getting up to speed with R and Python in real case scenarios.

Establishing Data Science team

Nowadays, number of companies are getting to the point where they have an increasingly large amount of data, and they’re looking for someone to set up a lot of the data infrastructure that the company will need moving forward. They’re also looking for someone to provide analysis. You’ll see job postings listed under both “Data Scientist” and “Data Engineer” for this type of position. Since you’d be (one of) the first data hires, there are likely many low-hanging fruit, making it less important that you’re a statistics or machine learning expert.
A data scientist with a software engineering background might excel at a company like this, where it’s more important that a data scientist make meaningful data-like contributions to the production code and provide basic insights and analyses.
At this time be ready to bootstrap servers, installations of new virtual machines, setting up networks, plain DBA work, Hadoop installation, setting up Oozie, Flume, Hive, etc. Many times I have been asked to set up Share Point or Web Servers, so I can set up Performance Point as part of the reporting solution.
In times of establishing Data Team or BI teams in a company, you should be ready literally for every other IT tasks you can imagine (especially if you work in a startup), so broad range of skills is really welcomed here.
Expect at least the first year to work mainly on infrastructure and legacy items instead of crunching data and making shiny assumptions and reports.

Keeping up to date

There is certainly a lot of upcoming potential in this profession. And with that the expectance from you in your company is growing exponentially.
As the industry is getting more inclined in data analysis, you working as Data Analyst/Scientist will be challenged to read all the time, whether is Business News or literature that will help you build or improve your models, reports or work in general. There is a lot of research in this field and a lot of new books going out with titles mentioning Data.
Data scientists today are akin to Wall Street “quants” of the 1980s and 1990s. In those days people with backgrounds in physics and math streamed to investment banks and hedge funds, where they could devise entirely new algorithms and data strategies. Then a variety of universities developed master’s programs in financial engineering, which churned out a second generation of talent that was more accessible to mainstream firms. The pattern was repeated later in the 1990s with search engineers, whose rarefied skills soon came to be taught in computer science programs.
One question raised by this is whether some firms would be wise to wait until that second generation of data scientists emerges, and the candidates are more numerous, less expensive, and easier to vet and assimilate in a business setting. Why not leave the trouble of hunting down and domesticating exotic talent to the big data start-ups and to firms whose aggressive strategies require them to be at the forefront?
The problem with that reasoning is that the advance of big data shows no signs of slowing. If companies sit out this trend’s early days for lack of talent, they risk falling behind as competitors and channel partners gain nearly unassailable advantages. Think of big data as an epic wave gathering now, starting to crest. If you want to catch it, you need people who can surf.