Reasons Why Your Data Science Project is Likely to Fail

Businesses are creating ahead with digital improvement at an unmatched rate. A current survey by Gartner Research discovered that 49 percent of CIOs are reporting that their company has already altered their business designs to scale their digital undertakings or are in the procedure of doing so.

As companies create ahead with these changes, they are instilling data science and machine learning into various company functions. This is not a simple job. A typical enterprise data science task is extremely complicated and requires the release of an interdisciplinary team that includes assembling data engineers, developers, data scientists, topic specialists, and people with other special abilities and understanding.

Additionally, this talent is limited and costly. In reality, only a little number of companies have actually been successful in building a skilled data science practice. And, while making this team takes time and resources, there is an even more significant problem faced by a number of these companies: more than 85 percent of big data jobs fail.

A variety of factors add to these failures, including human aspects, and challenges with time, ability, and impact.

Lack of Resources to Execute Data Science Projects

Data science is an interdisciplinary method that includes mathematicians, statisticians, data engineering, software application engineers, and notably, subject matter specialists. Depending upon the size and scope of the project, companies may release numerous data engineers, an option architect, a domain specialist, a data scientist (or several), company analysts and perhaps additional resources. Lots of business do not have and/or can not manage to release sufficient funds because employing such skills is ending up being increasingly-challenging and also because company frequently has many data science tasks to carry out, all of which take months to complete.

Heavy Dependence on Data Scientists abilities, Experiences of Particular People

Traditional data science much relies on skills, experiences, and intuitions of experienced people. In specific, the data and feature engineering procedure now are mostly based upon manual efforts and instincts of domain experts and data scientists. Although such gifted individuals are valuable, the practices relying on these individuals are not sustainable for enterprise business, given the hiring challenge of such skilled talents. As such, companies need to seek solutions to help equalize data science, allowing more individuals with different ability levels to carry out on tasks effectively.

Misalignment of Technical and Company Expectations

A lot of data science projects are carried out to provide crucial insights to the business group. Nevertheless, often a task begins without precise alignment between the service and data science groups on the expectations and goals of the job, resulting in that the data science team is focused primarily on model accuracy, while the company team is more thinking about metrics such as the monetary advantages, business insights, or model interpretability. In the end, the business team does not accept the outcomes of the data science team.

Data science projects take long turnaround time and upfront effort without exposure into the possible value

Among the most significant obstacles of data science projects is the big in advance effort required, despite an absence of presence into the eventual outcome and its business value. The traditional data science process takes months to finish until the result can be examined. In specific, data and function engineering process to transform service data into a machine learning, ready format takes a huge quantity of iterative efforts. The long turnaround time and significant upfront efforts related to this approach typically lead to job failure after months of investment. As an outcome, business executives are reluctant to apply more resources.

Absence of Architectural Consideration for Production and Operationalization on Data Science projects

Numerous data science tasks begin without consideration for how the established pipelines will be deployed in production. This takes place since the company pipeline is often handled by the IT group, which does not have insight into the data science process, and the data science team is concentrated on verifying its hypotheses and does not have an architectural view into production and option integration. As an outcome, instead of getting integrated into the pipeline, many data science tasks wind up as one-time, proof-of-concept exercises that fail to provide real business effect or triggers substantial cost-increases to productionalize the jobs.

End-to-end Data Science Automation is a Solution

The pressure to attain higher ROI from expert system (AI) and machine-learning (ML) initiatives has actually pressed more magnate to look for innovative options for their data science pipeline, such as machine learning automation. Picking the right service that delivers end-to-end automation of the data science procedure, including automated data and feature engineering, is the key to success for a data-driven business. Data science automation makes it possible to perform data science processes quicker, often in days instead of months, with more transparency, and to deliver minimum practical pipelines that can be improved continuously. As a result, companies can quickly scale their AI/ML initiatives to drive transformative business modifications.
However, Data science and machine learning automation can bring new types of problems, that is why I wrote before that : Guided analytics are the future of Data Science and AI

Practical Predictive Analytics in everyday enterprises

Predictive analytics is now part of the analytics fabric of companies. Even as companies continue to adopt predictive analytics, many are struggling to make it stick. Lots of organizations have not thought about how to virtually put predictive analytics to work, provided the organizational, technology, procedure, and deployment concerns they face.

These can be some of the biggest challenges organizations face today:

Skills development. Organizations are concerned about abilities for predictive modeling. These abilities consist of comprehending how to train a model, interpret output, and determine what algorithm to utilize in what circumstance. Skills are the most significant barrier to adoption of predictive analytics; many of the times, this is the top difficulty.

Model deployment. Companies are utilizing predictive analytics and machine learning throughout a series of use cases. Those checking out the technology are likewise preparing for a diverse set of use cases. Many participants are ruling out what it requires to build a valid predictive model and put it into production. Just a small number of Data Science Teams have a DevOps group, or another group that puts machine learning designs into production maintains versioning or monitors the designs. From experience, operating in this team structure, it can take months to put models into production.

Facilities. On the facilities side, the vast bulk of companies use the data storage facility, along with a variety of other innovations such as Hadoop, data lakes, or the cloud, for developing predictive designs. The bright side is that business appears to be looking to broaden their data platforms to support predictive analytics and machine learning. The relocation to contemporary data architecture to support the diverse type of data makes good sense and is required to prosper in predictive analytics.

New Practices for Predictive Analytics and Machine Learning

Since predictive analytics and machine learning abilities are in such high need, vendors are offering tooling to assist make predictive modeling easier, particularly for brand-new users. Essential to ease of usage are these functions:

  • Collaboration features. Anyone from a business analyst to a data scientist building a model often wants to collaborate with others. A business analyst may want to get input from a data scientist to validate a model or help build a more sophisticated one. Vendors provide collaboration features in their software that enable users to share or comment on models. Collaboration among analysts is an important best practice to help democratize predictive analytics.
  • Workflows and versioning. Lots of products supply workflows that can be saved and reused, including data pipeline workflows for preparing the data in addition to analytics workflows. If a data researcher or another model home builder develops a model, others can recycle the model. This frequently consists of a point-and-click interface for model versioning– crucial for monitoring the newest designs and model history– and for analytics governance.
  • GUIs. Lots of users do not like to program or even write scripts; this stimulated the movement toward GUIs (graphical user interfaces) decades earlier in analytics items. Today’s GUIs typically offer a drag-and-drop and point-and-click interface that makes it easy to construct analytics workflows. Nodes can be picked, defined, dragged onto a canvas, and linked to form a predictive analytics workflow. Some supplier GUIs enable users to plug in open source code as a node to the workflow. This supports models integrated into R or Python, for example.
  • Persona-driven features. Various users desire different user interfaces. A data scientist may want a notebook-based interface, such as Juypter note pads (e.g., “live” Web coding and collaboration user interfaces) or just a programming user interface. A business analyst may prefer a GUI user interface. A business analyst may desire a natural language-based interface to ask questions quickly and discover insights (even predictive ones) in the data. New analytics platforms have tailored environments to satisfy the requirements of various personas while maintaining reliable data stability beneath the platform. This makes structure models more efficient.

Next to read is:

What should you consider when choosing the right machine learning and AI platforms?

Important things to consider before building your machine learning and AI project

Current State of the market

In order to go in-depth on what exactly data science and machine learning (ML) tools or platforms are, why companies small and large are moving toward them, and why they matter in the Enterprise AI journey, it’s essential to take a step back and understand where we are in the larger story of AI, ML, and data science in the context of businesses:

 1. Enterprise AI is at peak hype.

Of course, the media has been talking about consumer AI for years. However, since 2018, the spotlight has turned to the enterprise. The number and type of devices sending data are skyrocketing while the cost of storing data continues to decrease, which means most businesses are collecting more data in more types and formats than ever before. Moreover, to compete and stay relevant among digital startups and other competition, these companies need to be able to use this data not only to drive business decisions but drive the business itself. Now, everyone is talking about how to make it a reality.

2. AI has yet to change businesses.

Despite the hype, the reality is that most businesses are struggling to leverage data at all, much less build machine learning models or take it one step further into AI systems. For some, it’s because they find building just one model is far more expensive and time-consuming that they planned for. However, the great majority struggle with fundamental challenges, like even organizing controlled access to data or efficient data cleaning and wrangling.

3. Successful enterprises have democratized.

 Those companies that have managed to make progress toward Enterprise AI have realized that it’s not one ML model that will make the difference; it’s thousands or hundreds. Also, that means scaling up data efforts in a big way that will require everyone at the company to be involved. Enter democratization. In August 2018, Gartner identified Democratized AI as one of the five emerging trends in their Hype Cycle for Emerging Technologies. Since then, we have seen the word “democratization” creep into the lexicon of AI-hopefuls everywhere, from the media to the board room. Also, to be sure, it’s an essential piece of the puzzle when it comes to an understanding of data science and machine learning (ML) platforms.

Is hiring Data Scientist enough to fulfil your AI and Machine learning goals?

Employing for data functions is at an all-time high. Currently in 2019, according to career listing data, a data scientist is the hottest career out there. Moreover, though statistics on Chief Data Offers (CDOs) vary, some put the figures as high as 100-fold growth in the function over the past 10 years.

Hiring data experts is a crucial element to a robust Enterprise AI strategy; however, hiring alone does not guarantee the expected outcomes, and it isn’t a factor not to invest in data science and ML platforms. For one thing, working with data scientists is costly – often excessively so – and they’re only getting more so as their need grows.

The truth is that when the intention is going from producing one ML model a year to tens, hundreds, or even thousands, a data team isn’t enough because it still leaves a big swath of employees doing day-to-day work without the capability to take advantage of data. Without democratization, the result of a Data team – even the very best one comprised of the leading data scientists – would be restricted.

As a response to this, some companies have decided to leverage their data team as sort of an internal contractor, working for lines of business or internal groups to complete projects as needed. Even with this model, the data team will need tools that allow them to scale up, working faster, reusing parts of projects where they can, and (of course) ensuring that all work is properly documented and traceable. A central data team that is contracted out can be a good short-term solution, but it tends to be a first step or stage; the longer-term model of reference is to train masses of non-data people to be data people.

Choosing the right tools for Machine Learning and AI

Opens Source – critical, but not always giving what you need

In order to be on the bleeding edge of technological developments, using open source makes it easier to onboard a team and hire. Not only are data scientists interested in growing their skills with the technologies that will be the most used in the future, but also there is less of a learning curve if they can continue to work with tools they know and love instead of being forced to learn an entirely different system. It’s important to remember, that keeping up with that rapid pace of change is difficult for big-sized corporations.
The latest innovations are usually highly technical, so without some packaging or abstraction layers that make the innovations more accessible, it’s challenging to keep everybody in the organization on board and working together.
A business might technically adopt the open source tool, but only a small number of people will be able to work with it. Not to mention that governance can be a considerable challenge if everyone is working with open source tools on their local machines without a way to have work centrally accessible and auditable.
Data science and ML platforms have the advantage of being usable right out of the box so that teams can start analyzing data from the first day. Sometimes, with open source tools (mostly R and Python), you need to assemble a lot of the parts by hand, and as anyone who’s ever done a DIY project can attest to, it’s often much more comfortable in theory than in practice. Choosing a data science and ML platform wisely (meaning one that is flexible and allows for the incorporation and continued use of open source) can allow the best of both worlds in the enterprise: cutting-edge open source technology and accessible, governable, control over data projects.

What should Machine Learning and AI platforms provide?

Data science and ML platforms allow for the scalability, flexibility,
and control required to thrive in the era of Machine Learning and AI because they provide a framework for:

  • Data governance: Clear workflows and a method for group
    leaders to monitor those workflows and data jobs.
  • Efficiency: Finding little methods to save time throughout the data-to-insights process gets business to organization value much faster.
  • Automation: A specific type of performance is the growing field
    of AutoML, which is broadening to automation throughout the data pipeline to ease inefficiencies and maximize personal time.
  • Operationalization: Effective ways to release data jobs into production quickly and safely.
  • Collaboration: A method for additional personnel working with data,
    much of whom will be non-coders, to add to data tasks in addition to data scientists (or IT and data engineers).
  • Self-Service Analytics: A system by which non-data expert from various industries can access and deal with data in a regulated environment.

Some things to consider before choosing the AI and MAchine Learning platform

Governance is becoming more challenging

With the quantity of information being accumulated today, data safety and security (particularly in specific sectors like financing) are crucial. Without a central area to access and collaborate with information that has correct user controls, data might be saved across different individuals’ laptop computers. And also if an employee or specialist leaves the company, the threats raise not just because they could still have accessibility to sensitive data, however since they might take their collaboration with them as well as leave the group to go back to square one, uncertain of what the individual was servicing. On top of these concerns, today’s enterprise is afflicted by shadow IT; that is, the suggestion that for years, different divisions have invested in all kinds of various innovations and are accessing as well as utilizing information in their ways. A lot to make sure that also IT groups today do not have a central sight of that is using what, just how. It’s a problem that becomes dangerously amplified as AI efforts scale and points to the requirement for governance at a more significant as well as much more fundamental scale throughout all industries in the business.

AI Needs to Be Responsible

We learn from a young age that topics like science and mathematics are all goal, which implies that naturally, individuals think that data science is as well – that it’s black and white, a specific discipline with just one method to reach a “proper” service, independent of who constructs it. We’ve understood for a long time that this is not the case and that it is possible to utilize data science strategies (and, hence, produce AI systems) that do things, well … incorrect. Even as just recently as last year, we are witnessing with problems that giants like Google, Tesla and Facebook face with their AI systems. These problems can cause domino effect very fast. It can be private information leakage, photo mislabelling, or video recognition not recognizing a pedestrian on crossing the road and hitting it.
This is where AI needs to be very responsible. And for that you need to be able to discover in early stages where you AI might fail, before deploying it in the real world.
The fact that these companies might not have fixed all of the problems, showing quickly how challenging it is to get AI.

Reproducibility of Machine Learning projects as well as scaling the same projects

Absolutely nothing is extra ineffective than needlessly repeating the same processes over as well as over. This relates to both duplicating procedures within a project (like data prep work) over and over as well as repeating the same process throughout projects or – even worse – unintentionally duplicating entire jobs if the team gets large yet does not have insight right into each other’s role. As well as no service is insusceptible to this danger – as a matter of fact, this issue can become exponentially worse in huge ventures with bigger teams and also even more separate in between them. To range efficiently, data groups require a tool that helps in reducing duplicated work and makes sure that work between members of the group hasn’t currently been done before.

Utilize Data Experts to Augment Data Scientists’ Job

Today, information researcher is one of the most in-demand settings. This means that data scientists can be both (1) difficult to locate and bring in and also (2) expensive to work with as well as retain. This combination implies that to range data initiatives to pursue Venture AI, it will unavoidably need to be submitted with service or information analysts. For the two sorts of a team to collaborate appropriately, they require a central atmosphere from which to work. Experts also often tend to work in a different way than data scientists, experienced in spreadsheets as well as possibly SQL yet generally not coding. Having a tool that allows each account to leverage the tools with which (s)he is most comfortable enables the performance to range data efforts to any size.

Ad-Hoc Methodology is Unsustainable for Large Teams

Small teams can sustain themselves to a specific point
by dealing with data, ML, or larger AI tasks in an ad-hoc fashion,
indicating staff member save their work in your area and not centrally and don’t have any reproducible procedures or workflows, figuring
things out along the method.
However, with more than just a couple of employee and more than one
job, this becomes rowdy rapidly. Any business with any hope of
doing Enterprise AI requires a central location where everybody involved
with data can do all of their work, from accessing data to deploying
a design into a production environment. Permitting workers -whether directly on the data team or not – to work ad hoc without a central tool from which to work is like a construction group attempting to build a high-rise building without a primary set of blueprints.

Machine Learning models Need to be Monitored and Managed

The most significant distinction between developing traditional software application and developing machine learning models is upkeep. For the most part, the software is composed when and does not need to be continually kept – it will typically continue to work over time. Machine learning models are established, put in production, and then must be kept an eye on and fine-tuned up until performance is ideal. Even when the efficiency is optimal, model performance can still move gradually as data (and the individuals producing it) changes. This is quite a different approach, especially for companies that are used to putting software application in production.
Moreover, it’s easy to see how issues with sustainability might eventually trigger – or intensify – problems with ML design bias. In reality, the two are deeply linked, and disregarding both can be devastating to a business’s data science efforts, particularly when magnified by the scaling up of efforts. All of these factors point to having a platform that can help manage design tracking and management.

Required to Create Models that Work in Production

Investing in predictive analytics and data science means guaranteeing that data teams are productive and see projects through to completion (i.e., production) – otherwise called operationalization. Without an API-based tool that allows for a single release, data teams likely will need to hand off designs to an IT team who then will have to re-code it. This step can take lots of time and resources and be a substantial barrier to executing data tasks that genuinely affect the business in essential methods. With a tool that makes it smooth, data groups can easily have an impact, screen, fine-tune, and continue to make improvements that positively impact the bottom line.

Having all said, choosing the right platform is not always straightforward. You need to carefully measure what you really need now and what will you need in the future.
You need to do so taking in account your budget, employees skills and their willingness to learn new methodologies and technologies.

Please bare in mind that developing a challenging AI project takes time, sometimes couple of years. that means your team can start building your prototipe in easy to use open source machine learning platform. Once you have proven your hypothesis you can migrate to more complex and more expensive platform.

Good luck on your new machine learning AI project!

How not to learn programing language like Python and R for machine learning the wrong way

Here is what you should NOT do when you start studying machine learning in Python.

  1. Get really good at Python programming and Python syntax.
  2. Deeply study the underlying theory and parameters for machine learning algorithms
  3. Avoid or lightly touch on all of the other tasks needed to complete a real project.

I think that this approach can work for some people, but it is a really slow and a roundabout way of getting to your goal. It teaches you that you need to spend all your time learning how to use individual machine learning algorithms. It also does not teach you the process of building predictive machine learning models in Python that you can actually use to make predictions.

Sadly, this is the approach used to teach machine learning that I see in almost all books and online courses on the topic.

Lessons: Learn how the sub-tasks of a machine learning project map onto Python and the best practice way of working through each task.

Projects: Tie together all of the knowledge from the lessons by working through a case study predictive modeling problems.

Recipes: Apply machine learning with a catalog of standalone recipes in Python that you

can copy-and-paste as a starting point for new projects.

1.2.1 Lessons

You need to know how to complete the specific subtasks of a machine learning project using the Python ecosystem. Once you know how to complete a discrete task using the platform and get a result reliably, you can do it again and again on the project after project. Let’s start with an overview of the common tasks in a machine learning project. A predictive modeling machine learning project can be broken down into 6 top-level tasks:

  1.  Investigate and characterize the problem in order to better understand the goals of the project.
  2. Analyze Data: Use descriptive statistics and visualization to better understand the data you have available.
  3. Prepare Data: Use data transforms in order to better expose the structure of the prediction problem to modeling algorithms.
  4. Evaluate Algorithms: Design a test harness to evaluate a number of standard algorithms on the data and select the top few to investigate further.
  5. Improve Results: Use algorithm tuning and ensemble methods to get the most out of well-performing algorithms on your data.
  6. Present Results: Finalize the model, make predictions and present results.
Time series components

Build Machine learning models using Time series data

Time series forecasting is an important area of machine learning that is often neglected. It is important because there are so many prediction problems that involve a time component. These problems are neglected because it is this time component that makes time series problems more difficult to handle.

Time series vs. normal machine learning dataset

A normal machine learning dataset is a collection of observations. For example:

observation #1
observation #2
observation #3

Predictions are made for new data when the actual outcome may not be known until some future date. The future is being
predicted, but all prior observations are treated equally. Perhaps with some very minor temporal dynamics to overcome the idea of concept drift such as only using the last year of observations rather than all data available.

A time series dataset is different. Time series adds an explicit order dependence between
observations: a time dimension. This additional dimension is both a constraint and a structure that provides a source of additional information.
A time series is a sequence of observations taken sequentially in time.

Time #1, observation
Time #2, observation
Time #3, observation

Time Series Nomenclature

it is essential to quickly establish the standard terms used when describing
time series data. The current time is defined as t, observation at the present time is defined as obs(t).

We are often interested in the observations made at prior times, called lag times or lags.

Times in the past are negative relative to the current time. For example, the previous time is t-1 and the time before that is t-2. The observations at these times are obs(t-1) and obs(t-2) respectively.

To summarize:

  • t-n: A prior or lag time (e.g. t-1 for the previous time).
  • t: A current time and point of reference.
  • t+n: A future or forecast time (e.g. t+1 for the next time).

Time Series Analysis vs. Time Series Forecasting

We have different goals depending on whether we are interested in understanding a dataset or making predictions. Understanding a dataset, called time series analysis, can help to make better predictions, but is not required and can result in a large technical investment in time and expertise not directly aligned with the desired outcome, which is forecasting the future.

Time Series Analysis

When using classical statistics, the primary concern is the analysis of time series. Time series analysis involves developing models that best capture or describe an observed time series to understand the underlying causes. This eld of study seeks the why behind a time series dataset. This often involves making assumptions about the form of the data and decomposing the time series into constitution components. The quality of a descriptive model is determined by how well it describes all available data and the interpretation it provides to better inform the problem domain.

The primary objective of time series analysis is to develop mathematical models that provide plausible descriptions from sample data.

Time Series Forecasting

Making predictions about the future is called extrapolation in the classical statistical handling of time series data. More modern fields focus on the topic and refer to it as time series forecasting.

Forecasting involves taking models t on historical data and using them to predict future observations. Descriptive models can borrow from the future (i.e. to smooth or remove noise), they only seek to best describe the data. An important distinction in forecasting is that the future is completely unavailable and must only be estimated from what has already happened.

The skill of a time series forecasting model is determined by its performance at predicting the future. This is often at the expense of being able to explain why a specific prediction was made, confidence intervals and even better understanding the underlying causes behind the problem.

Components of Time Series

Time series analysis provides a body of techniques to better understand a dataset. Perhaps the most useful of these is the decomposition of a time series into 4 constituent parts:

  • Level. The baseline value for the series if it were a straight line.
  • Trend. The optional and often linear increasing or decreasing behavior of the series over time.
  • Seasonality. The optional repeating patterns or cycles of behavior over time.
  • Noise. The optional variability in the observations that cannot be explained by the model.

All time series have a level, most have noise, and the trend and seasonality are optional.

Time series components
Time series components

Concerns of forecasting time series

When forecasting, it is important to understand your goal. Use the Socratic method and ask lots of questions to help zoom in on the specifics of your predictive modeling problem. For example:

How much data do you have available and are you able to gather it all together?

  1. Like in all Machine learning models, more data is often more helpful, offering greater opportunity for exploratory data analysis, model testing, and tuning, and model fidelity.
  2. What is the time horizon of predictions that is required? Short, medium or long term? Shorter time horizons are often easier to predict with higher confidence.
  3. Can forecasts be updated frequently over time or must they be made once and remain static? Updating forecasts as new information becomes available often results in more accurate predictions.
  4. At what temporal frequency are forecasts required? Often forecasts can be made at a lower or higher frequency, allowing you to harness down-sampling, and up-sampling of data, which in turn can offer benefits while modeling.

Time series data often requires cleaning, scaling, and even transformation. For example:

  • Frequency. Perhaps data is provided at a frequency that is too high to model or is unevenly spaced through time requiring resampling for use in some models.
  • Outliers. Perhaps there are corrupt or extreme outlier values that need to be identified and handled.
  • Missing. Perhaps there are gaps or missing data that need to be interpolated or imputed.

Often time series problems are real-time, continually providing new opportunities for prediction. This adds an honesty to time series forecasting that quickly eshes out bad assumptions, errors in modeling and all the other ways that we may be able to fool ourselves.

Examples of Time Series Forecasting

  • Forecasting the commodity, like corn, wheat etc. yield in tons by the state each year.
  • Forecasting whether an EEG trace in seconds indicates a patient is having a seizure or not.
  • Forecasting the closing price of a stock each day.
  • Forecasting the birth rate at all hospitals in a city each year.
  • Forecasting product sales in units sold each day for a store.
  • Forecasting the number of passengers through a train station each day.
  • Forecasting unemployment for a state each quarter.
  • Forecasting utilization demand on a server each hour.
  • Forecasting the size of the rabbit population in a state each breeding season.
  • Forecasting the average price of gasoline in a city each day.

Time Series as Supervised Learning

Time series forecasting can be framed as a supervised learning problem. This re-framing of your time series data allows you access to the suite of standard linear and nonlinear machine learning algorithms on your problem.

Sliding Windows

Sliding windows in time series machine leraning technique

Time series data can be phrased as supervised learning. Given a sequence of numbers for a time series dataset, we can restructure the data to look like a supervised learning problem. We can do this by using previous time steps as input variables and use the next time step as the output variable.

time, measure
1, 10
2, 20
3, 30
4, 40
5, 50

We can restructure this time series dataset as a supervised learning problem by using the value at the previous time step to predict the value at the next time-step.

X, y
?, 10
10, 20
20, 30
30, 40
40, 50
50, ?

 

Univariate Time Series vs. Multivariate Time Series

Univariate Time Series: These are datasets where only a single variable is observed

at each time, such as temperature each hour. The example in the previous section is a

univariate time series dataset.

Multivariate Time Series: These are datasets where two or more variables are observed

at each time.

Most time series analysis methods and even books on the topic focus on univariate data.

This is because it is the simplest to understand and work with. Multivariate data is often more difficult to work.

 

Time series in practice

Let’s take the following data sample:

Minimum Daily Temperatures dataset. This dataset describes the minimum daily temperatures over 10 years (1981-1990) in the city Melbourne, Australia.

# create date time features of a dataset
from pandas import read_csv
from pandas import DataFrame

series = read_csv('../Datasets/daily-min-temperatures.csv', header=0, index_col=0,parse_dates=True, squeeze=True)

dataframe = DataFrame()

Please note the index_col=0 in the read_csv.
This parameter turns our first column, Date, into an index. We will use this for further feature engineering.

Date
1981-01-01 20.7
1981-01-02 17.9
1981-01-03 18.8
1981-01-04 14.6

Feature engineering

we need to do some simple feature engineering, to get the date and month from the index itself:

#Get the month out of the Index
series.index[1].month

#Get the day out of the index
series.index[1].day 

#Applied on the whole dataset
dataframe['month'] = [series.index[i].month for i in range(len(series))]
dataframe['day'] = [series.index[i].day for i in range(len(series))]
dataframe['temperature'] = [series[i] for i in range(len(series))]
print(dataframe.head(10))

Once we have the days and months, we can observe that still, we don’t have too many features that can describe our data on the best possible way. Just the month and day information alone will not give us a lot of information to predict temperature and most probably likely result in a poor model.

That is why we need to think about extracting more information from the features we have available at the moment, like at this example “Date”. You may enumerate all the properties of a time-stamp and consider what might be useful for your problem, such as:

  • Minutes elapsed for the day.
  • Hour of day.
  • Business hours or not.
  • Weekend or not.
  • Season of the year.
  • The business quarter of the year.
  • Daylight savings or not.
  • Public holiday or not.
  • Leap year or not.

How to transform a time series problem into a supervised learning problem

Lag features are the classical way that time series forecasting problems are transformed into supervised learning problems. The simplest approach is to predict the value at the next time (t+1) given the value at the current time (t). The supervised learning problem with shifted values looks as follows:

Value(t), Value(t+1)

Pandas library provides the shift() function1 to help create these shifted or lag features from a time series dataset. Shifting the dataset by 1 creates the t column, adding a NaN value for the first row. The time series dataset without a shift represents the t+1.

temps = DataFrame(series.values)
dataframe = concat([temps.shift(1), temps], axis=1)
dataframe.columns = ['t', 't+1']

Printing the data frame now will show us the original column(t) and the shifted column (t+1)

t t+1
0 NaN 20.7
1 20.7 17.9
2 17.9 18.8
3 18.8 14.6
4 14.6 15.8

The first row contains NaN because it was shifted, that is why we will have to discard this one. Please note that if you shit the data for N times because you want to predict N time intervals in future, you will create N number of rows with NaN values that you will have to discard after performing the sliding windows operation. For example:

dataframe = concat([temps.shift(3), temps.shift(2), temps.shift(1), temps], axis=1)
dataframe.columns = ['t-2', 't-1', 't', 't+1']
print(dataframe.head(5))
 t-2 t-1 t t+1
0 NaN NaN NaN 20.7
1 NaN NaN 20.7 17.9
2 NaN 20.7 17.9 18.8
3 20.7 17.9 18.8 14.6
4 17.9 18.8 14.6 15.8

Looking at this example, we can conclude that we cant expect usable data until the 4th row ( index 3).

Removing noise and improving the signal in time series

Let’s take for example one really widely used dataset: Airline dataset

from pandas import read_csv
from matplotlib import pyplot
series = read_csv('../Datasets/airline-passengers.csv', header=0, index_col=0, parse_dates=True,squeeze=True)

#lets do some plotting:
pyplot.figure(1)

# line plot
pyplot.subplot(211)
pyplot.plot(series)

# histogram
pyplot.subplot(212)
pyplot.hist(series)
pyplot.show()

 

airline time series density plot

From the graph above, we can conclude that the data set is not stationary.
What does that mean? It means that the variance and the mean of the observations are changing over time.

This can happen in business problems where there is an increasing trend and or seasonality.

What does it mean for us? Non-stationary data causes more problems in solving time series problems. It makes it difficult to model a proper statistical method to give any kind of forecasting. That is why we need to perform certain transformations on the data.

Square root transformation

The square root, x to x^(1/2) = sqrt(x), is a transformation with a moderate effect on distribution shape: it is weaker than the logarithm and the cube root. It is also used for reducing right skewness, and also has the advantage that it can be applied to zero values. Note that the square root of an area has the units of a length. It is commonly applied to counted data, especially if the values are mostly rather small.

A time series that has a quadratic growth trend, like the example above, can be made linear by taking the square root.

What we need to do is apply square root transformation to out Airline dataset and make the growth trend from quadratic to linear and change the distribution of observations to be possibly Gaussian.

first lets import two more libraries:

from pandas import DataFrame
from numpy import sqrt

Perform  square root transformation

#Visualize after the transformations
#Lets create a function to make our transformation code more elegant and short
def visualize (column):
pyplot.figure(1)
# line plot
pyplot.subplot(211)
pyplot.plot(dataframe[column])
# histogram
pyplot.subplot(212)
pyplot.hist(dataframe[column])
pyplot.show() 

dataframe = DataFrame(series.values)

#Its very important to give name to our column, 
so later we can call the column by its name
dataframe.columns = ['passengers']

#Perform Transformations
dataframe['passengers'] = sqrt(dataframe['passengers'])
visualize('passengers')

square transformation on time series machine learning
square transformation time series machine learning

Looking at the plot above, we can see that the trend was reduced, but was not removed. The line plot still shows an increasing variance from cycle to cycle. The histogram still shows a long tail to the right of the distribution, suggesting an exponential or long-tail distribution. That is why we need to look up for another type of transformation.

 

Log Transformation on time series

A class of more extreme trends is exponential. Time series with an exponential distribution can be made linear by taking the logarithm of the values.

The log transformation is, arguably, the most popular among the different types of transformations used to transform skewed data to approximately conform to normality. Maybe that is why I applied this transformation first back in university times. If the original data follows a log-normal distribution or approximately so, then the logtransformed data follows a normal or near normal distribution.

To perform a Log transformation in our python script, first, we need to import:

from  numpy import  log

Then we perform the Log transformations on the time series dataset:

dataframe['passengers'] = log(dataframe['passengers'])
visualize('passengers')

log transformation time series machine learning
log transformation on time series machine learning

Running the example results in a trend that does look a lot more linear than the square root transform above. The line plot shows a seemingly linear growth and variance. The histogram also shows a more uniform or Gaussian-like distribution of observations.

Log transforms are popular with time series data as they are effective at removing exponential variance. It is important to note that this operation assumes values are positive and non-zero.

Box-Cox Transformation on time series

The Box-Cox transformation is a family of power transformations indexed by a parameter lambda. Whenever you use it the parameter needs to be estimated from the data.

Some common values for lambda:

  • lambda = -1. is a reciprocal transform.
  • lambda = -0.5 is a reciprocal square root transform.
  •  lambda = 0.0 is a log transform.
  •  lambda = 0.5 is a square root transform.
  •  lambda = 1.0 is no transform.

Implement Box-Cox transformation on time series in python:

First, import the boxcox library:

from  scipy.stats import  boxcox

Perform Box-Cox transformation on our time series:

dataframe['passengers' ] = boxcox(dataframe['passengers' ], lmbda=0.0)
visualize('passengers')

Box-Cox transformation on time series machine learning
Box-Cox transformation on time series in machine learning

We can let lambda to None and let the function find the most statistically significant value for lambda.

dataframe['passengers'], lam = boxcox(dataframe['passengers'])
print('Lambda: %f' % lam)
visualize('passengers')
Lambda: 0.148023

BoxCox transformation time series machine learning without lambda
BoxCox transformation time series machine learning without lambda

You see using this approach, the distribution is more normal since lambda is defined automatically.

It is sometimes possible that even if after applying the Box-Cox transformation the series does not appear to be stationary, diagnostics from ARIMA modeling can then be used to decide if differencing or seasonal differencing might be useful to remove polynomial trends or seasonal trends respectively. After that, the result might be an ARMA model that is stationary. If diagnostics confirm the orders p an q for the ARMA model, the AR and MA parameters can then be estimated.

Regarding other possible uses of Box-Cox in the case of a series of iid random variables that do not appear to be normally distributed there may be a particular value of lambda that makes the data look approximately normal.

 

Moving Average Smoothing

Smoothing is a technique applied to time series to remove the fine-grained variation between time steps. The hope of smoothing is to remove noise and better expose the signal of the underlying causal processes. Moving averages are a simple and common type of smoothing used in time series analysis and time series forecasting. Calculating a moving average involves creating a new series where the values are comprised of the average of raw observations in the original time series.

A moving average requires that you specify a window size called the window width. This denes the number of raw observations used to calculate the moving average value. The moving part in the moving average refers to the fact that the window denied by the window width is slid along the time series to calculate the average values in the new series. There are two main types of moving average that is used: Centered and Trailing Moving Average.

Calculating a moving average of a time series makes some assumptions about your data. It is assumed that both trend and seasonal components have been removed from your time series. This means that your time series is stationary, or does not show obvious trends (long-term increasing or decreasing movement) or seasonality (consistent periodic structure).

A moving average can be used as a data preparation technique to create a smoothed version of the original dataset. Smoothing is useful as a data preparation technique as it can reduce the random variation in the observations and better expose the structure of the underlying causal process.

Calculating Moving average over 10 periods in python:

#Moving Average
# tail-rolling average transform over 10 periods window=10
rolling = series.rolling(window=10)
rolling_mean = rolling.mean()
print(rolling_mean.head(10))
# plot original and transformed dataset
series.plot()
rolling_mean.plot(color='red')
pyplot.show()

moving average smoothing
moving average smoothing

The moving average is one of the most common sources of new information when modeling a time series forecast. In this case, the moving average is calculated and added as a new input feature used to predict the next time step.

White Noise

If a time series is a white noise, it is a sequence of random numbers and cannot be predicted. If the series of forecast errors are not white noise, it suggests improvements could be made to the predictive model.

A time series is a white noise if the variables are independent and identically distributed with a mean of zero.

White noise is an important concept in time series analysis and forecasting. It is important for two main reasons:

  • Time series Predictability: If your time series is white noise, then, by definition, it is random. You cannot model and predict random occurrence.
  • Model evaluation: The series of errors from a time series forecast model should ideally be white noise. This means the errors are random.

Your time series is not white noise if:

  • Your series has a non-zero mean
  •  The variance change over time
  • Lag values with time series values correlate

In order to check if your time series is a white noise, it is a good idea to do some visualization and draws some statistics during the data inspection process.

  • Like we did before, create a line plot and histogram. Check for gross features like a changing mean, variance, or obvious relationship between lagged variables.

    histogram from a white noise time series
    histogram from a white noise time series
  • Calculate summary statistics. Check the mean and variance of the whole series against the mean and variance of meaningful contiguous blocks of values in the series (e.g. days, months, or years).

Create an autocorrelation plot. Check for gross correlation between lagged variables.
First import the autocorrelation library:

from pandas.tools.plotting import autocorrelation_plot

Then plot:

# autocorrelation
autocorrelation_plot(series)

You can easily spot the difference between the autocorrelation of the Airplane sales dataset to the right vs. time series with an white noise to the left.

Random Walk and time series predictability

There is a tool called a random walk that can help you understand the predictability of your time series forecast problem.

A random walk is different from a list of random numbers because the next value in the sequence is a modification of the previous value in the sequence.

This dependence provides some consistency from step-to-step rather than the large jumps that a series of independent, random numbers provide.

We can confirm that our time series data set is a random walk and not a random white noise with a statistical test, called Adufller test.

To the code from our Airline time series you need to import the adufller library:

from statsmodels.tsa.stattools import adfuller

Then perform the statistical test:

# statistical test
result = adfuller(series)
print ('ADF Statistic: %f'  % result[0])
print ('p-value: %f'  % result[1])
print ('Critical Values:' )
for  key, value in  result[4].items():
print ('\t%s: %.3f'  % (key, value))

As a result, you will get something like:

ADF Statistic: 0.815369
p-value: 0.991880
Critical Values:
        1%: -3.482
        5%: -2.884
        10%: -2.579

The null hypothesis of the test is that the time series is non-stationary. Running the test we can see that the ADF Statistic value was 0.815369. This is larger than all of the critical values at 1%, 5%, and 10% confidence levels. Therefore, we can say that the time series does appear to be non-stationary.

Predictive Analytics from research and development to a business maker

The start of predictive analytics and machine learning

Predictive analytics started in the early 90s with pattern recognition algorithms—for example, finding similar objects. Over the years, things have evolved into machine learning. In the workflow of data analysis, you collect data, prepare data, and then perform the analysis. If you employ algorithms or functions to automate the data analysis, that’s machine learning.

Read more about the process of building data analysis.

Read More »

How to boost your Machine learning model accuracy

boosting predictive machine learning algorithms

There are multiple ways to boost your predictive model accuracy. Most of these steps are really easy to implement, but yet for many reasons data scientist fail to do proper data preparation and model tuning. in the end, they end up with average or below average machine learning models.
Having domain knowledge will give you the best possible chance of getting improvements on your machine learning models accuracy. However, if every data scientist follows these simple technical steps, they will end up with a great machine learning model accuracy even without being an expert in a certain field.

Read More »

Data scientists: How can I use my data?


I have been asked numerous times, I got access to the database can you please tell me how to use the data?

Since I have been asked this more and more,  I got some time to answer it here and help people with this question.
 

What do you want to do with your data?

First and foremost, what are you trying to do with the data?
Ask yourself, your manage, friend or whoever is making you do something with the data, what do you want the data to show you?
Most of the times the data is powerful as much as you can understand it. Here is how you can do that:

Understand how your data is linked

Databases no matter relational or non relational have schemas. This shows where specific attributes are stored, in which tables or objects, also shows how tables or objects are connected between themselves, the linking.

What is an attribute? Attribute is basically everything that is descriptive, name, surname, dress, profit etc..
What is table or object? Table or object is the structure that is holding or grouping the attributes.
What are links or keys? Links, keys and foreign keys are basically information that allows you linking one table to another. For example you want to link the profit to a sales person, or address to a person, you will use linking or joining to the tables. Mostly this is done using foreign keys or by joining multiple attributes – creating composite key. 

How to learn the links between the data

Some people try to learn the schema all at once, seeing the tables their attributes and how they link to each other.
My suggestion is to learn by doing. Lately in the world of big data, the database systems are getting too complex to be learnt all at once and most of the times you won’t need to know it all. Learning it by practical use cases can help you understand not just table structure but also the underlaying dat.
 

How do I get the data?

Usually we use SQL to query the databases. SQL is the fastest and best performing way to do it.

Other ways can be using code: Java, .Net, R, Python and what not else.
Excel, you can query data easily using Excel while creating Pivot table.
Lately Data Scientist are using tools like KNIME, Alteryx to fetch the data. Using this approach does not requires knowing any query language, but you might face the risk of downloading gigabytes of data in your memory or disk if the table you are trying to query is that big.
Query with Excel:
Query with Javascript: 

Use your imagination.

Once you succeed in getting your data you should start using your imagination and think of making useful use case scenarios that will help your business.
 

Visualize your data

Plain data is boring so that is why we visualize it.

Easy ways to do that is using Excel. Excel is really powerful by itself and can create pretty charts.
Some other popular tools are Tableau, QlickView and Jaspersoft, HighCharts
 Off course,There are endless other solutions that can create pretty charts.
One word of caution about visualization, don’t try to over visualize things because they can become really confusing. Also try to use few colors instead of using all color palette, so other people can follow you.

 

Create a story with your data

Now that you have your use case and cool visualization, try to create a story.
People will understand what you want to say and even get new ideas if you tell your analysis in a nice story.
Happy data mining, now when you know what to do with your data!

Modeling marketing multi-channel attribution in practice

multi channel attribution.png

What is the next step I need to take to close his deal? What will this customer ask for next and how can I drive it to him? What is the shortest path to close a deal?
How much do all my marketing and sales activities really cost? How much does one action or marketing channel costs?

All these are common questions marketing and sales are facing with on daily bases.

Luckily, there is an answer.

Here I represent the advantages on using machine learning models that will produce multi-channel attribution models. I strongly recommend you to read it.Read More »

What skills does a data scientist need and how to get them?

Upgrading your skills constantly is the way to stay on the top.
What skills do you need to have to become a Data Scientist?
I have written before but I’ll try to put again some more info to help the people who really want to go that path.

Free Tools can help a lot to start!

 There are many tools that can help you overcome this easily to some extent: KNIME is one great tool I use literally every day. It is really easy to learn and it covers 90% of the tasks you will be asked daily as Data Scientist. The best is free.
Check it out here: https://www.knime.org/
Other similar tools: RapidMiner
The important fact is you should know what to do with it.
I have given numerous courses on how they use the tool and how to start with super basic DS tasks.
Understanding Basic terms can help you along the way:
What are regression and what classification?
It is good to know how to approach a specific problem in order to solve it. Almost every problem in the world we are trying to solve can fall into these two.

What algorithms can be used and should be used for each problem?

This is important but not show stopper for the beginning. Decision trees can do just right for a start.
How to do:

Data Cleaning or Transformation

This is one of the most important things you’d come across working in Data Science. 90% of the time, you are not going to get well-formatted data. If you are skilled in one of the programming language, Python or R, you should be pro at packages like Pandas or Dplyr/Reshape.
Exploratory Data Analysis
I have written before of How can you start using the data. Check this link to get an idea.
Once again, this is the most important part, whether you are working to take insights or you want to do predictive modeling, this step comes in. You must train your mind analytically to make an image of variables in your head. You can build such a mind by practice. After that, you must be very good with hands-on with packages like matplotlib or ggplot2, depending upon the language you work with

Machine Learning / Predictive Modelling

One of the most important aspects of today’s data science is predictive modeling. This is dependent upon your EDA and your knowledge of mathematics. I must inform you that invest your time in theory. The more theoretical knowledge you have, the better you’d be going to do. There is no easy way around it. There’s this great course by Andrew NG that goes much into theory. Take it.

Programming Languages

If you want to go more advanced, it is important to have a grip on at least one programming language widely used in Data Science. But you should know a little of another language. Either you should know R very well and some Python or Python very well but some R.
Take my case, I know R very well ( at least I think so) but I can work around with Python too ( not expert level ), Java, C#, JavaScript. Anything works if you know to use it when you need it.
Example of complete data analysis that one Data Scientist is doing can be found here.
I use Knime, R and Python every day, I think if you are a total beginner, its good idea to start with KNIME.

Useful courses for learning Data Scientists

I really recommend spending some time on the following courses:
I have passed them myself and I learned a lot from each of it.
Happy learning!
Image credit: House of bots

Machine learning in practice – Let the machine find the optimal number of clusters from your data

 

What is Clustering?

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

Where Clustering is used in real life: 

Clustering is used almost everywhere – Search engines, making marketing campaigns, biological analysis, cancer analysis, your favorite phone provider is making cluster analysis to see in which group of people you belong before they decide if they will give you additional discount or special offer. the applications are countless.


How can I find the optimal number of clusters?

One fundamental question is: If the data is clusterable, then how to choose the right number of expected clusters (k)?

 

Read More »