How can Marketing use Data Science and AI?

marketing data science

Data Science is a field that extracts meaningful information from data and helps marketers in discerning the best insights. These insights can be on different marketing elements such as client intent, experience, habits, etc. that would assist them in efficiently optimizing their marketing strategies and derive optimum income.

Over the previous decade, online information consumption has dramatically soared due to the full price of the World Wide Web. It is estimated that there are over 6 billion devices linked to the web right now. Couple of million terabytes of data are generated every single day.
For online marketers, this staggering amount of data is a gold mine. If this data might be appropriately processed and evaluated, it can provide valuable insights which marketers can use to target customers. Nevertheless, decoding significant portions of data is a mammoth job. This is where Data Science can profoundly help.

Over the previous decade, online information consumption has dramatically soared due to the full price of the World Wide Web. It is estimated that there are over 6 billion devices linked to the web right now. Couple of million terabytes of data are generated every single day.
For online marketers, this staggering amount of data is a gold mine. If this data might be appropriately processed and evaluated, it can provide valuable insights which marketers can use to target customers. Nevertheless, decoding significant portions of data is a mammoth job. This is where Data Science can profoundly help.

How can data science be implemented in Marketing?

Optimizing Marketing budget

Marketers are constantly under a rigorous spending plan. The main objective of every online marketer is to derive optimum ROI from their designated budgets. Accomplishing this is constantly difficult and lengthy. Things don’t always go according to strategy and efficient budget plan usage is not accomplished.
By analyzing a marketer’s spend and acquisition data, a data scientist can develop a costs design that can assist make use of the budget better. The design can help marketers disperse their spending plan across areas, channels, mediums, and campaigns to optimize for their essential metrics.

Good example of optimizing marketing budget using data science is Increase Marketing ROI with Multi-touch Attribution Modelling

Identify the right channels for a specific Marketing campaign

Data science can be utilized to figure out which channels are giving an appropriate lift for the online marketer. Utilizing a time series model, a data scientist can compare and determine the sort of lift seen in various channels. This can be extremely advantageous as it informs the marketer precisely which channel and medium are delivering appropriate returns.

Increase Marketing ROI with Multi-touch Attribution Modelling and Modeling marketing multi-channel attribution in practice are talking about identifying the right marketing channels.

Marketing to the Right Audience

Typically, marketing campaigns are broadly distributed regardless of the location and audience. As a result, there are high opportunities for online marketers to overshoot their spending plan. They likewise may not be able to achieve any of their objectives and revenue targets.
However, if they use data science to analyze their data correctly, they will be able to understand which areas and demographics are providing the greatest ROI.
Clustering comes as a good tool for creating the right audience using Machine Learning.

Matching the right Marketing Strategies with Customers

To obtain maximum worth out of their marketing strategies, marketers require to match them with the ideal customer. To do this, data researchers can create a consumer lifetime value model that can section consumers by their habits. Marketers can use this model for a variety of usage cases. They can send out referral codes, and cash back provides to their most significant worth consumers. They can apply retention strategies to users who are likely to leave their consumer base and so on.
Another even more powerful tool is Marketing Personalization. Marketing Personalization will mathe your offers with the best customer. This will guarantee the best ROI of your marketing campaign.
Read more about Personalization – How much do you understand your customer? and building marketing recommendation systems.

Customer Segmentation and Profiling

While marketing a product/service, marketers take a look at developing client profiles. They are continually constructing specific lists of prospects to target. With data science, they can accurately decide which personas require to be targeted. They can find out the variety of personas and the type of qualities they need to produce their client base.
Clustering is the most used tool for creating marketing customer segments and profiles.

Email Campaigns

Data science can be utilized to find out which emails interest which customers. How frequently are these emails check out, when to send them out, what sort of content resonates with the consumer, and so on. Such insights make it possible for online marketers to send contextualized email campaigns and target customers with the ideal deals.
Creating personalized email campaigns is another example how personalization can be used.

Sentiment Analysis

Marketers can utilize the data science to do sentiment analysis. This means that they can gain much better insights into their customer beliefs, opinions, and attitudes. They can likewise keep track of how customers respond to marketing campaigns and whether they’re engaging with their company.
With the recent advances in deep learning, the capability of algorithms to examine text has improved considerably. Innovative use of sophisticated artificial intelligence strategies can be a reliable tool for doing much more effective marketing offers.

Recommender Systems – the start of marketing personalization

recommendation systems

Recommender systems are tools for communicating with big and intricate info spaces. They provide a customized view of such areas, prioritizing products most likely to be of interest to the user. The field, christened in 1995, has grown enormously in the range of problems attended to and methods employed, as well as in its useful applications.

We are dealing with personalized offer every day, whether we are aware of it or not. I talk here about Personalization – How much do you understand your customer?.
Recommender system help companies to give personalized offers and display to their customers.

Research study has incorporated a broad variety of artificial intelligence methods consisting of machine learning, data mining, user modeling, case-based thinking, and customer satisfaction, among others. Personalized suggestions are a vital part of numerous online e-commerce applications such as, Netflix, and Spotify. This wealth of practical application experience has supplied the motivation to researchers to extend the reach of recommender systems into new and challenging areas. The purpose of this unique issue is to take stock of the current landscape of recommender systems research study and recognize instructions the field is now taking. This post supplies an overview of the current state of the area and presents the different articles in a particular concern.

The prototypical usage case for a recommender system frequently occurs in e-commerce settings. A user, Jane, visits her preferred online bookstore. The homepage notes existing bestsellers and also a list consisting of advised products. This list might include, for instance, a new book published by one of Jane’s preferred authors, a cookbook by a brand-new author and a supernatural thriller. Whether Jane will find these recommendations beneficial or distracting is a function of how well they match her tastes. Is the cookbook for a style of cuisine that she likes (and is it different enough from ones she already owns)? Is the thriller too violent? A vital function of a recommender system, therefore, is that it supplies an individualized view of the data, in this case, the bookstore’s stock. If we eliminate the customization, we are entrusted to the list of best-sellers– a file that is independent of the user. The recommender system aims to lower the user’s search effort by noting those items of the highest utility, those that Jane might be probably to acquire. This is beneficial to Jane in addition to the e-commerce shopkeeper.

Recommender systems research study encompasses scenarios like this and various other info access environments in which a user and shopkeeper can benefit from the presentation of customized alternatives. The field has seen an incredible growth of interest in the previous decade, catalyzed in part by the Netflix Prize and evidenced by the fast development of the annual ACM Recommender Systems conference. At this point, it is rewarding to take stock, to consider what differentiates recommender systems research from other associated areas of the research study in artificial intelligence, and to take a look at the field’s successes and new challenges.

What is a Recommender System?

In a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients. In some cases the primary transformation is in the aggregation; in others the system’s value lies in its ability to make good matches between the recommenders and those seeking recommendations.

Two basic concepts stand out that differentiate recommender systems research:

  • A recommender system is individualized. The suggestions it produces are implied to optimize the experience of one user, not to represent group consensus for all.
  • A recommender system is meant to help the user choose among discrete alternatives. Usually, the products are already understood in advance and not created in a bespoke fashion.

The personalization aspect of recommender systems distinguishes this line of research study most strongly from what is typically comprehended as research in online search engine and other details retrieval applications. In a search engine or other details retrieval system, we expect the set of outcomes relevant to a specific query to be the same despite who issued it. Many recommender systems accomplish personalization by maintaining profiles of a user’s activity (long-term or short-term) or mentioned preferences. Others achieve a tailored result through conversational interaction.

A Recommender System Typology

A common problem location identifies a recommender systems research study instead of a joint innovation or technique. Variety of research study methods have actually been used to the recommender systems problem, from analytical methods to ontological thinking, and a wide range of the issues have been tackled, from selecting consumer products to discovering good friends and lovers. One lesson that has been found out over the previous years of recommender systems research study is that the application domain exerts a strong influence over the kinds of methods that can be effectively used.
Domain characteristics like the persistence of the user’s energy function have a significant effect: for instance, a users’ taste in music may change slowly however his interest in celebrity news stories may change much more. Hence, the dependability of preferences collected in the past might differ. Likewise, some products, such as books, are offered for suggestion and intake over a long duration of time, typically years.
On the other hand, in a technological domain, such as a mobile phone or cams, old items become rapidly outdated and cannot be usefully advised. This is also real of locations where timeliness matters such as news and cultural events. It is not surprising for that reason that there are many hairs of a research study in recommender systems, as researchers take on a range of suggestion domains. To merge these different approaches, it is helpful to consider the AI elements of suggestion, in specific, the understanding basis underlying a recommender system.

Knowledge Sources

Every AI system makes use of several sources of knowledge to do its work. A supervised machine learning system, for instance, would have a labeled collection of data as its primary understanding source, but the algorithm and its parameters can be considered another implicit type of understanding that is brought to bear on the classification task. Suggestion algorithms can likewise be categorized according to the knowledge sources that they utilize.

There are three basic types of knowledge:

  • social knowledge about the user base in primary,
  • individual understanding about the particular user for whom recommendations are looked for (and possibly knowledge about the specific requirements those suggestions need to fulfill), and lastly
  • real expertise about the items being recommended, ranging from easy feature lists to more complicated ontological expertise and means-ends knowledge that allows the system to factor about how an object can fulfill a user’s needs.

Types of Recommender Systems

Collaborative Recommendation system

collaborative filtering recommendation system

The most popular strategy in the recommendation systems is a collaborative recommendation.
The standard insight for this strategy is a sort of connection in the world of taste – if users Alice and Bob have the same energy for items 1 through k, then the opportunities are excellent that they will have the same utility for detail k +1.
Typically, these energies are based on rankings that users have supplied for products with which they are currently familiar. The critical advantage of collaborative recommendation is its simpleness. The issue of calculating utility is transformed into the problem of theorizing missing out on worths in the scores matrix, the sporadic matrix where each user is a row, each product a column and the qualities are the recognized scores. This insight can be operationalized in several ways. Originally, clustering techniques, like nearest-neighbor, were applied to find communities of like-minded peers. However, matrix factorization and other dimensionality-reduction strategies are now acknowledged as superior in precision
Some problems with collaborative recommendation are well-established: – New items cannot be suggested without depending on some additional understanding source. Extrapolation depends on having some values from which to task. Indeed, sparsely-rated items, in general, present a problem because the system lacks information on which to base forecasts. Users who have provided few rankings will get noisier recommendations than those with more significant histories. The issues of new users and brand-new ratings are jointly called the “cold start” problem in the collaborative proposal. – The circulation of grades and user choices in numerous customer taste domains is relatively concentrated: a small number of “blockbuster” products receive a terrific deal of attention, and there are lots of, various hardly ever rated products.

Malicious users may be able to generate significant sales of pseudonymous profiles and utilize them to predisposition the suggestions of the system in one method or another. There is still a good deal of algorithmic research study focused on the problems of collaborative recommendation: more precise and useful price quotes of the scores matrix, better handling of brand-new users and new products, and the extension of the fundamental collaborative recommendation idea to brand-new kinds of data consisting of multi-dimensional rankings and user-produced tags, among others.

Content-based Recommendation system


Before the development of collaborative recommendation in the 1990s, earlier research in individualized info access had focused on combining knowledge about products with info about user’s choices to find proper products. This technique, because of its reliance on the content knowledge source, in particular, product features, has come to be known as a content-based recommendation. A content-based recommendation is carefully related to supervised machine learning.
We can see the issue as one of discovering a set of user-specific classifiers where the classes are “useful to user X” and “not beneficial to user X.” One of the important concerns in the content-based recommendation is feature quality. The items to be advised need to be explained so that significant knowledge of user preferences can take place.
Ideally, every item would be described at the same level of detail, and the feature set would include descriptors that associate with the discriminations made by users. Unfortunately, this is frequently not the case. Descriptions may be partial, or some parts of the things space may be described in higher information than others. The match in between the function set and the user’s utility function likewise requires to be good. Among the strengths of the popular Pandora, streaming music service is that music-savvy listeners manually select the feature set it uses for musical choices. Automatic music processing is not yet significant enough to reliably draw out functions like “bop feel” from a Charlie Parker recording. In addition to the development and application of brand-new intelligent algorithms for the recommendation task, research study in content-based recommendation also takes a look at the problem of feature extraction in various domains.
A further subtype of content-based recommendation is a knowledge-based recommendation, in which the dependence on product features is extended to another sort of understanding about products and their possible energies for users. An example of this type of system is the financial investment recommender mentioned earlier that needs to understand about the threat profiles and tax consequences of various investments and how these connect with the financial position of the investor. As with other knowledge-based systems, understanding acquisition, maintenance, and validation are crucial problems. Also, considering that knowledge-based recommenders can utilize detailed requirements from the user, user interface research has been paramount in developing knowledge-based recommenders that do not position excessive of a burden on users.

Because of the difficulties of running large-scale user studies, recommender systems have conventionally been evaluated on one or both of the following measures:

  • Prediction accuracy. How well do the system’s predicted ratings compare with those that are known, but withheld?
  • Precision of recommendation lists. Given a short list of recommendations produced by the system (typically all a user would have patience to examine), how many of the entries match known “liked” items?

Both of these conventional measures are lacking in some essential aspects, and numerous of the brand-new locations of exploration in recommender systems have caused experimentation with new evaluation metrics to supplement these common ones. Among the most significant issues occurs because of the long-tailed nature of the distribution of the scores in lots of datasets.

A recommendation strategy that optimizes for high accuracy over the entire data set therefore consists of implicit bias towards popular items, and for that reason might stop working on recording elements of utility associated with novelty. An accurate forecast on a topic that the user already understands is naturally less helpful than a prediction on an unknown item. To address this problem, some researchers are taking a look at the balance between accuracy and diversity in a set of suggestions and dealing with algorithms that are sensitive to product distributions. Another issue with traditional recommender systems examination is that it is substantially fixed.
A set database of ratings is divided into training and test sets and used to demonstrate the effectiveness of an algorithm. However, the user experience of recommendation is entirely various.
In an application like movie recommendation, the field of products is always expanding; a user’s tastes are progressing; brand-new users concern the system. Some recommendation applications need that we take the dynamic nature of the recommendation environment into account and evaluate our algorithms appropriately. Another location of assessment that is reasonably under examined is the interaction between the utility functions of the shopkeeper and the user, which necessarily look slightly different. Owners carry out recommender systems to achieve business objectives, usually increased earnings. The owner, therefore, might prefer an imperfect match with a high-profit margin to a perfect game with a limited profit.
On the other hand, a user who exists with low energy suggestions may cease to rely on the recommendation function or the whole website. Owners with high volume websites can field algorithms side-by-side in randomized trials and observe sales and earnings differentials. However, such outcomes rarely filter out into the research study neighborhood.

Before implementing a Recommendation system in your organization you must make good planning of what you will recommend to whom and on which way. This is the most important first step that makes the foundation of success for your personalized offers.
I strongly recommend you to continue reading Personalization – How much do you understand your customer? as well as Increase Marketing ROI with Multi-touch Attribution Modelling. They are both really important parts of the personalization system.

Read more on Wiki:

Increase Marketing ROI with Multi-touch Attribution Modelling

With using Multi-touch Attribution Modelling advertisers typically realize a 15% – 44% improvement in marketing ROI with advanced multi-channel machine-learning algorithm. Using advanced machine learning techniques for marketing can give you true insight on performance all down to channel, campaign and device-level.

Implementing advanced machine learning models for marketing attribution will give you the following advantages:

  • Data-driven multi-touchpoint attribution.
  • Transparent and neutral: agency-independent and media-independent.
  • Connect the results of every touchpoint and campaign to your revenue stream.
  • Calculate the total cost of individual marketing campaigns to determine your ROI per campaign.
  • Channel-neutral: we evaluate both online and offline marketing channels and campaigns.
  • Budget optimiser to maximize your marketing ROI

Even small adjustments can have a noticeable effect on your marketing effectiveness and marketing ROI. With our Attribution Modelling machine learning model, we make the seemingly-complicated task of tracking the monetary impact of every touchpoint, every channel and every campaign easy. We can combine online and offline customer journeys and touchpoints to give you the full picture.

Non AI attribution modeling approaches

 undefined The Last Action Click model attributes 100% of the conversion value to the most recent Action ad that the customer clicked before buying or converting.

When it’s useful: If you want to identify and credit the Action that closed the most conversions, use the Last Action Click model.

undefined The First Interaction model attributes 100% of the conversion value to the first channel with which the customer interacted.

When it’s useful: This model is appropriate if you run ads or campaigns to create initial awareness. For example, if your brand is not well known, you may place a premium on the keywords or channels that first exposed customers to the brand.

 undefined The Linear model gives equal credit to each channel interaction on the way to conversion.

When it’s useful: This model is useful if your campaigns are designed to maintain contact and awareness with the customer throughout the entire sales cycle. In this case, each touchpoint is equally important during the consideration process.

 If the sales cycle involves only a short consideration phase, the Time Decay model may be appropriate. This model is based on the concept of exponential decay and most heavily credits the touchpoints that occurred nearest to the time of conversion. The Time Decay model has a default half-life of 7 days, meaning that a touchpoint occurring 7 days prior to a conversion will receive 1/2 the credit of a touchpoint that occurs on the day of conversion. Similarly, a touchpoint occuring 14 days prior will receive 1/4 the credit of a day-of-conversion touchpoint. The exponential decay continues within your lookback window (default of 30 days).

When it’s useful: If you run one-day or two-day promotion campaigns, you may wish to give more credit to interactions during the days of the promotion. In this case, interactions that occurred one week before have only a small value as compared to touchpoints near the conversion.

  The Position Based model allows you to create a hybrid of the Last Interaction and First Interaction models. Instead of giving all the credit to either the first or last interaction, you can split the credit between them. One common scenario is to assign 40% credit each to the first interaction and last interaction, and assign 20% credit to the interactions in the middle.

When it’s useful: If you most value touchpoints that introduced customers to your brand and final touchpoints that resulted in sales, use the Position Based model.

Stepping away from simplistic models like Last Click lets you take into account the full customer journey and properly account for all your marketing investments.

A Last Click attribution model leads to overvaluing and undervaluing certain channels, while Markov models – which take into account the full customer journey – is far more accurate. Using a good attribution model that takes into account the full customer journey lets you optimise marketing decisions based on the most valuable touchpoints.With’s Attribution insights solutions, backed by a team of experienced data scientists who honed their craft at well-known multinational corporations. We extract valuable insights and recommendations so you can take action where it matters most to increase your MROI.

Algorithmic multi channel attribution

Here we take more in detail about how I use Machine Learning modeling to build multi channel attribution models.

multi-channel marketing graph with probabilities.png

Using machine learning models we can develop many models for comparing, such as Last Click, First Click, Linear, and the algorithmic Markov model to deliver the right solution for you, in the way that suits you best. This gives you the best possible understanding of your strongest and weakest customer touchpoints, so you can optimize them for maximum effect.

Your Machine Learning project needs good Data. How to solve the problem of lack of data?

Machine learning applications are reliant on, and sensitive to, the data they train on. These most excellent practices will help you ensure that training data is of high quality.
To be efficient, machine learning (ML) need a significant amount of data.
We can anticipate a child to comprehend what a feline is and identify other cats after just a couple of encounters or by being revealed a couple of examples of felines, but Machine Learning algorithms need numerous, much more examples. Unlike humans, these algorithms can’t quickly develop reasonings on their own. For instance, machine learning algorithms analyze an image of a feline version.

The algorithms need a lot of data to separate the pertinent “features” of the cat from the background sound. It is the very same for other noise such as lighting and weather condition. Regrettably, such data cravings do not stop at the separation of signal from the sound. The algorithms also need to recognize significant functions that differentiate the feline itself. Variations that human beings do not require additional data to comprehend– such as a cat’s color or size– are challenging for machine learning.

Without an adequate number of samples, machine learning supplies no advantage.

Not all Machine Learning methods require loads of data

Many types of machine learning strategies exist, and some have been around for numerous years. Each has its strengths and weak points. These distinctions likewise reach the nature and amount of data required to build efficient models. For example, deep learning neural networks (DLNNs) are a fantastic area of machine learning because they can be delivering dramatic results. deep learning neural networks require a higher quantity of data than more established machine learning algorithms along with a large amount of calculating horsepower. In reality, deep learning neural networks were thought about practical only after the introduction of big data (which supplied the large data sets) and cloud computing (which offered the number-crunching capability).

Other aspects affect the need for data. General machine learning algorithms do not include domain-specific information; they must conquer this constraint through big, representative data sets. Referring back to the feline example, these machine learning algorithms do not comprehend the fundamental functions of felines, nor do they understand that backgrounds are sound. So they need many cases of this data to learn such differences.

To decrease the data needed in these scenarios, machine learning algorithms can consist of a level of domain data so important features, and characteristics of the target data are currently known. Then the focus of understanding can be strictly on optimizing output. This requirement to “imbue” human understanding into the machine learning system from the start is a direct outcome of the data-hungry nature of machine learning.

Training Data Sets Need Improvement

To truly drive innovation using machine learning, a significant amount of change requires to first happen around how to input data is chosen.

Curating (that is, selecting the data for a training data set) is, at heart, about keeping an eye on data quality. “Garbage-in, garbage-out” is specially true with machine learning. Intensifying this issue is the relative “black box” nature of machine learning, which avoids understanding why machine learning produces a specific output. When machine learning creates unexpected output, it is since the input data was not suitable, however, identifying the particular nature of the issue data is an obstacle.

Two typical problems caused by poor data curation are overfitting and bias. Overfitting is the outcome of a training data set that does not adequately represent the actual variation of production data; it, therefore, produces output that can deal with a portion of the full data stream.

Bias is a much deeper issue that connects to the same root cause as overfitting; however, is harder to determine and understand partial data sets are not representative, have skewed circulation, or do not include the proper data in the very first place. This incomplete training data results in partial output that makes incorrect conclusions that may be difficult to determine as inaccurate. Although there is much optimism about machine learning applications, data quality problems should be a significant concern as machine-learning-as-a-service offerings come online.

A related problem is having access to premium data sets. Big data has produced various data sets; however, rarely do these sets involve the type of details needed for machine learning. Data utilized for machine learning needs both the data and the outcome connected with the data. Using the feline example, images need to be tagged showing whether a feline exists.

Other machine learning tasks can need much more complex data. The need for large volumes of sample data integrated with the need to have this data sufficiently and accurately explained produces an environment of data haves and have-nots. Only the large companies with access to the finest data and deep pockets to curate it will be able to benefit from machine learning quickly. Unless the playing field is level, the development will be muted.

How to solve Data problems using Innovation?

Just as machine learning can be used to real problem resolving, the very same technologies and strategies utilized to sort through countless pages of data to identify key insights can be used to assist with the issues of finding high-quality training data.

To enhance data quality, some attractive options are available for automating problem detection and correction. For example, clustering or regression algorithms can be utilized to scan proposed input data sets to discover unseen anomalies. Alternatively, the procedure of identifying whether data is representative can be automated. If not appropriately addressed, hidden abnormalities and unrepresentative data can result in overfitting and bias.

If the input data stream is suggested to be reasonably consistent, regression algorithms can identify outliers that might represent garbage data that might negatively affect a knowing session. Clustering algorithms can assist examine a data set that includes a specific number of file categories to recognize if the data indeed comprises more or fewer types– either of which can result in poor results. Other ML techniques can be used to validate the accuracy of the tags on the sample data. We are still at the early phases of automated input data quality assurance. However, it looks promising.

To increase access to helpful data sets, one brand-new strategy offers with artificial data. Rather than an effort to collect genuine sample sets and after that tag them, companies use generative adversarial networks to produce and tag the data. In this circumstance, one neural network produces the data, and another neural network tries to figure out if the data is genuine. This procedure can be left unattended with impressive results.

Reinforcement learning is also getting real traction to address the absence of data. Systems that employ this technique can take data from interactions with their immediate environment to find out. Over time, the system can develop brand-new reasonings without needing curated sample data.

Data Is Driving Innovation

Promising and ongoing work using machine learning technologies is solving a variety of problems and automating work that is expensive, time-consuming, and complex (or a mix of all three). Yet without the necessary source data, machine learning can go nowhere. Efforts to simplify and broaden access to large volumes of high-quality input data are essential to increase the use of ML in a much broader set of domains and continue to drive innovation.

Guided analytics are the future of Data Science and AI

Nowadays, people are used to and take it for granted, the added value in their life, from using Siri, or Google’s Assistant or Alexa for all sorts of things: answering odd trivia concerns, inspecting the weather condition, purchasing groceries, getting driving instructions, turning on the lights, and even inspiring a dance celebration in the cooking area. These are splendidly beneficial (typically fun) AI-based gadgets that have boosted individuals’ lives. Nevertheless, human beings are not partaking in deep, significant conversations with these gadgets. Instead, automated assistants address the specific requests that are made from them. If you’re exploring AI and artificial intelligence in your enterprise, you may have experienced the claim that, if entirely automated, these innovations can replace data scientists entirely. It’s time to rethink this assertion.

The Issue with Fully Automated Analytics

How do all the driverless, automated, automatic AI, and machine learning systems suit the enterprise? Their objective is either to encapsulate (and hide) existing data researchers’ expertise or to apply advanced optimization plans to the fine-tuning of information science tasks.

Automated systems can be useful if no private data science competence is readily available, but they are likewise somewhat limiting. Business experts who depend on data to do their tasks get locked into the prepackaged competence and a limited set of hard-coded circumstances.

In my experience as a data scientist, automation tends to miss the most crucial and fascinating pieces, which can be very important in today’s extremely competitive marketplace. If data scientists are permitted to take a somewhat more active method and guide the analytics process, however, the world opens considerably.

Why a Guided Analytics Method Makes Sense?

In order for companies to get the most out of AI and data science, to effectively anticipate future outcomes and make much better organization choices, completely automatable information science sandboxes need to be left. Instead, enterprises need to begin interactive exchanges between a data scientist, organization analysts, and the devices doing the operate in the middle. This needs a procedure referred to as “assisted analytics,” in which personal feedback and assistance can be used whenever required– even while an analysis is in development.

The objective of guided analytics is to enable a team of data researchers with various choices and skills to collaboratively construct, preserve, and continuously refine a set of analytics applications that offer company users with various degrees of user interaction. Put, all stakeholders work together to create a better analysis.

Companies that wish to create a system that facilitates this type of interaction while still establishing a practical analytics application face a huge– but not overwhelming– obstacle.

Common Attributes

I have determined four typical properties that help data scientists successfully develop the right environment for the next type of wise applications– the ones that will assist them to obtain real service value from AI and machine learning.

The applications that offer company users just the correct amount of guidance and interaction allow groups of information scientists to merge their proficiency collaboratively. When specific residential or commercial properties collaborate, data researchers can build interactive analytics applications that reveal adaptive potential.

The perfect environment for guided analytics shares these 4 characteristics:

Open: Applications shouldn’t be strained with restrictions on the kinds of tools utilized. With an open environment, collaboration can occur between scripting masters and those who want to recycle their proficiency without diving into their code. Besides, it’s a plus to be able to connect to other tools for specific data types as well as interfaces specialized for high-performance or big information algorithms (such as H2O or Spark) from within the very same environment.

Agile: Once the application is deployed, new demands will emerge rapidly: more automation here, more customer feedback there. The environment used to develop these analytics applications requires likewise to make it easy for other members of the data science group to quickly adjust existing analytics applications to brand-new and changing requirements, so they continue to yield significant results over the long term.
Putting It into Practice

Versatile: Below the application, the environment should also be able to run simple regression designs or manage complicated specification optimization and ensemble designs– ranging from one to thousands of designs. It’s worth noting that this piece (or a minimum of some elements of it) can be hidden totally from the business user.

Uniform: At the same time, the specialists creating data science ought to have the ability to perform all their operations in the very same environment. They need to mix data, run the analysis, mix and match tools, and develop the facilities to deploy the resulting analytics applications all from that very same intuitive and nimble environment.

Some AI-based applications will merely provide an introduction or projection at journalism of a button. Others will allow completion user to select the data sources to be used. Still, others will ask the user for feedback that ends up improving the design( s) trained beneath the hood, factoring in the users’ knowledge. Those models can be easy or arbitrarily complicated ensembles or entire design families, and the end user might or might not be asked to assist fine-tune that setup. The control over how much of such interaction is required to depend on the hands of the information researchers who developed the underlying analytics procedure with their target audience, the actual organization users’ interests (and abilities), in mind.

The big concern you may be asking is, how do I do this in my organization? You might think this is not realistic for your team to construct on its own; you are resource-constrained as it is. The good news is that you do not have to.

Software, particularly open source software, is available that makes it useful to execute guided analytics. Utilizing it, teams of data researchers can work together utilizing visual workflows. They can give their expert service associates access to those workflows through web interfaces. Additionally, there is no need to use another tool to develop a web application; the workflow itself models the interaction points that consist of an analytics application. Workflows are the glue holding it all together: various tools utilized by different members of the information science team, information mixed from numerous sources by the information engineering experts, and interaction points modeling the UI parts noticeable to the end user. It is all quickly within your grasp.

Guided analytics in the following years

Interest in guided analytics is growing, permitting users not only to wrangle information; however, likewise, fine-tune their analyses. It is exciting to see just how much cooperation this sets off. It will also be fascinating to witness how information researchers build progressively practical analytics applications that help users in developing analyses with real organization effect.

Instead of taking experts out of the chauffeur’s seat and trying to automate their wisdom, assisted analytics aims to combine the best of both. This is good for data scientists, company analysts, and the practice of data analytics in general. Eventually, it will be necessary for development too. Although it might appear challenging now, the effort will be worth it to make sure a better future.

Data Engineering

Making Machine Learning more efficient with the cloud

In the essence, machine learning is a productivity tool for data scientists. As the heart of systems that can learn from data, machine learning permits data scientists to train design on an example data set and then utilize algorithms that immediately generalize and find out both from that example and from new data feeds. With not being watched methods, data scientists can do without training examples entirely and use machine learning to boil down insights directly and continuously from the data.

I write more here what are the advantages of using the Cloud for Building Machine Learning projects.

Machine learning can infuse every application with predictive power. Data scientists use these sophisticated algorithms to dissect, search, sort, infer, foretell, and otherwise understand the growing amounts of data in our world.

To achieve machine learning’s full capacity as a company resource, data scientists require to train it from the rich troves of data on the mainframes and other servers in your private cloud. For genuinely robust business analytics, you need machine-learning platforms that are crafted to provide the following:

  • Automation and optimization: Your enterprise machine learning platform should allow data scientists to automate creation, training, and release of algorithmic designs against high-value corporate data. The platform ought to assist them in selecting the optimal algorithm for every single data set. The way to do this is by having a system that scores their data against available algorithms and arrangements, the algorithm that best matches their requirements.
  • Efficiency and scalability: The platform needs to be able to continually develop, train, and release a high volume of machine learning models versus data kept in large business databases. It should allow data scientists to deliver better, fresher, more regular forecasts, consequently speeding time to insight.
  • Security and governance: The system ought to enable data scientists to train models without moving the data from the mainframe or another business platform where it is protected and governed. In addition to minimizing the latency and managing the cost of performing machine learning in your data center, this technique gets rid of the dangers associated with doing ETL on a platform different from the node where machine learning execution occurs.
  • Versatility and programmability: The platform ought to permit data scientists to utilize any language (e.g., Scala, Java, Python), any popular structure (e.g., Apache SparkML, TensorFlow, H2O), and any transactional data type throughout the machine learning development lifecycle.

Taking in account the above points, developing your Machine learning and AI project on the cloud can really make difference.

What are the Benefits of Machine Learning in the Cloud?

  • The cloud’s pay-per-use model is good for bursty AI or machine learning workloads.
  • The cloud makes it easy for enterprises to experiment with machine learning capabilities and scale up as projects go into production and demand increases.
  • The cloud makes intelligent capabilities accessible without requiring advanced skills in artificial intelligence or data science.
  • AWS, Microsoft Azure, and Google Cloud Platform offer many machine learning options that don’t require deep knowledge of AI, machine learning theory, or a team of data scientists.

You don’t need to use a cloud provider to build a machine learning solution. After all, there are plenty of open source machine learning frameworks, such as TensorFlow, MXNet, and CNTK that companies can run on their own hardware. However, companies building sophisticated machine learning models in-house are likely to run into issues scaling their workloads, because training real-world models typically requires large compute clusters.

The leading cloud computing platforms are all wagering huge on democratizing artificial intelligence and ML. Over the previous 3 years, Amazon, Google, and Microsoft have actually made considerable investments in artificial intelligence (AI) and machine learning, from presenting brand-new services to performing significant reorganizations that position AI tactically in their organizational structures. Google CEO, Sundar Pichai, has even said that his company is moving to an “AI-first” world.

Having that said, as the Data Science teams grow, the cloud usage will be more eminent. Bigg teams will ask for undisturbed and performing platform where they will create and share different Machine Learning projects. On which they will compare and optimize the machine learning models performance.
This is where the cloud comes in very handy by providing centralized place to keep all big data and all ML models build on top of this data.

Another argument to take into consideration is Machine Learning project reusability.
As teams change drastically and fast nowadays, it is essential to have the machine learning models deployed on the cloud. The difference between models being deployed on servers would be the ease for giving new access to new team members while not jeopardizing the security protocols in the company. That means that a new team member can be up and running with in the first day in the team. He can see the machine learning models developed by his predecessors and use some of them to build new project. That already adds a lot of value.

Some great Machine learning platforms in the cloud available today are:
IBM Machine Learning for z/OS
Amazon EC2 Deep Learning AMI backed by NVIDIA GPU, Google Cloud TPUMicrosoft Azure Deep Learning VM based on NVIDIA GPU, and IBM GPU-based Bare Metal Servers are examples of niche IaaS for ML.

Read more:

Machine learning in the cloud

Machine Learning in the cloud

As artificial intelligence (ML) and also artificial intelligence come to be extra prevalent, data logistics will be vital to your success.
While building Machine Learning projects, most of the effort required for success in artificial intelligence is not the algorithm, design, structure, or the learning itself. It’s the data logistics. Perhaps less amazing than these other facets of ML, it’s the data logistics that drive performance, continuous knowing, as well as success. Without data logistics, your capability to remain to refine as well as scale are significantly limited.

Data logistics is key for success in your Machine Learning and AI Projects

Great data logistics does more than drive effectiveness. It is essential to reduce prices currently and also boosted agility in the future. As ML and also AI continue to develop and also expand right into even more business processes, business have to not enable very early successes to become limitations or issues long-term. In a paper by Google scientists (Artificial intelligence: The High Rate Of Interest Credit Card of Technical Financial Debt), the writers point out that although it is simple to spin up ML-based applications, the initiative can result in expensive data dependencies. Excellent data logistics can mitigate the difficulty in managing these intricate data reliances to prevent hindering agility in the future. Using an appropriate structure such as this can also ease deployment and also administration as well as permit the advancement of these applications in ways that are difficult to predict precisely today.

When building Machine Learning and AI projects use – Keep It Simple to Start

Nowadays, we’ll see a shift from complex, data-science-heavy implementations to an expansion of efforts that can be finest called KISS (Keep It Simple to Start). Domain experience as well as data will be the chauffeurs of AI processes that will evolve and improve as experience grows. This strategy will use an additional benefit: it also improves the productivity of existing personnel along with costly, hard-to-find, -hire, as well as -preserve data researchers.

This approach additionally removes the problem over choosing “simply the right devices.” It is a fact of life that we need several devices for AI. Structure around AI the proper way allows continual adjustment to capitalize on brand-new AI tools as well as formulas as they appear. Don’t stress over performance, either (including that of applications that need to stream data in real time) due to the fact that there are constant bear down that front. For instance, NVIDIA recently announced RAPIDS, an open resource data scientific research initiative that leverages GPU-based processes to make the growth and training of models both much easier and also much faster.

Multi-Cloud Deployments will become more standard methods

To be completely agile for whatever the future may hold, the data platforms will certainly need to support the complete selection of diverse data kinds, including documents, items, tables, as well as events. The system must make input as well as outcome data available to any kind of application anywhere. Such agility will certainly make it feasible to totally utilize the worldwide sources offered in a multi cloud setting, thereby empowering organizations to attain the cloud’s complete potential to maximize efficiency, cost, as well as conformity requirements.

Organizations will move to release a common data platform to synchronize and drive converge of (and additionally preserve) all data throughout all deployments, as well as through a global namespace provide a sight into all data, any place it is. An usual data platform throughout numerous clouds will certainly also make it less complicated to explore different services for a range of ML as well as AI demands.

As companies broaden their use ML as well as AI throughout numerous industries, they will require to access the full variety of data sources, types, and also structures on any cloud while staying clear of the creation of data silos. Attaining this end result will cause releases that surpass a data lake, and also this will certainly mark the increased proliferation of worldwide data platforms that can extend data kinds and also places.

Analytics at the Cloud Will End Up Being Strategically Crucial

As the Web of Things (IoT) continues to increase and also develop, the capability to unite edge, on-premises, and cloud processing atop an usual, worldwide data platform will certainly become a tactical important.

A distributed ML/AI style efficient in coordinating data collection as well as processing at the IoT side removes the requirement to send large quantities of data over the WAN. This capability to filter, aggregate, and analyze data at the edge additionally promotes faster, much more reliable handling and also can cause better neighborhood decision making.

Organizations will certainly aim to have a typical data system– from the cloud core to the venture edge– with consistent data administration to make certain the honesty and also safety of all data. The data system picked for the cloud core will, therefore, be adequately extensible and also scalable to deal with the complexities connected with distributed processing at a scattered and also vibrant side. Enterprises will position a premium on a “light-weight” yet capable as well as compatible variation appropriate for the calculate power available at the side, especially for applications that should deliver results in real-time.

A Final Word

In the following years we will see a boosted focus for AI and also ML development in the cloud. Enterprises will maintain it basic to begin, avoid dependencies with a multicloud global data platform, as well as encourage the IoT edge so ML/AI campaigns provide more worth to business in latest years and also well right into the future.

More reads:

Where does a Data Scientist sit among all that Big Data

Predictive analytics, the process of building it

Advanced Data Science

Predictive Analytics from research and development to a business maker

The start of predictive analytics and machine learning

Predictive analytics started in the early 90s with pattern recognition algorithms—for example, finding similar objects. Over the years, things have evolved into machine learning. In the workflow of data analysis, you collect data, prepare data, and then perform the analysis. If you employ algorithms or functions to automate the data analysis, that’s machine learning.

Read more about the process of building data analysis.

Read More »

How to boost your Machine learning model accuracy

boosting predictive machine learning algorithms

There are multiple ways to boost your predictive model accuracy. Most of these steps are really easy to implement, but yet for many reasons data scientist fail to do proper data preparation and model tuning. in the end, they end up with average or below average machine learning models.
Having domain knowledge will give you the best possible chance of getting improvements on your machine learning models accuracy. However, if every data scientist follows these simple technical steps, they will end up with a great machine learning model accuracy even without being an expert in a certain field.

Read More »

Data scientists: How can I use my data?

I have been asked numerous times, I got access to the database can you please tell me how to use the data?

Since I have been asked this more and more,  I got some time to answer it here and help people with this question.

What do you want to do with your data?

First and foremost, what are you trying to do with the data?
Ask yourself, your manage, friend or whoever is making you do something with the data, what do you want the data to show you?
Most of the times the data is powerful as much as you can understand it. Here is how you can do that:

Understand how your data is linked

Databases no matter relational or non relational have schemas. This shows where specific attributes are stored, in which tables or objects, also shows how tables or objects are connected between themselves, the linking.

What is an attribute? Attribute is basically everything that is descriptive, name, surname, dress, profit etc..
What is table or object? Table or object is the structure that is holding or grouping the attributes.
What are links or keys? Links, keys and foreign keys are basically information that allows you linking one table to another. For example you want to link the profit to a sales person, or address to a person, you will use linking or joining to the tables. Mostly this is done using foreign keys or by joining multiple attributes – creating composite key. 

How to learn the links between the data

Some people try to learn the schema all at once, seeing the tables their attributes and how they link to each other.
My suggestion is to learn by doing. Lately in the world of big data, the database systems are getting too complex to be learnt all at once and most of the times you won’t need to know it all. Learning it by practical use cases can help you understand not just table structure but also the underlaying dat.

How do I get the data?

Usually we use SQL to query the databases. SQL is the fastest and best performing way to do it.

Other ways can be using code: Java, .Net, R, Python and what not else.
Excel, you can query data easily using Excel while creating Pivot table.
Lately Data Scientist are using tools like KNIME, Alteryx to fetch the data. Using this approach does not requires knowing any query language, but you might face the risk of downloading gigabytes of data in your memory or disk if the table you are trying to query is that big.
Query with Excel:
Query with Javascript: 

Use your imagination.

Once you succeed in getting your data you should start using your imagination and think of making useful use case scenarios that will help your business.

Visualize your data

Plain data is boring so that is why we visualize it.

Easy ways to do that is using Excel. Excel is really powerful by itself and can create pretty charts.
Some other popular tools are Tableau, QlickView and Jaspersoft, HighCharts
 Off course,There are endless other solutions that can create pretty charts.
One word of caution about visualization, don’t try to over visualize things because they can become really confusing. Also try to use few colors instead of using all color palette, so other people can follow you.


Create a story with your data

Now that you have your use case and cool visualization, try to create a story.
People will understand what you want to say and even get new ideas if you tell your analysis in a nice story.
Happy data mining, now when you know what to do with your data!

Modeling marketing multi-channel attribution in practice

multi channel attribution.png

What is the next step I need to take to close his deal? What will this customer ask for next and how can I drive it to him? What is the shortest path to close a deal?
How much do all my marketing and sales activities really cost? How much does one action or marketing channel costs?

All these are common questions marketing and sales are facing with on daily bases.

Luckily, there is an answer.

Here I represent the advantages on using machine learning models that will produce multi-channel attribution models. I strongly recommend you to read it.Read More »

Data Science Platforms

What is a good platform for Data science

When you think about becoming a Data Scientist, one of the first questions that will come on your mind is: What do I need to start, what tools do I need?

Well, today I’ll share my secret tools, and in multiple series of posts, I’ll try to make you proficient in it.

Basically, I don’t use super fancy tools like the one on CSI.

I use Excel, notepad, R, Python – those two are really popular nowadays. I have been using Microsoft BI full stack (SSRS, SSIS, SSAS), JENA, ENCOG, RapidMiner and what not else.
Starting in my latest company, I was introduced to a little gem called KNIME.

Surprisingly enough KNIME was quoting really good sufficient by that time, but I haven’t got the time to explore it before.

In the latest Gartner Report KNIME site on the far right among well-established industry leaders like IBM, SAS, RapidMiner.

But being wholly outsourced unlike his peers, KNIME really outshines the others.

Gartner Magic Quarter.png

What Gartner Says about KNIME:

KNIME (the name stands for “Konstanz Information Miner”) is based in Zurich, Switzerland. It offers a free, open-source, desktop-based advanced analytics platform. It also provides a commercial, server-based solution providing additional enterprise functionality that can be deployed on-premises or in a private cloud. KNIME competes across a broad range of industries but has a large client base in the life sciences, government and services sectors.

  • Almost every KNIME customer mentions the platform’s flexibility, openness, and ease of integration with other tools. Similar to last year, KNIME continues to receive among the highest customer satisfaction ratings in this Magic Quadrant.
  • KNIME stands out in this market with its open-source-oriented go-to-market strategy, large user base and active community — given the small size of the company.
  • Many customers choose KNIME for its cost-benefit ratio, and its customer reference ratings are among the highest for good value.
  • The most common customer complaints are about the outdated UI (which was recently updated in version 3.0 in October 2015, so few customer references have seen it) and a desire for better-performing algorithms for a distributed big data environment.
  • Customers also expect a high level of interactive visualizations from their tools. KNIME lacks in this area, requiring its customers to obtain this from data visualization vendors such as Tableau, Qlik or TIBCO Spotfire.
  • Some customers are looking for better insight into and communication of the product roadmap, but they do give KNIME high scores on including customer requests into subsequent product releases.

Read more here.

I strongly encourage you to download and get familiar with KNIME.

Do I use only KNIME as a Data Science Platform?

No. The beauty of KNIME is that can easily integrate with external solutions, Weka, R, Python. KNIME is very solid in building Predictive models, but sometimes I make models in R because I find libraries that I personally think are better to be used than KNIME native libraries.
After that, I integrate R models in KNIME using their R task. Works like a charm.

Use Database systems too. Please.

One thing that I should be careful while using these platforms is their memory consumption. R draining computer’s memory because it loads everything in a buffer. KNIME is similar. Therefore I use a database system to filter out the data before I load it in R or KNIME.

Database systems are must use. If you want to build effective and fast models, you need to let the Database system handle the vast amount of data first and then load it to the analytical platform. The choice of which platform you should use really depends on your company policy. It is terrific if you have a distributed database systems like Hadoop where you can run SQL operations on Big Data and then sent limited datasets to KNIME and R.

It will be fast, and it will save you a lot from a painful experience like filling up the DB Memory buffer.

However, if you don’t have a distributed Database system, the conventional Database system will do as well. Mayor task as a Data scientist makes the standard database system work with your data too 🙂

What is next?

How do you become a data scientist?

What skills does a Data Scientist need?

How can you make great reports?

What skills does a data scientist need and how to get them?

Upgrading your skills constantly is the way to stay on the top.
What skills do you need to have to become a Data Scientist?
I have written before but I’ll try to put again some more info to help the people who really want to go that path.

Free Tools can help a lot to start!

 There are many tools that can help you overcome this easily to some extent: KNIME is one great tool I use literally every day. It is really easy to learn and it covers 90% of the tasks you will be asked daily as Data Scientist. The best is free.
Check it out here:
Other similar tools: RapidMiner
The important fact is you should know what to do with it.
I have given numerous courses on how they use the tool and how to start with super basic DS tasks.
Understanding Basic terms can help you along the way:
What are regression and what classification?
It is good to know how to approach a specific problem in order to solve it. Almost every problem in the world we are trying to solve can fall into these two.

What algorithms can be used and should be used for each problem?

This is important but not show stopper for the beginning. Decision trees can do just right for a start.
How to do:

Data Cleaning or Transformation

This is one of the most important things you’d come across working in Data Science. 90% of the time, you are not going to get well-formatted data. If you are skilled in one of the programming language, Python or R, you should be pro at packages like Pandas or Dplyr/Reshape.
Exploratory Data Analysis
I have written before of How can you start using the data. Check this link to get an idea.
Once again, this is the most important part, whether you are working to take insights or you want to do predictive modeling, this step comes in. You must train your mind analytically to make an image of variables in your head. You can build such a mind by practice. After that, you must be very good with hands-on with packages like matplotlib or ggplot2, depending upon the language you work with

Machine Learning / Predictive Modelling

One of the most important aspects of today’s data science is predictive modeling. This is dependent upon your EDA and your knowledge of mathematics. I must inform you that invest your time in theory. The more theoretical knowledge you have, the better you’d be going to do. There is no easy way around it. There’s this great course by Andrew NG that goes much into theory. Take it.

Programming Languages

If you want to go more advanced, it is important to have a grip on at least one programming language widely used in Data Science. But you should know a little of another language. Either you should know R very well and some Python or Python very well but some R.
Take my case, I know R very well ( at least I think so) but I can work around with Python too ( not expert level ), Java, C#, JavaScript. Anything works if you know to use it when you need it.
Example of complete data analysis that one Data Scientist is doing can be found here.
I use Knime, R and Python every day, I think if you are a total beginner, its good idea to start with KNIME.

Useful courses for learning Data Scientists

I really recommend spending some time on the following courses:
I have passed them myself and I learned a lot from each of it.
Happy learning!
Image credit: House of bots

Machine learning in practice – Let the machine find the optimal number of clusters from your data


What is Clustering?

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

Where Clustering is used in real life: 

Clustering is used almost everywhere – Search engines, making marketing campaigns, biological analysis, cancer analysis, your favorite phone provider is making cluster analysis to see in which group of people you belong before they decide if they will give you additional discount or special offer. the applications are countless.

How can I find the optimal number of clusters?

One fundamental question is: If the data is clusterable, then how to choose the right number of expected clusters (k)?


Read More »

Predictive analytics, the process of building it

What is Predictive Analytics?

I have talked with many people with different technical knowledge, and many times I have been asked questions like: So, can predictive analytics tell my future? The sad answer is NO. Predictive analytics will not tell you for certain if you are going to be rich or not. Or will not guarantee 100% that your favorite team will win so you can put all your saving on a bet. Also, it won’t tell you where you will end up for sure next year.

However, predictive analytics can definitely forecast and give hints about what might happen in the future with an acceptable level of reliability and can include risk assessment and what – if scenarios.

The process of extracting information from a dataset order to get patterns or predict future outcomes upon that data.

What a process of implementing Predictive Analytics includes?

From having an idea until implementing a predictive model and being able to read it there are a couple of operations that need to be taken care of.

Know the business problem

It is really important to know the scope of the business that you are building a model for. Many people think that it is necessary to apply statistical methods and some Machine learning algorithm on a big chunk of data and the model will give you an answer by itself. Unfortunately, this is not the case. Most of the times you will have to complement your data set with producing new metrics out of the already existing data and for that, you will have to know at least the essentials of the business.

First and foremost, it is essential to identify your business problem. After that, you can successfully determine what metrics are necessary to address your problem and then you can decide which analysis technique you will use.

Transforming and extracting raw data

While trying to build a predictive model you will spend a lot of time trying to prepare the data in the best possible way. That will include handling different data sources like Web API, unstructured data (usually collected from Weblogs, Facebook, Twitter, etc.), different database engines (MSSQL, MySQL, Sybase, Casandra, etc.), flat files (Comma separated value (CSV), tab delimited files, etc.). Therefore, knowledge of Database structures, ETLs, and general computer science knowledge is really useful. In some of the cases, you might be lucky enough to have a separate team that will provide these services for you and delivers you a nicely formatted file that you can work on, but in most of the cases, you still have to do data transformations by yourself.

Tools that are usually used for this kind of transformations are SSIS from Microsoft, SAP Data services, IBM Infosphere Information Manage, SAS Data Management, Python. Many times I have seen ETL processes made purely in C# (This is when application developers are given BI tasks), or purely Stored Procedures (This is when database developers are given BI tasks), R (when statisticians try to perform ETL)
Exploratory data analysis

Exploratory data analysis is an approach to analyze data sets so you can summarize their main characteristics. Often the exploratory data analysis is done using visual methods. Here a statistical model can be used or not, but primary, exploratory data analysis is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

Exploratory data analysis VS Summary data analysis

A summary analysis is simply a numeric reduction of a historical data set. It is quite passive. Its focus was in the past. Its intent is to simply arrive at a few key statistics like mean and standard deviation, which may then either replace the data set or be appended to the data set in the frame of a summary table.
The purpose of exploratory data analysis is to gain insight into the engineering/scientific process behind the data. Whereas summary statistics are passive and historical,  exploratory data analysis is active and futuristic. In an effort to “understand” the procedure and improve it in the future the data as a “window” to peer into the spirit of the process that generated the information.

Building a predictive model

After successfully identifying the problems that this model needs to solve it is time to write some code and do some testings in order to make a theoretical account that will predict outcomes of the data we anticipate to receive in the near future according to the data we already deliver.
This is the time to implement some machine learning. You will basically try to implement algorithms like ANOVA, ARIMA, decision trees, KNN, etc., depending on the problem you are trying to solve and the performance that the algorithm is giving for the specific data we have.
In this pace, we should basically evaluate algorithms with developing a test harness and baseline accuracy from which to ameliorate. The second thing is to leverage results to develop more accurate models.
There are many ways to choose the right algorithm while building a model depending on the scope of the problem. Most of the times the prediction model is improved by combining more than one algorithm, blending. For example, the Predictive model that won the $1 million prize from Netflix for giving recommendations of movies contains more than 100 different models that are blended into a one.
Popular tools that are used now days are R, Python, Weka, Rapid Miner, Mathlab, IBM SPSS, Apache Mahout.
I will write more about choosing the right algorithm for a specific problem in another article.

Presenting the outcome

At this stage, we need to come up with a way of presenting the results that the predictive model has generated. This is where good data visualization practices come handy. Most of the times the results are presented as a report or just an excel spreadsheet, but lately, I see the increased demand, interactive dashboards where the user can see the data from many perspectives instead of one.
At this stage, we should be careful how to present the data since executives and people that need to bring strategic decisions are not necessarily really technical. We must make sure that they have a good understanding of the data. Asking the help of a graphic designer or reading more about how to play with colors and shapes will be really useful and awards at the end.
Some popular visualization platforms that can provide interactive dashboards are Microstrategy, Performance Point on Sharepoint, Tableau, QlikView, Logi Analytics, SAP, SAS, Big Blue from IBM

As you can see the process of building a Predictive model has a much bigger scope than just applying some fancy logos and mathematical formulas. Therefore a successful data scientist should have an understanding of business problems and business analysis so he/she will have a greater understanding of what the data is really saying; some computer science skills so he can perform the extraction and the transformation of the different data sources and some statistical knowledge so he can apply data sampling, better understanding of the predictive models, hypothesis testing, etc.
Maybe, you might think that this kind of person with that much broad knowledge does not exist, but in fact they do. That is why the data scientist is really appreciated lately.

Where does a Data Scientist sit among all that Big Data

In the past years we have been all witnesses of the growing demand on machine learning, predictive analytics, data analysis. Why is that so?
Well, it is quite simple. Like E.O Wilson said,
“We are drowning in information, while starving for wisdom. The world henceforth will be run by synthesizers, people able to put together the right information at the right time, think critically about it, and make important choices wisely.”
 Every possible device I can think of is generating some data feed. But what to do with this data is always the big question.
Business people for sure will have some sparkles of ideas, marketing people will try to sell the perfect fit all models for optimizing your marketing campaigns and bringing all your needed customers, business controllers will try to predict what the sales going to look alike in the next quarter and so on.. All that by using the raw data generated by everything around us.
The big questions is how doing all this smart analysis?
That is where all the fancy names and companies with shiny solutions come in the game, of course that comes with a price, usually, big price.
But on the other side of the game, this is where all smart people come into the game. Lately, these people are called Data Scientist, Data Anlayst, BI professionals and all possible varieties of names.

What does Data Scientist / Data Analyst really do?

Working as a part of already established Data Science team

Most of them do tasks like pulling data out of MySQL, MSSQL, Sybase and all other databases on the market, becoming a master at Excel pivot tables, and producing basic data visualizations (e.g., line and bar charts). You may on occasion analyze the results of an A/B test or take the lead on your company’s Google Analytics account.
Also here you will have to build predictive models that will forecast your client’s mood, possible new market openings, product forecasting or time series analysis.

Basic Statistics

This is where the basic statistics come really handy.
You should be familiar with statistical tests, distributions, maximum likelihood estimators, etc.
This will also be the case for machine learning, but one of the more important aspects of your statistics knowledge will be understanding when different techniques are (or aren’t) a valid approach. Statistics are important in all company types, but especially data-driven companies where the product is not data-focused and product stakeholders will depend on your help to make decisions and design / evaluate experiments.

Machine Learning

 If you’re dealing huge amounts of data, or working at a company where the product itself is especially data-driven, it may be the case that you’ll want to be familiar with machine learning methods. At this time classical statistical methods might not always work and you might be facing a time when you need to work with all data, instead of a sample of the whole data set, like you would do if you follow a conventional statistical approach of analyzing your data.
This can mean things like ARIMA models, Neural Networks, SVM, VARs, k-nearest neighbors, decision trees, random forests, ensemble methods – all of the machine learning fancy words. It’s true that a lot of these techniques can be implemented using R or Python libraries – because of this, it’s not necessarily a deal breaker if you’re not the world’s leading expert on how the algorithms work. More important is to understand the broad strokes and really understand when it is appropriate to use different techniques. There is a lot of literature that can help you in getting up to speed with R and Python in real case scenarios.

Establishing Data Science team

Nowadays, number of companies are getting to the point where they have an increasingly large amount of data, and they’re looking for someone to set up a lot of the data infrastructure that the company will need moving forward. They’re also looking for someone to provide analysis. You’ll see job postings listed under both “Data Scientist” and “Data Engineer” for this type of position. Since you’d be (one of) the first data hires, there are likely many low-hanging fruit, making it less important that you’re a statistics or machine learning expert.
A data scientist with a software engineering background might excel at a company like this, where it’s more important that a data scientist make meaningful data-like contributions to the production code and provide basic insights and analyses.
At this time be ready to bootstrap servers, installations of new virtual machines, setting up networks, plain DBA work, Hadoop installation, setting up Oozie, Flume, Hive, etc. Many times I have been asked to set up Share Point or Web Servers, so I can set up Performance Point as part of the reporting solution.
In times of establishing Data Team or BI teams in a company, you should be ready literally for every other IT tasks you can imagine (especially if you work in a startup), so broad range of skills is really welcomed here.
Expect at least the first year to work mainly on infrastructure and legacy items instead of crunching data and making shiny assumptions and reports.

Keeping up to date

There is certainly a lot of upcoming potential in this profession. And with that the expectance from you in your company is growing exponentially.
As the industry is getting more inclined in data analysis, you working as Data Analyst/Scientist will be challenged to read all the time, whether is Business News or literature that will help you build or improve your models, reports or work in general. There is a lot of research in this field and a lot of new books going out with titles mentioning Data.
Data scientists today are akin to Wall Street “quants” of the 1980s and 1990s. In those days people with backgrounds in physics and math streamed to investment banks and hedge funds, where they could devise entirely new algorithms and data strategies. Then a variety of universities developed master’s programs in financial engineering, which churned out a second generation of talent that was more accessible to mainstream firms. The pattern was repeated later in the 1990s with search engineers, whose rarefied skills soon came to be taught in computer science programs.
One question raised by this is whether some firms would be wise to wait until that second generation of data scientists emerges, and the candidates are more numerous, less expensive, and easier to vet and assimilate in a business setting. Why not leave the trouble of hunting down and domesticating exotic talent to the big data start-ups and to firms whose aggressive strategies require them to be at the forefront?
The problem with that reasoning is that the advance of big data shows no signs of slowing. If companies sit out this trend’s early days for lack of talent, they risk falling behind as competitors and channel partners gain nearly unassailable advantages. Think of big data as an epic wave gathering now, starting to crest. If you want to catch it, you need people who can surf.