Guided analytics are the future of Data Science and AI

Nowadays, people are used to and take it for granted, the added value in their life, from using Siri, or Google’s Assistant or Alexa for all sorts of things: answering odd trivia concerns, inspecting the weather condition, purchasing groceries, getting driving instructions, turning on the lights, and even inspiring a dance celebration in the cooking area. These are splendidly beneficial (typically fun) AI-based gadgets that have boosted individuals’ lives. Nevertheless, human beings are not partaking in deep, significant conversations with these gadgets. Instead, automated assistants address the specific requests that are made from them. If you’re exploring AI and artificial intelligence in your enterprise, you may have experienced the claim that, if entirely automated, these innovations can replace data scientists entirely. It’s time to rethink this assertion.

The Issue with Fully Automated Analytics

How do all the driverless, automated, automatic AI, and machine learning systems suit the enterprise? Their objective is either to encapsulate (and hide) existing data researchers’ expertise or to apply advanced optimization plans to the fine-tuning of information science tasks.

Automated systems can be useful if no private data science competence is readily available, but they are likewise somewhat limiting. Business experts who depend on data to do their tasks get locked into the prepackaged competence and a limited set of hard-coded circumstances.

In my experience as a data scientist, automation tends to miss the most crucial and fascinating pieces, which can be very important in today’s extremely competitive marketplace. If data scientists are permitted to take a somewhat more active method and guide the analytics process, however, the world opens considerably.

Why a Guided Analytics Method Makes Sense?

In order for companies to get the most out of AI and data science, to effectively anticipate future outcomes and make much better organization choices, completely automatable information science sandboxes need to be left. Instead, enterprises need to begin interactive exchanges between a data scientist, organization analysts, and the devices doing the operate in the middle. This needs a procedure referred to as “assisted analytics,” in which personal feedback and assistance can be used whenever required– even while an analysis is in development.

The objective of guided analytics is to enable a team of data researchers with various choices and skills to collaboratively construct, preserve, and continuously refine a set of analytics applications that offer company users with various degrees of user interaction. Put, all stakeholders work together to create a better analysis.

Companies that wish to create a system that facilitates this type of interaction while still establishing a practical analytics application face a huge– but not overwhelming– obstacle.

Common Attributes

I have determined four typical properties that help data scientists successfully develop the right environment for the next type of wise applications– the ones that will assist them to obtain real service value from AI and machine learning.

The applications that offer company users just the correct amount of guidance and interaction allow groups of information scientists to merge their proficiency collaboratively. When specific residential or commercial properties collaborate, data researchers can build interactive analytics applications that reveal adaptive potential.

The perfect environment for guided analytics shares these 4 characteristics:

Open: Applications shouldn’t be strained with restrictions on the kinds of tools utilized. With an open environment, collaboration can occur between scripting masters and those who want to recycle their proficiency without diving into their code. Besides, it’s a plus to be able to connect to other tools for specific data types as well as interfaces specialized for high-performance or big information algorithms (such as H2O or Spark) from within the very same environment.

Agile: Once the application is deployed, new demands will emerge rapidly: more automation here, more customer feedback there. The environment used to develop these analytics applications requires likewise to make it easy for other members of the data science group to quickly adjust existing analytics applications to brand-new and changing requirements, so they continue to yield significant results over the long term.
Putting It into Practice

Versatile: Below the application, the environment should also be able to run simple regression designs or manage complicated specification optimization and ensemble designs– ranging from one to thousands of designs. It’s worth noting that this piece (or a minimum of some elements of it) can be hidden totally from the business user.

Uniform: At the same time, the specialists creating data science ought to have the ability to perform all their operations in the very same environment. They need to mix data, run the analysis, mix and match tools, and develop the facilities to deploy the resulting analytics applications all from that very same intuitive and nimble environment.

Some AI-based applications will merely provide an introduction or projection at journalism of a button. Others will allow completion user to select the data sources to be used. Still, others will ask the user for feedback that ends up improving the design( s) trained beneath the hood, factoring in the users’ knowledge. Those models can be easy or arbitrarily complicated ensembles or entire design families, and the end user might or might not be asked to assist fine-tune that setup. The control over how much of such interaction is required to depend on the hands of the information researchers who developed the underlying analytics procedure with their target audience, the actual organization users’ interests (and abilities), in mind.

The big concern you may be asking is, how do I do this in my organization? You might think this is not realistic for your team to construct on its own; you are resource-constrained as it is. The good news is that you do not have to.

Software, particularly open source software, is available that makes it useful to execute guided analytics. Utilizing it, teams of data researchers can work together utilizing visual workflows. They can give their expert service associates access to those workflows through web interfaces. Additionally, there is no need to use another tool to develop a web application; the workflow itself models the interaction points that consist of an analytics application. Workflows are the glue holding it all together: various tools utilized by different members of the information science team, information mixed from numerous sources by the information engineering experts, and interaction points modeling the UI parts noticeable to the end user. It is all quickly within your grasp.

Guided analytics in the following years

Interest in guided analytics is growing, permitting users not only to wrangle information; however, likewise, fine-tune their analyses. It is exciting to see just how much cooperation this sets off. It will also be fascinating to witness how information researchers build progressively practical analytics applications that help users in developing analyses with real organization effect.

Instead of taking experts out of the chauffeur’s seat and trying to automate their wisdom, assisted analytics aims to combine the best of both. This is good for data scientists, company analysts, and the practice of data analytics in general. Eventually, it will be necessary for development too. Although it might appear challenging now, the effort will be worth it to make sure a better future.

Machine learning in the cloud

Machine Learning in the cloud

As artificial intelligence (ML) and also artificial intelligence come to be extra prevalent, data logistics will be vital to your success.
While building Machine Learning projects, most of the effort required for success in artificial intelligence is not the algorithm, design, structure, or the learning itself. It’s the data logistics. Perhaps less amazing than these other facets of ML, it’s the data logistics that drive performance, continuous knowing, as well as success. Without data logistics, your capability to remain to refine as well as scale are significantly limited.

Data logistics is key for success in your Machine Learning and AI Projects

Great data logistics does more than drive effectiveness. It is essential to reduce prices currently and also boosted agility in the future. As ML and also AI continue to develop and also expand right into even more business processes, business have to not enable very early successes to become limitations or issues long-term. In a paper by Google scientists (Artificial intelligence: The High Rate Of Interest Credit Card of Technical Financial Debt), the writers point out that although it is simple to spin up ML-based applications, the initiative can result in expensive data dependencies. Excellent data logistics can mitigate the difficulty in managing these intricate data reliances to prevent hindering agility in the future. Using an appropriate structure such as this can also ease deployment and also administration as well as permit the advancement of these applications in ways that are difficult to predict precisely today.

When building Machine Learning and AI projects use – Keep It Simple to Start

Nowadays, we’ll see a shift from complex, data-science-heavy implementations to an expansion of efforts that can be finest called KISS (Keep It Simple to Start). Domain experience as well as data will be the chauffeurs of AI processes that will evolve and improve as experience grows. This strategy will use an additional benefit: it also improves the productivity of existing personnel along with costly, hard-to-find, -hire, as well as -preserve data researchers.

This approach additionally removes the problem over choosing “simply the right devices.” It is a fact of life that we need several devices for AI. Structure around AI the proper way allows continual adjustment to capitalize on brand-new AI tools as well as formulas as they appear. Don’t stress over performance, either (including that of applications that need to stream data in real time) due to the fact that there are constant bear down that front. For instance, NVIDIA recently announced RAPIDS, an open resource data scientific research initiative that leverages GPU-based processes to make the growth and training of models both much easier and also much faster.

Multi-Cloud Deployments will become more standard methods

To be completely agile for whatever the future may hold, the data platforms will certainly need to support the complete selection of diverse data kinds, including documents, items, tables, as well as events. The system must make input as well as outcome data available to any kind of application anywhere. Such agility will certainly make it feasible to totally utilize the worldwide sources offered in a multi cloud setting, thereby empowering organizations to attain the cloud’s complete potential to maximize efficiency, cost, as well as conformity requirements.

Organizations will move to release a common data platform to synchronize and drive converge of (and additionally preserve) all data throughout all deployments, as well as through a global namespace provide a sight into all data, any place it is. An usual data platform throughout numerous clouds will certainly also make it less complicated to explore different services for a range of ML as well as AI demands.

As companies broaden their use ML as well as AI throughout numerous industries, they will require to access the full variety of data sources, types, and also structures on any cloud while staying clear of the creation of data silos. Attaining this end result will cause releases that surpass a data lake, and also this will certainly mark the increased proliferation of worldwide data platforms that can extend data kinds and also places.

Analytics at the Cloud Will End Up Being Strategically Crucial

As the Web of Things (IoT) continues to increase and also develop, the capability to unite edge, on-premises, and cloud processing atop an usual, worldwide data platform will certainly become a tactical important.

A distributed ML/AI style efficient in coordinating data collection as well as processing at the IoT side removes the requirement to send large quantities of data over the WAN. This capability to filter, aggregate, and analyze data at the edge additionally promotes faster, much more reliable handling and also can cause better neighborhood decision making.

Organizations will certainly aim to have a typical data system– from the cloud core to the venture edge– with consistent data administration to make certain the honesty and also safety of all data. The data system picked for the cloud core will, therefore, be adequately extensible and also scalable to deal with the complexities connected with distributed processing at a scattered and also vibrant side. Enterprises will position a premium on a “light-weight” yet capable as well as compatible variation appropriate for the calculate power available at the side, especially for applications that should deliver results in real-time.

A Final Word

In the following years we will see a boosted focus for AI and also ML development in the cloud. Enterprises will maintain it basic to begin, avoid dependencies with a multicloud global data platform, as well as encourage the IoT edge so ML/AI campaigns provide more worth to business in latest years and also well right into the future.

More reads:

Where does a Data Scientist sit among all that Big Data

Predictive analytics, the process of building it

Advanced Data Science

Predictive Analytics from research and development to a business maker

The start of predictive analytics and machine learning

Predictive analytics started in the early 90s with pattern recognition algorithms—for example, finding similar objects. Over the years, things have evolved into machine learning. In the workflow of data analysis, you collect data, prepare data, and then perform the analysis. If you employ algorithms or functions to automate the data analysis, that’s machine learning.

Read more about the process of building data analysis.

Read More »

Personalization – How much do you understand your customer?

Every day we are trying to better understand our customers. Directly or indirectly. Consciously or unconsciously.

All marketing activities that are organized for us, all the offers that we receive on our emails, apps, banner ads, billboards on the bus stations, hidden messages in the last movie you watched or message to buy pairing product. All these marketing activities try to show that they understand your preferences. Or they try to persuade you that the product they represent is the product you are looking for.

Read More »

What skills does a data scientist need and how to get them?

Upgrading your skills constantly is the way to stay on the top.
What skills do you need to have to become a Data Scientist?
I have written before but I’ll try to put again some more info to help the people who really want to go that path.

Free Tools can help a lot to start!

 There are many tools that can help you overcome this easily to some extent: KNIME is one great tool I use literally every day. It is really easy to learn and it covers 90% of the tasks you will be asked daily as Data Scientist. The best is free.
Check it out here: https://www.knime.org/
Other similar tools: RapidMiner
The important fact is you should know what to do with it.
I have given numerous courses on how they use the tool and how to start with super basic DS tasks.
Understanding Basic terms can help you along the way:
What are regression and what classification?
It is good to know how to approach a specific problem in order to solve it. Almost every problem in the world we are trying to solve can fall into these two.

What algorithms can be used and should be used for each problem?

This is important but not show stopper for the beginning. Decision trees can do just right for a start.
How to do:

Data Cleaning or Transformation

This is one of the most important things you’d come across working in Data Science. 90% of the time, you are not going to get well-formatted data. If you are skilled in one of the programming language, Python or R, you should be pro at packages like Pandas or Dplyr/Reshape.
Exploratory Data Analysis
I have written before of How can you start using the data. Check this link to get an idea.
Once again, this is the most important part, whether you are working to take insights or you want to do predictive modeling, this step comes in. You must train your mind analytically to make an image of variables in your head. You can build such a mind by practice. After that, you must be very good with hands-on with packages like matplotlib or ggplot2, depending upon the language you work with

Machine Learning / Predictive Modelling

One of the most important aspects of today’s data science is predictive modeling. This is dependent upon your EDA and your knowledge of mathematics. I must inform you that invest your time in theory. The more theoretical knowledge you have, the better you’d be going to do. There is no easy way around it. There’s this great course by Andrew NG that goes much into theory. Take it.

Programming Languages

If you want to go more advanced, it is important to have a grip on at least one programming language widely used in Data Science. But you should know a little of another language. Either you should know R very well and some Python or Python very well but some R.
Take my case, I know R very well ( at least I think so) but I can work around with Python too ( not expert level ), Java, C#, JavaScript. Anything works if you know to use it when you need it.
Example of complete data analysis that one Data Scientist is doing can be found here.
I use Knime, R and Python every day, I think if you are a total beginner, its good idea to start with KNIME.

Useful courses for learning Data Scientists

I really recommend spending some time on the following courses:
I have passed them myself and I learned a lot from each of it.
Happy learning!
Image credit: House of bots

Machine learning in practice – Let the machine find the optimal number of clusters from your data

 

What is Clustering?

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

Where Clustering is used in real life: 

Clustering is used almost everywhere – Search engines, making marketing campaigns, biological analysis, cancer analysis, your favorite phone provider is making cluster analysis to see in which group of people you belong before they decide if they will give you additional discount or special offer. the applications are countless.


How can I find the optimal number of clusters?

One fundamental question is: If the data is clusterable, then how to choose the right number of expected clusters (k)?

 

Read More »

Predictive analytics, the process of building it

What is Predictive Analytics?

I have talked with many people with different technical knowledge, and many times I have been asked questions like: So, can predictive analytics tell my future? The sad answer is NO. Predictive analytics will not tell you for certain if you are going to be rich or not. Or will not guarantee 100% that your favorite team will win so you can put all your saving on a bet. Also, it won’t tell you where you will end up for sure next year.

However, predictive analytics can definitely forecast and give hints about what might happen in the future with an acceptable level of reliability and can include risk assessment and what – if scenarios.

The process of extracting information from a dataset order to get patterns or predict future outcomes upon that data.

What a process of implementing Predictive Analytics includes?

From having an idea until implementing a predictive model and being able to read it there are a couple of operations that need to be taken care of.

Know the business problem

It is really important to know the scope of the business that you are building a model for. Many people think that it is necessary to apply statistical methods and some Machine learning algorithm on a big chunk of data and the model will give you an answer by itself. Unfortunately, this is not the case. Most of the times you will have to complement your data set with producing new metrics out of the already existing data and for that, you will have to know at least the essentials of the business.

First and foremost, it is essential to identify your business problem. After that, you can successfully determine what metrics are necessary to address your problem and then you can decide which analysis technique you will use.

Transforming and extracting raw data

While trying to build a predictive model you will spend a lot of time trying to prepare the data in the best possible way. That will include handling different data sources like Web API, unstructured data (usually collected from Weblogs, Facebook, Twitter, etc.), different database engines (MSSQL, MySQL, Sybase, Casandra, etc.), flat files (Comma separated value (CSV), tab delimited files, etc.). Therefore, knowledge of Database structures, ETLs, and general computer science knowledge is really useful. In some of the cases, you might be lucky enough to have a separate team that will provide these services for you and delivers you a nicely formatted file that you can work on, but in most of the cases, you still have to do data transformations by yourself.

Tools that are usually used for this kind of transformations are SSIS from Microsoft, SAP Data services, IBM Infosphere Information Manage, SAS Data Management, Python. Many times I have seen ETL processes made purely in C# (This is when application developers are given BI tasks), or purely Stored Procedures (This is when database developers are given BI tasks), R (when statisticians try to perform ETL)
Exploratory data analysis

Exploratory data analysis is an approach to analyze data sets so you can summarize their main characteristics. Often the exploratory data analysis is done using visual methods. Here a statistical model can be used or not, but primary, exploratory data analysis is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

Exploratory data analysis VS Summary data analysis

A summary analysis is simply a numeric reduction of a historical data set. It is quite passive. Its focus was in the past. Its intent is to simply arrive at a few key statistics like mean and standard deviation, which may then either replace the data set or be appended to the data set in the frame of a summary table.
The purpose of exploratory data analysis is to gain insight into the engineering/scientific process behind the data. Whereas summary statistics are passive and historical,  exploratory data analysis is active and futuristic. In an effort to “understand” the procedure and improve it in the future the data as a “window” to peer into the spirit of the process that generated the information.

Building a predictive model

After successfully identifying the problems that this model needs to solve it is time to write some code and do some testings in order to make a theoretical account that will predict outcomes of the data we anticipate to receive in the near future according to the data we already deliver.
This is the time to implement some machine learning. You will basically try to implement algorithms like ANOVA, ARIMA, decision trees, KNN, etc., depending on the problem you are trying to solve and the performance that the algorithm is giving for the specific data we have.
In this pace, we should basically evaluate algorithms with developing a test harness and baseline accuracy from which to ameliorate. The second thing is to leverage results to develop more accurate models.
There are many ways to choose the right algorithm while building a model depending on the scope of the problem. Most of the times the prediction model is improved by combining more than one algorithm, blending. For example, the Predictive model that won the $1 million prize from Netflix for giving recommendations of movies contains more than 100 different models that are blended into a one.
Popular tools that are used now days are R, Python, Weka, Rapid Miner, Mathlab, IBM SPSS, Apache Mahout.
I will write more about choosing the right algorithm for a specific problem in another article.

Presenting the outcome

At this stage, we need to come up with a way of presenting the results that the predictive model has generated. This is where good data visualization practices come handy. Most of the times the results are presented as a report or just an excel spreadsheet, but lately, I see the increased demand, interactive dashboards where the user can see the data from many perspectives instead of one.
At this stage, we should be careful how to present the data since executives and people that need to bring strategic decisions are not necessarily really technical. We must make sure that they have a good understanding of the data. Asking the help of a graphic designer or reading more about how to play with colors and shapes will be really useful and awards at the end.
Some popular visualization platforms that can provide interactive dashboards are Microstrategy, Performance Point on Sharepoint, Tableau, QlikView, Logi Analytics, SAP, SAS, Big Blue from IBM

As you can see the process of building a Predictive model has a much bigger scope than just applying some fancy logos and mathematical formulas. Therefore a successful data scientist should have an understanding of business problems and business analysis so he/she will have a greater understanding of what the data is really saying; some computer science skills so he can perform the extraction and the transformation of the different data sources and some statistical knowledge so he can apply data sampling, better understanding of the predictive models, hypothesis testing, etc.
Maybe, you might think that this kind of person with that much broad knowledge does not exist, but in fact they do. That is why the data scientist is really appreciated lately.

Where does a Data Scientist sit among all that Big Data

In the past years we have been all witnesses of the growing demand on machine learning, predictive analytics, data analysis. Why is that so?
Well, it is quite simple. Like E.O Wilson said,
“We are drowning in information, while starving for wisdom. The world henceforth will be run by synthesizers, people able to put together the right information at the right time, think critically about it, and make important choices wisely.”
 Every possible device I can think of is generating some data feed. But what to do with this data is always the big question.
Business people for sure will have some sparkles of ideas, marketing people will try to sell the perfect fit all models for optimizing your marketing campaigns and bringing all your needed customers, business controllers will try to predict what the sales going to look alike in the next quarter and so on.. All that by using the raw data generated by everything around us.
The big questions is how doing all this smart analysis?
That is where all the fancy names and companies with shiny solutions come in the game, of course that comes with a price, usually, big price.
But on the other side of the game, this is where all smart people come into the game. Lately, these people are called Data Scientist, Data Anlayst, BI professionals and all possible varieties of names.

What does Data Scientist / Data Analyst really do?

Working as a part of already established Data Science team

Most of them do tasks like pulling data out of MySQL, MSSQL, Sybase and all other databases on the market, becoming a master at Excel pivot tables, and producing basic data visualizations (e.g., line and bar charts). You may on occasion analyze the results of an A/B test or take the lead on your company’s Google Analytics account.
Also here you will have to build predictive models that will forecast your client’s mood, possible new market openings, product forecasting or time series analysis.

Basic Statistics

This is where the basic statistics come really handy.
You should be familiar with statistical tests, distributions, maximum likelihood estimators, etc.
This will also be the case for machine learning, but one of the more important aspects of your statistics knowledge will be understanding when different techniques are (or aren’t) a valid approach. Statistics are important in all company types, but especially data-driven companies where the product is not data-focused and product stakeholders will depend on your help to make decisions and design / evaluate experiments.

Machine Learning

 If you’re dealing huge amounts of data, or working at a company where the product itself is especially data-driven, it may be the case that you’ll want to be familiar with machine learning methods. At this time classical statistical methods might not always work and you might be facing a time when you need to work with all data, instead of a sample of the whole data set, like you would do if you follow a conventional statistical approach of analyzing your data.
This can mean things like ARIMA models, Neural Networks, SVM, VARs, k-nearest neighbors, decision trees, random forests, ensemble methods – all of the machine learning fancy words. It’s true that a lot of these techniques can be implemented using R or Python libraries – because of this, it’s not necessarily a deal breaker if you’re not the world’s leading expert on how the algorithms work. More important is to understand the broad strokes and really understand when it is appropriate to use different techniques. There is a lot of literature that can help you in getting up to speed with R and Python in real case scenarios.

Establishing Data Science team

Nowadays, number of companies are getting to the point where they have an increasingly large amount of data, and they’re looking for someone to set up a lot of the data infrastructure that the company will need moving forward. They’re also looking for someone to provide analysis. You’ll see job postings listed under both “Data Scientist” and “Data Engineer” for this type of position. Since you’d be (one of) the first data hires, there are likely many low-hanging fruit, making it less important that you’re a statistics or machine learning expert.
A data scientist with a software engineering background might excel at a company like this, where it’s more important that a data scientist make meaningful data-like contributions to the production code and provide basic insights and analyses.
At this time be ready to bootstrap servers, installations of new virtual machines, setting up networks, plain DBA work, Hadoop installation, setting up Oozie, Flume, Hive, etc. Many times I have been asked to set up Share Point or Web Servers, so I can set up Performance Point as part of the reporting solution.
In times of establishing Data Team or BI teams in a company, you should be ready literally for every other IT tasks you can imagine (especially if you work in a startup), so broad range of skills is really welcomed here.
Expect at least the first year to work mainly on infrastructure and legacy items instead of crunching data and making shiny assumptions and reports.

Keeping up to date

There is certainly a lot of upcoming potential in this profession. And with that the expectance from you in your company is growing exponentially.
As the industry is getting more inclined in data analysis, you working as Data Analyst/Scientist will be challenged to read all the time, whether is Business News or literature that will help you build or improve your models, reports or work in general. There is a lot of research in this field and a lot of new books going out with titles mentioning Data.
Data scientists today are akin to Wall Street “quants” of the 1980s and 1990s. In those days people with backgrounds in physics and math streamed to investment banks and hedge funds, where they could devise entirely new algorithms and data strategies. Then a variety of universities developed master’s programs in financial engineering, which churned out a second generation of talent that was more accessible to mainstream firms. The pattern was repeated later in the 1990s with search engineers, whose rarefied skills soon came to be taught in computer science programs.
One question raised by this is whether some firms would be wise to wait until that second generation of data scientists emerges, and the candidates are more numerous, less expensive, and easier to vet and assimilate in a business setting. Why not leave the trouble of hunting down and domesticating exotic talent to the big data start-ups and to firms whose aggressive strategies require them to be at the forefront?
The problem with that reasoning is that the advance of big data shows no signs of slowing. If companies sit out this trend’s early days for lack of talent, they risk falling behind as competitors and channel partners gain nearly unassailable advantages. Think of big data as an epic wave gathering now, starting to crest. If you want to catch it, you need people who can surf.