Important things to consider before building your machine learning and AI project

Current State of the market

In order to go in-depth on what exactly data science and machine learning (ML) tools or platforms are, why companies small and large are moving toward them, and why they matter in the Enterprise AI journey, it’s essential to take a step back and understand where we are in the larger story of AI, ML, and data science in the context of businesses:

 1. Enterprise AI is at peak hype.

Of course, the media has been talking about consumer AI for years. However, since 2018, the spotlight has turned to the enterprise. The number and type of devices sending data are skyrocketing while the cost of storing data continues to decrease, which means most businesses are collecting more data in more types and formats than ever before. Moreover, to compete and stay relevant among digital startups and other competition, these companies need to be able to use this data not only to drive business decisions but drive the business itself. Now, everyone is talking about how to make it a reality.

2. AI has yet to change businesses.

Despite the hype, the reality is that most businesses are struggling to leverage data at all, much less build machine learning models or take it one step further into AI systems. For some, it’s because they find building just one model is far more expensive and time-consuming that they planned for. However, the great majority struggle with fundamental challenges, like even organizing controlled access to data or efficient data cleaning and wrangling.

3. Successful enterprises have democratized.

 Those companies that have managed to make progress toward Enterprise AI have realized that it’s not one ML model that will make the difference; it’s thousands or hundreds. Also, that means scaling up data efforts in a big way that will require everyone at the company to be involved. Enter democratization. In August 2018, Gartner identified Democratized AI as one of the five emerging trends in their Hype Cycle for Emerging Technologies. Since then, we have seen the word “democratization” creep into the lexicon of AI-hopefuls everywhere, from the media to the board room. Also, to be sure, it’s an essential piece of the puzzle when it comes to an understanding of data science and machine learning (ML) platforms.

Is hiring Data Scientist enough to fulfil your AI and Machine learning goals?

Employing for data functions is at an all-time high. Currently in 2019, according to career listing data, a data scientist is the hottest career out there. Moreover, though statistics on Chief Data Offers (CDOs) vary, some put the figures as high as 100-fold growth in the function over the past 10 years.

Hiring data experts is a crucial element to a robust Enterprise AI strategy; however, hiring alone does not guarantee the expected outcomes, and it isn’t a factor not to invest in data science and ML platforms. For one thing, working with data scientists is costly – often excessively so – and they’re only getting more so as their need grows.

The truth is that when the intention is going from producing one ML model a year to tens, hundreds, or even thousands, a data team isn’t enough because it still leaves a big swath of employees doing day-to-day work without the capability to take advantage of data. Without democratization, the result of a Data team – even the very best one comprised of the leading data scientists – would be restricted.

As a response to this, some companies have decided to leverage their data team as sort of an internal contractor, working for lines of business or internal groups to complete projects as needed. Even with this model, the data team will need tools that allow them to scale up, working faster, reusing parts of projects where they can, and (of course) ensuring that all work is properly documented and traceable. A central data team that is contracted out can be a good short-term solution, but it tends to be a first step or stage; the longer-term model of reference is to train masses of non-data people to be data people.

Choosing the right tools for Machine Learning and AI

Opens Source – critical, but not always giving what you need

In order to be on the bleeding edge of technological developments, using open source makes it easier to onboard a team and hire. Not only are data scientists interested in growing their skills with the technologies that will be the most used in the future, but also there is less of a learning curve if they can continue to work with tools they know and love instead of being forced to learn an entirely different system. It’s important to remember, that keeping up with that rapid pace of change is difficult for big-sized corporations.
The latest innovations are usually highly technical, so without some packaging or abstraction layers that make the innovations more accessible, it’s challenging to keep everybody in the organization on board and working together.
A business might technically adopt the open source tool, but only a small number of people will be able to work with it. Not to mention that governance can be a considerable challenge if everyone is working with open source tools on their local machines without a way to have work centrally accessible and auditable.
Data science and ML platforms have the advantage of being usable right out of the box so that teams can start analyzing data from the first day. Sometimes, with open source tools (mostly R and Python), you need to assemble a lot of the parts by hand, and as anyone who’s ever done a DIY project can attest to, it’s often much more comfortable in theory than in practice. Choosing a data science and ML platform wisely (meaning one that is flexible and allows for the incorporation and continued use of open source) can allow the best of both worlds in the enterprise: cutting-edge open source technology and accessible, governable, control over data projects.

What should Machine Learning and AI platforms provide?

Data science and ML platforms allow for the scalability, flexibility,
and control required to thrive in the era of Machine Learning and AI because they provide a framework for:

  • Data governance: Clear workflows and a method for group
    leaders to monitor those workflows and data jobs.
  • Efficiency: Finding little methods to save time throughout the data-to-insights process gets business to organization value much faster.
  • Automation: A specific type of performance is the growing field
    of AutoML, which is broadening to automation throughout the data pipeline to ease inefficiencies and maximize personal time.
  • Operationalization: Effective ways to release data jobs into production quickly and safely.
  • Collaboration: A method for additional personnel working with data,
    much of whom will be non-coders, to add to data tasks in addition to data scientists (or IT and data engineers).
  • Self-Service Analytics: A system by which non-data expert from various industries can access and deal with data in a regulated environment.

Some things to consider before choosing the AI and MAchine Learning platform

Governance is becoming more challenging

With the quantity of information being accumulated today, data safety and security (particularly in specific sectors like financing) are crucial. Without a central area to access and collaborate with information that has correct user controls, data might be saved across different individuals’ laptop computers. And also if an employee or specialist leaves the company, the threats raise not just because they could still have accessibility to sensitive data, however since they might take their collaboration with them as well as leave the group to go back to square one, uncertain of what the individual was servicing. On top of these concerns, today’s enterprise is afflicted by shadow IT; that is, the suggestion that for years, different divisions have invested in all kinds of various innovations and are accessing as well as utilizing information in their ways. A lot to make sure that also IT groups today do not have a central sight of that is using what, just how. It’s a problem that becomes dangerously amplified as AI efforts scale and points to the requirement for governance at a more significant as well as much more fundamental scale throughout all industries in the business.

AI Needs to Be Responsible

We learn from a young age that topics like science and mathematics are all goal, which implies that naturally, individuals think that data science is as well – that it’s black and white, a specific discipline with just one method to reach a “proper” service, independent of who constructs it. We’ve understood for a long time that this is not the case and that it is possible to utilize data science strategies (and, hence, produce AI systems) that do things, well … incorrect. Even as just recently as last year, we are witnessing with problems that giants like Google, Tesla and Facebook face with their AI systems. These problems can cause domino effect very fast. It can be private information leakage, photo mislabelling, or video recognition not recognizing a pedestrian on crossing the road and hitting it.
This is where AI needs to be very responsible. And for that you need to be able to discover in early stages where you AI might fail, before deploying it in the real world.
The fact that these companies might not have fixed all of the problems, showing quickly how challenging it is to get AI.

Reproducibility of Machine Learning projects as well as scaling the same projects

Absolutely nothing is extra ineffective than needlessly repeating the same processes over as well as over. This relates to both duplicating procedures within a project (like data prep work) over and over as well as repeating the same process throughout projects or – even worse – unintentionally duplicating entire jobs if the team gets large yet does not have insight right into each other’s role. As well as no service is insusceptible to this danger – as a matter of fact, this issue can become exponentially worse in huge ventures with bigger teams and also even more separate in between them. To range efficiently, data groups require a tool that helps in reducing duplicated work and makes sure that work between members of the group hasn’t currently been done before.

Utilize Data Experts to Augment Data Scientists’ Job

Today, information researcher is one of the most in-demand settings. This means that data scientists can be both (1) difficult to locate and bring in and also (2) expensive to work with as well as retain. This combination implies that to range data initiatives to pursue Venture AI, it will unavoidably need to be submitted with service or information analysts. For the two sorts of a team to collaborate appropriately, they require a central atmosphere from which to work. Experts also often tend to work in a different way than data scientists, experienced in spreadsheets as well as possibly SQL yet generally not coding. Having a tool that allows each account to leverage the tools with which (s)he is most comfortable enables the performance to range data efforts to any size.

Ad-Hoc Methodology is Unsustainable for Large Teams

Small teams can sustain themselves to a specific point
by dealing with data, ML, or larger AI tasks in an ad-hoc fashion,
indicating staff member save their work in your area and not centrally and don’t have any reproducible procedures or workflows, figuring
things out along the method.
However, with more than just a couple of employee and more than one
job, this becomes rowdy rapidly. Any business with any hope of
doing Enterprise AI requires a central location where everybody involved
with data can do all of their work, from accessing data to deploying
a design into a production environment. Permitting workers -whether directly on the data team or not – to work ad hoc without a central tool from which to work is like a construction group attempting to build a high-rise building without a primary set of blueprints.

Machine Learning models Need to be Monitored and Managed

The most significant distinction between developing traditional software application and developing machine learning models is upkeep. For the most part, the software is composed when and does not need to be continually kept – it will typically continue to work over time. Machine learning models are established, put in production, and then must be kept an eye on and fine-tuned up until performance is ideal. Even when the efficiency is optimal, model performance can still move gradually as data (and the individuals producing it) changes. This is quite a different approach, especially for companies that are used to putting software application in production.
Moreover, it’s easy to see how issues with sustainability might eventually trigger – or intensify – problems with ML design bias. In reality, the two are deeply linked, and disregarding both can be devastating to a business’s data science efforts, particularly when magnified by the scaling up of efforts. All of these factors point to having a platform that can help manage design tracking and management.

Required to Create Models that Work in Production

Investing in predictive analytics and data science means guaranteeing that data teams are productive and see projects through to completion (i.e., production) – otherwise called operationalization. Without an API-based tool that allows for a single release, data teams likely will need to hand off designs to an IT team who then will have to re-code it. This step can take lots of time and resources and be a substantial barrier to executing data tasks that genuinely affect the business in essential methods. With a tool that makes it smooth, data groups can easily have an impact, screen, fine-tune, and continue to make improvements that positively impact the bottom line.

Having all said, choosing the right platform is not always straightforward. You need to carefully measure what you really need now and what will you need in the future.
You need to do so taking in account your budget, employees skills and their willingness to learn new methodologies and technologies.

Please bare in mind that developing a challenging AI project takes time, sometimes couple of years. that means your team can start building your prototipe in easy to use open source machine learning platform. Once you have proven your hypothesis you can migrate to more complex and more expensive platform.

Good luck on your new machine learning AI project!

Data Engineering

Making Machine Learning more efficient with the cloud

In the essence, machine learning is a productivity tool for data scientists. As the heart of systems that can learn from data, machine learning permits data scientists to train design on an example data set and then utilize algorithms that immediately generalize and find out both from that example and from new data feeds. With not being watched methods, data scientists can do without training examples entirely and use machine learning to boil down insights directly and continuously from the data.

I write more here what are the advantages of using the Cloud for Building Machine Learning projects.

Machine learning can infuse every application with predictive power. Data scientists use these sophisticated algorithms to dissect, search, sort, infer, foretell, and otherwise understand the growing amounts of data in our world.

To achieve machine learning’s full capacity as a company resource, data scientists require to train it from the rich troves of data on the mainframes and other servers in your private cloud. For genuinely robust business analytics, you need machine-learning platforms that are crafted to provide the following:

  • Automation and optimization: Your enterprise machine learning platform should allow data scientists to automate creation, training, and release of algorithmic designs against high-value corporate data. The platform ought to assist them in selecting the optimal algorithm for every single data set. The way to do this is by having a system that scores their data against available algorithms and arrangements, the algorithm that best matches their requirements.
  • Efficiency and scalability: The platform needs to be able to continually develop, train, and release a high volume of machine learning models versus data kept in large business databases. It should allow data scientists to deliver better, fresher, more regular forecasts, consequently speeding time to insight.
  • Security and governance: The system ought to enable data scientists to train models without moving the data from the mainframe or another business platform where it is protected and governed. In addition to minimizing the latency and managing the cost of performing machine learning in your data center, this technique gets rid of the dangers associated with doing ETL on a platform different from the node where machine learning execution occurs.
  • Versatility and programmability: The platform ought to permit data scientists to utilize any language (e.g., Scala, Java, Python), any popular structure (e.g., Apache SparkML, TensorFlow, H2O), and any transactional data type throughout the machine learning development lifecycle.

Taking in account the above points, developing your Machine learning and AI project on the cloud can really make difference.

What are the Benefits of Machine Learning in the Cloud?

  • The cloud’s pay-per-use model is good for bursty AI or machine learning workloads.
  • The cloud makes it easy for enterprises to experiment with machine learning capabilities and scale up as projects go into production and demand increases.
  • The cloud makes intelligent capabilities accessible without requiring advanced skills in artificial intelligence or data science.
  • AWS, Microsoft Azure, and Google Cloud Platform offer many machine learning options that don’t require deep knowledge of AI, machine learning theory, or a team of data scientists.

You don’t need to use a cloud provider to build a machine learning solution. After all, there are plenty of open source machine learning frameworks, such as TensorFlow, MXNet, and CNTK that companies can run on their own hardware. However, companies building sophisticated machine learning models in-house are likely to run into issues scaling their workloads, because training real-world models typically requires large compute clusters.

The leading cloud computing platforms are all wagering huge on democratizing artificial intelligence and ML. Over the previous 3 years, Amazon, Google, and Microsoft have actually made considerable investments in artificial intelligence (AI) and machine learning, from presenting brand-new services to performing significant reorganizations that position AI tactically in their organizational structures. Google CEO, Sundar Pichai, has even said that his company is moving to an “AI-first” world.

Having that said, as the Data Science teams grow, the cloud usage will be more eminent. Bigg teams will ask for undisturbed and performing platform where they will create and share different Machine Learning projects. On which they will compare and optimize the machine learning models performance.
This is where the cloud comes in very handy by providing centralized place to keep all big data and all ML models build on top of this data.

Another argument to take into consideration is Machine Learning project reusability.
As teams change drastically and fast nowadays, it is essential to have the machine learning models deployed on the cloud. The difference between models being deployed on servers would be the ease for giving new access to new team members while not jeopardizing the security protocols in the company. That means that a new team member can be up and running with in the first day in the team. He can see the machine learning models developed by his predecessors and use some of them to build new project. That already adds a lot of value.

Some great Machine learning platforms in the cloud available today are:
IBM Machine Learning for z/OS
Amazon EC2 Deep Learning AMI backed by NVIDIA GPU, Google Cloud TPUMicrosoft Azure Deep Learning VM based on NVIDIA GPU, and IBM GPU-based Bare Metal Servers are examples of niche IaaS for ML.

Read more:

Machine learning in the cloud

Machine Learning in the cloud

As artificial intelligence (ML) and also artificial intelligence come to be extra prevalent, data logistics will be vital to your success.
While building Machine Learning projects, most of the effort required for success in artificial intelligence is not the algorithm, design, structure, or the learning itself. It’s the data logistics. Perhaps less amazing than these other facets of ML, it’s the data logistics that drive performance, continuous knowing, as well as success. Without data logistics, your capability to remain to refine as well as scale are significantly limited.

Data logistics is key for success in your Machine Learning and AI Projects

Great data logistics does more than drive effectiveness. It is essential to reduce prices currently and also boosted agility in the future. As ML and also AI continue to develop and also expand right into even more business processes, business have to not enable very early successes to become limitations or issues long-term. In a paper by Google scientists (Artificial intelligence: The High Rate Of Interest Credit Card of Technical Financial Debt), the writers point out that although it is simple to spin up ML-based applications, the initiative can result in expensive data dependencies. Excellent data logistics can mitigate the difficulty in managing these intricate data reliances to prevent hindering agility in the future. Using an appropriate structure such as this can also ease deployment and also administration as well as permit the advancement of these applications in ways that are difficult to predict precisely today.

When building Machine Learning and AI projects use – Keep It Simple to Start

Nowadays, we’ll see a shift from complex, data-science-heavy implementations to an expansion of efforts that can be finest called KISS (Keep It Simple to Start). Domain experience as well as data will be the chauffeurs of AI processes that will evolve and improve as experience grows. This strategy will use an additional benefit: it also improves the productivity of existing personnel along with costly, hard-to-find, -hire, as well as -preserve data researchers.

This approach additionally removes the problem over choosing “simply the right devices.” It is a fact of life that we need several devices for AI. Structure around AI the proper way allows continual adjustment to capitalize on brand-new AI tools as well as formulas as they appear. Don’t stress over performance, either (including that of applications that need to stream data in real time) due to the fact that there are constant bear down that front. For instance, NVIDIA recently announced RAPIDS, an open resource data scientific research initiative that leverages GPU-based processes to make the growth and training of models both much easier and also much faster.

Multi-Cloud Deployments will become more standard methods

To be completely agile for whatever the future may hold, the data platforms will certainly need to support the complete selection of diverse data kinds, including documents, items, tables, as well as events. The system must make input as well as outcome data available to any kind of application anywhere. Such agility will certainly make it feasible to totally utilize the worldwide sources offered in a multi cloud setting, thereby empowering organizations to attain the cloud’s complete potential to maximize efficiency, cost, as well as conformity requirements.

Organizations will move to release a common data platform to synchronize and drive converge of (and additionally preserve) all data throughout all deployments, as well as through a global namespace provide a sight into all data, any place it is. An usual data platform throughout numerous clouds will certainly also make it less complicated to explore different services for a range of ML as well as AI demands.

As companies broaden their use ML as well as AI throughout numerous industries, they will require to access the full variety of data sources, types, and also structures on any cloud while staying clear of the creation of data silos. Attaining this end result will cause releases that surpass a data lake, and also this will certainly mark the increased proliferation of worldwide data platforms that can extend data kinds and also places.

Analytics at the Cloud Will End Up Being Strategically Crucial

As the Web of Things (IoT) continues to increase and also develop, the capability to unite edge, on-premises, and cloud processing atop an usual, worldwide data platform will certainly become a tactical important.

A distributed ML/AI style efficient in coordinating data collection as well as processing at the IoT side removes the requirement to send large quantities of data over the WAN. This capability to filter, aggregate, and analyze data at the edge additionally promotes faster, much more reliable handling and also can cause better neighborhood decision making.

Organizations will certainly aim to have a typical data system– from the cloud core to the venture edge– with consistent data administration to make certain the honesty and also safety of all data. The data system picked for the cloud core will, therefore, be adequately extensible and also scalable to deal with the complexities connected with distributed processing at a scattered and also vibrant side. Enterprises will position a premium on a “light-weight” yet capable as well as compatible variation appropriate for the calculate power available at the side, especially for applications that should deliver results in real-time.

A Final Word

In the following years we will see a boosted focus for AI and also ML development in the cloud. Enterprises will maintain it basic to begin, avoid dependencies with a multicloud global data platform, as well as encourage the IoT edge so ML/AI campaigns provide more worth to business in latest years and also well right into the future.

More reads:

Where does a Data Scientist sit among all that Big Data

Predictive analytics, the process of building it

Advanced Data Science

What skills does a data scientist need and how to get them?

Upgrading your skills constantly is the way to stay on the top.
What skills do you need to have to become a Data Scientist?
I have written before but I’ll try to put again some more info to help the people who really want to go that path.

Free Tools can help a lot to start!

 There are many tools that can help you overcome this easily to some extent: KNIME is one great tool I use literally every day. It is really easy to learn and it covers 90% of the tasks you will be asked daily as Data Scientist. The best is free.
Check it out here:
Other similar tools: RapidMiner
The important fact is you should know what to do with it.
I have given numerous courses on how they use the tool and how to start with super basic DS tasks.
Understanding Basic terms can help you along the way:
What are regression and what classification?
It is good to know how to approach a specific problem in order to solve it. Almost every problem in the world we are trying to solve can fall into these two.

What algorithms can be used and should be used for each problem?

This is important but not show stopper for the beginning. Decision trees can do just right for a start.
How to do:

Data Cleaning or Transformation

This is one of the most important things you’d come across working in Data Science. 90% of the time, you are not going to get well-formatted data. If you are skilled in one of the programming language, Python or R, you should be pro at packages like Pandas or Dplyr/Reshape.
Exploratory Data Analysis
I have written before of How can you start using the data. Check this link to get an idea.
Once again, this is the most important part, whether you are working to take insights or you want to do predictive modeling, this step comes in. You must train your mind analytically to make an image of variables in your head. You can build such a mind by practice. After that, you must be very good with hands-on with packages like matplotlib or ggplot2, depending upon the language you work with

Machine Learning / Predictive Modelling

One of the most important aspects of today’s data science is predictive modeling. This is dependent upon your EDA and your knowledge of mathematics. I must inform you that invest your time in theory. The more theoretical knowledge you have, the better you’d be going to do. There is no easy way around it. There’s this great course by Andrew NG that goes much into theory. Take it.

Programming Languages

If you want to go more advanced, it is important to have a grip on at least one programming language widely used in Data Science. But you should know a little of another language. Either you should know R very well and some Python or Python very well but some R.
Take my case, I know R very well ( at least I think so) but I can work around with Python too ( not expert level ), Java, C#, JavaScript. Anything works if you know to use it when you need it.
Example of complete data analysis that one Data Scientist is doing can be found here.
I use Knime, R and Python every day, I think if you are a total beginner, its good idea to start with KNIME.

Useful courses for learning Data Scientists

I really recommend spending some time on the following courses:
I have passed them myself and I learned a lot from each of it.
Happy learning!
Image credit: House of bots

Predictive analytics, the process of building it

What is Predictive Analytics?

I have talked with many people with different technical knowledge, and many times I have been asked questions like: So, can predictive analytics tell my future? The sad answer is NO. Predictive analytics will not tell you for certain if you are going to be rich or not. Or will not guarantee 100% that your favorite team will win so you can put all your saving on a bet. Also, it won’t tell you where you will end up for sure next year.

However, predictive analytics can definitely forecast and give hints about what might happen in the future with an acceptable level of reliability and can include risk assessment and what – if scenarios.

The process of extracting information from a dataset order to get patterns or predict future outcomes upon that data.

What a process of implementing Predictive Analytics includes?

From having an idea until implementing a predictive model and being able to read it there are a couple of operations that need to be taken care of.

Know the business problem

It is really important to know the scope of the business that you are building a model for. Many people think that it is necessary to apply statistical methods and some Machine learning algorithm on a big chunk of data and the model will give you an answer by itself. Unfortunately, this is not the case. Most of the times you will have to complement your data set with producing new metrics out of the already existing data and for that, you will have to know at least the essentials of the business.

First and foremost, it is essential to identify your business problem. After that, you can successfully determine what metrics are necessary to address your problem and then you can decide which analysis technique you will use.

Transforming and extracting raw data

While trying to build a predictive model you will spend a lot of time trying to prepare the data in the best possible way. That will include handling different data sources like Web API, unstructured data (usually collected from Weblogs, Facebook, Twitter, etc.), different database engines (MSSQL, MySQL, Sybase, Casandra, etc.), flat files (Comma separated value (CSV), tab delimited files, etc.). Therefore, knowledge of Database structures, ETLs, and general computer science knowledge is really useful. In some of the cases, you might be lucky enough to have a separate team that will provide these services for you and delivers you a nicely formatted file that you can work on, but in most of the cases, you still have to do data transformations by yourself.

Tools that are usually used for this kind of transformations are SSIS from Microsoft, SAP Data services, IBM Infosphere Information Manage, SAS Data Management, Python. Many times I have seen ETL processes made purely in C# (This is when application developers are given BI tasks), or purely Stored Procedures (This is when database developers are given BI tasks), R (when statisticians try to perform ETL)
Exploratory data analysis

Exploratory data analysis is an approach to analyze data sets so you can summarize their main characteristics. Often the exploratory data analysis is done using visual methods. Here a statistical model can be used or not, but primary, exploratory data analysis is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

Exploratory data analysis VS Summary data analysis

A summary analysis is simply a numeric reduction of a historical data set. It is quite passive. Its focus was in the past. Its intent is to simply arrive at a few key statistics like mean and standard deviation, which may then either replace the data set or be appended to the data set in the frame of a summary table.
The purpose of exploratory data analysis is to gain insight into the engineering/scientific process behind the data. Whereas summary statistics are passive and historical,  exploratory data analysis is active and futuristic. In an effort to “understand” the procedure and improve it in the future the data as a “window” to peer into the spirit of the process that generated the information.

Building a predictive model

After successfully identifying the problems that this model needs to solve it is time to write some code and do some testings in order to make a theoretical account that will predict outcomes of the data we anticipate to receive in the near future according to the data we already deliver.
This is the time to implement some machine learning. You will basically try to implement algorithms like ANOVA, ARIMA, decision trees, KNN, etc., depending on the problem you are trying to solve and the performance that the algorithm is giving for the specific data we have.
In this pace, we should basically evaluate algorithms with developing a test harness and baseline accuracy from which to ameliorate. The second thing is to leverage results to develop more accurate models.
There are many ways to choose the right algorithm while building a model depending on the scope of the problem. Most of the times the prediction model is improved by combining more than one algorithm, blending. For example, the Predictive model that won the $1 million prize from Netflix for giving recommendations of movies contains more than 100 different models that are blended into a one.
Popular tools that are used now days are R, Python, Weka, Rapid Miner, Mathlab, IBM SPSS, Apache Mahout.
I will write more about choosing the right algorithm for a specific problem in another article.

Presenting the outcome

At this stage, we need to come up with a way of presenting the results that the predictive model has generated. This is where good data visualization practices come handy. Most of the times the results are presented as a report or just an excel spreadsheet, but lately, I see the increased demand, interactive dashboards where the user can see the data from many perspectives instead of one.
At this stage, we should be careful how to present the data since executives and people that need to bring strategic decisions are not necessarily really technical. We must make sure that they have a good understanding of the data. Asking the help of a graphic designer or reading more about how to play with colors and shapes will be really useful and awards at the end.
Some popular visualization platforms that can provide interactive dashboards are Microstrategy, Performance Point on Sharepoint, Tableau, QlikView, Logi Analytics, SAP, SAS, Big Blue from IBM

As you can see the process of building a Predictive model has a much bigger scope than just applying some fancy logos and mathematical formulas. Therefore a successful data scientist should have an understanding of business problems and business analysis so he/she will have a greater understanding of what the data is really saying; some computer science skills so he can perform the extraction and the transformation of the different data sources and some statistical knowledge so he can apply data sampling, better understanding of the predictive models, hypothesis testing, etc.
Maybe, you might think that this kind of person with that much broad knowledge does not exist, but in fact they do. That is why the data scientist is really appreciated lately.

Where does a Data Scientist sit among all that Big Data

In the past years we have been all witnesses of the growing demand on machine learning, predictive analytics, data analysis. Why is that so?
Well, it is quite simple. Like E.O Wilson said,
“We are drowning in information, while starving for wisdom. The world henceforth will be run by synthesizers, people able to put together the right information at the right time, think critically about it, and make important choices wisely.”
 Every possible device I can think of is generating some data feed. But what to do with this data is always the big question.
Business people for sure will have some sparkles of ideas, marketing people will try to sell the perfect fit all models for optimizing your marketing campaigns and bringing all your needed customers, business controllers will try to predict what the sales going to look alike in the next quarter and so on.. All that by using the raw data generated by everything around us.
The big questions is how doing all this smart analysis?
That is where all the fancy names and companies with shiny solutions come in the game, of course that comes with a price, usually, big price.
But on the other side of the game, this is where all smart people come into the game. Lately, these people are called Data Scientist, Data Anlayst, BI professionals and all possible varieties of names.

What does Data Scientist / Data Analyst really do?

Working as a part of already established Data Science team

Most of them do tasks like pulling data out of MySQL, MSSQL, Sybase and all other databases on the market, becoming a master at Excel pivot tables, and producing basic data visualizations (e.g., line and bar charts). You may on occasion analyze the results of an A/B test or take the lead on your company’s Google Analytics account.
Also here you will have to build predictive models that will forecast your client’s mood, possible new market openings, product forecasting or time series analysis.

Basic Statistics

This is where the basic statistics come really handy.
You should be familiar with statistical tests, distributions, maximum likelihood estimators, etc.
This will also be the case for machine learning, but one of the more important aspects of your statistics knowledge will be understanding when different techniques are (or aren’t) a valid approach. Statistics are important in all company types, but especially data-driven companies where the product is not data-focused and product stakeholders will depend on your help to make decisions and design / evaluate experiments.

Machine Learning

 If you’re dealing huge amounts of data, or working at a company where the product itself is especially data-driven, it may be the case that you’ll want to be familiar with machine learning methods. At this time classical statistical methods might not always work and you might be facing a time when you need to work with all data, instead of a sample of the whole data set, like you would do if you follow a conventional statistical approach of analyzing your data.
This can mean things like ARIMA models, Neural Networks, SVM, VARs, k-nearest neighbors, decision trees, random forests, ensemble methods – all of the machine learning fancy words. It’s true that a lot of these techniques can be implemented using R or Python libraries – because of this, it’s not necessarily a deal breaker if you’re not the world’s leading expert on how the algorithms work. More important is to understand the broad strokes and really understand when it is appropriate to use different techniques. There is a lot of literature that can help you in getting up to speed with R and Python in real case scenarios.

Establishing Data Science team

Nowadays, number of companies are getting to the point where they have an increasingly large amount of data, and they’re looking for someone to set up a lot of the data infrastructure that the company will need moving forward. They’re also looking for someone to provide analysis. You’ll see job postings listed under both “Data Scientist” and “Data Engineer” for this type of position. Since you’d be (one of) the first data hires, there are likely many low-hanging fruit, making it less important that you’re a statistics or machine learning expert.
A data scientist with a software engineering background might excel at a company like this, where it’s more important that a data scientist make meaningful data-like contributions to the production code and provide basic insights and analyses.
At this time be ready to bootstrap servers, installations of new virtual machines, setting up networks, plain DBA work, Hadoop installation, setting up Oozie, Flume, Hive, etc. Many times I have been asked to set up Share Point or Web Servers, so I can set up Performance Point as part of the reporting solution.
In times of establishing Data Team or BI teams in a company, you should be ready literally for every other IT tasks you can imagine (especially if you work in a startup), so broad range of skills is really welcomed here.
Expect at least the first year to work mainly on infrastructure and legacy items instead of crunching data and making shiny assumptions and reports.

Keeping up to date

There is certainly a lot of upcoming potential in this profession. And with that the expectance from you in your company is growing exponentially.
As the industry is getting more inclined in data analysis, you working as Data Analyst/Scientist will be challenged to read all the time, whether is Business News or literature that will help you build or improve your models, reports or work in general. There is a lot of research in this field and a lot of new books going out with titles mentioning Data.
Data scientists today are akin to Wall Street “quants” of the 1980s and 1990s. In those days people with backgrounds in physics and math streamed to investment banks and hedge funds, where they could devise entirely new algorithms and data strategies. Then a variety of universities developed master’s programs in financial engineering, which churned out a second generation of talent that was more accessible to mainstream firms. The pattern was repeated later in the 1990s with search engineers, whose rarefied skills soon came to be taught in computer science programs.
One question raised by this is whether some firms would be wise to wait until that second generation of data scientists emerges, and the candidates are more numerous, less expensive, and easier to vet and assimilate in a business setting. Why not leave the trouble of hunting down and domesticating exotic talent to the big data start-ups and to firms whose aggressive strategies require them to be at the forefront?
The problem with that reasoning is that the advance of big data shows no signs of slowing. If companies sit out this trend’s early days for lack of talent, they risk falling behind as competitors and channel partners gain nearly unassailable advantages. Think of big data as an epic wave gathering now, starting to crest. If you want to catch it, you need people who can surf.