In the past years we have been all witnesses of the growing demand on machine learning, predictive analytics, data analysis. Why is that so?
Well, it is quite simple. Like E.O Wilson said,
“We are drowning in information, while starving for wisdom. The world henceforth will be run by synthesizers, people able to put together the right information at the right time, think critically about it, and make important choices wisely.”
Every possible device I can think of is generating some data feed. But what to do with this data is always the big question.
Business people for sure will have some sparkles of ideas, marketing people will try to sell the perfect fit all models for optimizing your marketing campaigns and bringing all your needed customers, business controllers will try to predict what the sales going to look alike in the next quarter and so on.. All that by using the raw data generated by everything around us.
The big questions is how doing all this smart analysis?
That is where all the fancy names and companies with shiny solutions come in the game, of course that comes with a price, usually, big price.
But on the other side of the game, this is where all smart people come into the game. Lately, these people are called Data Scientist, Data Anlayst, BI professionals and all possible varieties of names.
What does Data Scientist / Data Analyst really do?
Working as a part of already established Data Science team
Most of them do tasks like pulling data out of MySQL, MSSQL, Sybase and all other databases on the market, becoming a master at Excel pivot tables, and producing basic data visualizations (e.g., line and bar charts). You may on occasion analyze the results of an A/B test or take the lead on your company’s Google Analytics account.
Also here you will have to build predictive models that will forecast your client’s mood, possible new market openings, product forecasting or time series analysis.
This is where the basic statistics come really handy.
You should be familiar with statistical tests, distributions, maximum likelihood estimators, etc.
This will also be the case for machine learning, but one of the more important aspects of your statistics knowledge will be understanding when different techniques are (or aren’t) a valid approach. Statistics are important in all company types, but especially data-driven companies where the product is not data-focused and product stakeholders will depend on your help to make decisions and design / evaluate experiments.
If you’re dealing huge amounts of data, or working at a company where the product itself is especially data-driven, it may be the case that you’ll want to be familiar with machine learning methods. At this time classical statistical methods might not always work and you might be facing a time when you need to work with all data, instead of a sample of the whole data set, like you would do if you follow a conventional statistical approach of analyzing your data.
This can mean things like ARIMA models, Neural Networks, SVM, VARs, k-nearest neighbors, decision trees, random forests, ensemble methods – all of the machine learning fancy words. It’s true that a lot of these techniques can be implemented using R or Python libraries – because of this, it’s not necessarily a deal breaker if you’re not the world’s leading expert on how the algorithms work. More important is to understand the broad strokes and really understand when it is appropriate to use different techniques. There is a lot of literature that can help you in getting up to speed with R and Python in real case scenarios.
Establishing Data Science team
Nowadays, number of companies are getting to the point where they have an increasingly large amount of data, and they’re looking for someone to set up a lot of the data infrastructure that the company will need moving forward. They’re also looking for someone to provide analysis. You’ll see job postings listed under both “Data Scientist” and “Data Engineer” for this type of position. Since you’d be (one of) the first data hires, there are likely many low-hanging fruit, making it less important that you’re a statistics or machine learning expert.
A data scientist with a software engineering background might excel at a company like this, where it’s more important that a data scientist make meaningful data-like contributions to the production code and provide basic insights and analyses.
At this time be ready to bootstrap servers, installations of new virtual machines, setting up networks, plain DBA work, Hadoop installation, setting up Oozie, Flume, Hive, etc. Many times I have been asked to set up Share Point or Web Servers, so I can set up Performance Point as part of the reporting solution.
In times of establishing Data Team or BI teams in a company, you should be ready literally for every other IT tasks you can imagine (especially if you work in a startup), so broad range of skills is really welcomed here.
Expect at least the first year to work mainly on infrastructure and legacy items instead of crunching data and making shiny assumptions and reports.
Keeping up to date
There is certainly a lot of upcoming potential in this profession. And with that the expectance from you in your company is growing exponentially.
As the industry is getting more inclined in data analysis, you working as Data Analyst/Scientist will be challenged to read all the time, whether is Business News or literature that will help you build or improve your models, reports or work in general. There is a lot of research in this field and a lot of new books going out with titles mentioning Data.
Data scientists today are akin to Wall Street “quants” of the 1980s and 1990s. In those days people with backgrounds in physics and math streamed to investment banks and hedge funds, where they could devise entirely new algorithms and data strategies. Then a variety of universities developed master’s programs in financial engineering, which churned out a second generation of talent that was more accessible to mainstream firms. The pattern was repeated later in the 1990s with search engineers, whose rarefied skills soon came to be taught in computer science programs.
One question raised by this is whether some firms would be wise to wait until that second generation of data scientists emerges, and the candidates are more numerous, less expensive, and easier to vet and assimilate in a business setting. Why not leave the trouble of hunting down and domesticating exotic talent to the big data start-ups and to firms whose aggressive strategies require them to be at the forefront?
The problem with that reasoning is that the advance of big data shows no signs of slowing. If companies sit out this trend’s early days for lack of talent, they risk falling behind as competitors and channel partners gain nearly unassailable advantages. Think of big data as an epic wave gathering now, starting to crest. If you want to catch it, you need people who can surf.