How to boost your Machine learning model accuracy

boosting predictive machine learning algorithms

There are multiple ways to boost your predictive model accuracy. Most of these steps are really easy to implement, but yet for many reasons data scientist fail to do proper data preparation and model tuning. in the end, they end up with average or below average machine learning models.
Having domain knowledge will give you the best possible chance of getting improvements on your machine learning models accuracy. However, if every data scientist follows these simple technical steps, they will end up with a great machine learning model accuracy even without being an expert in a certain field.

Data Cleansing

When we talk about boosting machine learning model accuracy, data cleansing is the first a most important step. Data cleansing techniques are usually performed on data that is at rest rather than data that is being moved. It attempts to find and remove or correct data that detracts from the quality, and thus the usability, of data. The goal of data cleansing is to achieve consistent, complete, accurate, and uniform data.

Data cleansing uses statistical analysis tools to read and audit data based on a list of predefined constraints. Data that violates these constraints are put into a workflow for exception data handling.

Data cleansing leads to high-quality data. When data is of excellent quality, it can be efficiently processed and analyzed, leading to insights that help the organization make better decisions. High-quality data is essential to business intelligence efforts and other types of data analytics, as well as better overall operational efficiency.

  • missing values treatment
  • outlier removal
  • fixing dirty data

Before starting with building any machine learning model, really make sure that your data is perfectly clean. That will make a huge difference later when you want to boost your Machine learning model accuracy.

I talk more about data cleansing in this article.

Normalize your data, i.e., shift it to have a mean of zero, and a spread of 1 standard deviation

In the simplest cases, normalization of ratings means adjusting values measured on different scales to a notionally common scale, often prior to averaging.

By doing this you will standardize the range of independent variables or features of data.

Data normalization in Python

When we do normalization, our data will rescale real-valued numeric attributes into the range 0 and 1.

When we use a model that relies on the magnitude of values, like for example distance of measures used in KNN or in preparation of coefficients in linear regression, it is beneficial to scale the training variables.

If we take the simple Iris data set as an example, we can do data normalization using Python with the following code:

from sklearn.datasets import load_iris
from sklearn import preprocessing
# loading Iris dataset
iris = load_iris()
# separate train data vs target data
train = iris.data
test = iris.target
# normalize the data attributes
normalized_train = preprocessing.normalize(train)

if we use the command:

print(train) 
we will see the sample of the train data:
 [ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]

after normalization, we can see the normalized train data using:
print(normalized_train)

 [ 0.80377277  0.55160877  0.22064351  0.0315205 ]
 [ 0.82813287  0.50702013  0.23660939  0.03380134]
 [ 0.80533308  0.54831188  0.2227517   0.03426949]
 [ 0.80003025  0.53915082  0.26087943  0.03478392]

More info here.

Data Standardization

data normalization.png

Another method to boost machine learning accuracy is standardization. When we do data standardization we shift the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance).

If you use a model that relies on the distribution of attributes such as Gaussian processes, then it is useful to perform standardization.

To perform data standardization in Python use the following command:

standardized_train = preprocessing.scale(train)

Turn categorical data into variables

Feature engineering

Feature engineering is really useful when comes to improving your machine learning model accuracy.

Estimate feature importance

To boost your machine learning model accuracy it is really important to estimate your feature importance. Many times models with less but highly suggestive (thus important) features perform much better than models with a lot of features.

Estimating the influence of a given feature to a model prediction is challenging and time-consuming but once done right, really pays off. The idea is to keep only the features with most model importance.

You can use the following command in python to determine feature importance:

model = ExtraTreesClassifier()
model.fit(train, test)
print(model.feature_importances_)

Derive new features from one or more existing features

While doing feature engineering you can try to really understand your data. For example, you can greatly improve your machine learning models accuracy if you split or join certain columns that will make them more easy to be understood by the machine learning algorithm. Like this, you create new features that are derived from previous features that will help the model better understand your data.

Check for colinearity

correlation

The red line is there because values should be correlated with themselves. However, any red or blue columns show there’s a strong correlation/anti-correlation that requires more investigation. For example, Resource=1, Resource=4, might be highly correlated in the sense if people have 1 there is a less chance to have 4, etc. Regression assumes that the parameters used are independent of one another.

Look at alternative models given the underlying features and the aim of the project

Trying new models is exciting. Besides, different models can interpret the data in different ways thus give different results.

Don’t be stuck with your Random Forest model or Regression learning just because you always used them and they performed well. If you want to make a difference and really boost your model accuracy, you need to do some experimentation.

Please bear in mind, that not all machine learning algorithms can be used for every problem you are trying to solve.

Ensemble models

Ensemble modeling is a well-established means for improving prediction accuracy; it enables you to average out noise from diverse models and thereby enhance the generalizable signal. Basic stacked ensemble techniques combine predictions from multiple machine learning algorithms and use these predictions as inputs to second-level learning models.

Model stacking is an efficient ensemble method in which the predictions that are generated by using different learning algorithms are used as inputs in a second-level learning algorithm. This second-level algorithm is trained to optimally combine the model predictions to form a final set of predictions.

When we want to improve the accuracy of our machine learning model, we really have to be patient and prudent. We need to be open to a lot of experimentation and failure.
But also we should be really mindful of how much time we spend on boosting our Machine learning model accuracy and make the really good calculation if all that hard work is really necessary or our machine learning model can solve our problem good enough with modest accuracy.

2 thoughts on “How to boost your Machine learning model accuracy

  1. […] At this stage, we need to come up with a way for presenting the results that the predictive model has generated. This is where good data visualization practices come handy. Most of the times the results are presented as a report or just an excel spreadsheet, but lately, I see the increased demand, interactive dashboards where the user can see the data from many perspectives instead of one.  At this stage, we should be careful how to present the data since executives and people that need to bring strategic decisions are not necessarily really technical. We must make sure that they have a good understanding of the data. Asking the help of a graphic designer or reading more about how to play with colors and shapes will be really useful and awards at the end. Some popular visualization platforms that can provide interactive dashboards are Microstrategy, Performance Point on Sharepoint, Tableau, QlickView, Logi Analytics, SAP, SAS, Big Blue from IBM   As you can see the process of building a Predictive model has a much bigger scope than just applying some fancy algos and mathematical formulas. Therefore a successful data scientist should have an understanding of business problems and business analysis so he/she will have a greater understanding of what the data is really saying; some computer science skills so he can perform the extraction and the transformation of the different data sources and some statistical knowledge so he can apply data sampling, better understanding of the predictive models, hypothesis testing etc. Maybe, you might think that this kind of person with that much broad knowledge does not exist, but in fact they do. That is why the data scientist is really appreciated lately. Read about improving your machine learning models. […]

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.