There are multiple ways to boost your predictive model accuracy. Most of these steps are really easy to implement, but yet for many reasons data scientist fail to do proper data preparation and model tuning. in the end, they end up with average or below average machine learning models.
Having domain knowledge will give you the best possible chance of getting improvements on your machine learning models accuracy. However, if every data scientist follows these simple technical steps, they will end up with a great machine learning model accuracy even without being an expert in a certain field.
When we talk about boosting machine learning model accuracy, data cleansing is the first a most important step. Data cleansing techniques are usually performed on data that is at rest rather than data that is being moved. It attempts to find and remove or correct data that detracts from the quality, and thus the usability, of data. The goal of data cleansing is to achieve consistent, complete, accurate, and uniform data.
Data cleansing uses statistical analysis tools to read and audit data based on a list of predefined constraints. Data that violates these constraints are put into a workflow for exception data handling.
Data cleansing leads to high-quality data. When data is of excellent quality, it can be efficiently processed and analyzed, leading to insights that help the organization make better decisions. High-quality data is essential to business intelligence efforts and other types of data analytics, as well as better overall operational efficiency.
- missing values treatment
- outlier removal
- fixing dirty data
Before starting with building any machine learning model, really make sure that your data is perfectly clean. That will make a huge difference later when you want to boost your Machine learning model accuracy.
Normalize your data, i.e., shift it to have a mean of zero, and a spread of 1 standard deviation
In the simplest cases, normalization of ratings means adjusting values measured on different scales to a notionally common scale, often prior to averaging.
By doing this you will standardize the range of independent variables or features of data.
Data normalization in Python
When we do normalization, our data will rescale real-valued numeric attributes into the range 0 and 1.
When we use a model that relies on the magnitude of values, like for example distance of measures used in KNN or in preparation of coefficients in linear regression, it is beneficial to scale the training variables.
If we take the simple Iris data set as an example, we can do data normalization using Python with the following code:
from sklearn.datasets import load_iris from sklearn import preprocessing # loading Iris dataset iris = load_iris() # separate train data vs target data train = iris.data test = iris.target # normalize the data attributes normalized_train = preprocessing.normalize(train)
if we use the command:
print(train) we will see the sample of the train data: [ 5.1 3.5 1.4 0.2] [ 4.9 3. 1.4 0.2] [ 4.7 3.2 1.3 0.2] [ 4.6 3.1 1.5 0.2] after normalization, we can see the normalized train data using: print(normalized_train) [ 0.80377277 0.55160877 0.22064351 0.0315205 ] [ 0.82813287 0.50702013 0.23660939 0.03380134] [ 0.80533308 0.54831188 0.2227517 0.03426949] [ 0.80003025 0.53915082 0.26087943 0.03478392]
More info here.
Another method to boost machine learning accuracy is standardization. When we do data standardization we shift the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance).
If you use a model that relies on the distribution of attributes such as Gaussian processes, then it is useful to perform standardization.
To perform data standardization in Python use the following command:
standardized_train = preprocessing.scale(train)
Turn categorical data into variables
Feature engineering is really useful when comes to improving your machine learning model accuracy.
Estimate feature importance
To boost your machine learning model accuracy it is really important to estimate your feature importance. Many times models with less but highly suggestive (thus important) features perform much better than models with a lot of features.
Estimating the influence of a given feature to a model prediction is challenging and time-consuming but once done right, really pays off. The idea is to keep only the features with most model importance.
You can use the following command in python to determine feature importance:
model = ExtraTreesClassifier() model.fit(train, test) print(model.feature_importances_)
Derive new features from one or more existing features
While doing feature engineering you can try to really understand your data. For example, you can greatly improve your machine learning models accuracy if you split or join certain columns that will make them more easy to be understood by the machine learning algorithm. Like this, you create new features that are derived from previous features that will help the model better understand your data.
Check for colinearity
The red line is there because values should be correlated with themselves. However, any red or blue columns show there’s a strong correlation/anti-correlation that requires more investigation. For example, Resource=1, Resource=4, might be highly correlated in the sense if people have 1 there is a less chance to have 4, etc. Regression assumes that the parameters used are independent of one another.
Look at alternative models given the underlying features and the aim of the project
Trying new models is exciting. Besides, different models can interpret the data in different ways thus give different results.
Don’t be stuck with your Random Forest model or Regression learning just because you always used them and they performed well. If you want to make a difference and really boost your model accuracy, you need to do some experimentation.
Please bear in mind, that not all machine learning algorithms can be used for every problem you are trying to solve.
Ensemble modeling is a well-established means for improving prediction accuracy; it enables you to average out noise from diverse models and thereby enhance the generalizable signal. Basic stacked ensemble techniques combine predictions from multiple machine learning algorithms and use these predictions as inputs to second-level learning models.
Model stacking is an efficient ensemble method in which the predictions that are generated by using different learning algorithms are used as inputs in a second-level learning algorithm. This second-level algorithm is trained to optimally combine the model predictions to form a final set of predictions.
When we want to improve the accuracy of our machine learning model, we really have to be patient and prudent. We need to be open to a lot of experimentation and failure.
But also we should be really mindful of how much time we spend on boosting our Machine learning model accuracy and make the really good calculation if all that hard work is really necessary or our machine learning model can solve our problem good enough with modest accuracy.