Data Cleaning


Data Cleansing is one of the most important procedures for improving your model performance.

Most times when we collect data, we must do data cleaning, to ensure that the data is as perfect as it can be.  Data cleaning can involve many assessments.  For example, let’s say a survey questionnaire was put online and data was collected via a website.

Remove unwanted data observation

Not all data should always be used for building your models.
Let’s say; we have a data screening evaluation is the inclusion criteria.  If a participant does not fit into specified inclusion criteria, specific education level for example, or country of origin, then his responses should be not be taken in consideration, and his data should be filtered from the whole dataset for building your model.
Specific inclusion criteria depend on the goal of the research.  For example, a study only wants to explore the responses from male participants. Another example, if a study intends to examine a specific age group, then the participants that do not fit into the particular age group should be removed from the data set.

Check for outliers


This is how outlier will look in scatter plot
This is how outlier will look in scatter plot

You should also check for outliers. When examining scores within the dataset, it is essential not to have values that skew a variable too much.  For example, if a study focused on test scores, and the variable on test scores averaged around an 80, a participant with a test score of 12 would most likely be considered an outlier and should be removed from the data set.

outliers bar plot
Outliers seen in a bar plot

Find missing data

missing data in time series
Missing data in time series

Another evaluation that often occurs is the examination of missing data.  While doing a survey, participant sometimes can skip questions in the survey questionnaire and leave blank or missing data.  For example, if a study had 40 survey questions and few participants chose only to answer three survey questions, then those participants do not contribute much and should be removed from the data set.

Techniques that apply for fixing missing data:

  1. Most common one is filling up missing data with a predefined value, something like N/A, or NaN this mostly work if that column doesn’t really give a benefit to your project

    Filling up missing data with a static value NaN
    Filling up missing data with a static value NaN
  2. Applying the next consecutive value. This applies in cases when you are working with continuous data. For example, filling updates, or the following numbers.
  3. Apply averages – This method is especially prevalent when working with financial data, like stock data. There are many cases when the data broker missies quote for specific days. In this cases, data scientist use the average value between the previous and the next day. You need to be really careful if you opt for this technique since apply average numbers don’t always make sense. Good business knowledge is required while working with this technique.
  4.  Exclude columns with missing data more than 90% – in this cases, you can’t do much with this feature. Applying other techniques will be highly inefficient.

Clean duplicate data

While doing data examination, you should always be mindful of duplicate data.

Duplicate observations most frequently arise during data collection, such as when you: Combine datasets from multiple places, scrape data or receive data from clients or other departments.

One thought on “Data Cleaning

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.