Data Cleansing is one of the most important procedures for improving your model performance.
Most times when we collect data, we must do data cleaning, to ensure that the data is as perfect as it can be. Data cleaning can involve many assessments. For example, let’s say a survey questionnaire was put online and data was collected via a website.
Remove unwanted data observation
Not all data should always be used for building your models.
Let’s say; we have a data screening evaluation is the inclusion criteria. If a participant does not fit into specified inclusion criteria, specific education level for example, or country of origin, then his responses should be not be taken in consideration, and his data should be filtered from the whole dataset for building your model.
Specific inclusion criteria depend on the goal of the research. For example, a study only wants to explore the responses from male participants. Another example, if a study intends to examine a specific age group, then the participants that do not fit into the particular age group should be removed from the data set.
Check for outliers
You should also check for outliers. When examining scores within the dataset, it is essential not to have values that skew a variable too much. For example, if a study focused on test scores, and the variable on test scores averaged around an 80, a participant with a test score of 12 would most likely be considered an outlier and should be removed from the data set.
Find missing data
Another evaluation that often occurs is the examination of missing data. While doing a survey, participant sometimes can skip questions in the survey questionnaire and leave blank or missing data. For example, if a study had 40 survey questions and few participants chose only to answer three survey questions, then those participants do not contribute much and should be removed from the data set.
Techniques that apply for fixing missing data:
- Most common one is filling up missing data with a predefined value, something like N/A, or NaN this mostly work if that column doesn’t really give a benefit to your project
- Applying the next consecutive value. This applies in cases when you are working with continuous data. For example, filling updates, or the following numbers.
- Apply averages – This method is especially prevalent when working with financial data, like stock data. There are many cases when the data broker missies quote for specific days. In this cases, data scientist use the average value between the previous and the next day. You need to be really careful if you opt for this technique since apply average numbers don’t always make sense. Good business knowledge is required while working with this technique.
- Exclude columns with missing data more than 90% – in this cases, you can’t do much with this feature. Applying other techniques will be highly inefficient.
Clean duplicate data
While doing data examination, you should always be mindful of duplicate data.
Duplicate observations most frequently arise during data collection, such as when you: Combine datasets from multiple places, scrape data or receive data from clients or other departments.