STEPS IN DATA PREPROCESSING:-
In most cases removing the null values is not preferred. If we have millions of records & in that only less number of records are missing we can remove the null values.
The best way of dealing with null values is by filling in the null values. Depending on the type of data we have we can fill the null values by using mean, median, mode, ffill & bfill.
--> Let's see some of the use cases to fill null values:-
--> After filling in the null values with any of the methods, plot a graph between the old values i.e. before filling in the null values & the new values i.e. after filling in the null values. This will help us to find the change in the distribution of data before & after filling in the null values.
Advantages And Disadvantages of Mean/Median Imputation:-
By using the machine learning models such as KNN & Iterative Imputer methods also we can fill the null value
3. Handling categorical columns
Let's discuss types of Encoding. There are two types of encoding
1) Nominal Encoding
2) Ordinal Encoding
Nominal encoding is nothing but the features where variables have no order or rank to this variable's feature.
-->The different types of Nominal Encoding are
Among these, all Nominal Encoding One Hot Encoding is preferred.
If categories > 2 & <7 use One Hot Encoder. By using a Label Encoder for the dataset which more than 2 categories there is a chance of High Bias so by using One hot encoder we can remove the bias of higher numbers.
2. Ordinal Encoder
Ordinal Encoding is nothing but the feature where variables have some order or rank.
Label Encoder assigns a unique number (starting from 0) to each category.
-->If we have nan values present in the data & after doing Label Encoding, the nan value also will be classified into separate categories.
1)Straightforward to implement
2)Does not require hours of variable exploration
3)Does not expand massively the feature space(no of columns in the dataset)
1)Does not add any information that may make the variables more predictive
2)Does not keep the information of the ignored labels
-->We can handle categorical columns by using a library present in pandas i.e. get_dummies
4. Handle Outlier
a. Remove outlier (Not recommended)
c. Make outliers as Nulls, and do Fill in Missing
Which Machine Learning Models Are Sensitive To Outliers?
How to find out the outliers:-
1)By Using Box plot:-
We can find the outliers by using a box plot. We can consider the values as outliers if they are less than the minimum value & greater than the maximum value from the Box plot
5. Feature Selection
a. Manual Analysis
b. Univariate Selection
c. Feature Importance
d. Correlation Matrix with Heatmap
e. PCA (Principle component analysis)
-->For selecting the features manually we will take the help of domain exports.
Ex: - While solving Banking domain problem statements we will take the help of banking domain people for selecting the features.
In the univariate selection, we use the SelectKBest library present inside learn. SelectKBest internally applies the chi-square test and gives the out chi-square score. Based on this we will select the top features.
Correlation Matrix with Heatmap
In this, we construct the correlation matrix with a heatmap, and from the heat map, we can get what are the features that are more important for predicting the output.
-->From the above heatmap, we can observe that the price_range is the output columns & with respect to output columns ram has the highest correlation value of 0.92 next is battery power, etc.
6. Scale your data (normalize data in a certain range)
MinMax Scaler, Standard Scaler, Robust Scaler
Scaling helps to bring all Columns into a Particular Range
1)MinMax Scaler: -
MinMax scaler converts the data between 0 & 1 by using the min-max formula.
-->Below is the formula of the min-max scaler.
Robust Scaler is used to scale the feature to median and quantiles Scaling using median and quantiles consists of subtracting the median from all the observations and then dividing by the interquartile difference.
IQR = 75th quantile - 25th quantile
X_scaled = (X - X.median) / IQR
Some machine learning algorithms like linear and logistic assume that the features are normally distributed
If the data is not normally distributed follow the below steps
- logarithmic transformation
- reciprocal transformation
- square root transformation
- exponential transformation (more general, you can use any exponent)
- box cox transformation
Which Models require Scaling of the data?
3)Decision Tree-->Not Require
4)Random Forest-->Not Require
5)XG Boost-->Not Require
i.e. distance-based models & the models which use the concept of Gradient Descent require Scaling.
-->fit_transform is applied only on the training dataset & on the testing dataset only transform is used, this is done to avoid data leakage.
Let's Discuss some of the Automated EDA Library
There are different kinds of Automated EDA Library.
-->Some of them are
6)Pandas Visual Analysis