Practical-1
Aim: Data Pre-processing tasks in Python using Scikit-learn.
Theory:
Data Pre-processing:
- Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.
- Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.
Need of Data Pre-processing:
- For achieving better results from the applied model in Machine Learning projects the format of the data has to be in a proper manner. Some specified Machine Learning model needs information in a specified format, for example, Random Forest algorithm does not support null values, therefore to execute random forest algorithm null values have to be managed from the original raw data set.
- Another aspect is that data set should be formatted in such a way that more than one Machine Learning and Deep Learning algorithms are executed in one data set, and best out of them is chosen.
Data Pre-processing Technique:
- Standardization.
- Normalization.
- Encoding.
- Discretization.
- Imputation.
1. Standardization.
The result of standardization (or Z-score normalization)
is that the features will be rescaled to ensure the mean and the
standard deviation to be 0 and 1, respectively. The equation is shown
below:
This technique is to re-scale features value with the distribution value between 0 and 1 is useful for the optimization algorithms, such as gradient descent, that are used within machine learning algorithms that weight inputs (e.g., regression and neural networks). Rescaling is also used for algorithms that use distance measurements, for example, K-Nearest-Neighbours (KNN).
The aim of
normalization is to change the values of
numeric columns in the dataset to a common scale, without distorting
differences in the ranges of values.
3. Encoding.
Label Encoding refers
to converting the labels into numeric form so as to convert it into the
machine-readable form. Machine learning algorithms can then decide in a better
way on how those labels must be operated. It is an important pre-processing
step for the structured dataset in supervised learning.
4. Discretization.
Discretization refers
to the process of converting or partitioning continuous attributes, features or
variables to discretized or nominal
attributes/features/variables/intervals.
5. Imputation.
Imputation is the process of replacing
missing data with substituted values. Because
missing data can create problems for
analyzing data, imputation is seen as a way to avoid pitfalls
involved with listwise deletion of cases that have missing values.
Data set Description:
The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2.
Attribute Information:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
Implementation:
Reading the Data:
Standardization :
Data standardization is the process of rescaling one or more attributes so that they have a mean value of 0 and a standard deviation of 1. Standardization assumes that your data has a Gaussian (bell curve) distribution.
Normalization:
Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is
to change the values of numeric columns in the dataset to a common
scale, without distorting differences in the ranges of values.
Encoding Categorical features
Question and Answers:
1. How to decide variance threshold in data reduction?
- The variance threshold calculation depends on the probability density function of a particular distribution.
2. Does the output result same even after applying model on encoded data v\s original data?
- No, after applying several methods and process the data the output changes even by small percentage value.
1. How to decide variance threshold in data reduction?
- The variance threshold calculation depends on the probability density function of a particular distribution.
2. Does the output result same even after applying model on encoded data v\s original data?
- No, after applying several methods and process the data the output changes even by small percentage value.


Comments
Post a Comment