Practical-2

 Aim: Data Pre-processing (Feature Selection/Elimination) tasks using python


What is feature selection ?

Feature Selection is a Pre-processing step that chooses a subset of original feature according to evaluation criteria to get better outcome like removing redundant data, reducing dimensionality, increasing learning accuracy and many more.


Why feature selection ?

- Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.

- Improves Accuracy: Less misleading data means modeling accuracy improves.

Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train faster.


Different methods of feature selection.

Variance Threshold

    - Variance Threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.

Univariate feature selection

    - Univariate feature selection examines each feature individually to determine the strength of the relationship of the feature with the response variable.

Recursive Feature Elimination

    - Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached.

PCA:

    - feature selection method is proposed to select a subset of variables in principal component analysis (PCA) that preserves as much information present in the complete data as possible. The information is measured by means of the percentage of consensus in generalised Procrustes analysis.


Correlation:

    - A good feature will always be highly correlated to the class and not redundant to any other relevant featuresCorrelation based feature selection consists of two important stages namely, selecting relevant features from the class and identifying redundant features and eliminating the same from original dataset.


Dataset Description

The dataset used here is Pima Indian diabetes dataset which has 769 entries and consist of 9 attributes including one class variable (Outcome) which is binary.

Task 1: Univariate feature selection



Task 2: Recursive Feature Elimination



Task 3: Principal component analysis






Comments