This tutorial explains how to use scikit-learn's univariate feature selection methods to select the top N features and the top P% features with the mutual information statistic.
This will work with an OpenML dataset to predict who pays for internet with 10108 observations and 69 columns.
This tutorial uses:
The data is from OpenML imported using the Python package sklearn.datasets.
This should return a table resembling something like this:
Split the data into target and features.
Drop target leakage features of other options to pay.
Encode the categorical variables prior to feature selection.
Start with 63 features after dropping target leakage features.
Select the top 20 features.
Note, multual_info_classif is used as this is a classification problem. For a regression problem, use mutual_info_regression instead.
The function get_support can be used to generate the list of features that were kept.
Select the top 25% of features.
Again, using the function get_support to generate the list of features that were kept.