This tutorial explains how to use scikit-learn's univariate feature selection methods to select the top N features and the top P% features with the mutual information statistic.
This will work with an OpenML dataset to predict who pays for internet with 10108 observations and 69 columns.
Packages
This tutorial uses:
import pandas as pd
from sklearn.datasets import fetch_openml
import category_encoders as ce
from sklearn.feature_selection import SelectKBest, SelectPercentile, mutual_info_classif
Reading the Data
The data is from OpenML imported using the Python package sklearn.datasets.
data = fetch_openml(name='kdd_internet_usage')
df = data.frame
df.info()
This should return a table resembling something like this:
RangeIndex: 10108 entries, 0 to 10107
Data columns (total 69 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Actual_Time 10108 non-null category
1 Age 10108 non-null category
2 Community_Building 10108 non-null category
3 Community_Membership_Family 10108 non-null category
4 Community_Membership_Hobbies 10108 non-null category
5 Community_Membership_None 10108 non-null category
6 Community_Membership_Other 10108 non-null category
7 Community_Membership_Political 10108 non-null category
8 Community_Membership_Professional 10108 non-null category
9 Community_Membership_Religious 10108 non-null category
10 Community_Membership_Support 10108 non-null category
11 Country 10108 non-null category
12 Disability_Cognitive 10108 non-null category
13 Disability_Hearing 10108 non-null category
14 Disability_Motor 10108 non-null category
15 Disability_Not_Impaired 10108 non-null category
16 Disability_Not_Say 10108 non-null category
17 Disability_Vision 10108 non-null category
18 Education_Attainment 10108 non-null category
19 Falsification_of_Information 10108 non-null category
20 Gender 10108 non-null category
21 Household_Income 10108 non-null category
22 How_You_Heard_About_Survey_Banner 10108 non-null category
23 How_You_Heard_About_Survey_Friend 10108 non-null category
24 How_You_Heard_About_Survey_Mailing_List 10108 non-null category
25 How_You_Heard_About_Survey_Others 10108 non-null category
26 How_You_Heard_About_Survey_Printed_Media 10108 non-null category
27 How_You_Heard_About_Survey_Remebered 10108 non-null category
28 How_You_Heard_About_Survey_Search_Engine 10108 non-null category
29 How_You_Heard_About_Survey_Usenet_News 10108 non-null category
30 How_You_Heard_About_Survey_WWW_Page 10108 non-null category
31 Major_Geographical_Location 10108 non-null category
32 Major_Occupation 10108 non-null category
33 Marital_Status 10108 non-null category
34 Most_Import_Issue_Facing_the_Internet 10108 non-null category
35 Opinions_on_Censorship 10108 non-null category
36 Primary_Computing_Platform 7409 non-null category
37 Primary_Language 10108 non-null category
38 Primary_Place_of_WWW_Access 10108 non-null category
39 Race 10108 non-null category
40 Not_Purchasing_Bad_experience 10108 non-null category
41 Not_Purchasing_Bad_press 10108 non-null category
42 Not_Purchasing_Cant_find 10108 non-null category
43 Not_Purchasing_Company_policy 10108 non-null category
44 Not_Purchasing_Easier_locally 10108 non-null category
45 Not_Purchasing_Enough_info 10108 non-null category
46 Not_Purchasing_Judge_quality 10108 non-null category
47 Not_Purchasing_Never_tried 10108 non-null category
48 Not_Purchasing_No_credit 10108 non-null category
49 Not_Purchasing_Not_applicable 10108 non-null category
50 Not_Purchasing_Not_option 10108 non-null category
51 Not_Purchasing_Other 10108 non-null category
52 Not_Purchasing_Prefer_people 10108 non-null category
53 Not_Purchasing_Privacy 10108 non-null category
54 Not_Purchasing_Receipt 10108 non-null category
55 Not_Purchasing_Security 10108 non-null category
56 Not_Purchasing_Too_complicated 10108 non-null category
57 Not_Purchasing_Uncomfortable 10108 non-null category
58 Not_Purchasing_Unfamiliar_vendor 10108 non-null category
59 Registered_to_Vote 10108 non-null category
60 Sexual_Preference 10108 non-null category
61 Web_Ordering 10108 non-null category
62 Web_Page_Creation 10108 non-null category
63 Who_Pays_for_Access_Dont_Know 10108 non-null category
64 Who_Pays_for_Access_Other 10108 non-null category
65 Who_Pays_for_Access_Parents 10108 non-null category
66 Who_Pays_for_Access_School 10108 non-null category
67 Who_Pays_for_Access_Self 10108 non-null category
68 Who_Pays_for_Access_Work 10108 non-null category
dtypes: category(69)
memory usage: 715.7 KB
Split the data into target and features.
Drop target leakage features of other options to pay.
target = 'Who_Pays_for_Access_Work'
y = df[target]
X_cat = data.data.drop(columns=['Who_Pays_for_Access_Dont_Know',
'Who_Pays_for_Access_Other', 'Who_Pays_for_Access_Parents',
'Who_Pays_for_Access_School', 'Who_Pays_for_Access_Self'])
Encode the categorical variables prior to feature selection.
encoder = ce.LeaveOneOutEncoder(return_df=True)
X = encoder.fit_transform(X_cat, y)
Feature Selection
Select the Top N
Start with 63 features after dropping target leakage features.
Select the top 20 features.
Note, multual_info_classif is used as this is a classification problem. For a regression problem, use mutual_info_regression instead.
selector = SelectKBest(mutual_info_classif, k=20)
X_reduced = selector.fit_transform(X, y)
X_reduced.shape
The function get_support can be used to generate the list of features that were kept.
cols = selector.get_support(indices=True)
selected_columns = X.iloc[:,cols].columns.tolist()
selected_columns
['Community_Membership_Family',
'Community_Membership_None',
'Community_Membership_Political',
'Community_Membership_Religious',
'Community_Membership_Support',
'Disability_Cognitive',
'Disability_Hearing',
'Disability_Vision',
'How_You_Heard_About_Survey_Banner',
'How_You_Heard_About_Survey_Mailing_List',
'How_You_Heard_About_Survey_Printed_Media',
'How_You_Heard_About_Survey_Remebered',
'How_You_Heard_About_Survey_Search_Engine',
'How_You_Heard_About_Survey_Usenet_News',
'Race',
'Not_Purchasing_Bad_press',
'Not_Purchasing_Cant_find',
'Not_Purchasing_Enough_info',
'Not_Purchasing_Never_tried',
'Not_Purchasing_Prefer_people']
Select the Top P%
Select the top 25% of features.
selector = SelectPercentile(mutual_info_classif, percentile=25)
X_reduced = selector.fit_transform(X, y)
X_reduced.shape
Again, using the function get_support to generate the list of features that were kept.
cols = selector.get_support(indices=True)
selected_columns = X.iloc[:,cols].columns.tolist()
selected_columns
['Community_Building',
'Community_Membership_Political',
'Community_Membership_Religious',
'Community_Membership_Support',
'Disability_Cognitive',
'Disability_Hearing',
'Disability_Motor',
'Disability_Vision',
'How_You_Heard_About_Survey_Banner',
'How_You_Heard_About_Survey_Printed_Media',
'Not_Purchasing_Bad_press',
'Not_Purchasing_Company_policy',
'Not_Purchasing_No_credit',
'Not_Purchasing_Prefer_people',
'Sexual_Preference']