Sklearn.impute. Transformers for missing value imputation

Слайд 2

Imputation transformer for completing missing values.

Multivariate imputer that estimates each feature from

Imputation transformer for completing missing values. Multivariate imputer that estimates each feature
all the others.

Binary indicators for missing values.

Imputation for completing missing values using k-Nearest Neighbors.

What’s inside this module?

Слайд 3

SimpleImputer

IterativeImputer,
KNNImputer

Imputes values in the i-th feature dimension using only non-missing values in

SimpleImputer IterativeImputer, KNNImputer Imputes values in the i-th feature dimension using only
that feature dimension

The entire set of available feature dimensions may be used to estimate the missing values

Слайд 4

All imputers implement methods:

All imputers implement methods:

Слайд 5

Simple Imputer class sklearn.impute.SimpleImputer

SimpleImputer(
missing_values=nan,
strategy='mean’,
 fill_value=None,
 verbose=0,
 copy=True,
 add_indicator=False
)

The placeholder for the missing values

The imputation strategy:
‘constant’, ‘mean’,

Simple Imputer class sklearn.impute.SimpleImputer SimpleImputer( missing_values=nan, strategy='mean’, fill_value=None, verbose=0, copy=True, add_indicator=False )
‘median’ or ‘most_frequent’

Needed if strategy is ‘constant’

Controls the verbosity of the imputer.

If True, a copy of data will be created. 

If True, a MissingIndicator transform will stack onto output of the imputer’s transform.

Слайд 7

Iterative Imputer class sklearn.impute.IterativeImputer

A strategy for imputing missing values by modeling each feature

Iterative Imputer class sklearn.impute.IterativeImputer A strategy for imputing missing values by modeling
with missing values as a function of other features in a round-robin fashion.

At each step, a feature column is designated as output y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. The results of the final imputation round are returned.

Слайд 8

Iterative Imputer class sklearn.impute.IterativeImputer

IterativeImputer(
estimator=None
missing_values=nan,
initial_strategy='mean’,
 n_nearest_features=None,
 verbose=0,
 imputation_order='ascending’,
 random_state=None
….
many other settings
)

The estimator to use

Iterative Imputer class sklearn.impute.IterativeImputer IterativeImputer( estimator=None missing_values=nan, initial_strategy='mean’, n_nearest_features=None, verbose=0, imputation_order='ascending’, random_state=None
at each step of the round-robin imputation.
default=BayesianRidge()

How to initialize missing data :
‘constant’, ‘mean’, ‘median’ or ‘most_frequent’

Controls the verbosity of the imputer.

Number of other features to use to estimate the missing values of each feature column. If None, all features will be used.

The seed of the pseudo random number generator to use.

The placeholder for the missing values

The order in which the features will be imputed. Possible values: “ascending”, “descending”, “roman”, “arabic”, “random”

Слайд 10

k-Nearest Neighbors Imputer class sklearn.impute.KNNImputer

KNNImputer(
missing_values=nan,
n_neighbors=5,
weights='uniform’,
metric='nan_euclidean’,
copy=True,
add_indicator=False
)

The

k-Nearest Neighbors Imputer class sklearn.impute.KNNImputer KNNImputer( missing_values=nan, n_neighbors=5, weights='uniform’, metric='nan_euclidean’, copy=True, add_indicator=False
placeholder for the missing values

Number of neighboring samples to use for imputation.

Weight function used in prediction. Possible values:
‘uniform’ , ‘distance’ or user-defined function 

Distance metric for searching neighbors. Possible values:
‘nan_euclidean’, or user-defined function

If True, a copy of data will be created. 

If True, a MissingIndicator transform will stack onto output of the imputer’s transform.

Слайд 12

Marking imputed values class sklearn.impute.MissingIndicator

MissingIndicator(
missing_values=nan,
features='missing-only',
sparse='auto',
error_on_new=True’,
)

The placeholder for the

Marking imputed values class sklearn.impute.MissingIndicator MissingIndicator( missing_values=nan, features='missing-only', sparse='auto', error_on_new=True’, ) The
missing values

Whether the imputer mask should represent all or a subset of features. Could be ‘missing-only’ or ‘all’

Whether the imputer mask format should be sparse or dense.
True, False or ‘auto’

If True, transform will raise an error when there are features with missing values in transform that have no missing values in fit. This is applicable only when features='missing-only'

The MissingIndicator transformer is useful to transform a dataset into corresponding binary matrix indicating the presence of missing values in the dataset. This transformation is useful in conjunction with imputation.