feature_scoring

Feature scoring functionality

ema_workbench.analysis.feature_scoring.F_REGRESSION(X, y, center=True)

Univariate linear regression tests.

Linear model for testing the individual effect of each of many regressors. This is a scoring function to be used in a feature selection procedure, not a free standing feature selection procedure.

This is done in 2 steps:

  1. The correlation between each regressor and the target is computed, that is, ((X[:, i] - mean(X[:, i])) * (y - mean_y)) / (std(X[:, i]) * std(y)).
  2. It is converted to an F score then to a p-value.

For more on usage see the User Guide.

Parameters:
  • X ({array-like, sparse matrix} shape = (n_samples, n_features)) – The set of regressors that will be tested sequentially.
  • y (array of shape(n_samples)) – The data matrix
  • center (True, bool,) – If true, X and y will be centered.
Returns:

  • F (array, shape=(n_features,)) – F values of features.
  • pval (array, shape=(n_features,)) – p-values of F-scores.

See also

mutual_info_regression()
Mutual information for a continuous target.
f_classif()
ANOVA F-value between label/feature for classification tasks.
chi2()
Chi-squared stats of non-negative features for classification tasks.
SelectKBest()
Select features based on the k highest scores.
SelectFpr()
Select features based on a false positive rate test.
SelectFdr()
Select features based on an estimated false discovery rate.
SelectFwe()
Select features based on family-wise error rate.
SelectPercentile()
Select features based on percentile of the highest scores.
ema_workbench.analysis.feature_scoring.F_CLASSIFICATION(X, y)

Compute the ANOVA F-value for the provided sample.

Read more in the User Guide.

Parameters:
  • X ({array-like, sparse matrix} shape = [n_samples, n_features]) – The set of regressors that will be tested sequentially.
  • y (array of shape(n_samples)) – The data matrix.
Returns:

  • F (array, shape = [n_features,]) – The set of F values.
  • pval (array, shape = [n_features,]) – The set of p-values.

See also

chi2()
Chi-squared stats of non-negative features for classification tasks.
f_regression()
F-value between label/feature for regression tasks.
ema_workbench.analysis.feature_scoring.CHI2(X, y)

Compute chi-squared stats between each non-negative feature and class.

This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.

Read more in the User Guide.

Parameters:
  • X ({array-like, sparse matrix}, shape = (n_samples, n_features_in)) – Sample vectors.
  • y (array-like, shape = (n_samples,)) – Target vector (class labels).
Returns:

  • chi2 (array, shape = (n_features,)) – chi2 statistics of each feature.
  • pval (array, shape = (n_features,)) – p-values of each feature.

Notes

Complexity of this algorithm is O(n_classes * n_features).

See also

f_classif()
ANOVA F-value between label/feature for classification tasks.
f_regression()
F-value between label/feature for regression tasks.
ema_workbench.analysis.feature_scoring.get_univariate_feature_scores(x, y, score_func=<function f_classif>)

calculate feature scores using univariate statistical tests. In case of categorical data, chi square or the Anova F value is used. In case of continuous data the Anova F value is used.

Parameters:
  • x (structured array) –
  • y (1D nd.array) –
  • score_func ({F_CLASSIFICATION, F_REGRESSION, CHI2}) – the score function to use, one of f_regression (regression), or f_classification or chi2 (classification).
Returns:

sorted in descending order of tuples with uncertainty and feature scores (i.e. p values in this case).

Return type:

pandas DataFrame

ema_workbench.analysis.feature_scoring.get_rf_feature_scores(x, y, mode=<RuleInductionType.CLASSIFICATION: 'classification'>, nr_trees=250, max_features='auto', max_depth=None, min_samples_split=2, min_samples_leaf=1, bootstrap=True, oob_score=True, random_state=None)

Get feature scores using a random forest

Parameters:
Returns:

  • pandas DataFrame – sorted in descending order of tuples with uncertainty and feature scores
  • object – either RandomForestClassifier or RandomForestRegressor

ema_workbench.analysis.feature_scoring.get_ex_feature_scores(x, y, mode=<RuleInductionType.CLASSIFICATION: 'classification'>, nr_trees=100, max_features=None, max_depth=None, min_samples_split=2, min_samples_leaf=None, min_weight_fraction_leaf=0, max_leaf_nodes=None, bootstrap=True, oob_score=True, random_state=None)

Get feature scores using extra trees

Parameters:
Returns:

  • pandas DataFrame – sorted in descending order of tuples with uncertainty and feature scores
  • object – either ExtraTreesClassifier or ExtraTreesRegressor

ema_workbench.analysis.feature_scoring.get_feature_scores_all(x, y, alg='extra trees', mode=<RuleInductionType.REGRESSION: 'regression'>, **kwargs)

perform feature scoring for all outcomes using the specified feature scoring algorithm

Parameters:
  • x (numpy structured array) –
  • y (dict of 1d numpy arrays) – the outcomes, with a string as key, and a 1D array for each outcome
  • alg ({'extra trees', 'random forest', 'univariate'}, optional) –
  • mode ({RuleInductionType.REGRESSION, RuleInductionType.CLASSIFICATION}, optional) –
  • kwargs (dict, optional) – any remaining keyword arguments will be passed to the specific feature scoring algorithm
Returns:

Return type:

DataFrame instance