`feature_scoring`

Feature scoring functionality.

ema_workbench.analysis.feature_scoring.CHI2(X, y)

Compute chi-squared stats between each non-negative feature and class.

This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative integer feature values such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

If some of your features are continuous, you need to bin them, for example by using KBinsDiscretizer.

Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.

See also

r_regression: Pearson’s R between label/feature for regression tasks.
f_classif: ANOVA F-value between label/feature for classification tasks.
chi2: Chi-squared stats of non-negative features for classification tasks.
SelectKBest: Select features based on the k highest scores.
SelectFpr: Select features based on a false positive rate test.
SelectFdr: Select features based on an estimated false discovery rate.
SelectFwe: Select features based on family-wise error rate.
SelectPercentile: Select features based on percentile of the highest scores.

Examples

>>> from sklearn.datasets import make_regression
>>> from sklearn.feature_selection import f_regression
>>> X, y = make_regression(
...     n_samples=50, n_features=3, n_informative=1, noise=1e-4, random_state=42
... )
>>> f_statistic, p_values = f_regression(X, y)
>>> f_statistic
array([1.21, 2.67e13, 2.66])
>>> p_values
array([0.276, 1.54e-283, 0.11])

ema_workbench.analysis.feature_scoring.get_ex_feature_scores(x, y, mode=RuleInductionType.CLASSIFICATION, nr_trees=100, max_features=None, max_depth=None, min_samples_split=2, min_samples_leaf=None, min_weight_fraction_leaf=0, max_leaf_nodes=None, bootstrap=True, oob_score=True, random_state=None)

Get feature scores using extra trees.

Parameters:

x (DataFrame)
y (1D nd.array)
mode ({RuleInductionType.CLASSIFICATION, RuleInductionType.REGRESSION})
nr_trees (int, optional) – nr. of trees in forest (default=250)
max_features (int, float, string or None, optional) – by default, it will use number of features/3, following Jaxa-Rozen & Kwakkel (2018) doi: 10.1016/j.envsoft.2018.06.011 see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
max_depth (int, optional) – see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
min_samples_split (int, optional) – see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
min_samples_leaf (int, optional) – defaults to 1 for N=1000 or lower, from there on proportional to sqrt of N (see discussion in Jaxa-Rozen & Kwakkel (2018) doi: 10.1016/j.envsoft.2018.06.011) see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
min_weight_fraction_leaf (float, optional) – see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
max_leaf_nodes (int or None, optional) – see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
bootstrap (bool, optional) – see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
oob_score (bool, optional) – see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
random_state (int, optional) – see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

Returns:

pandas DataFrame – sorted in descending order of tuples with uncertainty and feature scores
object – either ExtraTreesClassifier or ExtraTreesRegressor

ema_workbench.analysis.feature_scoring.get_feature_scores_all(x, y, alg='extra trees', mode=RuleInductionType.REGRESSION, **kwargs)

Perform feature scoring for all outcomes using the specified feature scoring algorithm.

Parameters:

x (DataFrame)
y (dict of 1d numpy arrays) – the outcomes, with a string as key, and a 1D array for each outcome
alg ({'extra trees', 'random forest', 'univariate'}, optional)
mode ({RuleInductionType.REGRESSION, RuleInductionType.CLASSIFICATION}, optional)
kwargs (dict, optional) – any remaining keyword arguments will be passed to the specific feature scoring algorithm

Return type:

DataFrame instance

ema_workbench.analysis.feature_scoring.get_rf_feature_scores(x, y, mode=RuleInductionType.CLASSIFICATION, nr_trees=250, max_features='sqrt', max_depth=None, min_samples_split=2, min_samples_leaf=1, bootstrap=True, oob_score=True, random_state=None)

Get feature scores using a random forest.

Parameters:

x (DataFram,e)
y (1D nd.array)
mode ({RuleInductionType.CLASSIFICATION, RuleInductionType.REGRESSION})
nr_trees (int, optional) – nr. of trees in forest (default=250)
max_features (int, optional) – see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
max_depth (int, optional) – see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
min_samples (int, optional) – see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
min_samples_leaf (int, optional) – see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
bootstrap (bool, optional) – see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
oob_score (bool, optional) – see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
random_state (int, optional) – see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Returns:

pandas DataFrame – sorted in descending order of tuples with uncertainty and feature scores
object – either RandomForestClassifier or RandomForestRegressor

ema_workbench.analysis.feature_scoring.get_univariate_feature_scores(x, y, score_func=<function f_classif>)

Calculate feature scores using univariate statistical tests.

In case of categorical data, chi square or the Anova F value is used. In case of continuous data the Anova F value is used.

Parameters:

x (DataFrame)
y (1D nd.array)
score_func ({F_CLASSIFICATION, F_REGRESSION, CHI2}) – the score function to use, one of f_regression (regression), or f_classification or chi2 (classification).

Returns:

sorted in descending order of tuples with uncertainty and feature scores (i.e. p values in this case).

Return type:

pandas DataFrame

feature_scoring

`feature_scoring`