prim
¶
A scenario discovery oriented implementation of PRIM.
The implementation of prim provided here is data type aware, so categorical variables will be handled appropriately. It also uses a non-standard objective function in the peeling and pasting phase of the algorithm. This algorithm looks at the increase in the mean divided by the amount of data removed. So essentially, it uses something akin to the first order derivative of the original objective function.
The implementation is designed for interactive use in combination with the jupyter notebook.
- class ema_workbench.analysis.prim.PRIMObjectiveFunctions(value)¶
An enumeration.
- class ema_workbench.analysis.prim.Prim(x, y, threshold, obj_function=PRIMObjectiveFunctions.LENIENT1, peel_alpha=0.05, paste_alpha=0.05, mass_min=0.05, threshold_type=1, mode=RuleInductionType.BINARY, update_function='default')¶
Patient rule induction algorithm
The implementation of Prim is tailored to interactive use in the context of scenario discovery
- Parameters
x (DataFrame) – the independent variables
y (1d ndarray) – the dependent variable
threshold (float) – the density threshold that a box has to meet
obj_function ({LENIENT1, LENIENT2, ORIGINAL}) – the objective function used by PRIM. Defaults to a lenient objective function based on the gain of mean divided by the loss of mass.
peel_alpha (float, optional) – parameter controlling the peeling stage (default = 0.05).
paste_alpha (float, optional) – parameter controlling the pasting stage (default = 0.05).
mass_min (float, optional) – minimum mass of a box (default = 0.05).
threshold_type ({ABOVE, BELOW}) – whether to look above or below the threshold value
mode ({RuleInductionType.BINARY, RuleInductionType.REGRESSION}, optional) – indicated whether PRIM is used for regression, or for scenario classification in which case y should be a binary vector
{'default' (update_function =) – controls behavior of PRIM after having found a first box. use either the default behavior were all points are removed, or the procedure suggested by guivarch et al (2016) doi:10.1016/j.envsoft.2016.03.006 to simply set all points to be no longer of interest (only valid in binary mode).
'guivarch'} – controls behavior of PRIM after having found a first box. use either the default behavior were all points are removed, or the procedure suggested by guivarch et al (2016) doi:10.1016/j.envsoft.2016.03.006 to simply set all points to be no longer of interest (only valid in binary mode).
optional – controls behavior of PRIM after having found a first box. use either the default behavior were all points are removed, or the procedure suggested by guivarch et al (2016) doi:10.1016/j.envsoft.2016.03.006 to simply set all points to be no longer of interest (only valid in binary mode).
See also
cart
- property boxes¶
Property for getting a list of box limits
- determine_coi(indices)¶
Given a set of indices on y, how many cases of interest are there in this set.
- Parameters
indices (ndarray) – a valid index for y
- Returns
the number of cases of interest.
- Return type
int
- Raises
ValueError – if threshold_type is not either ABOVE or BELOW
- find_box()¶
Execute one iteration of the PRIM algorithm. That is, find one box, starting from the current state of Prim.
- property stats¶
property for getting a list of dicts containing the statistics for each box
- class ema_workbench.analysis.prim.PrimBox(prim, box_lims, indices)¶
A class that holds information for a specific box
- coverage¶
coverage of currently selected box
- Type
float
- density¶
density of currently selected box
- Type
float
- mean¶
mean of currently selected box
- Type
float
- res_dim¶
number of restricted dimensions of currently selected box
- Type
int
- mass¶
mass of currently selected box
- Type
float
- peeling_trajectory¶
stats for each box in peeling trajectory
- Type
DataFrame
- box_lims¶
list of box lims for each box in peeling trajectory
- Type
list
by default, the currently selected box is the last box on the peeling trajectory, unless this is changed via
PrimBox.select()
.- drop_restriction(uncertainty='', i=-1)¶
Drop the restriction on the specified dimension for box i
- Parameters
i (int, optional) – defaults to the currently selected box, which defaults to the latest box on the trajectory
uncertainty (str) –
Replace the limits in box i with a new box where for the specified uncertainty the limits of the initial box are being used. The resulting box is added to the peeling trajectory.
- inspect(i=None, style='table', **kwargs)¶
Write the stats and box limits of the user specified box to standard out. if i is not provided, the last box will be printed
- Parameters
i (int or list of ints, optional) – the index of the box, defaults to currently selected box
style ({'table', 'graph', 'data'}) – the style of the visualization. ‘table’ prints the stats and boxlim. ‘graph’ creates a figure. ‘data’ returns a list of tuples, where each tuple contains the stats and the box_lims.
that (additional kwargs are passed to the helper function) –
graph (generates the table or) –
- resample(i=None, iterations=10, p=0.5)¶
Calculate resample statistics for candidate box i
- Parameters
i (int, optional) –
iterations (int, optional) –
p (float, optional) –
- Return type
DataFrame
- select(i)¶
select an entry from the peeling and pasting trajectory and update the prim box to this selected box.
- Parameters
i (int) – the index of the box to select.
- show_pairs_scatter(i=None, dims=None, cdf=False)¶
Make a pair wise scatter plot of all the restricted dimensions with color denoting whether a given point is of interest or not and the boxlims superimposed on top.
- Parameters
i (int, optional) –
dims (list of str, optional) – dimensions to show, defaults to all restricted dimensions
cdf (bool, optional) – plot diag as cdf or pdf
- Return type
seaborn PairGrid
- show_ppt()¶
show the peeling and pasting trajectory in a figure
- show_tradeoff(cmap=<matplotlib.colors.ListedColormap object>, annotated=False)¶
Visualize the trade off between coverage and density. Color is used to denote the number of restricted dimensions.
- Parameters
cmap (valid matplotlib colormap) –
annotated (bool, optional. Shows point labels if True.) –
- Return type
a Figure instance
- update(box_lims, indices)¶
update the box to the provided box limits.
- Parameters
box_lims (DataFrame) – the new box_lims
indices (ndarray) – the indices of y that are inside the box
- write_ppt_to_stdout()¶
write the peeling and pasting trajectory to stdout
- ema_workbench.analysis.prim.pca_preprocess(experiments, y, subsets=None, exclude={})¶
perform PCA to preprocess experiments before running PRIM
Pre-process the data by performing a pca based rotation on it. This effectively turns the algorithm into PCA-PRIM as described in Dalal et al (2013)
- Parameters
experiments (DataFrame) –
y (ndarray) – one dimensional binary array
subsets (dict, optional) – expects a dictionary with group name as key and a list of uncertainty names as values. If this is used, a constrained PCA-PRIM is executed
exclude (list of str, optional) – the uncertainties that should be excluded from the rotation
- Returns
rotated_experiments – DataFrame
rotation_matrix – DataFrame
- Raises
RuntimeError – if mode is not binary (i.e. y is not a binary classification). if X contains non numeric columns
- ema_workbench.analysis.prim.run_constrained_prim(experiments, y, issignificant=True, **kwargs)¶
Run PRIM repeatedly while constraining the maximum number of dimensions available in x
Improved usage of PRIM as described in Kwakkel (2019).
- Parameters
x (DataFrame) –
y (numpy array) –
issignificant (bool, optional) – if True, run prim only on subsets of dimensions that are significant for the initial PRIM on the entire dataset.
**kwargs (any additional keyword arguments are passed on to PRIM) –
- Return type
PrimBox instance
- ema_workbench.analysis.prim.setup_prim(results, classify, threshold, incl_unc=[], **kwargs)¶
Helper function for setting up the prim algorithm
- Parameters
results (tuple) – tuple of DataFrame and dict with numpy arrays the return from
perform_experiments()
.classify (str or callable) – either a string denoting the outcome of interest to use or a function.
threshold (double) – the minimum score on the density of the last box on the peeling trajectory. In case of a binary classification, this should be between 0 and 1.
incl_unc (list of str, optional) – list of uncertainties to include in prim analysis
kwargs (dict) – valid keyword arguments for prim.Prim
- Return type
a Prim instance
- Raises
PrimException – if data resulting from classify is not a 1-d array.
TypeError – if classify is not a string or a callable.