Module pyfx.pricingscience.feature_selection

Functions and utilities for feature selection.

Functions

def build_hierarchy_graph(corr_feature: pandas.core.frame.DataFrame, duplication_threshold: float = 1.0) ‑> igraph.Graph

Build the hierarchy graphs of the features using the asymmetrical correlations.

Args

corr_feature : pd.DataFrame
Pandas DataFrame containing the asymmetrical correlations between the different features. The dataframe must contain "feature1", "feature2", "correlation" columns. Self correlations must be removed. Duplicated feature must be removed as well. Using it with symmetrical correlation will break the hierarchies.

duplication_threshold

Returns

ig.Graph
A tree with directed edges.

Raises

ValueError
if corr_feature contains duplicated features
def correlation(data: pandas.core.frame.DataFrame, **kwargs: Any) ‑> pandas.core.frame.DataFrame

Compute correlations between all the columns of the given dataframe.

The columns can include both numerical and categorical data.

This method use dython.nominal.associations internally, with the following parameters by default: - numerical_columns: all features of type np.floating - nominal_columns: all other features - nom_nom_assoc: "theil" - compute_only: True

You can override these parameters or set additional ones using kwargs.

See the dython doc for more details about the parameters of this method: http://shakedzy.xyz/dython/modules/nominal/#associations

Args

data
the Dataframe to use.
kvargs
extras arguments to give to dython

Returns

A pd.DataFrame with 3 columns ["feature1", "feature2", "correlation"].

def detect_duplicated_features(corr_feature: pandas.core.frame.DataFrame, duplicate_threshold: float = 1) ‑> dict[str, set[str]]

Detect duplicated features from the DataFrame and remove them.

Args

corr_feature : pd.DataFrame
Correlation Dataframe. Needs to contain asymmetrical correlations. Required columns are ("feature1", "feature2", "correlation")
duplicate_threshold
Threshold used to consider a correlation as a duplication

Returns

a dictionary of all duplicated features

def has_node(g: igraph.Graph, name: str) ‑> bool

Verify if the graph has the node specified by its name.

Strangely igraph doesn't have such function.

Args

g : ig.Graph
graph the name will be searched in
name : str
the name of the vertex

Returns

True if the graph contains a vertex with the name. False otherwise.

def importance(x: pandas.core.frame.DataFrame, y: pandas.core.series.Series, seed: int = 1234, random_baseline: bool = True, inline: bool = False) ‑> tuple[pandas.core.series.Series, float]

Compute the relative importance of feature candidates regarding the given target and dataset.

By default, random baseline features are added to adjust the importances of "true" feature relative to noise.

Args

x
the features data whose importance to compute
y
the target values to compute importance
seed
the random seed to use (default: 1234)
random_baseline
use random baseline features to adjust computed importance (default: True)
inline
apply mutations inline. If false, a copy of the data will be made (default: False)

Returns

A tuple(pd.Series, float) with the first element being the importance of each feature, and the second being the r2 score of the model used to compute the importance.