Module pyfx.pricingscience.feature_selection
Functions and utilities for feature selection.
Functions
def build_hierarchy_graph(corr_feature: pandas.core.frame.DataFrame, duplication_threshold: float = 1.0) ‑> igraph.Graph
-
Build the hierarchy graphs of the features using the asymmetrical correlations.
Args
corr_feature
:pd.DataFrame
- Pandas DataFrame containing the asymmetrical correlations between the different features. The dataframe must contain "feature1", "feature2", "correlation" columns. Self correlations must be removed. Duplicated feature must be removed as well. Using it with symmetrical correlation will break the hierarchies.
duplication_threshold
Returns
ig.Graph
- A tree with directed edges.
Raises
ValueError
- if corr_feature contains duplicated features
def correlation(data: pandas.core.frame.DataFrame, **kwargs: Any) ‑> pandas.core.frame.DataFrame
-
Compute correlations between all the columns of the given dataframe.
The columns can include both numerical and categorical data.
This method use
dython.nominal.associations
internally, with the following parameters by default: -numerical_columns
: all features of typenp.floating
-nominal_columns
: all other features -nom_nom_assoc
: "theil" -compute_only
: TrueYou can override these parameters or set additional ones using
kwargs
.See the dython doc for more details about the parameters of this method: http://shakedzy.xyz/dython/modules/nominal/#associations
Args
data
- the Dataframe to use.
kvargs
- extras arguments to give to dython
Returns
A
pd.DataFrame
with 3 columns ["feature1", "feature2", "correlation"]. def detect_duplicated_features(corr_feature: pandas.core.frame.DataFrame, duplicate_threshold: float = 1) ‑> dict[str, set[str]]
-
Detect duplicated features from the DataFrame and remove them.
Args
corr_feature
:pd.DataFrame
- Correlation Dataframe. Needs to contain asymmetrical correlations. Required columns are ("feature1", "feature2", "correlation")
duplicate_threshold
- Threshold used to consider a correlation as a duplication
Returns
a dictionary of all duplicated features
def has_node(g: igraph.Graph, name: str) ‑> bool
-
Verify if the graph has the node specified by its name.
Strangely igraph doesn't have such function.
Args
g
:ig.Graph
- graph the name will be searched in
name
:str
- the name of the vertex
Returns
True if the graph contains a vertex with the name. False otherwise.
def importance(x: pandas.core.frame.DataFrame, y: pandas.core.series.Series, seed: int = 1234, random_baseline: bool = True, inline: bool = False) ‑> tuple[pandas.core.series.Series, float]
-
Compute the relative importance of feature candidates regarding the given target and dataset.
By default, random baseline features are added to adjust the importances of "true" feature relative to noise.
Args
x
- the features data whose importance to compute
y
- the target values to compute importance
seed
- the random seed to use (default: 1234)
random_baseline
- use random baseline features to adjust computed importance (default: True)
inline
- apply mutations inline. If false, a copy of the data will be made (default: False)
Returns
A tuple(
pd.Series
, float) with the first element being the importance of each feature, and the second being the r2 score of the model used to compute the importance.