Module pyfx.pricingscience.clustering
Functions and utilities for clustering.
The overall approach uses hierarchical clustering. First we create a dendrogram based on the data, then we select a cutoff point to fix the number of clusters.
The best clustering can be selected by manually evaluating multiple cutoff points across the dendrogram, but we also provide a procedure to automatically select the best cutoff point using an aggregation of multiple clustering metrics.
Here is an example of using this module to get the best clustering for a given dataset:
dataset = ...
GROUPBY_COL = "somecol" # The column containing the attributes to be clusterized
BASEDON_COL = "someothercol" # The column used for comparison of the attributes to be clusterized
TARGET = "targetcol" # The target metric used in the comparison
FORMULA = "mean"
MIN_CLUSTERS = 2
MAX_CLUSTERS = 5
dendro = clustering.Dendrogram.from_metric(
dataset,
groupby_dim=GROUPBY_COL,
basedon_dim=BASEDON_COL,
target=TARGET,
formula=FORMULA
)
clust = clustering.OptimalClustering(dendro, MIN_CLUSTERS, MAX_CLUSTERS)
best = clust.best()
best.clusters() # The resulting cluster attribution for all the values of the GROUPY_COL
Classes
class Clustering (dendrogram: Dendrogram, cut_level: float)
-
A clustering proposal and its associated metrics.
Create the clustering using a
Dendrogram.
at a specific cut level.Args
dendrogram
- the dendrogram to use as a base for the clustering
cut_level
- the level at which to cut the dendrogram to get the corresponding clusters.
Returns
The resulting clustering
Methods
def clusters(self) ‑> pandas.core.frame.DataFrame
-
The clusters associated with each grouby value.
def cut_level(self) ‑> float
-
The dendrogram cut level corresponding to this clustering.
def nb_clusters(self) ‑> int
-
The number of clusters.
def score_calinski_harabasz(self) ‑> float
-
The Calinski and Harabasz score of the clustering, as of https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html#sklearn.metrics.calinski_harabasz_score.
def score_davies_bouldin(self) ‑> float
-
The Davies-Bouldin score fo the clustering, as of https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html#sklearn.metrics.davies_bouldin_score.
def score_silhouette(self) ‑> float
-
The silhouette coefficient of the clustering, as of https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html.
class Dendrogram (matrix: pandas.core.frame.DataFrame, linkage_method: str)
-
A Dendrogram representation of the underlying data.
The purpose of this object is to serve as the basis for (potentially multiple iterations of) hierarchical clustering. It has no use in itself.
The expected way to create a new instance of Dendrogram is by using one of
Dendrogram.from_expenses()
orDendrogram.from_metric()
.The Dendrogram is created
Static methods
def from_expenses(data: pandas.core.frame.DataFrame, groupby_dim: str, basedon_dim: str, target: str, threshold_min_expense_in_a_segment: float, linkage_method: str = 'ward') ‑> Dendrogram
-
Create the dendrogram based on expense patterns in the data.
Args
data
- the data used to generate the dendrogram
groupby_dim
- the dimension to use for grouping the data, aka the attribute intended to be grouped together, e.g. customers to be grouped in customer clusters. Must be a column without missing value
basedon_dim
- the base dimension for the dendrogram, aka that will become the criteria to compare the groups. It must be a dimension type column without missing values
target
- the name for the target field, aka the attribute used by the metric computation
threshold_min_expense_in_a_segment
- the minimum value for the aggregated target inside a segment
linkage_method
- can be one of the methods listed in https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html (default: "ward")
Returns
The resulting dendrogram
def from_metric(data: pandas.core.frame.DataFrame, groupby_dim: str, basedon_dim: str, target: str, formula: str, linkage_method: str = 'ward') ‑> Dendrogram
-
Create the dendrogram based on a given metric in the data.
Args
data
- the data used to generate the dendrogram
groupby_dim
- the dimension to use for grouping the data, aka the attribute intended to be grouped together, e.g. customers to be grouped in customer clusters. Must be a column without missing value
basedon_dim
- the base dimension for the dendrogram, aka that will become the criteria to compare the groups. It must be a dimension type column without missing values
target
- the name for the target field, aka the attribute used by the metric computation
formula
- the aggregation formula to use (a pandas
agg
argument like "mean", "max" …) linkage_method
- can be one of the methods listed in https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html (default: "ward")
Returns
The resulting dendrogram
class OptimalClustering (dendrogram: Dendrogram, cluster_min_number: int, cluster_max_number: int)
-
Procedure to determine the best clustering based on a given
Dendrogram
.This procedure work by iterating at different cutting level over the
Dendrogram
and finding the one with the best aggregarted score over multiple metrics.Create and run a new OptimalClustering.
Methods
def best(self) ‑> Optional[Clustering]
-
The best clustering (if any).
def metrics(self) ‑> Optional[pandas.core.frame.DataFrame]
-
All the metrics computed during the procedure.