Module pyfx.pricingscience.clustering

Functions and utilities for clustering.

The overall approach uses hierarchical clustering. First we create a dendrogram based on the data, then we select a cutoff point to fix the number of clusters.

The best clustering can be selected by manually evaluating multiple cutoff points across the dendrogram, but we also provide a procedure to automatically select the best cutoff point using an aggregation of multiple clustering metrics.

Here is an example of using this module to get the best clustering for a given dataset:

dataset = ...

GROUPBY_COL = "somecol"  # The column containing the attributes to be clusterized
BASEDON_COL = "someothercol"  # The column used for comparison of the attributes to be clusterized
TARGET = "targetcol"  # The target metric used in the comparison
FORMULA = "mean"

MIN_CLUSTERS = 2
MAX_CLUSTERS = 5

dendro = clustering.Dendrogram.from_metric(
    dataset,
    groupby_dim=GROUPBY_COL,
    basedon_dim=BASEDON_COL,
    target=TARGET,
    formula=FORMULA
)
clust = clustering.OptimalClustering(dendro, MIN_CLUSTERS, MAX_CLUSTERS)

best = clust.best()

best.clusters() # The resulting cluster attribution for all the values of the GROUPY_COL

Classes

class Clustering (dendrogram: Dendrogram, cut_level: float)

A clustering proposal and its associated metrics.

Create the clustering using a Dendrogram. at a specific cut level.

Args

dendrogram
the dendrogram to use as a base for the clustering
cut_level
the level at which to cut the dendrogram to get the corresponding clusters.

Returns

The resulting clustering

Methods

def clusters(self) ‑> pandas.core.frame.DataFrame

The clusters associated with each grouby value.

def cut_level(self) ‑> float

The dendrogram cut level corresponding to this clustering.

def nb_clusters(self) ‑> int

The number of clusters.

def score_calinski_harabasz(self) ‑> float
def score_davies_bouldin(self) ‑> float
def score_silhouette(self) ‑> float
class Dendrogram (matrix: pandas.core.frame.DataFrame, linkage_method: str)

A Dendrogram representation of the underlying data.

The purpose of this object is to serve as the basis for (potentially multiple iterations of) hierarchical clustering. It has no use in itself.

The expected way to create a new instance of Dendrogram is by using one of Dendrogram.from_expenses() or Dendrogram.from_metric().

The Dendrogram is created

Static methods

def from_expenses(data: pandas.core.frame.DataFrame, groupby_dim: str, basedon_dim: str, target: str, threshold_min_expense_in_a_segment: float, linkage_method: str = 'ward') ‑> Dendrogram

Create the dendrogram based on expense patterns in the data.

Args

data
the data used to generate the dendrogram
groupby_dim
the dimension to use for grouping the data, aka the attribute intended to be grouped together, e.g. customers to be grouped in customer clusters. Must be a column without missing value
basedon_dim
the base dimension for the dendrogram, aka that will become the criteria to compare the groups. It must be a dimension type column without missing values
target
the name for the target field, aka the attribute used by the metric computation
threshold_min_expense_in_a_segment
the minimum value for the aggregated target inside a segment
linkage_method
can be one of the methods listed in https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html (default: "ward")

Returns

The resulting dendrogram

def from_metric(data: pandas.core.frame.DataFrame, groupby_dim: str, basedon_dim: str, target: str, formula: str, linkage_method: str = 'ward') ‑> Dendrogram

Create the dendrogram based on a given metric in the data.

Args

data
the data used to generate the dendrogram
groupby_dim
the dimension to use for grouping the data, aka the attribute intended to be grouped together, e.g. customers to be grouped in customer clusters. Must be a column without missing value
basedon_dim
the base dimension for the dendrogram, aka that will become the criteria to compare the groups. It must be a dimension type column without missing values
target
the name for the target field, aka the attribute used by the metric computation
formula
the aggregation formula to use (a pandas agg argument like "mean", "max" …)
linkage_method
can be one of the methods listed in https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html (default: "ward")

Returns

The resulting dendrogram

class OptimalClustering (dendrogram: Dendrogram, cluster_min_number: int, cluster_max_number: int)

Procedure to determine the best clustering based on a given Dendrogram.

This procedure work by iterating at different cutting level over the Dendrogram and finding the one with the best aggregarted score over multiple metrics.

Create and run a new OptimalClustering.

Methods

def best(self) ‑> Optional[Clustering]

The best clustering (if any).

def metrics(self) ‑> Optional[pandas.core.frame.DataFrame]

All the metrics computed during the procedure.