Module `pyfx.pricingscience.similarity`

Functions and utilities for product similarity analysis.

Contains formulas to evaluate distances between products regarding either textual or categorical or numerical features. Based on SentenceTransformer to extract textual features of products from textual fields (ie description). Based on Fast Approximate Nearest Neighbor search to identify similar products Based on Leiden Algorithm to extract partition from the graph of nearest neighbors, i.e. groups of similar products.

Functions

def approx_top_knn(text_embedding: Optional[numpy.ndarray], cat_array: Optional[numpy.ndarray], num_array: Optional[numpy.ndarray], top_k: int, batch_queries: bool = True, text_n_neighbor_connect: int = 30, cat_n_neighbor_connect: int = 30, num_n_neighbor_connect: int = 30, text_query_rate: float = 0.1, cat_query_rate: float = 0.1, num_query_rate: float = 0.1) ‑> Optional[numpy.ndarray]

Identify closest neighbours.

Identify among an array of products * descriptors which products are neighbors using fast approximate nearest neighbor search.

Args

text_embedding: 2-D input array of products * textual features,
cat_array: 2-D input array of products * categorical features,
num_array: 2-D input array of products * numerical features,
top_k: number of nearest neighbors to return,
batch_queries: use parallelism for queries,
text_n_neighbor_connect: The number of neighbors to use for ANN search using
textual features. High values increase accuracy at cost of time,
cat_n_neighbor_connect: The number of neighbors to use for ANN search using
categorical features. High values increase accuracy at cost of time,
num_n_neighbor_connect: The number of neighbors to use for ANN search using
numerical features. High values increase accuracy at cost of time,
text_query_rate: backtracking parameter that increases accuracy at cost of time
during textual features based ANN search,
cat_query_rate: backtracking parameter that increases accuracy at cost of time
during categorical features based ANN search,
num_query_rate: backtracking parameter that increases accuracy at cost of time

during numerical features based ANN search.

Returns

The array of approximate nearest neighbors across all kind of features.

def extract_embeddings(text_list_: List[str], text_transformer_: str) ‑> pandas.core.frame.DataFrame

Create the embeddings from a list of texts using a specific transformer.

Args

text_list_: the list of texts to process,
text_transformer_: the name of the text transformer to use,

see (https://www.sbert.net/docs/pretrained_models.html) for other available transformers.

Returns

A DataFrame of the embeddings

def get_leiden_partition_opti_modularity(prod_graph: igraph.Graph, _max_communities_size: int) ‑> igraph.Graph

Build the modularity maximising partition of a graph using leiden algorithm.

Args

prod_graph: original graph built using iGraph library,
_max_communities_size: restrict the communities size.

Returns

The graph containing optimal partition information.

def get_leiden_partition_with_resolution_parameter(prod_graph: igraph.Graph, _resolution_param: float, _max_communities_size: int) ‑> igraph.Graph

Build a partition of a graph using leiden algorithm.

Args

prod_graph: original graph built using iGraph library,
_resolution_param: higher resolutions lead to more communities and
lower resolutions lead to fewer communities,
_max_communities_size: restrict the communities size.

Returns

The graph containing partition information.

def pairwise_cos(ndarray: numpy.ndarray, approximate_neighbors: numpy.ndarray) ‑> numpy.ndarray

Compute an array of Cosine similarity.

Compute an array of Cosine similarity between an array (of textual descriptors) of products and their respective neighbours.

Args

ndarray: first 2-D input array of products,
approximate_neighbors: second 2-D input array of neighbours.

Returns

The array of Cosine distance between ndarray and approximate_neighbors.

def pairwise_ham(ndarray: numpy.ndarray, approximate_neighbors: numpy.ndarray) ‑> numpy.ndarray

Compute an array of Hamming similarity.

Compute an array of Hamming similarity between an array (of categorical descriptors) of products and their respective neighbours.

Args

ndarray: first 2-D input array of products,
approximate_neighbors: second 2-D input array of neighbours.

Returns

The array of Hamming distance between ndarray and approximate_neighbors.

def pairwise_maha(ndarray: numpy.ndarray, approximate_neighbors: numpy.ndarray, vi: numpy.ndarray) ‑> numpy.ndarray

Compute an array of Mahalanobis similarity.

Compute an array of Mahalanobis similarity between an array (of numerical descriptors) of products and their respective neighbours.

Args

ndarray: first 2-D input array of products,
approximate_neighbors: second 2-D input array of neighbours,
vi: the inverse of the covariance matrix.

Returns

The array of Mahalanobis distance between ndarray and approximate_neighbors.

def similarity(input_df: pandas.core.frame.DataFrame, cat_columns: Optional[List[str]] = None, num_columns: Optional[List[str]] = None, text_embedding: Optional[pandas.core.frame.DataFrame] = None, k_approx_neighbors: int = 400) ‑> pandas.core.frame.DataFrame

Build a table of similar products based on numerical, categorical and textual descriptors.

Args

input_df: original graph built using iGraph library,
cat_columns: list of categorical column names,
num_columns: list of numerical column names,
text_embedding: 2-D input array of products * textual features,
k_approx_neighbors: number of approximate nearest neighbors to return.

Returns

The DataFrame of similar products.