Module pyfx.pricingscience.similarity

Functions and utilities for product similarity analysis.

Contains formulas to evaluate distances between products regarding either textual or categorical or numerical features. Based on SentenceTransformer to extract textual features of products from textual fields (ie description). Based on Fast Approximate Nearest Neighbor search to identify similar products Based on Leiden Algorithm to extract partition from the graph of nearest neighbors, i.e. groups of similar products.

Functions

def approx_top_knn(text_embedding: Optional[numpy.ndarray], cat_array: Optional[numpy.ndarray], num_array: Optional[numpy.ndarray], top_k: int, batch_queries: bool = True, text_n_neighbor_connect: int = 30, cat_n_neighbor_connect: int = 30, num_n_neighbor_connect: int = 30, text_query_rate: float = 0.1, cat_query_rate: float = 0.1, num_query_rate: float = 0.1) ‑> Optional[numpy.ndarray]

Identify closest neighbours.

Identify among an array of products * descriptors which products are neighbors using fast approximate nearest neighbor search.

Args

text_embedding
2-D input array of products * textual features,
cat_array
2-D input array of products * categorical features,
num_array
2-D input array of products * numerical features,
top_k
number of nearest neighbors to return,
batch_queries
use parallelism for queries,
text_n_neighbor_connect
The number of neighbors to use for ANN search using
textual features. High values increase accuracy at cost of time,
cat_n_neighbor_connect
The number of neighbors to use for ANN search using
categorical features. High values increase accuracy at cost of time,
num_n_neighbor_connect
The number of neighbors to use for ANN search using
numerical features. High values increase accuracy at cost of time,
text_query_rate
backtracking parameter that increases accuracy at cost of time
during textual features based ANN search,
cat_query_rate
backtracking parameter that increases accuracy at cost of time
during categorical features based ANN search,
num_query_rate
backtracking parameter that increases accuracy at cost of time

during numerical features based ANN search.

Returns

The array of approximate nearest neighbors across all kind of features.

def extract_embeddings(text_list_: List[str], text_transformer_: str) ‑> pandas.core.frame.DataFrame

Create the embeddings from a list of texts using a specific transformer.

Args

text_list_
the list of texts to process,
text_transformer_
the name of the text transformer to use,

see (https://www.sbert.net/docs/pretrained_models.html) for other available transformers.

Returns

A DataFrame of the embeddings

def get_leiden_partition_opti_modularity(prod_graph: igraph.Graph, _max_communities_size: int) ‑> igraph.Graph

Build the modularity maximising partition of a graph using leiden algorithm.

Args

prod_graph
original graph built using iGraph library,
_max_communities_size
restrict the communities size.

Returns

The graph containing optimal partition information.

def get_leiden_partition_with_resolution_parameter(prod_graph: igraph.Graph, _resolution_param: float, _max_communities_size: int) ‑> igraph.Graph

Build a partition of a graph using leiden algorithm.

Args

prod_graph
original graph built using iGraph library,
_resolution_param
higher resolutions lead to more communities and
lower resolutions lead to fewer communities,
_max_communities_size
restrict the communities size.

Returns

The graph containing partition information.

def pairwise_cos(ndarray: numpy.ndarray, approximate_neighbors: numpy.ndarray) ‑> numpy.ndarray

Compute an array of Cosine similarity.

Compute an array of Cosine similarity between an array (of textual descriptors) of products and their respective neighbours.

Args

ndarray
first 2-D input array of products,
approximate_neighbors
second 2-D input array of neighbours.

Returns

The array of Cosine distance between ndarray and approximate_neighbors.

def pairwise_ham(ndarray: numpy.ndarray, approximate_neighbors: numpy.ndarray) ‑> numpy.ndarray

Compute an array of Hamming similarity.

Compute an array of Hamming similarity between an array (of categorical descriptors) of products and their respective neighbours.

Args

ndarray
first 2-D input array of products,
approximate_neighbors
second 2-D input array of neighbours.

Returns

The array of Hamming distance between ndarray and approximate_neighbors.

def pairwise_maha(ndarray: numpy.ndarray, approximate_neighbors: numpy.ndarray, vi: numpy.ndarray) ‑> numpy.ndarray

Compute an array of Mahalanobis similarity.

Compute an array of Mahalanobis similarity between an array (of numerical descriptors) of products and their respective neighbours.

Args

ndarray
first 2-D input array of products,
approximate_neighbors
second 2-D input array of neighbours,
vi
the inverse of the covariance matrix.

Returns

The array of Mahalanobis distance between ndarray and approximate_neighbors.

def similarity(input_df: pandas.core.frame.DataFrame, cat_columns: Optional[List[str]] = None, num_columns: Optional[List[str]] = None, text_embedding: Optional[pandas.core.frame.DataFrame] = None, k_approx_neighbors: int = 400) ‑> pandas.core.frame.DataFrame

Build a table of similar products based on numerical, categorical and textual descriptors.

Args

input_df
original graph built using iGraph library,
cat_columns
list of categorical column names,
num_columns
list of numerical column names,
text_embedding
2-D input array of products * textual features,
k_approx_neighbors
number of approximate nearest neighbors to return.

Returns

The DataFrame of similar products.